PIP-9: Performance Benchmark Adjustment

Note: The below poll serves only to gauge general sentiment towards the change and is not a binding vote, following the process outlined in PIP-8.

Do you support keeping the Performance Benchmark at the current level (95%)?
  • Yes
  • No
0 voters

PIP-9: Performance Benchmark Adjustment

Authors:

Delroy Bosco

Jackson Lewis

Harry Rook

Mateusz Rzeszowski

Status : Final

Type: Contracts

Abstract

Following on from the favorable signal received for PIP-4, the validator performance metrics were implemented and released as an additional feature to the Polygon Staking Dashboard to better inform delegators about validator health.

The following PIP proposes to maintain the current level of Performance Benchmark at 95%, instead of changing it to 98% at checkpoint number 42,943, as defined in the initially-proposed timeline (in PIP-4).

Motivation

If Performance Benchmark 2 (“PB2”) were to be introduced today, 18 validators would underperform PB2, falling into grace periods.

The potential for 18 subsequent offboardings presents a concern as it takes time for 18 new validators to onboard and sync their nodes. This could be detrimental to network performance and overall developer and user experience.

Rationale

PIP-4 specified two performance benchmarks:

  1. PB1 → 95% of median average of last 700 checkpoints signed by validator set (first 2,800 checkpoints)

  2. PB2 → 98% of median average of last checkpoints signed by validator set (continues thereafter)

This implementation method was chosen to ease validators into the process and improve their uptime over time. Pushing validators to achieve an unattainable uptime would have led to a large number of Final Notices and resultant offboardings.

Since the launch of the performance metrics it can be observed that:

  1. Overall validator performance (in terms of checkpoint signing) has improved:

Currently it can be observed that:

  1. 5 validators are operating below the current Performance Benchmark 1 (“PB1”),
  2. 18 validators are operating below the maximum Performance Benchmark (median average of 100 * 98%) possible under PB2; and,
  3. 2 validators have received Final Notices so far.

It can be seen that the current performance benchmark has provided upward pressure on validator performance. Whilst also providing a steady amount of churn in the validator set, effectively identifying and removing underperforming validators, allowing for the onboarding of new ones. Over time, this process should strengthen the set as a whole.

Based on the above, we propose to keep the Performance Benchmark at its current level until the network shows it is capable of performing at a higher threshold.

Specification

In order to maintain a balanced performance benchmark, we propose the following:

Maintain the current level of required performance:

  • PB → 95% of median average of last 700 checkpoints signed by the validator set

Security Considerations

If the performance benchmark is set too high, it may result in too many validator offboardings which could temporarily impact network security whilst new validators are being onboarded.

Based on the current state of network performance, 2 validators have received final notices since the performance data set began at checkpoint 40,143.

Conversely, if the Performance Benchmark is set at too low of a level, it may not provide sufficient upward pressure on checkpoint signing performance; which is one of the network’s key security features. In the long term, this would reduce the efficacy of the framework.

Conclusion

When validators have adapted to the higher level of performance necessary to maintain a validator slot (expressed by an overall increase in the total % of checkpoints signed by the network), the Performance Benchmark may gradually be amended to a level agreed upon by the community.

Notes

  1. The context provided about validator performance was taken at checkpoint number 42,512.

Copyright

All copyrights and related rights in this work are waived under CC0 1.0 Universal.

4 Likes

I think there needs to be concrete steps laid out from both teams not meeting the original proposal numbers and the foundation on how their performance issues are being rectified.
The 98% threshold isn’t unobtainable and shifting the requirement downward to cater for not working on issues isn’t a sustainable or virtuous signal long term.
Obviously the network needs to remain viable in terms of the set, but I do think this PIP should be backed up with real measurable steps for those that would fail the threshold so that there is a short term plan on getting to the figure.

2 Likes

Thanks for the proposal, we totally agree.

Yes, the 98% threshold is not unattainable, but I do not think that delaying the transition to a higher threshold is an incentive to not work on issues. Even the 95% has already singled out unscrupulous participants and as it worked, it will continue to work.

This is a good solution, especially in light of recent events and when 100% of checkpoints signed now have only 15 validators

3 Likes

While potentially replacing 18 validators at once could indeed be problematic, at the same time there is no need to keep the 95% level in this round of adjustment. I’d suggest raising it by 1% every 700 checkpoints until it reaches 98%.

On a separate note (maybe should be in its own thread)

A) I consider 98% too strict, the long term goal and single threshold should be 97.5% or even 97%. Out of the validators which are under 98% now, 43% (6) of them are between 98 and 97%. These 6 represent 7.5% of all the validators.

B) The network could come under attack and cause a lot of validators to fall below the threshold (be it 98 or 97%). Another unforeseen event could also have similar consequences. A limit on how many validators can be offboarded at once would be useful: for example, (completely off the top of my head), 4 validators per 700 blocks, offboarding them in order of performance (worst performance gets offboarded 1st) until all vals marked for offboarding are gone.

7 Likes

I agree.
My only point really is I think delaying it in coordination with a concrete set of steps to work with those who have been struggling on uptime but are otherwise actively engaged would be a good solution.
Perhaps there is a common thread to the issues… maybe AWS or some other provider is the cause? Maybe there has been a shift in specifications required that hasn’t been broadly disseminated?
I simply think the action should be taken in a way that simultaneously puts forth a strategy to help resolve any underlying technical issues being experienced.

2 Likes

The performance benchmark is based on 700 checkpoints, which we believe is sufficient to accommodate events such as block reorg or temporary validator downtime.

Although many validators may fall into Grace Period 1 if we stick with the plan to implement 98%, we do not think we should further delay the implementation of PB2. Falling into Grace Period is merely a warning of potential future offboarding, and we believe that validators in Grace Period 1 will work to improve their performance.

There are advantages and disadvantages to this vote, but we believe that advancing towards PB2 is necessary to improve the network’s health.

1 Like

For any validator to sustain operations at high uptime, there are several factors at play. Having good infrastructure and good monitoring is a given. At the same time, recovering from issues in a fast way is very critical. This is an area that is very inadequate in Polygon. Issue happen and will happen in future as well (ofcourse product improvements will reduce occurrences). Once an issue happens though (like the one 2 days ago), the recovery time was literally 6-7 hours for many of the affected validators and mind you this was not because of a mistake made by the validators.

At the same time, validator operations will run into issues of their own making or run into situations where the storage size needs to be reduced. I use validators with 4TB storage and storage reduction is once a quarter exercise. Restoring nodes from snapshot can take days and reliance on public RPC is not recommended and not good quality as it stands.

I think you need to revisit why the Perf benchmark has been introduced. In my mind, it is only to weed out inefficient operators and not meant to keep validators under constant stress of falling into grace period (a stigma).

I would recommend keeping benchmark level at 95% for the foreseeable future until:
1 - there are no wide spread issues for a period of 6 months where lots of validators got affected by a network event
2 - there are smaller snapshots available for both bor (must) and heimdall (target 1-2 hour turn around time for download, extract, sync up)
3 - bor sync up times are drastically improved → took 6 hours+ for just catching up from a snapshot that was 3 days old. Another 6-8 hours for downloading and extracting the snapshot
4 - snapshots need to be made available on daily basis → perhaps a bounty/grant for providing a service like this

5 Likes

Agreed on the snapshotting services, @smartstake. As is, it is the slowest recovery process, across any chain. It needs a lot of work:

  • Old, inefficient and slow compression (gzip, single threaded)
  • Not mirrored in EU and US to increase DL speeds.
  • Not daily
  • Decompression requires 2x the allotted space, it is better to let the transport layer compress it on the fly. Overall, the amount of time required is longer when decompression is needed. Database files are not very compressible to start with.

At 98%, it is unlikely a validator will be able to recover in time to not be penalized.

3 Likes

2% of a ~15 day window (assuming 30 minute checkpoints) is ~7 hours. Today’s ~16 hour downtime over the same period is enough to keep validators attentive of their uptime, but still lenient enough to account for the multi-hour downtime events that even the most attentive validators have experienced.

I do not see how a 3% increase in required uptime improves the network stability in any measurable way. A vast majority of validators are attentive of their uptime, and extended downtime can be out of their control (sudden node failure, network/peer instability, etc). It’s my opinion that any near-future move to a required 98% will start to punish (or at least increase the stress of) good validators instead of the intended goal of holding poor-performing validators accountable.

1 Like

“sudden node failure” is within the validator control. it is recommended to keep a spare ready to deploy.

1 Like