PROPOSAL: Validator Performance Management

Eric · June 10, 2022, 2:44pm

Executive Summary

In its current state, the PoS validator network is largely permissioned in that the previously-selected set of validators during testnet has largely persisted. Following the spirit of gradual decentralization, certain steps need to be taken with the end state being the validators assuming care for the network.

To arrive at a state of decentralized self-governance, the validators will have to self-regulate network participation to an agreed set of parameters.

The Polygon Governance Team wants to now propose, discuss, and gather consensus around a framework seeking to aid that aim.

Introduction

Self-regulation in this context refers to the setting and administration of conditions for the admission, participation, and as the case may be, the forced exit of validators from the “club” - with the last part formalizing previously-achieved consensus on the subject.

This includes setting parameters in a fair, transparent, and self-enforcing standard for:

measuring performance
compliance with conditions of participation
choosing and acting on remedial measures, and
actions to address breaches of compliance if and when they arise

The purpose of this initial proposal is to invite and assemble the wisdom of the validators and to collectively arrive at answers to questions that will eventually lead to a state of network self-governance.

The Preliminary Path to Self-Regulation

The Performance Management Proposal is divided into two parts. Part A proposes the preliminary parameters for network monitoring. Part B proposes the preliminary standards for remedial action for non-compliance with the standards proposed in Part A, up to and including a forced exit of a validator node from the network through the unbonding of their stake.

Part A: Network monitoring

The aim of Part A is to develop a fair framework to manage validator performance through a self-enforced performance standard across the network.

There could be reasons both technical or social, that could lead to temporary conditions where validator nodes are underperforming from the common standard. To be a fair process, a process that leads to the forced exit of a validator for technical underperformance should accommodate these realities and should be approached with some caution.

Q1) In a self-governed network, what are the parameters for the technical performance measurement of a validator node?

The proposed parameters are:

checkpoints signed expressed as a percentage, and
a time interval over which the checkpoint compliance, once established, is measured.

Q2) In a self-governed network, what is a non-compliant validator?

The proposed parameters are:

less than 98% checkpoints signed, and
measured over a continuous 14 day interval.

Q3) In a self-governed network, how should the technical performance measurement period be monitored?

The proposed parameters are:

initially, to manually monitor performance on the Polygon Web Wallet v2 page; and
in a future proposal, establish an automated Performance Deficiency Report (“PDR”) that measures performance over the 14 day interval.

Q4) In a self-governed network, who should be responsible to monitor the technical performance of validators?

The proposed parameters are:

initially, the Performance Monitor (“PM”) will continue to be members of the polygon team; and
in a future proposal, the validators will transition to assume the responsibility for the PM and self-monitor performance.

Q5) In a self-governed network, how should non-compliance with the performance standard be recognized and communicated to a validator operator? And by whom?

The proposed parameters are:

the PM will maintain a call-out list of the validator node operators contact information;
Validator node operators will have a positive responsibility to keep their contact information accurate and up to date;
the PM will periodically test the call-out list to confirm its currency;
on the generation of a PDR, the PM will communicate a Notice of Deficiency (“NOD”) directly to the delinquent validator node operator by the means recorded in the call-out list; and
in a future proposal, the NOD will be automatically self-generated and delivered to the delinquent validator.

Before moving to the next step of adoption of the parameters by a vote of the validator community, here is a poll to gather a measure of soft consensus.

Yes - I agree with the parameters in Part A
Yes, I agree but see my comments below for consideration for inclusion
No- I do not agree with the parameters in Part A - see my comments below

0 voters

Part B - Remedial measures and corrective action

The aim of Part B is to develop a fair framework to manage validator performance through a self-enforced performance standard across the network that incorporates the technical performance parameters from Part A, and additionally incorporate remedial measures for underperformance, up to and including the forced exit of validators by unbonding their stake.

The health of the validator network is connected to its efficiency, and its efficiency is connected to validator checkpoints and validator communications. When a validator is offline, or does not respond to communications when prompted, this can have an adverse effect on the network and by extension it can affect the success of the other members of the validator community.

In a prior post Off-boarding Offline Validator, the community already expressed a preference for using a multi-sig kick mechanism. When triggered, this would unbond the stake of a validator if the occasion was necessary. This proposal is to establish a consensus across the validator community of what the parameters should be and what qualifies as the “occasion” to unbond the stake of a validator. The above post also describes the technical implementation of validator offboarding.

Q6) In a self-governed network, what is the process for a remedial response to a non-compliant validator?

The proposed parameters are:

the Grace Period (“GP”) is 7 days;
on issuance of a Notice of Deficiency (“NOD”) from the Performance Monitor (“PM”) the operator will have a grace period to correct the deficiency noted in the NOD;
if the deficiency is corrected within the GP there is no further action;
if the deficiency is not corrected within the initial GP, then the delinquent validator will be issued a Final Notice (“FN”) of the intent of the community to implement a forced exit procedure by offboarding the validator from the network by unbonding their stake.
the FN is followed by a second GP.
if the deficiency is corrected within the second GP there is no further action.
if the deficiency is not corrected at the end of the second GP, the validator’s stake will be unbonded and the validator will be off-boarded from the network.

Before moving to the next step of adoption of the parameters by a vote in the validator community, here is a poll to gather a measure of soft consensus.

Yes- I agree with the parameters in Part B
Yes- I agree but see my comments below for consideration and inclusion
No - I do not agree with the parameters in Part B - see my comments below

0 voters

Conclusion

In summary, under the proposed parameters a validator operator who has been underperforming the common standard for 14 consecutive days will have a second 14 day period to correct the deficiency before a process to unbound their stake is implemented.

The parameters in Part A and Part B are proposals for ideation by the validator community specifically and the Polygon community at large. Suggestions for suitable alternate parameters are invited and encouraged during the incubation period to adjust the proposals into a consensus prior to moving to the next step of adoption of the parameters by a community vote. Once adopted, the framework will allow further decentralization of the network by means of validator self-regulation.

BlocksUnited · June 24, 2022, 5:02pm

Our concern with part A only being 14 days to be 98% checkpoints signed or better, is that it takes the system time to catch up and get a validator back up to 100% checkpoints signed. Perhaps a little more wiggle room is needed, like 17 days.

Our belief is that Part B should allow a validator more time to comply, like 10 days.

Eric · July 1, 2022, 8:10pm

ADDENDUM

Some instances of validator underperformance may be beyond the control of individual validators. For example, underperformance could span the entire network of validators for external or technical reasons unrelated to validator behaviour.

In these cases, actual performance up to the theoretical maximum of 100% may not be possible, and a fair and accurate measurement system should not rely on a 100% theoretical maximum.

For this reason, choosing a means of measuring a delta in performance of one validator against the central tendency of the performance of all of the validators in the network will be more reflective of true state of underperformance and fairer measurement system accounting for deficiencies that affect the performance of all validators.

I would propose to amend the above proposal when the benchmark is the median performance of all of the validators in the network, and underperformance would be signalled when a validator performance falls below 98% of the median performance value of the network.

AlgoRhythm · July 14, 2022, 12:50am

Part A)
2% of 14 days is less than 7 hours. Even taking into account Eric’s addendum above, a validator can use up all their downtime due to a missed notification overnight. I believe either the 98% or the 14 day metric (or both) have to change. My initial recommendation would be somewhere in the 92-95% range over 14 days, or around 95-98% over 21 days.

Part B)
I agree with Part B’s schedule if Part A is more lenient.

Eric · July 14, 2022, 12:12pm

Thank you for your feedback AlgoRhythm.

To add some light on the proposal and how it is evolving from external and internal feedback and to address your comment and how I understand this proposal will operate: a Notice of a deficiency will be generated whenever a validators checkpoint percentage drops below a median threshold of the entire network.

Operationally, this means if a validators drops below 98% of the performance level of the rest of the validators, they will then have the initial grace period to remedy the deficiency.

The proposed initial grace period was arbitrary chosen at 7 days. On feedback from validators initial grace period is now proposed to be 10 days.

If the deficient validators does not bump above the 98% threshold by the end of the 10 day initial grace period, a second notice, the Final Notice will be pushed out, and the final 10 grace period will begin. If during the final grace period the deficient validator does not bump above the 98% threshold, the next step of un-bonding the stake will be activated without further communications.

By way of example only:

Validator 1 performance drops to 55% of the network median and the unbond clock begins. After 4 days the performance bumps above 98% and the remedial measure is terminated and validator continues as usual.

Validator 2 performance drops to 70% of the network median and the unbond clock begins. After 19 Days the performance bumps above the 98% network threshold, and the remedial process is terminated and the validator continues as usual.

Validator 3 performance drops below the 98% network threshold and does not bump performance above 98% for 20 consecutive days, the remedial process does not terminate and proceeds to the next step without further communications or appeal.

The process is still open for input and inputs from validators are highly encouraged. The validators will be asked to ratify whatever the final form of the policy will be as part of the progressive process of taking ownership of the network.

Eric · July 14, 2022, 12:34pm

On-boarding validators is purposely not included in this proposal. At the moment we are thinking of a multi phase process which looks like:

Phase 1 ( current) Polygon team sets the parameters and chooses the validators
Phase 2 ( next step) validator team sets and approves the parameters, and the polygon team acts on the direction of the validators
Phase 3 Validator team sets the parameters and onboards validators themselves.

The on-board process was intended to be PART C of the performance management proposal.

AlgoRhythm · July 14, 2022, 3:24pm

Thanks Eric. Maybe I’m not following the time measurement of the 98% threshold. Today, the Web Wallet UI measures uptime based on the prior 200 checkpoints (roughly 4 days). I read your statement in Part A “less than 98% checkpoints signed, and measured over a continuous 14 day interval” to mean the new measurement would be 14 days. Are you considering a validator to be compliant if they reach 98% uptime during the current 4 day window or a new 14 day window?

Eric · July 14, 2022, 4:01pm

Hi AlgoRhythm,

I was proposing they become compliant anytime in the new 14 day (initial grace) window.

Your input is very valuable and I thank you for following up. We would like to see these parameters not just agreed on by the validators but to the extent possible formulated by them and therefore any comments made by a validator is persuasive.

Essentially, we would like to see an agreed performance threshold, and the ability to off-board for cause, but not off-bord by mistake. Anything you , or your peers have to add in shaping this is well received.

H_Rook · July 15, 2022, 3:27pm

Further to the conversation and considering the feedback received, the information received has been transposed into the following technical document:

Validator Performance Metrics

The performance benchmark (“PB”) is 98% of the Median Average of checkpoints signed by Validators in the Measurement period (“MP”).
The MP is the last 700 checkpoints on a rolling basis.
Currently this data is shown on Polygon Web Wallet v2 and is based on the last 200 checkpoints (this can be changed to 700 checkpoints).
At each checkpoint we calculate the % of checkpoints signed in the MP by each validator and measure against the PB.

Scenario of Validator Underperformance:

The scenario below assumes that the median average of validator performance is 100%, meaning the PB is 98%.

If a Validator falls below the PB in the MP (700 checkpoints ~14 days)

Validator A signed 95% of checkpoints in the MP.
Validator A enters into the Grace Period (“GP”).
They receive a notice stating they have 700 checkpoints to rectify; or they will receive a Notice of Deficiency (“NOD”) .

If a Validator falls Below the PB for a further 700 checkpoints (1400 checkpoints total / ~28 days)

In the following 700 checkpoints, Validator A signed 90% of checkpoints.
Validator A receives the NOD, stating; “You have not been compliant with the PB for 1400 checkpoints. If not compliant with PB in the following 700 checkpoints you will receive a Final Notice (“FN”) and be kicked with no further resource".

If a Validator falls Below the PB for a further 700 checkpoints (2800 checkpoints total / ~42 days)

In the following 700 checkpoints , Validator A signed 92% of checkpoints.
Validator A receives a FN of the intent of the community to implement a forced exit procedure by offboarding the validator from the network by unbonding their stake.