PIP-10: Increase StateSync Confirmations

PIP-10: Increase StateSync Confirmations

Authors:

Krishna Upadhyaya
Manav Darji
Pratik Patil

Type: Core

Table of Contents:

  • Abstract
  • Rationale
  • Specification
    • Genesis File
    • Bor Consensus Rules
  • Security Considerations
  • Backward Compatibility
  • Appendix
  • Copyright

Abstract

At block #39599624, The Polygon PoS chain experienced a reorg with a depth of 157 blocks. It was observed that the cause of this was due to two forks running with more than 16 blocks of depth, which is greater than one sprint length. As a result of this, the value of to (which is used to fetch the state sync events and calculated based on sprint length) was different on both forks, causing a bad block error to occur.

Following the bad block error, reorgs were occurring at a greater frequency than normal and block propagation was affected as peers would reject blocks and fail to broadcast them further.

This caused issues with the user experience and chain stability which we seek to address with this proposed solution.

Rationale

At the start of each sprint, when Bor commits the state (which includes state sync transactions), it fetches state sync events from Heimdall using the two arguments below:

  1. fromID - value which is a unique, incremental identity of the state, and;
  2. to - value, which is a timestamp.

Using these two arguments Bor is directed to fetch from Heimdall all the state sync events that occurred between fromID and to.

The calculation of to occurs in Bor and depends on the sprint length, to is the timestamp of the block (n) which is a ‘sprint length’ behind the ‘current block’:

  • n = (current block number) - (sprint length)
    

Once the block number of n is calculated, Bor fetches the block from its local database.

The issue can occur during network partitions when two forks are running in parallel. During a partition, there is a high chance that the timestamp of block n will differ between parallel forks. As the forks would be running in isolation, the block formation would not happen simultaneously, causing unequal timestamps, and by extension, values for to.

The Delhi Hardfork reduced the sprint length from 64 to 16 blocks. As a result of this, Bor now fetches state sync events from Heimdall more frequently as it occurs at the start of each new sprint. The result of this is that Bor nodes running different forks could receive a different number of state sync events (due to the calculation ‘to’) from Heimdall during shallower partitions that tend to occur more frequently.

Example 1.

Assuming that we have two different forks, “A” and “B”, with a different number of state sync transactions at the start of a particular sprint:

If there are 2 state syncs in fork A and 3 state syncs in fork B, when forks A and B merge back, let’s say fork B merges back to fork A, then B will import A’s state.

When nodes on fork B import the state from fork A, Heimdall will attempt to fetch the state sync from Bor. At this point (as mentioned above) Bor will calculate the block as n = (current block number) - (sprint length).

When nodes from fork B fetch block n from the local database, the calculation will be based on the state of fork B and not the one from the incoming chain. This causes the value of to to differentiate between the two forks and Heimdall will return 3 state syncs instead of 2. This will result in a BAD BLOCK error (invalid merkle root).

Specification

This PIP proposes to Introduce a new genesis parameter called stateSyncConfirmationMultiplyer. This will be used in conjunction with the current calculation to increase the block range used to calculate the to timestamp while fetching the state-sync events from Heimdall.

Genesis File

We will introduce a new genesis parameter stateSyncConfirmationMultiplyer which will be stored as a map[string]unit64 and used to calculate the block interval for state-sync confirmations.

"stateSyncConfirmationMultiplyer": {
    "0": 1,
    "HF_BLOCK": 8
},

Bor Consensus Rules

Current Implementation:

// Currently sprintLength is 16
to = GetHeader(currentBlockNumber - sprintLength).Time

Proposed Change:

// fetch stateSyncConfirmationMultiplyer from genesis file
stateSyncConfirmationMultiplyer := FetchStateSyncMultiplyer(currentBlockNumber)
to = GetHeader(currentBlockNumber - (sprintLength*stateSyncConfirmationMultiplyer)).Time

Summary:

  • Current sprint length = 16
  • State sync confirmation multiple = 8 (Proposed)
  • The block range to calculate the to value will increase from 16 to 128 (16*8)

Security Considerations

The proposed state sync confirmation multiple would lead to state sync transactions being delayed by ((128-16) * 2.25) ~ 252 seconds as the interval would be increased from 16 to 128 blocks.

Backward Compatibility

This PIP will not be backward compatible with the current implementation of Bor and will therefore require a Hard Fork.

It should be considered that due to the use of sprint length within the stateSyncConfirmationMultiplyer, any future changes to the sprint length will require a revisit to the calculation logic.

Appendix

Copyright

All copyrights and related rights in this work are waived under CC0 1.0 Universal.

6 Likes

Taking the assumptions made here as correct, the proposal’s tradeoff seems like a worthy one to implement.

But I am curious why the new sprint length was decided on 128 blocks whereas before 64 seemed to be sufficient to address the problem?

Hi @Gaia,

The above proposal does not seek to change sprintlength itself. It proposes to change the calculation of a ‘to’ value, which determines how frequent the state sync mechanism between Bor and Heimdall is activated.

Through analysis of the deep reorg and bad block error that occurred earlier this year, it was found that this value was the root cause (see the updated rational).

There have been some recent amendments to this PIP by the authors which have made this more clear.

2 Likes

Reserving this for later comment (if any) and including a previous gov forum discussion around this issue.

Thanks @H_Rook for flagging this!


1 Like