Proposal: Increase block interval & wiggle time

Hi, we operate Web3Nodes - the 4th largest validator (by stake) on the Polygon PoS chain. We have been collaborating closely with the Polygon Foundation to diagnose and debug the recent network issues.

To ensure the reliability and stability of the chain, we strongly recommend increasing the block time to 5 seconds and wiggle time to 10 seconds as soon as possible.

With yesterday’s announcement from the Polygon team and the issues the Polygon chain is currently experiencing such as frequent reorgs, unpredictable block intervals, and overwhelmed mempools, we think it would be essential to increase the block time on the mainnet as soon as possible. This was already implemented/experimented on the Mumbai test net by the Polygon team, and it greatly improved the reliability issues.

The current target block interval (2 seconds) is too short for the blocks to propagate in a timely manner across the entire validator network. Some validators run on beefy hardware in reliable network environments and can reliably produce and broadcast every block in under 2 seconds. Due to the network’s decentralized and permissionless nature, the hardware and network requirements are not something that can be enforced without a protocol update.

As a result, the actual block interval is closer to 2.7 seconds on average. If the selected validator for the current sprint does not produce (and successfully propagate) a block within the current wiggle time (4 seconds), other validators in the validator set of the current span are permitted to kick in and produce blocks out-of-turn, potentially causing a reorg.

Until we have a system in place that can dynamically react to network conditions in the protocol, we should perform a network parameter change as soon as possible to increase the block time to 5 seconds and wiggle time to 10 seconds.

Realistically, this should not affect the throughput of the network because of the recent increase of the max gas limit (block size) from 15M to 30M. This will also bring back MATIC burns which slowed to a negligible pace since the increase.

We urge all validators to participate in this conversation so that we can make the change as soon as possible.

2 Likes

5 seconds is more than double the current time, could increase by a lot the confirmations.

Can’t we go lower, like 4 seconds?

1 Like

I disagree with this proposal - it seems like a very heavy handed approach that treats the symptoms of the current issue rather than the cause, whilst simultaneously changing fundamental aspects of the polygon blockchain when there are many less-severe options that I believe should be exhausted first.

I have a few thoughts:

  1. Increasing the block time would worsen the problems in the mempool due to processing transactions more slowly.
  2. When you say ‘the current target block interval is too short for the blocks to propagate’ - I strongly disagree with this. The chain was running smoothly and not having these reorg issues for quite some time before these issues started happening. We therefore know that the block interval isn’t too short for the blocks to propagate - although I do believe that validators should use a specially modified Bor client on their sentries to aid in the propagation.
  3. Unless the current bottleneck with slow validators is the actual block building - which I don’t believe it is - then increasing the block time to 5s will not significantly decrease the frequency of the reorgs - only the depth. A validator hanging for 20s and then catching up and reorging its backup is currently a 10 block deep reorg. With 5s blocks and 5s wiggle it’d be a 4 block reorg.
  4. I also don’t know that I agree with your reasoning around some validators using slow hardware. I do not think that the chain should be modified due to a small minority of validators not running adequate hardware. A 5s blocktime would significantly increase the costs for users of the polygon chain and I object to the thought of slowing the blockchain down for everyone and simultaneously passing costs from the slow validators to the users of the chain. This isn’t Ethereum - it’s a proof of stake network and being a validator is a privilege that comes with plenty of rewards. If a validator can’t afford to run the hardware needed to have a working validator/sentry node ($200 x2 / month for a good OVH VS in Virginia) then they shouldn’t be a validator. I can’t emphasize enough how much I am against the idea of slowing down the entire blockchain and increasing costs for all users just to accomodate a small minority of validators who aren’t willing to upgrade their hardware.

The fundamental issue is the the stress and delay being put on validators that is causing them to fail to propagate their blocks on time. I would argue that the only items we know aren’t the cause are the 2s block time / wiggle time, simply because we have so much data from the last year of those times working just fine.

Sure, 5s blocks would help… slightly. But I don’t think the benefit is worth the cost and worth fundamentally changing such a core Polygon mechanism. I’d rather spend more time investigating the root cause and fix that.

7 Likes

Thank you @web3nodes for starting this discussion and @Thogard for the thoughtful and thorough response.

I’d appreciate some perspective on the following points. This feels like a pretty big change.

In an ideal world, I’d like to -

1 -Understand the severity of the problem on the network, particularly user impact

Is this causing a significant degradation of service to users?

2 - Understand the risk/reward benefit of the proposed change

The proposal states that the network performance shouldn’t be impacted in a detrimental way. Is that what the testnet proved?

3 - See this tested on a testnet run by the validator set or have the full internal testnet results published for review.

Depending on the above, I could see a few paths forward -

A - Proceed with the original proposal, at least as a temporary solution to the issue

B - Proceed with an alternate proposal, i.e. different block interval and wiggle time durations or increasing validator hardware requirements

C - Do nothing for now, while a more complete and deterministic analysis is conducted

Right now, I don’t feel I have enough information to make an informed decision one way or another.

Hi Chris -

A few thoughts.

  1. Testnet will not be a viable simulation for testing whether or not this proposal fixes the current problem. Testnet does not have the massive txpool buildup, nor does it have any of the bots or listen-only nodes that plague Mainnet and cause propagation issues.

  2. While I am against increasing the default blocktime, I’m not against increasing the wiggle time. At present, the backup validators are effectively neutered - their blocks get reorged anyway, so why release them? Better to have a very late block than a fast block that gets reorged.

A 2s blocktime with a 10s wiggle time that incrementally decreases based on how far into the sprint the head is would be ideal imo.

1 Like
  1. We still think that the block time should be raised. if 5 seconds seems too drastic of a change, 3-4 seconds might be good too (with the wiggle of 10s). Please note that the average block time has been well above 2 seconds and closer to 3 seconds in many cases recently, across all validators:

    Some validators are even worse, often taking 4+ seconds per block.

  2. Another issue we’re hoping to solve is the problem of the base fee being close to zero and not scaling since we raised the block gas limit from 15M to 30M. This basically halted $MATIC burns. We think the minimum priority fee of 30 Gwei should stay to discourage spam and to cover the increasing cost of running validators (which has gone up dramatically over the past year), but we definitely also have to either adjust the block space down to 20M or increase the block interval to restore the healthy, organic scaling of the base fee.

You are misinterpreting that block time graph. If a primary is down, the backup will be releasing 4s blocks. If the first backup is down, the next will release 6s blocks… etc. That doesn’t mean that the validators who made the backup blocks took 4/6s to do so… just that a primary had failed.

If you are concerned about costs, DM me in discord.

1 Like

I’m not a validator, but Polygon asked the Aavegotchi team what we thought about the issue, and here is my reply:

I agree with @Thogard that changing the blocktime is probably not going to alleviate the issue, and may end up making it worse. It’s going to lead to more tx buildup in the mempool as txns won’t be getting processed as quickly.

I’ve had several discussions with Thogard in the #validators channel in Polygon discord. I even suggested that he apply for an ecosystem grant to publish some of the research he has done, and the modifications he and his team have made to their validator nodes to help improve performance.

I believe validators should explore every possible optimization before changing such an important constant in the ecosystem.

1 Like

Main dev of Gains Network (gTrade leveraged trading) here, I agree with Thogard that it is not obvious it will fix the reorgs issue.

The first step of fixing this problem is identifying the real cause of the issue. The polygon chain has worked well for months with an even higher transaction count than now without these reorg issues.

Therefore I do not think it is obvious that the problem comes from the low block time, and I would like to see at least a few evidences of it, and more investigation on multiple possible causes.

I’m not an expert in this field but to me with 100 validators, the current state of the tech should be able to handle 2 sec blocks.

From an UX perspective, I also think it is a big mistake because use cases like us benefit greatly from the 2 second blocks. With our chainlink architecture, it means a trade can be opened as fast as in 4 seconds (one tx for the user, one tx for our chainlink nodes). With this change, it would go to 10 seconds minimum.

The Polygon PoS chain does a compromise of decentralization to bring fast transactions and low fees. Increasing the transaction times is a big threat to its use case in my opinion.

I would be very surprised if the conclusion is we need to increase block time, I’m sure a lot of things can be optimized, including the mechanism that causes the reorgs with the other validators producing blocks.

That being said as I said, I’m not an expert, just sharing my opinion and feedback, and how it would affect the UX and the value proposition of Polygon PoS.

Thanks.

1 Like

From my perspective, the most frustrating element of the reorgs is that in the majority of situations they are detectable before they happen. It is my view that this shouldn’t be the case - if I can predict a reorg before it happens, that means the nodes should be able to as well.

Most reorgs are caused from a scenario like this:
(EDIT: note that validator sprints are 64 blocks long. Whichever chain - the backup or the primary - makes it to block 64 first and is used by the primary of the next sprint is the main chain.)

  1. Validator A suffers some sort of failure at the start of their sprint and are unable to propagate blocks.
  2. Validator B is validator A’s backup. After 2s go by with no block from validator A, validator B makes the backup block. Then another 2s go by and validator b makes another backup block. Validator B’s blocktime is 4s because validator B has to wait and make sure it doesnt receive a block from validator A before it starts on its own. Let’s say for the sake of this example, validator B makes 10 ‘backup’ blocks and it takes them 40s to do so.
  3. Validator A comes back online and starts making their own blocks. Validator A does not accept any of validator B’s blocks
    note- if validator B isn’t past block 32 at this point then I know there will be a reorg… it just hasn’t happened yet.
  4. Another 20 seconds go by. During those 20s, validator B has made an additional 5 blocks (because its blockTime is 4s: 20/4=5) and they are now on block 15. Validator A has made 10 blocks during those 20 seconds (its blocktime is 2s due to being the primary. 20/2=10).
  5. Another 20 seconds go by. Validator B made another 5 blocks, and validator A made another 10 blocks. They’ve now each made a total of 20 blocks. As soon as the block number from validator A reaches or exceeds the block number for validator B, validator A’s ‘chain’ becomes the primary chain. Everyone’s node will now recognize validator A’s chain as the main one and everyone will experience a 20 block reorg as the 20 blocks we received from validator B are ‘undone’ and replaced with the 20 received from validator A.

Would a 5s blocktime decrease the number of reorgs? Well… only if the initial ‘freeze’ / ‘hang’ time of validator A was less than 10s. That’s because with 5s blocktimes, the backup blocktime would be 10s and so validator A would come out of the freeze and release block 1 around the same time as validator B’s block 1 and it would only overwrite the top block (in other words, validator B wouldn’t have released a second block prior to validator A’s first block). (backup: 15/10=1.5 round to 1) (primary: (15-10)/5=1

Is this common? well we can tell the duration of the average validator reorg-causing hang/freeze by looking at how many blocks get reorged. In the current system, a 10s hang would mean the backup validator has propagated 4 blocks by the time the primary catches up (at 18s, backup has 4 (18/4=4.5, round to 4) and primary has 4 ((18-10)/2=4)

So how many reorgs would this change prevent?
go here and take a look:

While I am against increasing the blocktime, I am 100% on board with increasing the wiggle time. Wiggle time is how long the backup must wait on the primary before it can start making its own block.

What i’d like to see is an incrementally decreasing wiggle time based on how many blocks into the sprint we are. For example, if we are on the first block of the sprint then it will be very easy for the primary to catch up and reorg the backup… but if we’re towards the end of the sprint then it will be quite difficult as the primary has less time to catch up to the backup.

As an extreme example: imagine a scenario where the wiggle time was 10s for the first block, 8s for the second, 6s for the third… etc until you get to 0s for the 6th. From how I understand it (and i am not by any means an expert on consensus here, so if this is wrong someone please correct me), this would have a significant impact on the depth of the reorgs. If the backup block makes it to the 6th block before the primary recovers then there is no chance that the primary could then pass the backup since both will now have 2s blocktimes. Before then, however, the primary would have 12 + 10 + 8 + 6 + 4 + 2 = 42s to make 6 blocks, meaning that with 2s blocktimes it could have a hang/freeze for a max of 30s (42 - (12/2)).

I like this scheme because it lessens the depth of the reorgs at all levels and removes the possibility for small reorgs (it’s even more effective than the 5s blocktime/wiggletime for preventing small reorgs) and it prevents large reorgs as well.

Consensus would have to be changed and i’m sure there might be some ways to ‘game’ the system that would need to be accounted for… we’d also need to make sure the wiggle room’s incremental decrease amount gets reset back to 12s whenever the backup receives a block from the primary that matches their blocknumber (to prevent issues if the primary is slightly late half way through and the backup immediately taking over).

I haven’t really modelled this out yet - it’s just kind of an idea that’s been bouncing around in my head. But I think it would help the chain far more in every measurable metric than the 5s blocktime proposal would. Its only downside is the possibility for 12s / 10s blocks in the beginning of a sprint… but youd be able to rest comfortable knowing that if it was a 4s block it’d probably get reorged anyway (and the reorg’d replacement block would probably be more equivalent to a 30s block lol). I bet someone else could probably come up with an even better version of this concept.

But yeah - i’d love to model this out. This isn’t related to the tx propagation stuff i’m working on. It would be fun to do.

1 Like

Also, I want to point out that if you look at that ‘Forked Blocks’ page you’ll see a validator listed along with the time and depth of the reorg. Please be aware that the validator listed is the backup validator who was the victim of the reorg. The backup released blocks ontime like they were supposed to and these blocks were picked up by the explorer’s RPC. They were then reorg’d when they were overwritten by a primary validator coming back online after experiencing a long freeze/hang. The ‘perpetrator’ of the reorg can be found by putting the reorg’d block number into the explorer and seeing who the actual validator was.

So don’t be mad at 0xbdbd4347b082d9d6bdf2da4555a37ce52a2e2120
Instead, look at 0xbc6044f4a1688d8b8596a9f7d4659e09985eebe6

1 Like

Looks like a very interesting starting point. I’m wondering however why the Polygon team itself doesn’t have a team working full-time to optimize this tech? Or if it’s already the case it would be great to know about the progress. I’ve heard there is Polygon PoS v3 is being worked on, and maybe it also fixes this problem.

It seems there’s a ton of work to fix the reorgs in a clean manner and it will require more than just a community effort imo. I’m not even sure why we’re proceeding this way using a forum, after 2 months it looks like nearly no progress was made.

Any update on this from the team would be greatly appreciated. I love Polygon that’s why I want it to succeed, that’s why I’m giving my feedback in a honest manner. Thanks.

2 Likes

@Seb your first question / point is the easiest :smiley: Polygon does indeed have a full-time team actively building and optimizing new and existing tech. There is a focused effort on researching the cause of these re-orgs - with related issues of blocktime - and strategies to mitigate or fix them. We greatly appreciate community feedback and analysis as well.
@Thogard’s analysis is quite good, particularly wrt the possible role of the primary block producer in causing re-orgs. Wiggle time is also one of the mitigation strategies we are researching and testing. More to come soon and I will be dropping into the validator channel(s) on Discord to discuss there as well.