Polygon zkEVM: Recent Network Outage Report

J_Nicolas · March 28, 2024, 3:19pm

On March 22, the core engineers for Polygon zkEVM began receiving reports from RPC infrastructure providers on the network that they were having issues synchronizing the state of the network. The Polygon zkEVM team was unable to reproduce these errors. On March 23, the team attempted to resolve the issue by resyncing the network using L1 sequenced data, which led to detection of two different correct Global Exit Roots.

In the process of resolving this issue, the network was down for a total of 14 hours. Polygon PoS, chains built with Polygon CDK, and chains connected to the AggLayer were unaffected.

Resolving the outage resulted in an emergency upgrade to the Verifier, as well as a reindexing on Etherscan and resynchronization of permissionless nodes. For dApps on Polygon zkEVM: Some transactions made on March 22 and March 23 were affected by a network reorg and may have been processed in a different block or may not have been processed—please get in touch below, if this is the case. Approximately 4,000 transactions may have been affected.

Description of Incident

The network outage was caused by a reorg on the underlying L1, a relatively common situation, usually involving a depth of 1 or 2 blocks. The original version of the reorged block included a deposit transaction from Polygon zkEVM that generated an update on the Global Exit Root, which is the root of the bridge’s global exit Merkle Tree. The reorged block that followed did not include this deposit transaction.

The synchronizer for Polygon zkEVM did not correctly detect this reorg and the record was not deleted or updated from the State Database for over two epochs, approximately 12 minutes. As a result, the sequencer included the incorrect Global Exit Root in the next L2 block. The actual state of the network was different from the one that was published.

As a result of the reorg, many transactions following this block returned an invalid nonce. Valid proofs for these transactions were generated, but the result of these transactions were no-operation, as if they were not there. The analysis of the incident, development of the fixes, and synchronization of these no-op transactions delayed recovery of the network.

Resolving these issues ultimately required putting the network into an Emergency State, a security mechanism that allows the network to be upgraded without a timelock, conditional on the approval of the Security Council. The Security Council is a 6/8 multisig, with two members from Polygon Labs. On March 24, at 23:00 UTC, the network was halted and upgraded. This is the first time that the Emergency State mechanism has been used. For more on the network’s security mechanisms, see zkEVM’s governance model: Security Council.

Solution

Due to the invalid management of the L1 reorgs from synchronization, updated versions of the Node and Prover were released. Those repos and changelogs can be found here:

Node: v0.6.4
Prover/Executor: v6.0.0

To prevent this issue from recurring, the Polygon zkEVM team introduced an additional protection in the sequencer that ensures that the L1InfoTreeldx, which contains the Global Exit Root, timestamp, and L1root, are correctly minted on the underlying L1, even in the event of a reorg.

Timeline of Events

March 22, 2024

18:39 UTC: A reorg affecting a Global Exit Root update transaction occurred on the L1, but the trusted synchronizer could not handle it correctly.

This caused the timestamp and Global Exit Root used for the block production to be incorrect.
The network continued to process transactions with a virtual state that was incorrectly timestamped.

March 23

09:02 UTC: A resync of the network using L1 sequenced data detected two different correct Global Exit Roots. Downtime began.

10:59 UTC: It was discovered that some L2 batches were invalid because they did not have the correct timestamp.

18:07 UTC: A new version of the Prover/Executor was released to address this.

20:08 UTC: A new resync with the latest version of the Prover/Executor provided the final state of the network.

21:42 UTC: Two invalid batches were identified for which, as a result of the reorg, it was not possible to generate a proof.

March 24

00:03 UTC: Recovery of the RPCs and the sequencing of batches on L1. Network activity resumed with withdrawals halted.

01:00 UTC: A reindexing by Etherscan was performed.

The Security Council was subsequently informed of the network’s issues and approved placing the network into an Emergency State to upgrade the Prover and Verifier.

23:00 UTC: The network was halted to deploy a new Prover and Verifier.

23:40 UTC: All smart contract operations were completed and the Emergency State was lifted, with withdrawals enabled.

March 25

00:00 UTC: It was determined that a new version of the synchronizer was needed to update the external RPC infrastructure providers.

02:30 UTC: Node v0.6.4 and Prover v0.6.0 were released.

02:35 UTC: The pending batches were correctly verified and the network continued operating normally.