Recent Downtime on Astar zkEVM

J_Nicolas · April 13, 2024, 1:16am

Earlier this week, an outage on Astar zkEVM required an emergency upgrade of the network. The issue was caused by the improper handling of an L1 reorg. To ensure full transparency, a Root Cause Analysis (RCA) will be made public on the Astar forum shortly.

Until then, here’s a summary of the incident, an update, and a look at the next steps. Please note, the upgrade and fix to Astar zkEVM have been implemented and the network is operating smoothly.

Resolving the issue required upgrading the network, which took approximately 5 hours due to an unforeseen issue restarting the synchronizer.
While a limited number of wallets were affected by the reorg, those assets will be reinstated. More details are coming soon on the Astar forum.
Additional context is available on the Astar forum, here: https://forum.astar.network/t/astar-zkevm-network-upgrade-report/6633
Acknowledging concerns about on-chain data, Astar zkEVM is actively collaborating with ecosystem projects to address them effectively.
The Yoki Origins campaign is going live on April 14, at noon (JST). Stay tuned to the Astar forum for more.

Next steps for impacted users
User funds are safe and assistance is readily available. The Astar Foundation and Polygon Labs teams are committed to supporting users affected by the outage.

Please fill out this form if you need support: https://forms.gle/GLWWBH7xkeLZzrY26

Check the Astar blog for a comprehensive update on the incident, which will be shared with the community shortly.

Next steps for projects
The Astar Foundation has asked developers to update their nodes following completion of the upgrade. More details will be shared on the Astar forum soon.

Next steps for Astar & Polygon Labs
Polygon Labs and the Astar zkEVM team will continue to maintain open communication with the community, keeping them informed of all the latest updates and ensuring transparency in our interactions. The teams are also committed to providing a detailed breakdown of the incident.

J_Nicolas · April 19, 2024, 4:12pm

As mentioned, here is the full Root Cause Analysis of the recent outage on Astar zkEVM:

Summary
At 6:24 UTC, on April 5, the Polygon Labs team was notified of issues within the Astar zkEVM mainnet network, where their permissionless synchronizer had halted at block 1,132,274. When the Polygon Labs team began investigating, they found that Astar nodes were unable to synchronize the latest state of the network, while the trusted Gelato nodes were operating normally. Upon additional inspection, the Polygon Labs team discovered a discrepancy between the batches generated by the Astar trusted sequencer and those obtained from the RPC. Specifically, block #1131566 and batch #18540 were identified as mismatching.

On April 07, following Polygon Labs’ analysis of the Astar database provided on April 05, the root cause was discovered. The exit_root table was affected by an L1 reorg, at block 19594622. Onchain data showed there was no update in that block; instead, the update appeared in block 19594622. This was the reason permissionless nodes were unable to synchronize the state of the network.

The Polygon Labs team recommended re-executing the resyncing operation. Ultimately, the Polygon Labs, Astar, and Gelato teams determined that the resync should be performed on April 08, during the already-planned network upgrade from Fork ID7 to Fork ID9. As an additional precaution to avoid proving the suspected faulty batch, the teams agreed to decrease the number of provers until after the resync was complete.

When attempting to re-sync on April 08, the Polygon Labs team discovered that a batch from April 05 had been incorrectly sequenced with an invalid state. This confirmed the state synchronization issues that permissionless nodes were seeing.

Resolution
When the resync failed, core engineers for Polygon zkEVM recommended proceeding with the mainnet upgrade from Fork ID7 to Fork ID9, which would allow for the required resync of the invalid batch. This upgrade was expected to take 3.5 hours; there were approximately 4000 batches in the backlog

The Astar team was informed that resolving the issue completely required a reorg, the size of which was difficult to estimate without additional investigation. Additionally, the longer Astar held off on triggering the upgrade, the greater the reorg would be; moving quickly with the upgrade was determined to be the best course of action. Astar’s management team agreed and requested that Polygon team members be on-call for any issues that arose following the upgrade.

The upgrade began at 18:00 UT and, ultimately, took 5.5 hours.

An unexpected delay in restarting the sequencer was caused by missing migration hotfixes. After the synchronizer was successfully restarted, the teams encountered a knock-on issue in the pool database, which was dead locked due to the migration. The pool database had over 500,000 transactions. The Gelato team suggested dropping the pool table in favor of completing the migration.

When the synchronizer was restarted, the Fork ID detection didn’t work. To address this, the Polygon Labs team manually modified the fork_id table to specify that it should begin processing blocks before the upgrade transaction.This worked, but immediately failed with a different error because there were sanity checks in the node that prevent processing a batch with a different Fork ID.

When the teams attempted to start the new version of the bridge service, there was another missing fix. The quickest solution was to downgrade. Following that, the synchronizer restarted and normal network operation was recovered.

Scope of Impact
Resolving the issue required upgrading the network, resulting in 5.5 hours of downtime.

Whys
On April 06, in an effort to get Astar zkEVM mainnet back online quickly, the Polygon team’s recommendation to resync was well intentioned but without a complete understanding of the real impacts of the L1 reorg. Had that been fully understood, the Polygon team would have recommended rolling back the network further.

In retrospect, several factors exacerbated the situation:

The verification gap in this particular network was large. If the network had been verifying batches faster, the network would have halted on its own on April 06. That may have been the preferred outcome.
The network’s Data Availability Committee (DAC) and every other RPC node outside of Gelato were unreachable, which made it difficult to assess if the trusted state was indeed wrong or if it was an RPC issue with the synchroniser or executor.
Testing or simulating these procedures is time consuming and complex. Taking a snapshot, exchanging it, restoring it, to create a simulation had not been operationalized.

Repair items
Issue 1: RPC providers, block explorers and indexers are out of sync.

Resolution: CDK team to provide a new validium client package that allows RPC providers to seamlessly sync the node from scratch without any manual intervention.
Status: Package built CDK tag v0.6.4+cdk.6 currently being tested and shared with the Astar team so they can concurrently test if they wish.

Issue 2: Sequencer Balance is now similar to what it was three days ago. An exact number of transactions affected by the reorg is still being determined.

Resolution: The Polygon Labs team conducted an analysis of database logs to identify impacted transactions.
Status: The Polygon Labs team has verbally shared high-level figures for the number of potentially affected transactions. The product team will share that data directly with Astar.

Issue 3: The reorg also affected transactions on third party bridges. How best to address these is a work in progress.

Status: The Polygon Labs and Astar teams reviewed the financial impact of identified third-party bridge issues. Efforts are underway to explore compensation options.

ztianshi · April 24, 2024, 10:49am

I believe you will do better，Thank you for your contributions