Postmortem - Polygon PoS Outage & Learnings

On March 10, 2022 at 4:49h UTC our Core Development team received an alert indicating there was an issue with the state-sync mechanism on the Polygon PoS Mainnet. The Heimdall layer in Polygon PoS handles state-sync communications between bridge contracts on the ETH Mainnet and the Polygon PoS side chain; it was a large data-sized state-sync transaction that couldn’t get processed due to Heimdall transaction gas limitation threshold. The following is an account of the remediation events taken by Polygon’s Core Development team in coordination with its validator and infrastructure partners.

Root Cause
It was found that Heimdall v0.2.5 had a data size check error which allowed a large transaction to clog the state-sync mechanism. Heimdall v0.2.6 with PR #781 was pushed out which fixed the size check (by reducing the limit from 100kb to 50kb). Due to a check in transaction validation, this change required a mandatory upgrade/hard fork. However, the team erroneously released it as an optional upgrade.

The incremental release of Heimdall v0.2.6 caused a state-mismatch when the original large transaction reappeared among Mainnet nodes now on different Heimdall versions (v0.2.5 and v0.2.6, each with different data size limit logic.) This mismatch exacerbated the original issue and caused the Heimdall chain to halt. Bor depends on Heimdall for subsequent spans, so it too halted until hotfix Bor v.0.2.14 was released. The team then pushed subsequent Heimdall releases over the next few days that ultimately returned the chain to normal operation.

Actions Taken
Approximately six hours after the original alert the team released Heimdall v0.2.6 to address the data size limit bug by lowering the tx size limit by 50% (from around 100kb to 50kb), this upgrade was distributed to the top nodes and they began the upgrade. When a small number began to upgrade and come online they caused an AppHash mismatch at Heimdall height 8588756. At this point the Heimdall chain is no longer progressing, the Polygon bridges are stopped and the team takes the decision to stop all official web wallet deposits.

Bor stops after the subsequent span is completed as no additional span information is available from Heimdall. The team decides to work on a new Heimdall release while also pushing a hotfix for Bor that includes around 50 hardcoded spans - this returns block production to the Polygon Mainnet, but the bridges are still offline.
At this point, internal and external RPC nodes were experiencing issues and Polyscan was also behind. Once they applied the hotfix on Bor, they started coming online.

Validators also passed Heimdall Proposal 9 to allow higher tx data sync temporarily, this was accomplished by increasing transaction gas limit to 2.5m (up from 1m) until after the heimdall hard-fork.

The team then makes Heimdall v0.2.7 available, this fix includes an overwrite of the 50 hard coded spans plus a small rollback function to return Heimdall on all nodes to a previously working block (height 8588755.)

From the release notes: “Rollback overwrites a state height n with the state n-1. The application also rolls back to height n-1. No blocks are removed, so upon restarting…transactions in block n will be re-executed against the application.”

We had to instruct validators to restart slowly and change ports/enode ids so any node still running Heimdall v0.2.6 on the network wouldn’t pollute the newly restarted v.0.2.7 nodes. Once 2/3+1 was achieved, and state-sync transitions had been replayed, the Polygon Mainnet was back online. The bridges were now active and web wallet deposits were turned back on. The team monitored and after a while checkpoints started normally.

Heimdall v0.2.7 is then widely released to other full nodes, RPCs are back online.

With normal functionality returned, the team begins work on Heimdall v0.2.8 which is released with PRs #792 and #791 and includes a hard-fork at height 8664000. This release reconciles the hardcoded spans at the particular height on Heimdall which was hardcoded during the hotfix and includes a new reduced state-sync size maximum limitation.

Future changes
In the short term, we are making changes in Heimdall to support better upgrade mechanisms. We are working with the VitWit team to make the upgrade process better in future. The team is also working on Heimdall to make it more robust by adding dynamic transaction gas limit, increasing the block gas limit and adding bulk state-sync txs to make state-sync mechanism more robust.

The team is also reviewing our testing procedures, internal audits, proactive peer reviews for complex changes and upgrades and lastly, we are considering upgrades of the Mumbai testnet to make it a more suitable testing ground for node and network layer improvements.

In the long term, we have been brainstorming to redesign the architecture for Heimdall/Bor and implementing the next version of the chain in such a way that the bridge mechanism is not tightly connected with the consensus and core system of the chain. The next version, tentatively codenamed v3, will merge the Heimdall and Bor nodes and chain and will remove the span mechanism.

Acknowledgements
We would like to thank the VitWit team (specially Anil, Kaustubh and Sai) for helping the team throughout the incident, sending PR #787, reviewing subsequent PRs and Heimdall releases, coordinating the efforts to start the Heimdall chain and brainstorming the possible path to recover Heimdall during the incident. Special thanks to Pete Kim for helping/supporting the team throughout the incident including implementing hotfix PR #786 on Heimdall and reviewing subsequent PRs during the last week. We would also like to thank the Informal Systems team who quickly got on an emergency call to guide us with possible solutions to recover Heimdall.
A big thank you to all the validators who helped and supported us during this time. And many thanks to the developers and users who stayed patient during the whole incident.

We intend to revisit our release and communication processes to mitigate similar issues happening in the future.

3 Likes