PIP-32: Ancient data pruning

manav2401 · December 21, 2023, 1:31pm

Authors:

Type: Core

The most recent data (the past 90000 blocks (a configurable limit)) is stored in LevelDB for faster read/write on the most recent block data and state. The data beyond this limit is moved to the Freezer Database, which stores the historical data. For a full node, the Freezer Database stores block headers, body and receipts for fulfilling RPC requests over them in a flattened raw binary format. Currently, as bor is derived from geth, it supports state pruning, which prunes the old state present in the key-value store, accumulating as the chain progresses. This reduces disk space as we remove the state data that is not required.

Motivation

Currently, the disk size of the bor database on mainnet and testnet are as follows:

Mainnet:

Total datadir: 5 TB
Ancient dir: 1.7 TB

Testnet:

Total datadir: 461 GB
Ancient dir: 191 GB

The ancient data accounts for 34% and 41% of the total size in mainnet and testnet respectively.

While this data is required for nodes to serve RPC requests on historical blocks, it is not a necessity for a node to store this data if its purpose is to produce/validate a block and participate in consensus (i.e. a validator node) or a regular sentry node. Currently, there’s no way to prune the Freezer/Ancient database from a full node.

Having a way to prune this data helps in reducing the disk space for normal full nodes that don’t serve historical RPC requests. Moreover, it would enable a step towards enabling EIP 4444 for Polygon PoS (although it’s a bit different, it’s a step in the same direction). This PIP proposes the bor client to have the ability to prune ancient data reducing the overall disk space consumption.

Specification

This PIP proposes to introduce a new cli command for pruning the ancient data as follows:

Usage: bor snapshot prune-block [options...]
  This command will prune ancient blockchain data at the given datadir location. Options:
  -datadir <value>                  Path of the data directory to store information
  -datadir.ancient <value>          Path of the old ancient data directory
  -block-amount-reserved <value>    Sets the expected reserved number of blocks for offline block prune (default: 1024)
  -cache <value>                    Megabytes of memory allocated to internal caching (default: 1024)
  -cache.triesinmemory              Number of block states (tries) to keep in memory (default: 128)
  -check-snapshot-with-mpt          Enable checking between snapshot and Merkle Patricia Tree (default: false)

This command should be used in an offline mode.

Rationale

The prune command keeps the last N blocks in the Freezer database and removes the rest. Value of N can vary from 0 to K, where K is the length of the chain excluding the blocks in LevelDB (90k default) and can be set using the flag block-amount-reserved. Basically, the pruning process will begin from genesis moving towards the most recent block. Due to the nature of implementation, pruning can be done multiple times.

For example: the Freezer database has these blocks: [0, 1, …, 999, 1000]. If the value of block-amount-reserved is set to 100, then [0, 1, …, 900] will be pruned and the database will be left with [901, …, 1000]. On the second round of pruning with the value of block-amount-reserved set to 50, the remaining values will be [951, …, 1000].

The pruner maintains an offset in the key value store, which depicts the start block number of the Freezer db containing ancient data. The pruner keeps updating this value after each pruning round and uses it for the next round to determine the starting point. The pruner opens a backup Freezer db and moves the blocks to be kept from the old db location. Upon completion, it performs the validation to make sure the Freezer db and kvDB (LevelDB) are in continuity and proceeds to delete the old db fully.

Although the implementation is straightforward, it opens up some security concerns which are addressed in the security considerations section below.

Backwards Compatibility

Although this feature doesn’t need a hard fork, the changes are backwards incompatible (unless a separate fork is maintained). If pruning is done (at least once), the offset value is updated and used to open the Freezer db. Moreover, it helps in maintaining context that the old historical block data isn’t available. The versions of bor which don’t support pruning will fail to open the Freezer db as it doesn’t have the required ancient data and won’t know the offset value.

Security Considerations

This section mentions some concerns raised by this change and also some alternatives to achieve the same.

Impacts the data availability of the chain as nodes maintaining/serving that data will be reduced.

It will be difficult to find peers who will have (and hence serve) historical data in the p2p network (a variation of EIP 4444).
A simple counterargument for this concern is that node operators can rely heavily on snapshots for initial synchronization instead of syncing from scratch.
For querying historical data via RPC, people will move to RPC providers who are incentivized to serve and keep that data.

Disk costs can be reduced/mitigated by moving Freezer db containing historical data to a cheaper storage medium like HDD with a simple configuration change, removing the need for this feature.

Copyright

All copyrights and related rights in this work are waived under CCO 1.0 Universal.

manav2401 · December 21, 2023, 1:34pm

The reference implementation is available as a part of this pull request. Huge shoutout and thanks to external contributor jsvisa for implementing this feature.

n8wb · December 31, 2023, 8:50pm

I agree with this proposal. While it may impact the availability of historical data on the chain, it would allow more people to run their own nodes. 5 TB is a hefty storage requirement, and this will keep increasing. 3.3TB is much more reasonable. A good consumer-grade Samsung SSD is roughly $150/2TB, so an average user would spend ~$300 instead of ~$450 for the storage to have a local bor node. Since 4TB SSDs have become widely available and cheap, they could do it with a single 4TB SDD. (No software RAID!) Of course, this is speculation. However, the benefits of dropping the barriers of entry (and the network benefits) likely outweigh the lower availability of historical data.

web3nodes · March 12, 2024, 5:04pm

I strongly recommend we proceed with this proposal.

pete · March 12, 2024, 5:20pm

App developers often need to operate full nodes, but this is becoming prohibitively expensive and difficult. Running a full node currently requires at least 4TB of storage ($500 on AWS), with daily increases of 20-30GB. Most developers don’t need access to very old blocks. For those rare instances where they do, options include simply not pruning these blocks or using archival nodes.

The concerns about data availability seem misplaced for several reasons. Firstly, syncing a node from the beginning via peer-to-peer is already unfeasible, and developers are advised to use snapshots instead.

Secondly, the trend of blockchain tech, as indicated by EIP-4844 (ephemeral blobs) and EIP-4444 (pruning of state older than a year), favors pruning and off-chain storage of historical data, with the exception being those few who require archival nodes. Ethereum calls it “The Purge” and considers it a very important step to maintain decentralization.

0xVitek · April 4, 2024, 12:10pm

Hi.

The disk space usage data presented in the proposal are missleading and lack information about when the database was pruned.

Here is data for a sentry node that I pruned 45 days ago.
Freshly pruned:

Total datadir: 2.7 TB
Ancient dir: 1.8 TB (67%)

45 days later:

Total datadir: 3.4 TB
Ancient dir: 1.9 TB (55%)

Conclusion: the disk usage can be reduced by 50% by pruning the whole database (including ancient) every 2 months.

PS: Moving ancient to a cheaper storage may seem like a good idea but it is often impossible or complex. Not every server provider has it as a standard option to add a 4 TB HDD to the server. Configuring this over a network is probably possible but increases complexity in setting up a node.

yorick · May 26, 2024, 12:01pm

For disk space reference, a node configured with Pebble and PBSS no longer needs manual pruning runs and uses:

Total: 2.9 TiB
Ancient: 2.0 TiB

With sufficient time it can be synchronized on a single 4TB drive, but if a snapshot is to be used, a 2x4TB (or 2x3.8TB) is required.

Community providers could start offering a Pebble/PBSS snapshot (alongside or instead of the current Level/Hash) immediately.

It would still be beneficial to be able to prune ancient.