Proposal: Decentralized sharing of validator health metrics

henri · March 7, 2023, 3:55pm

Thanks for highlighting Tenderduty and as well as existing Geth monitoring tools. I think that’s valuable and people should be aware of those, however my goal here is to propose something that’s more like a platform than yet another tool - and that unlocks quite different outcomes. I’ll explain below:

The tools that I’ve seen (and please correct me if I’m wrong here) are meant to be installed by node operators to monitor their own nodes. Polygon validators can use a mix of these or Prometheus-based tooling to monitor their own nodes. For example, for our own validator we are using Prometheus and Grafana internally, and that’s all fine and working well. However, the keyword there is internally.

While there are tools that validators can leverage for themselves, or even centralized SaaS tools that do some of the work for you, none of the existing solutions seek to open up the data for everyone to build on, enable data sharing among validators, or data sharing with the community. Our proposal essentially aims to create an open firehose of data from validators which anyone can analyze or build tooling on - it’s more like a platform or ecosystem, as opposed to proposing to build a new tool or adopt a particular tool.

Introducing just a new tool probably wouldn’t improve the status quo much actually, but allowing an ecosystem to emerge on top of opened-up data might achieve such goals.

In a nutshell:

In the current model, validators have to install and operate whatever monitoring tools they need themselves, and the data is not accessible to others.
In the proposed model, any third party can independently build and operate monitoring tools to benefit all validators, and the data is available to everyone equally.

Or looking at a slightly different angle:

In the current model, validators need to set up elaborate tool chains if they wish to be notified when something’s wrong with their node
In the proposed model, anyone can detect if something’s wrong with your node and tell you about it

Also, will you guys be seeking funding for this?

We wouldn’t mind a small grant to cover our costs, but otherwise no, we don’t need funding to implement this to the PoC phase, and from there help from the Polygon team is needed to help roll it out to validators.

The proposed solution is quite simple really, as it requires just a bit of application code on top of existing mature building blocks. The Metrics node will be just 50-100 lines of code, pretty tiny! Curiously in this case, the effort of writing the proposal and discussing it is probably 10x more work than writing the software itself