Proposal: Decentralized sharing of validator health metrics

Decentralized sharing of validator health metrics

Authors

Henri Pihkala
Matthew Fontana
Matthew Rossi

Abstract

This document proposes a decentralized, open, and secure data pipeline for sharing a broad set of Polygon validator metrics. It enables an ecosystem of early-warning and node monitoring tools to be built by anyone, thereby improving network health, transparency, and decentralization. The data sharing is based on a peer-to-peer data transport protocol.

Motivation

This proposal improves on and expands the idea presented in Michael Leidson’s proposal to create a system for gathering health metrics from validator nodes. The motivation and goals of this proposal are similar to those presented by Leidson, but the technical approach differs significantly in adopting the following design principles:

  1. Decentralization & trustlessness - the system used to gather information from the network must not contain a single point of failure nor rely on a single party to run the infrastructure,
  2. Transparency & openness - the raw data must be accessible to everyone equally, and anyone must be able to build tooling on top of the metrics data,
  3. Security & robustness - the data must be tamper-proof and cryptographically attributable to the node that signed it, and the data collection and distribution system must be censorship-resistant.

Currently, there exist some validator health dashboards such as this one, which shows limited information about Bor nodes. The system proposed here improves on the status quo by enabling very detailed dashboards and alerting systems to be constructed by anyone for both Bor and Heimdall nodes, as well as both validator and sentry nodes, all based on a decentralized stream of detailed metrics shared voluntarily and securely by the validators themselves.

This proposal maintains a non-invasive approach that requires no changes to the Heimdall and Bor software, and uses a separately running Metrics node to extract, prepare, and share the data at regular intervals.

Instead of just the most recently seen block information proposed by Leidson, this proposal suggests sharing a broader set of metrics available from the Prometheus-compatible metrics API on both Bor and Heimdall nodes (as leveraged by the Prometheus exporters for Polygon nodes available in the Ansible playbooks for Polygon).

The benefits of this proposal are as follows: The proposed approach enables and powers an ecosystem of early-warning systems, helps troubleshoot problems with validator performance, and gives the community access to validator health data, boosting innovation and confidence in the Polygon ecosystem.

The technical architecture presented here is scalable and robust, contains no single point of failure, allows anyone to access the data, and is extensible to include any details available from the nodes in the future. This proposal is not opinionated as to what kind of end-user tools should be built on top of the data, it simply describes a method for collecting and openly distributing the data, enabling any kind of tooling to emerge.

Specification

The only component involved in the solution is a Metrics node. It can be co-located with the Bor and Heimdall nodes or run on a separate, lower-performance machine for best isolation. The Metrics node collects the metrics by periodically querying the metrics API present on both Heimdall and Bor nodes. This API was originally intended for Prometheus metrics collection, but works perfectly for this too.

The Metrics node publishes the metrics data over a decentralized peer-to-peer network that implements the publish/subscribe (a.k.a. pubsub) messaging pattern. In pubsub, data is published to a named ‘topic’, and anyone can join the topic as a subscriber to receive the stream of data. The peers, consisting of Metrics nodes and subscribers, form a mesh of connections with other peers. Each peer forwards the data to a number of other peers in the network, thereby eventually reaching all peers in the network. Since each node is connected to a limited number of peers, and the data travels through multiple redundant connections, such networks scale very well, are fault tolerant, and most importantly, don’t depend on any centralized server.

Each data point is cryptographically signed, ensuring that data can not be tampered with or spoofed, and it can always be traced back to the source node. Subscribers validate the signatures to ensure that the data was indeed produced by one of the validators in the Polygon network.

The subscribers can be any kind of applications: dashboards, analytics backends, alerting and monitoring systems, and so forth. The data is public and can be subscribed to by anyone, at massive scale. Applications built on top of the raw data can do whatever they wish with it, for example aggregate data, store a history of data, or even publish the aggregated data in realtime to another topic on the same data network. The network itself can also store the raw data points published by validator nodes for later retrieval and batch processing by applications.

The Metrics node can be distributed as a Docker image, making deployment and installation easy regardless of platform. The Metrics node has minimal CPU and memory requirements and does not consume disk space over time. While the Bor and Heimdall nodes need to be run on heavy-weight machines, a humble VM is enough to host the Metrics node, making it inexpensive for validators to run.

Rationale / Technology choices

The main technology choice here is the decentralized pub/sub messaging protocol to be used. The main alternatives in the space are:

For the P2P data distribution in the Polygon Metrics network, the Streamr protocol is proposed for the following reasons:

Additionally, the choice has the following advantages against libp2p for this particular use case:

  • It has access control features backed by on-chain permission registries
  • It has QoS features like message order guarantees and detecting missed messages
  • It supports storage and playback of historical data
  • It uses the same cryptographic primitives as Polygon and Ethereum, meaning identity equivalence (all Polygon/Ethereum private keys/addresses are also Streamr private keys/ids)
  • It supports adding incentives to data sharing, which can be useful in case validators don’t otherwise opt-in to metrics sharing

(Full disclaimer: the authors of this proposal are Streamr contributors, which obviously introduces bias to the technology recommendation, but on the other hand it’s probably fair to say that the authors are experts on the subject matter as well as Polygon advocates.)

For further reading, here’s a thought piece about collecting metrics data from decentralized systems.

Proposed work phases

We propose to divide the work into the following phases:

  1. [COMPLETED] Proof-of-concept: A Metrics node implementation is demonstrated to pull data from a Bor and Heimdall node and make it available over the peer-to-peer network to any number of subscribers
  2. Metrics testnet: Metrics nodes are rolled out to a small number of Polygon sentry nodes
  3. Sentry rollout: Metrics nodes are rolled out to all Polygon sentry nodes
  4. Validator rollout: finally Metrics nodes are rolled out to all Polygon validator nodes

Proof-of-concept

There’s an initial implementation of the Metrics node in this Github repository. As a proof-of-concept, one Metrics node has been deployed and connected to the Polygon validator nodes run by Streamr. The node is publishing metrics data every 10 seconds to the following streams:

Here are snapshot examples of what the data looks like: Bor , Heimdall . As you can see, the data contains a wealth of metrics. If the Polygon team adds new metrics in Bor/Heimdall updates, they will automatically show up in the streams.

Builders seeking to use the data can easily subscribe to the above streams using one of the following Streamr protocol resources:

As a bonus, to also show something simple built on top of the data, here’s a quick 5-minute demo dashboard on JSbin that shows a few selected metrics, including current block, peer count, and some cpu and memory -related variables. Of course, it currently shows data for only one validator, because only one validator is pushing data into the streams. If there were more validators, it would be easy to see for example when some nodes are lagging behind just by looking at the latest block number of each validator - something which is hard to determine just by looking at your own node.

Cost of operation

The cost depends on the cloud/data center solution used by each validator. A virtual machine with, say, 4 GB RAM and 2 vCPUs should be fine for this, although this needs to be verified in the PoC phase. The cost of running such a node ranges from around $5-8/month on cheap providers like Hetzner or Contabo to maybe $20-30/month on more expensive providers like AWS or Azure.

In any case, the costs of running the Metrics node are negligible compared to the heavyweight machines with large, fast disks required to run Bor. The total increase in validator operating expenses will likely be less than 1%.

Limitations to the applicability of the data

The validator nodes self-report the data, meaning that they could simply lie. The attached cryptographic signatures prove that the data originates from a certain Metrics node, but not that the values shared correctly represent the actual state of their Bor and Heimdall nodes. This is a fundamental limitation and not a technical shortcoming of the system.

We propose that slashing and other ‘hard’ measures continue to be strictly based on the validators’ adherence to the Bor and Heimdall protocols, such as signing checkpoints and so forth, like they are now. The metrics data complements this by helping reach ‘softer’ goals, such as helping the validators themselves (“Are my nodes healthy?”), other validators (“Are my nodes performing as well as other validators?”), and the Polygon community (“Are Polygon validators healthy and reliable?”).

Security considerations

It’s recommended as a security best practice to run the Metrics node on a separate VM to isolate it from the Bor and Heimdall nodes and networks. This way, the Metrics system can not disrupt or influence their operation in any way. The Metrics node only needs access to the Prometheus metrics port on Bor and Heimdall nodes in order to query metrics, which is easy to accomplish via firewall rules that allow those ports to be accessed from the Metrics machine.

Similarly to Bor and Heimdall, the Metrics node uses a private key which is stored on disk. The Metrics key can be a different key than the Signer key used with Heimdall/Bor. If a Metrics private key gets compromised, the key is easy to revoke from the metrics streams’ access control lists and replace with a new one. While care should definitely be taken to safeguard the key, the damage from a compromised Metrics key is much less compared to an Owner or Signer key of the actual validator nodes getting compromised, as those can lead to theft of stake or heavy slashing.

The Metrics node does not need to open any ports, which helps secure it against DoS attacks - although it does need to allow traffic in a range of UDP ports.

With proper isolation from the Bor and Heimdall nodes as described above, possible attacks can only disrupt the metrics system: either individual Metrics nodes or the whole Metrics network itself. Disrupting individual nodes does not compromise the data flow in the network as a whole, as data always travels through many redundant paths through the network.

On the network level, all P2P networks are by their nature vulnerable to certain types of attacks, in particular eclipse attacks. This applies to for example Bitcoin, Ethereum, Polygon Bor, Polygon Heimdall, and the proposed Metrics network. The chosen network parameters play a large role in how robust the network is against attacks. For example, blockchain networks typically defend against eclipse attacks by having a high number of connections to other peers (up to hundreds), which is a performance vs. security tradeoff.

Unlike blockchains, the Metrics network is ‘only’ a data distribution network and does not secure digital assets nor maintain consensus. It is therefore a lower-value target and lower-risk overall: a nice-to-have, but not connected to the operation of the Bor or Heimdall networks in any way. This also allows the Metrics network to choose a slightly less defensive P2P parameterization to fit the use case better, improve efficiency, and reduce bandwidth consumption.

7 Likes

Hi @StakePool thanks for your feedback in the other thread! Let’s continue the conversation here.

I agree that it’s important that validators start using it. There are strong network effects: The more validators contribute data, the more valuable it will be for everyone. The installation of the Metrics node itself should be very easy, in any case MUCH easier than setting up Bor and Heimdall. As a second point, I think the Polygon team and official docs play a large role in the adoption: If docs and the people onboarding validators instruct and encourage people to run a Metrics node, then I believe people will happily do it.

Just to provide an update to you and others reading this, my next steps are:

  • Build a proof-of-concept (POC) implementation, hopefully ready by the end of this week
  • Gather feedback and organize a demo session (maybe as part of the Polygon Builder Sessions)
  • Submit the formal PIP in Github

What is the expected cost to run the metrics node? I.e. I assuming the metrics node perform fine on cpu limited / lowmem instances, is that true?.

Is it appropriate to run the metrics node on the same machine as one of the other nodes, or does it need to be fully isolated?

It seems to me this isn’t totally true. A compromised metrics machine could technically compromise private keys used to sign metrics, or provide the actor with a back door where they’re able to generate metrics events on that machine itself. While this won’t affect the blockchain network, it could affect any downstream governance based on those metrics.

If I remember correctly the proposal for using these metrics during governance was to have them being of low weight, so perhaps there’s not good reason to execute such an attack, but the strategy itself does seem possible and should be acknowledged.

Is it appropriate to run the metrics node on the same machine as one of the other nodes, or does it need to be fully isolated?

Nothing prevents running it on the same machines, but if we aim for the most solid choice from a security standpoint (which we should), then I’d recommend a best practice of running it on a separate VM. It can be a cheap one, see below.

What is the expected cost to run the metrics node? I.e. I assuming the metrics node perform fine on cpu limited / lowmem instances, is that true?

The cost depends on the cloud/data center solution used by each validator, and I’m not sure which data centers are most popular among Polygon validators. A virtual machine with, say, 4 GB RAM and 2 vCPUs should be fine for this. The cost of running one ranges from around $5-8/month on cheap providers like Hetzner or Contabo to maybe $20-30/month on more expensive providers like AWS or Azure.

In any case, the costs of running the Metrics node are negligible compared to the heavy-weight machines with large, fast disks needed to run Bor.

A compromised metrics machine could technically compromise private keys used to sign metrics, or provide the actor with a back door where they’re able to generate metrics events on that machine itself. While this won’t affect the blockchain network, it could affect any downstream governance based on those metrics.

Correct, and another point to make is, given that validators self-report the metrics, they can simply just lie. The cryptographic signatures prove that the data originates from a validator’s Metrics node, but not that the values posted are correct.

Therefore, regarding the idea of basing downstream governance on the metrics, I think it’s important that slashing and other ‘hard’ measures are strictly based on the validators’ adherence to the Bor and Heimdall protocols, such as signing checkpoints and so forth, like it is now. The metrics data complements this with ‘softer’ goals, such as servicing the validators themselves (“Are my nodes healthy?”), other validators (“Are my nodes performing as well as other validators?”), and the Polygon community (“Are Polygon validators healthy and reliable?”).

As a small note, if a Metrics private key gets compromised, the key is easy to revoke from the metrics streams’ access control lists and replace with a new one. While care should definitely be taken to safeguard the key, the damage from a compromised Metrics key is much less compared to an Owner or Signer key of the actual validator nodes getting compromised, as those can lead to theft of stake or heavy slashing.

I’ll do a round of updates to the proposal text based on your questions, they were very good ones!

It would be great to have another monitoring option, but let’s not forget that Tenderduty is already the standard for monitoring the Cosmos SDK side, and that there are already multiple Geth monitoring solutions which could be easily adapted to monitor Bor.

If something will be custom built, it would be wise to have it work also for other Cosmos SDK and Geth based chains, so that it accepts PRs from other teams in other spaces and keeps evolving.

Also, will you guys be seeking funding for this?

Thanks for highlighting Tenderduty and as well as existing Geth monitoring tools. I think that’s valuable and people should be aware of those, however my goal here is to propose something that’s more like a platform than yet another tool - and that unlocks quite different outcomes. I’ll explain below:

The tools that I’ve seen (and please correct me if I’m wrong here) are meant to be installed by node operators to monitor their own nodes. Polygon validators can use a mix of these or Prometheus-based tooling to monitor their own nodes. For example, for our own validator we are using Prometheus and Grafana internally, and that’s all fine and working well. However, the keyword there is internally.

While there are tools that validators can leverage for themselves, or even centralized SaaS tools that do some of the work for you, none of the existing solutions seek to open up the data for everyone to build on, enable data sharing among validators, or data sharing with the community. Our proposal essentially aims to create an open firehose of data from validators which anyone can analyze or build tooling on - it’s more like a platform or ecosystem, as opposed to proposing to build a new tool or adopt a particular tool.

Introducing just a new tool probably wouldn’t improve the status quo much actually, but allowing an ecosystem to emerge on top of opened-up data might achieve such goals.

In a nutshell:

  • In the current model, validators have to install and operate whatever monitoring tools they need themselves, and the data is not accessible to others.

  • In the proposed model, any third party can independently build and operate monitoring tools to benefit all validators, and the data is available to everyone equally.

Or looking at a slightly different angle:

  • In the current model, validators need to set up elaborate tool chains if they wish to be notified when something’s wrong with their node

  • In the proposed model, anyone can detect if something’s wrong with your node and tell you about it

Also, will you guys be seeking funding for this?

We wouldn’t mind a small grant to cover our costs, but otherwise no, we don’t need funding to implement this to the PoC phase, and from there help from the Polygon team is needed to help roll it out to validators.

The proposed solution is quite simple really, as it requires just a bit of application code on top of existing mature building blocks. The Metrics node will be just 50-100 lines of code, pretty tiny! Curiously in this case, the effort of writing the proposal and discussing it is probably 10x more work than writing the software itself :smiley:

1 Like

Awesome to hear it’s a minimal adaption to the current code!

And I referred to existing tools a means of building the data sharing layer on top of them, but if you already have robust data collection then all the better.

The one question I have left is what is the purpose of sharing individual validator data in this manner. What is the benefit to the ecosystem, delegators and validators, individually?

What is the benefit to the ecosystem, delegators and validators, individually?

Here’s a few quick takes off the top of my head, but anyone else feel free to add insights!

Benefit to validators

  • Creating better tooling becomes possible, because the data is readily available. Polygon invests quite heavily in hackathons for example - imagine what kind of dashboards and analytics could come out of that.
  • Get notified of problems earlier.
  • Find out root causes faster (by e.g. comparing your node’s data to data from other nodes).
  • Polygon team (or anyone else in the community) has more means to help you if your nodes are having trouble, as they can easily look at the data from your node

Benefit to delegators

  • More visibility into validator health. Is their CPU burning hot because they run on an underpowered server? Will they soon run out of disk space? Maybe I should undelegate now, before that happens?

Benefit to ecosystem

  • Boosts confidence in Polygon stability. Anyone can observe high uptimes and reliable node performance.
  • Better decentralization. Reduce need for centralized backends and tools. Allow people to support each other.
  • Pioneer a novel way to expose metrics from a decentralized blockchain network. Establish Polygon as a trendsetter in the area of web3 devops and transparency.
  • Open data inspires creativity.

PoC complete

I’m happy to announce that I now have a proof-of-concept to show! I’ve completed an initial Metrics node implementation and connected our validator node to the metrics streams.

Here’s how it works

The Metrics node is configured with the URLs to the Prometheus metrics endpoints on Bor and Heimdall nodes on both Sentry and Validator machines. The machine that runs the Metrics node must be able to access these ports through the firewall. By default the URLs are:

http://VALIDATOR-IP:7071/debug/metrics/prometheus
http://VALIDATOR-IP:26660/metrics
http://SENTRY-IP:7071/debug/metrics/prometheus
http://SENTRY-IP:26660/metrics

For each of the 4 node types, there’s a precreated stream ID on the Streamr Network. Anyone can subscribe to these streams, but only whitelisted Metrics keys can publish to them.

polygon-validators.eth/validator/bor
polygon-validators.eth/validator/heimdall
polygon-validators.eth/sentry/bor
polygon-validators.eth/sentry/heimdall

The Metrics node polls the Prometheus endpoints on Bor and Heimdall nodes on both Sentry and Validator machines every 10 seconds, transforms the data a bit, and publishes it as JSON to one of the four streams depending on the node type.

Using the data

You can check out the raw data live in your browser via the Streamr UI. In this case your browser becomes a light node on the Streamr network. Note that there is no backend involved. Wait a bit to see data points, as they’re published every 10 seconds:

Builders seeking to use the data can easily subscribe to the above streams using one of the following Streamr protocol resources:

Here are snapshot examples of what the data looks like: Bor, Heimdall. As you can see the data contains a wealth of metrics. If the Polygon team adds new metrics in Bor/Heimdall updates, they will automatically show up in the streams.

Demo dashboard

To illustrate how easy it is to build a dashboard on top of the data, here’s a quick 5-minute demo dashboard on JSbin that shows a few selected metrics, including current block, peer count, and some cpu and memory -related variables. Of course, it currently shows data for only one validator, because only one validator is pushing data into the streams.

Here’s the source code to that JSbin.

Code and Github

Here’s the Github repository for the Metrics node.

The code is quite simple. Here’s the main part, which polls each Prometheus endpoint, parses and transforms the metrics, and publishes messages to the corresponding stream. The rest of the code is mostly just setting up stuff.

Next steps

  • What’s still missing from the implementation is Docker packaging, this makes it easier for people to install and run it. Also installation instructions need to be written.
  • Optionally onboard a few other pilot validators to show that multiple validators can send data to the same firehose, and practice onboarding process. (If there’s early adopters here willing to help by trying it out, please let me know!)
  • Continue gathering feedback and submit formal PIP on Github.
2 Likes

Updated the proposal draft to include a “Proof-of-concept” section. It contains the same information than my previous post to this thread, just more compressed.

2 Likes

Ok, I’ve finished wrapping the node into a Docker container and documented the env variables accepted by the node. There are also step-by-step instructions for installing and running the image using the docker command line tool for those who aren’t already using Kubernetes or some other container orchestration tool.

Just a security note: Obviously, it’s very wise not to trust some random piece of software written by a guy on the internet! :sweat_smile: Here’s why there’s no risk to your Bor and Heimdall nodes if you try running the Metrics node:

  • Run the Metrics node on a different machine and only open up access to the (read-only) metrics API of your Heimdall and Bor instances - this way the Metrics node has no way of accessing anything on those machines or disrupting them in any way.
  • The Metrics node runs inside a Docker virtual environment, meaning that it’s fully sandboxed from anything else on the local machine.
  • You can also get started by only connecting your Sentry nodes to the Metrics node, and leave the actual Validator nodes for later, if you’d like.

Any early adopters here ready to join the pilot? Please ping me here or on Polygon Discord and I can help you get set up. In particular, I need to whitelist your Metrics node address on the streams.

2 Likes