Proposal: Blocks Monitoring

Author: Michel Leidson
Abstract:
Following the considerations of the performance management proposal PIP-4 we propose an improvement in monitoring for validator nodes.

Motivation:
One of the criteria to be a performant validator is to keep the node with a high uptime. The performance of the validator is based on checkpoint production and attestation on Heimdall and block production and attestation on Bor. Any reduction in execution in this respect degrades network health. In addition to not receiving the rewards, validators lose performance and gain a bad reputation. In the current state that node monitoring is in, we can only mitigate failures later.

Benefits:
Monitoring allows control to anticipate instabilities that occur in the node, where these failures directly impact the low performance of the validator. So, with this information in hand, you can correct it, reducing downtime and avoiding checkpoint missed.

Architecture:
Designed with Client-Server architecture, where the client application is located on the validator side, collecting the blocks in a log file and sending them to the server application, receiving the information and persisting the data to generate alert messages based on the signature performance of validators bor blocks.

Applications present in the system

1 – Client Application (Log information collection, and sending via HTTP call)

2 – Server Application (Receives information from validators, persists and generates alert messages)

Provide alert on deficient status:

  • Missing Block Monitoring (“MBM”): Occurs when a Validator fails to sign blocks.

  • Low Block Performance (“LBP”): May arise in situations where the node cannot keep up with the current height of the network.


Specification

Monitoring should consist of two parts: the PoS node known as Heimdall and the EVM node called Bor. That said, for us to have Bor in full working order, the intervention of the validator side is necessary.

Client Application Script

#!/bin/bash
source /etc/cbbc/config

VAR=$(echo $SIGNER_KEY | awk '{ print tolower($SIGNER_KEY) }')
COMPLETE_API_URL="$API_URL/validator/$VAR/blocks/bor"
if [ -z $API_URL ] || [ -z $FILE_PATH ] ||  [ -z $SIGNER_KEY ]
then
echo "Set yours variables in /etc/cbbc/config file!"

if [ -z $API_URL ]
then
   echo -e "Variable API_URL is empty!"
fi
if [ -z $FILE_PATH ]
then 
   echo -e "Variable FILE_PATH is empty!"
fi 
if [ -z $SIGNER_KEY ] 
then  
   echo -e "Variable SIGNER_KEY is empty!"
fi 

else

while true; do
BLOCK=$(tail -n 1000 $FILE_PATH | grep "Imported new chain segment" | tail -1 | awk 
-F'number=' '{ print $2 }' | 
awk -F' ' '{ print $1 }'  | sed -e 's/,//g')
TIMESTAMP=$(date +'%Y-%m-%dT%H:%M:%S.%N')
JSON='{ "block": '$BLOCK' ,"timestamp":"'$TIMESTAMP'" }'
if [ -z $BLOCK ]
then
    echo -e "Not found block in log file: $FILE_PATH \n"
else
    echo -e "Block collected $BLOCK from file $FILE_PATH\n"
    echo -e "Request: Send JSON to API $COMPLETE_API_URL\n$JSON\n" 
    echo "Response: "
    curl -X POST "$COMPLETE_API_URL" -d "{ \"block\" : "$BLOCK" , \"timestamp\" : 
\""$TIMESTAMP"\" }" -H 
'Content-Type: application/json'
    echo -e "\n"
fi

sleep 2;
done

fi

You can access the repository through the link:
GitHub - Michel-Leidson/collect-bor-blocks-client

The first step of the application is to read the configuration file where the API_URL, FILE_PATH and SIGNER_KEY variables are obtained, in the file located by default in /etc/cbbc/config. Right after reading the file, all variables necessary for execution are validated, and if any have not been defined, the error stating which environment variable is missing is criticized. After loading the variables, the script collects information regarding the height of the block, saving it in the BLOCK variable. Right after this collection, the block collection date and time is also saved through the “date” command in the TIMESTAMP variable. Once all the necessary information is collected, the “curl” is used to send, in the body of an HTTP request with the POST method, the JSON with the block information of the date and time of collection, as in the example below:

POST https://server-domain.com/validator/<SIGNER_KEY_OF_VALIDATOR/blocks/bor

{
"block": 9999999,
"timestamp":"9999-01-01T00:00:00.000"
}

FAQ:

1 - Which notification channels will the monitoring of blocks be implemented?

Telegram and Discord.

2 - Are there risks for my validator?

There are no risks. Today the metrics are sent to a central instance using a subscription mechanism where the information collection takes place for the bor block.

Note: In a future version, if approved, the monitoring of bor blocks in real time will be implemented through: https://monitor.stakepool.dev.br/

10 Likes

Thanks for putting this together. Sounds good to us.

3 Likes

Great initiative, Michel, and thank you. Early detection systems are a must!

3 Likes

Monitoring has always been a key challenge for us to ensure uptime. Thanks for putting this together, Stakepool.
The Heimdall dashboard https://monitor.stakepool.dev.br/ and now this should help ensure much higher uptime.

Always a fan, mate!

3 Likes

I think it’s great - nice and lean.

3 Likes

Monitoring at the block level is very important, this is a helpful tool to help the validators in monitoring their node and improve their performance and the reliability of the network. Great job Stakepool.

3 Likes

Gathering block information (and potentially other lower-level metrics) from the validators is a fantastic goal, but the technical solution seems to have some aspects which need further consideration:

  1. The client-server structure makes the system centralized, adds a single point of failure (the server), and creates a trust relationship between each validator and the operator of that server. Such an approach is incompatible with a network that strives for full decentralization, like Polygon.

  2. The proposed system is opaque and places the data in a silo. The community can not obtain the raw information provided by validators and no one can build additional or alternative tooling, creating vendor lock-in.

A better approach would be for the validators to publish the information over a decentralized pubsub protocol, making it freely accessible to everyone and allowing anyone to build analytics, monitoring, and alerting tools on top of the stream of data. With this approach, any alerting backend including the proposed one can receive the information via the decentralized messaging protocol instead of directly from the validators, and other backends and frontends can subscribe to the information equally well, creating an open and fair environment with no lock-in and no trust required.

To provide an example, the Streamr Network uses a similar approach to share node metrics across the community, allowing anyone to build tooling such as this explorer, where real-time information about the network nodes is available without any centralized backend collecting the data.

I would be happy to work on and put forward over the next few weeks an improved proposal around the same idea with two important improvements:

  1. The data published by validators is distributed over a decentralized protocol in an open and accessible way,

  2. The data content is extensible and flexible, allowing it to include a set of metrics - for example CPU & memory usage, or whatever the community finds useful for detecting problems in the validator set.

2 Likes

Hey! Henri nice to meet you thank you for suggesting improvements.
Let me sum up the proposal. So as you well know validators are responsible for securing the network and making sure everything works properly. And as you can see what was proposed was directed to validators, where this proposal suggests the implementation of a monitoring system to assist in the performance analysis where the objective (focus) is to monitor the efficiency of each validator, that is, a complement to the existing checkpoint bot on the discord channel.

I understand your approach and what you want to propose and I think it is very valid and beneficial for the community. However, please note that we are talking about validators’ unique data so I don’t know how much validators will be willing to share. In addition, any integration made in the validator goes against the principles of many here. But if this data is to be available decentralized, the best way for integration to occur would be in the bor layer itself.

Any thoughts available. Thanks in advance!

2 Likes

Yes, my feedback was definitely from a validator point of view - we (Streamr) are validators too. I’m thinking of an approach that will enable the goals presented here, but with broader capabilities, no single point of failure, and no trust in any third party required. Like this proposal, it will not require any changes to Bor or Heimdall software or interfere with their operation in any way, I think this non-invasive approach is important.

Let me post something more formal later this week, I’m currently gathering feedback internally. Cheers!

2 Likes

Here’s my proposal that relates to this one but seeks to decentralize, expand, and unsilo the health metrics system. It perfectly well still enables the alerter tool imagined here to be built, but also much more. Please let me know your feedback. Cheers!

1 Like

Hello Henri. I really like your very valid proposal in which should sanction the problems of collecting metrics. However, it is essential that all Polygon validators implement it, however making such data available to the public remains a challenge. Although having such data would provide us with N possibilities for creating tools.
So what I propose, host a meet organized by the team, with the valdiators, that way would present more about the project, clearing any doubts, where understanding the pros and cons would be of paramount importance, just tell me what you think and I’ll be happy to to help.

2 Likes