BOR 0.3.3 Connection inconsistencies not limited to this version
During the building of a new Polygon server environment we experienced the following behaviour:
1 out of more sentries holding-up the validator nodes partly or all together
2 validators served by 3 sentries
The sentries are isolated, no connection between them, all of them have 200 connections to the world
Both validators have 3 connections, 1 to each of the sentries
Sentries and validators are on a 4 physical server setup. All nodes have Ubuntu 22.04 Virtual machines in Linux KVM/VMM. The network is a local-LAN, network. Connections are configured to be going via the physical Switches (not over KVM internal networks). 10Gbit.
Validator have separated VM’s for heimdall and bor
2 Sentries have separated VM’s for heimdall and bor
1 Sentry has combined heimdall-bor
The validators are on different physical servers, as between the Sentries they are physically separated over servers too. 1 Server holds both a sentry and a validator.
The behaviour is not limited to the above configuration. In our former configuration the same behaviour is noticed over the last 2 years
The configuration is meant to be resilient and strong against failure in one of the components.
Misbehaviour in 1 of the components should be isolated to the component itself.
If 2 sentries fail or fal behind the validators should be fine.
Monitoring the nodes is by viewing the log files in linux terminal
(journalctl -fu bor)
1 Failure on graceful shutdown
While gracefully shutting down 1 Sentry.
At graceful shutdown of a Sentry, the validator(s) can get stuck immediately or soon after shutting down the Sentry. It is not a certainty the validator(s) get stuck but if the validator stops: then it simply stops progressing and is NOT able to recover from the absence of 1 of the sentries EVEN when this same validator is still connected to (2) healthy sentries.
The remedy; restart bor on the validator. This remedy is of course NO solution while the validator should recover from a disconnect or faulty sentry by itself.
1a) Failure on Sentry slow-down
If a Sentry falls behind for whatever reason, the same effect can be noticed.
In the bor monitor (bor.vitwit…) the strict following of a validator can be easily followed.
A validator connected to 3 sentries can follow a sentry falling behind the other 2 sentries. Exactly at the same time and block numbers.
The validator should follow the fastest (most progressed) sentry
The symptoms are not limited to the latest version. The behaviour is present since we started validating for Polygon 2 years ago.
( ** Looking for connections
Remarkable: We notice that bor is looking for connections during graceful shutdown. Although having 200 connections these connections seem to have been dropped instantly at graceful shutdown, then the BOR starts looking for connections again.
Are these the same connections or serving a different function?
The looking for connections can take a as much as a minutes, slowing down the proces of a restart.
The dangerous way:
More connections between the nodes, shows the behaviour earlier and quicker.
The setup in Figure b, should lead to a very robust setup. The quickest node leads the other nodes.
The result is the oposit: All nodes are effected if 1 node is dis-engaging or fals behind.
Concept: Abstract process schema