Connection inconsistencies | Bug

BOR 0.3.3 Connection inconsistencies not limited to this version

During the building of a new Polygon server environment we experienced the following behaviour:

1 out of more sentries holding-up the validator nodes partly or all together

Figure A:

2 validators served by 3 sentries

The sentries are isolated, no connection between them, all of them have 200 connections to the world

Both validators have 3 connections, 1 to each of the sentries

Sentries and validators are on a 4 physical server setup. All nodes have Ubuntu 22.04 Virtual machines in Linux KVM/VMM. The network is a local-LAN, network. Connections are configured to be going via the physical Switches (not over KVM internal networks). 10Gbit.

Validator setups:

Validator have separated VM’s for heimdall and bor

Sentry setups:
2 Sentries have separated VM’s for heimdall and bor
1 Sentry has combined heimdall-bor
The validators are on different physical servers, as between the Sentries they are physically separated over servers too. 1 Server holds both a sentry and a validator.

The behaviour is not limited to the above configuration. In our former configuration the same behaviour is noticed over the last 2 years

Goal:
The configuration is meant to be resilient and strong against failure in one of the components.
Misbehaviour in 1 of the components should be isolated to the component itself.
If 2 sentries fail or fal behind the validators should be fine.

Monitoring the nodes is by viewing the log files in linux terminal
(journalctl -fu bor)

1 Failure on graceful shutdown
While gracefully shutting down 1 Sentry.

At graceful shutdown of a Sentry, the validator(s) can get stuck immediately or soon after shutting down the Sentry. It is not a certainty the validator(s) get stuck but if the validator stops: then it simply stops progressing and is NOT able to recover from the absence of 1 of the sentries EVEN when this same validator is still connected to (2) healthy sentries.
The remedy; restart bor on the validator. This remedy is of course NO solution while the validator should recover from a disconnect or faulty sentry by itself.

1a) Failure on Sentry slow-down

If a Sentry falls behind for whatever reason, the same effect can be noticed.
In the bor monitor (bor.vitwit…) the strict following of a validator can be easily followed.
A validator connected to 3 sentries can follow a sentry falling behind the other 2 sentries. Exactly at the same time and block numbers.

Expected behaviour:
The validator should follow the fastest (most progressed) sentry

The symptoms are not limited to the latest version. The behaviour is present since we started validating for Polygon 2 years ago.

( ** Looking for connections
Remarkable: We notice that bor is looking for connections during graceful shutdown. Although having 200 connections these connections seem to have been dropped instantly at graceful shutdown, then the BOR starts looking for connections again.
Are these the same connections or serving a different function?
The looking for connections can take a as much as a minutes, slowing down the proces of a restart.
)

Figure b
The dangerous way:
More connections between the nodes, shows the behaviour earlier and quicker.
The setup in Figure b, should lead to a very robust setup. The quickest node leads the other nodes.
The result is the oposit: All nodes are effected if 1 node is dis-engaging or fals behind.

Concept: Abstract process schema