In a distributed system, we cannot believe that a server is down just because another server says so.
Better solution is to use decentralized failure detection methods like gossip protocol.
Working of gossip protocol
Each node maintains a node membership list, which contains member IDs and heartbeat counters.
Each node periodically increments its heartbeat counter
Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes
Once nodes receives heartbeats, membership list is updated to the latest info.
If the heartbeat has not increased for more than predefined periods, the member (node) is considered to be “offline”.
Demonstration/Example
Node s0 maintains a node membership list shown on the left side.
Node s0 notices that the heartbeat counter of s2 has not increased for a long time.
Node s0 sends heartbeats that include the info of s2 to a set of random nodes. Once other nodes confirm that s2‘s heartbeat counter has not been updated for a long time, node s2 is marked as “down”, and this information is propagated to other nodes.