Fault-tolerant Distributed Systems with Diagnostics Algorithms

To provide consistent actions in distributed systems with faulty nodes the Byzantine agreement protocol (algorithm) is widely used. In case of using message exchange scheme without authentication the Byzantine agreement algorithm leads to agreement if the number of nodes doesn’t exceed 1/3 of the total number. The proposed algorithms based on diagnostics procedures are used to reach an agreement in distributed models with 2n+ 2 nodes and fewer than k failed nodes. The hierarchical diagnostic procedures give the possibility to vary the complexity of hardware and software overhead according to required level of fault-tolerance.


INTRODUCTION
The Byzantine Generals problem involves reaching an agreement among n nodes of distributed system, some of which may be faulty. It can be stated as follows: Given a set of n nodes which are sending messages to one another, to find an algorithm by which one of the nodes -general can transmit the message a to all other nodes such that: a. If General is nonfaulty, then any nonfaulty nodes get the same message. b. If nodes i and j are nonfaulty, then both get the same message.
Nonfaulty nodes are assumed to correctly follow their algorithm, but faulty nodes may do anything.
To provide consistent actions in distributed systems with faulty nodes the Byzantine agreement protocol (algorithm) is widely used. In case of using message exchange scheme without authentication the Byzantine agreement algorithm leads to agreement if the number of faulty nodes doesn't exceed 1/3 of the total number. Byzantine agreement is an efficient method, since the failure model considered in this problem is most general and if we can handle satisfactorily, we can be sure that most types of failures can be masked. It is interesting problem to be solvedto reach an agreement if the number of faulty nodes exceed one third.

DISTRIBUTED MODEL
Let us consider a distributed system of 2k + 2 nodes, where there are not more than k faulty nodes. Let the node i be the general. Node i sends messages to each of other nodes (lieutenants). Then lieutenants exchange the messages received from general. After analysis performed by each lieutenant, the nodes which do not meet the loyalty condition (contradictory messages are sent to the various lieutenants) are defined as traitors. The rest of lieutenants are loyal.

DIAGNOSTICS ALGORITHMS
The diagnostics of the nodes performed by each loyal lieutenant leads to the only solution concerning to the traitors an agreement between general and loyal lieutenants can be achieved. The algorithm 1 should be performed by each node of the distributed system. It can be shown easily that the application of algorithm 1 can results with a few solutions and there is a problem to choose the right solution. The following theorem has been proved: Theorem 1: The right solution is always present among the solution given by cond_sub_M p.
Corollary: If cond_sub_M p contains the only solution, it is correct.
If the class of faults is more extend, including malicious node behavior, then algorithm 2 can be applied. Algorithm 2 consists of several stages. Each stage includes several phases. We assume that there is the mechanism for assigning the node that provides the distributed system diagnostics. Such node is referred as tester. A different tester has been assigned on different stage.

RESULTS
As we have mention above in case of using message exchange scheme without authentication the Byzantine agreement algorithm leads to agreement if the number of faulty nodes doesn't exceed 1/3 of the total number. We are able to overcome this limitation if the diagnostics of faulty nodes is carried out and then their messages are excluded from consideration to achieve an agreement. The implementation of suggested algorithms leads us to an agreement in the distributed systems with 2k + 2 nodes, where the number of faulty nodes does not exceeds k.

CONCLUSION
The multilevel diagnostics algorithms of distributed systems are suggested. The most attractive feature of the diagnostic is adaptive increasing of diagnostics procedures power that depends on the class of faults. The implementation of algorithms suggested in the paper lead us to an agreement in the distributed systems with 2k + 2 nodes, where the number of faulty nodes does not exceeds k. At the beginning algorithm 1 is to be applied. It seems to be rather effective in the case of hardware faults. If the class of faults is more extend including malicious node behavior, the algorithm 2 is used.