Reliability Evaluation of Distributed Computer Systems Subject to Imperfect Coverage and Dependent Common-Cause Failures

Imperfect coverage (IPC) occurs when a malicious component failure causes extensive damage due to inadequate fault detection, fault location or fault recovery. Common-cause failures (CCF) are multiple dependent component failures within a system due to a shared root cause. Both imperfect coverage and common-cause failures can exist in distributed computer systems and can contribute significantly to the overall system unreliability. Moreover they can complicate the reliability analysis. In this study, we propose an efficient approach to the reliability analysis of distributed computer systems (DCS) with both IPC and CCF. The proposed methodology is to decouple the effects of IPC and CCF from the combinatorics of the solution. The resulting approach is applicable to the computationally efficient binary decision diagrams (BDD) based method for the reliability analysis of DCS. We provide a concrete analysis of an example DCS to illustrate the application and advantages of our approach. Due to the consideration of IPC and CCF, our approach can evaluate a wider class of DCS as compared with existing approaches. Due to the nature of the BDD and the separation of IPC and CCF from the solution combinatorics, our approach has high computational efficiency and is easy to implement, which means that it can be easily applied to the accurate reliability analysis of large-scale DCS subject to IPC and CCF. The DCS without IPC or CCF appear to be special cases of our approach.


INTRODUCTION
A distributed computer system (DCS) is a collection of interconnected independent computers (hosts) that appears to its users as a single coherent system [1] . DCS provide an efficient way to achieve fault-tolerance and share system resources such as processing elements, memory modules, data files, and so on. A successful execution of a distributed program usually requires one or more of the resources that reside on multiple hosts at different geographic sites of the DCS.
It is possible that some faults of hosts or communication links may not be adequately detected and located so that the distributed program cannot be executed successfully despite the presence of adequate redundancies (other operational hosts and links). This phenomenon is known as imperfect coverage (IPC) [2] . The IPC introduces additional failure modes that must be considered for accurate reliability analysis of DCS. In other words, the analysis must allow multiple failure modes including operational (not failed), failed covered, and failed uncovered, rather than the traditional binary designation of operational and failed. This consideration poses unique challenges to existing analysis methods. Because failure to consider IPC in the reliability analysis leads to overestimated system reliability, considerable research have been performed in studying IPC for the reliability analysis of faulttolerant systems [2][3][4][5][6][7] , but only few of them [5,7] are applicable to DCS and their complexity can increase rapidly as the size of DCS, i.e., the number of hosts and links in a DCS increases.
The challenges increase when common-cause failures are incorporated in the model. Common-cause failures (CCF) are multiple dependent component failures within a system that are a direct result of a common cause (CC) or a shared root cause [8] , such as extreme environmental conditions, operation and maintenance errors. Examples abound in the real world. Sabotage, lightning strike, and power outage can cause the simultaneous failure of numerous components in a DCS. It has been shown by many reliability studies that CCF increase a system's joint failure probabilities and thus contributes significantly to the overall unreliability of systems subject to CCF [9] . Therefore, failure to consider CCF in the reliability analysis of such systems leads to underestimated system unreliability measures. Considerable research efforts have been expended in the study of common cause failures for reliability modeling and analysis of computer-based systems. However, the existing CCF models are mainly applicable to non-DCS systems. And they have various limitations, such as being concerned with a specific system structure [10][11][12][13] ; applicable only to systems with exponential time-to-failure distributions [14][15][16] ; limiting analysis to components being affected by at most one common cause, i.e., components belonging to at most a single commoncause group (CCG) [9,17] ; having a single common cause (CC) that affects all components of a system [12,18] ; or defining CC as being statistically-independent or mutually exclusive. In this study, we seek to address some of these limitations in developing a model for the reliability analysis of DCS subject to CCF by allowing for multiple CC that can affect different subsets of system components, and which can occur statisticallydependently.
As discussed above, a great deal of work has been done to separately address IPC or CCF in the system reliability analysis. To the best of our knowledge, however, only little work [12,18] has considered both IPC and CCF in solving reliability problems. Moreover, the existing methods did not consider IPC and CCF in a DCS and they share a restrictive assumption that a single elementary CC leads to simultaneous failures of all components of a system. In this study we relax the above restriction by utilizing our generalized CCF model for DCS.
And we propose a separable and efficient reduced ordered binary decision diagram (ROBDD) based approach to the reliability analysis of DCS with both IPC and dependent CCF in an elegant manner.
In this study we use the following acronyms, and we assume the singular and plural of an acronym are always spelled the same:

PROBLEM STATEMENT
This study considers the problem of assessing distributed program reliability for distributed computer systems. Distributed program reliability (DPR) is defined as the probability that at least one minimal file spanning tree (MFST) of a distributed program is operational within the time interval (0, t) [7] . A file-spanning tree (FST) is defined as a spanning tree that connects the root node, i.e., the host running the program under consideration to other nodes such that its vertices hold all the required resources for successful execution of the program. An FST is an MFST if there exists no other FST that is a subset of it. An MFST is said to be operational when all its components are operational [7,19] . The approach developed in this study is also applicable to evaluate distributed system reliability (DSR), which is defined as the probability that at least one MFST for all programs is operational [7] .

Assumptions
* The DCS is modeled by a probabilistic undirected graph G(V,E), in which vertices represent the hosts and edges represent the communication links [20] . By probabilistic we mean that failure probabilities are assigned to each node and link in the graph. * Links or nodes in DCS fail s-independently with known probabilities. * The failure probability for each link or node is given as a fixed probability for a given mission time or in terms of a lifetime distribution. * The imperfect coverage behavior is described using Dugan et al's imperfect coverage model (IPCM, Fig. 1) [2] .  Note that our CCF model is more general and thus more practical than the existing CCF models, which usually require some restrictive assumptions.

Problem inputs:
The following lists all the required input parameters for solving the problem: * DCS configuration in the probabilistic graph * Mission time t * Failure parameters of each link and each node * Fault coverage factors (r i , c i , s i ) of each link and each node * Statistical relationship between elementary CC: sindependent, or s-dependent * Probabilities of elementary CC occurring or conditional probabilities of CC occurring conditioned on the occurrence of another CC when they are s-dependent.
The accurate reliability analysis of a fault-tolerant DCS heavily depends on the realistic estimate of its input parameters. Fault injection [21,22] is a commonly used technique for estimating the component failure parameters and fault coverage factors. The occurrence probabilities of CC and their statistical relationship can usually be available from sufficient weather data or equipment data [23] . In this study, we consider them as given input parameters of the problem.

AN ILLUSTRATIVE EXAMPLE
We use a simple example (adapted from [7] ) to illustrate the proposed methodology for DCS reliability analysis. Figure 2 shows the probabilistic graph of the example DCS.
The  [7] ; FN i denotes the set of files required by a program P i ; system resources are abstracted into files) same coverage factors; our methodology is applicable to arbitrary link (node) failure distributions and coverage factors. In addition, the DCS is subject to CCF from two independent common-causes, earthquakes (denoted by CC 1 ) and power failures (denoted by CC 2 ). An earthquake of sufficient intensity would cause links e 2 and e 5 and node n 4 to fail (CCG 1 = {e 2 , e 5 , n 4 }); a power failure would cause nodes n 1 and n 2 to fail (i.e., CCG 2 = {n 1 , n 2 }). We assume that the following information can be extracted from the available weather and power data: the probability of an earthquake is P CC1 = 0.001, the probability of a power failure is P CC2 = 0.003. The problem is to find DPR for program P 1 in the example DCS for mission time of t = 1000 hours. In the following part, the example will be analyzed to illustrate our method step by step.

SEPARABLE AND EFFICIENT DPR ANALYSIS
We present our separable ROBDD-based approach for analyzing DPR of DCS with both IPC and CCF. The methodology is to separate both IPC and CCF in two phases from the combinatorics of the solution based on the "Total Probability Theorem". The resulted reduced DCS reliability problems are freed from the concern about both CCF and IPC, and can be solved using computationally efficient ROBDD methods. Finally, the results of all reduced sub-problems are integrated to obtain the entire DCS DPR measure.
Separating IPC: Consider two mutually exclusive and complete events E 1 (1 or more components including links and nodes in the DCS fail uncovered) and E 2 (no component experiences an uncovered failure). According to the "Total Probability Theorem", for event E, the failure of a given distributed program whose occurrence probability is distributed program unreliability (DPUR), we have: q i (t) is the failure probability of the link/node i within time interval (0,t), which can be obtained directly or calculated from the input failure parameters; r i , c i , s i are fault coverage factors given as input parameters. Based on Eq. (2), we can calculate Pr(E 2 ) in Eq. (1) as (1) is the unreliability of corresponding perfect coverage DCS that ignores IPC. It should be evaluated given that no link or node experiences an uncovered failure. Therefore, before calculating Pr(E|E 2 ) we modify each node/link's failure function q i (t) to a conditional probability ) ( t q i conditioned on no uncovered failure occurring: , which is valued as 9.8995e-5 for the nodes and 1.8998e-4 for the links of the example DCS. Using these modified component failure probabilities, we can calculate Pr(E|E 2 ) by any approach that ignores IPC but considers CCF. A is simply the union of those CCG whose corresponding elementary common-causes occur, as shown in the second column of Table 1.
Based on the CCE space we developed and the Total Probability Theorem, we calculate the Pr(E|E 2 ) in Eq. (1) as:  IPC and CCF is required. Since both IPC and CCF are out of the picture, traditional DCS reliability analysis approaches that ignore both IPC and CCF can now be applied to solve those reduced reliability problems DPUR i . In the following, we present an efficient ROBDD-based approach for solving DPUR i .
Solving reduced problems DPUR i : It has been shown by many studies that in most cases, ROBDD-based algorithms require less memory compared with other methods and can perform exact and efficient calculation for large system reliabilities [2,7,24] . In the following we present a four-step ROBDD-based approach to the evaluation of DPUR i in Eq. (3).
Step 1: Obtain the set of MFST using the algorithm based on a breadth-first search, described in [20] .
Step 2: Order all the DCS components including nodes and links using a good variable ordering heuristic. A heuristic is good in the sense that it yields a compact BDD [25] .
Step 3: Generate the ROBDD for the failure function of a DPR from the MFST using an algorithm similar to the one described in [7] . Step 4: Evaluate DPUR i recursively from the ROBDD using the modified failure probability ) ( t q i . The evaluation algorithm is the same as the traditional BDD evaluation [6] . For the example DCS, the set of MFST for program P 1 includes: MFST 1 = {n 1 , e 1 , n 2 }, MFST 2 = {n 4 , e 5 , n 3 }, MFST 3 = {n 1 , e 4 , n 3 , e 3 , n 2 }, MFST 4 = {n 4 , e 2 , n 2 , e 3 , n 3 }. We use ordering of n 1 < n 2 < e 1 < e 4 < n 3 < n 4 < e 5 < e 3 < e 2 to generate the ROBDDs. There are four reduced problems for the example DCS: DPUR i , i=1,2,3,4. The DPUR 4 is simply 1 because when CCE 4 occurs, all MFST for program P 1 fail. According to the components affected by CCE i (i.e. i CCE A ), the ROBDD of DPUR 1 is generated from all the four MSFT; ROBDD of DPUR 2 is generated from MFST 1 and MFST 3 , and ROBDD of DPUR 3 is generated from MFST 2 . Figure 3 and Figure 4 show the ROBDD for the first three reduced problems. Evaluation of them gives: DPUR 1 = 7.6825e-8, DPUR 2 = 1.9807e-4 and DPUR 3 = 3.8793e-4. Integrating results: Based on the discussion, we integrate the results of DPUR i with Pr(CCE i ) using Eq.
(3) to generate Pr(E|E 2 ). Then we integrate the result of Pr(E|E 2 ) with Pr(E 2 ) using Eq. (1) to fulfill the task of distributed program reliability analysis for program P 1 . Table 2 summarizes the integration process and the results. As a comparison, the distributed program unreliability for program P 1 without considering IPC and CCF is 8e-8 for the example DCS, which shows that IPC and CCF contribute significantly to the system unreliability and must be properly considered for the accurate analysis of DCS reliability.

Summary of the DCS analysis approach:
Our DPR analysis approach for DCS subject to IPC and CCF first separates the consideration of IPC from the solution combinatorics and then decompose the resulted simplified problem into a number of reduced problems according to Total Probability Theorem. The effects of both IPC and CCF are factored out through the above two-phase reductions. The reduced problems are solved using efficient ROBDD based method. Figure 5 shows a conceptual overview of the proposed separable approach. The advantages of our approach are that it allows reliability engineers to use their favorite software package that ignores both IPC and CCF for computing distributed program reliability, and adjust the input and output of the program slightly to produce the DPR measure considering both IPC and CCF. As shown through the example, due to the nature of the ROBDD and the separation of IPC and CCF from the solution combinatorics, our approach has higher computational efficiency and is easier to implement than other potential methods such as Markov chain based methods, which can accommodate IPC and CCF by expanding the state space and number of transitions, worsening the state explosion problem.

CONCLUSION
In this study, we presented a separable and efficient ROBDD-based approach for DPR analysis of DCS with IPC and dependent CCF. Our approach enables the analysis of multiple common-causes that can affect different subsets of system components, and which may be s-dependent. We illustrate the proposed approach by considering the DPR analysis of a DCS subject to two common-causes. The efficiency of our approach means that it can be easily applied to the accurate reliability (including both DPR and DSR) analysis for large-scale DCS subject to IPC and CCF.