Determining the Optimal Software Rejuvenation Schedule via Semi-Markov Decision Process

Software rejuvenation is a preventive and proactive maintenance policy that is particularly 
useful for counteracting the phenomenon of software aging. In this study we consider an operational 
software system with multiple degradations and derive the optimal software rejuvenation policy 
minimizing the expected operation cost per unit time in the steady state, via the dynamic programming 
approach. Especially, we show analytically that the control-limit type of software rejuvenation policy is 
optimal. A numerical example is presented to make a decision table and to perform the sensitivity 
analysis of cost parameters.


INTRODUCTION
Software faults should ideally have been removed during the debugging phase. Even if a piece of software has been thoroughly tested, it still may have some design faults that are yet to be revealed. Such software faults are called bohrbugs and may exist even in mature software such as commercial operating systems. Also, even mature software can be expected to have what are known as heisenbugs [1] . These are bugs in the software that are revealed only during specific collusions of events. For example, a sequence of operations may leave the software in a state that results in an error on an operation executed next. Simply retrying a failed operation, or if the application process has crashed, restarting the process might resolve the problem. Another type of fault observed in software systems is due to the phenomenon of resource exhaustion.
Operating system resources such as swap space and free memory available are progressively depleted due to defects in software such as memory leaks and incomplete cleanup of resources after use. These faults may exist in operating systems, middleware and application software.
When software application executes continuously for long periods of time, some of the faults cause software to age due to the error conditions that accrue with time and/or load. Software aging will affect the performance of the application and eventually cause it to fail [2][3][4] . Software aging has also been observed in widely-used communication software like Internet Explorer, Netscape and xrn as well as commercial operating systems and middleware. A complementary approach to handle software aging and its related transient software failures, called software rejuvenation, is becoming popular [3][4][5] . Software rejuvenation is a preventive and proactive solution that is particularly useful for counteracting the phenomenon of software aging. It involves stopping the running software occasionally, cleaning its internal state and restarting it. Cleaning the internal state of a software might involve garbage collection, flushing operating system kernel tables, reinitializing internal data structures and hardware reboot.
In this study we consider an operational software system with multistage degradations and derive the optimal software rejuvenation policy minimizing the expected operation cost per unit time in the steady state, via the dynamic programming approach. This can be considered as an extension of the classical two-step failure models with time-based rejuvenation policies [5][6][7][8] .
Vaidyanathan et al. [9] treat a multistep failure model, but do not discuss the optimal software rejuvenation policy from the analytical point of view. We suppose that the state of software system deteriorates stochastically and is described by a right-skip free continuous-time Markov chain (CTMC) with an absorbing state [10] . In the dynamic operation circumstance, we formulate the semi-Markov decision process and derive the optimal software rejuvenation policy minimizing the expected operation cost per unit time in the steady state. Especially, we show analytically that the control-limit type of software rejuvenation policy is optimal.
The rest part of this study is organized as follows. First, we summarize the related work and describe a multistage degradation software system with a CTMC and define the software rejuvenation scheme with the condition-based monitoring. Next, we formulate the cost minimization problem via the familiar semi-Markov decision process and propose an iteration algorithm to derive the optimal software rejuvenation policy which minimizes the expected operation cost per unit time in the steady state. Thirdly, we study the optimality structure carefully and characterize some mathematical properties of the optimal software rejuvenation policy under several parametric assumptions. Numerical examples are presented to make a decision table and to perform the sensitivity analysis of cost parameters. Finally, the study is concluded with some remarks.

RELATED WORK
Huang et al. [5] consider a degradation phenomenon as a two-step stochastic process. From the clean state the software system jumps into a degraded state from which two actions are possible: rejuvenation with return to the clean state or transition to the complete failure state. They model a four-state process as a CTMC and derive the steady-state system availability and the expected operation cost per unit time in the steady state. Avritzer and Weyuker [11] discuss the aging in a telecommunication switching software where the effect manifests as gradual performance degradation. Garg et al. [7] introduce the idea of periodic rejuvenation (deterministic interval between successive rejuvenations) into the Huang et al.'s model [5] and represent the stochastic behavior by using a Markov regenerative stochastic Petri net. Dohi et al. [6] and Suzuki et al. [8] extend the seminal two-step software degradation models in Huang et al. [5] and Garg et al. [7] , respectively, by using semi-Markov processes.
As other examples, it is interesting to consider both effects of aging as crash/hang failure, referred to as hard failure and of aging as soft failure that can lead to performance degradation. Pfening et al. [12] model a performance degradation process by the gradual decrease of the processing rate in a non-stationary Markovian queueing system and formulate a determination problem of the optimal software rejuvenation schedule by a Markov decision process. Garg et al. [13] consider a transaction-based software system, which involves arrival and queueing of jobs and analyze both effects of aging; hard failures that result in an unavailability and soft failures that result in performance degradation. Park and Kim [14] carry out the availability analysis for active/standby cluster systems with rejuvenation. Li et al. [15] analyze an aging phenomenon in a real web server application. Liu et al. [16] and Vaidyanathan et al. [17] model a cable modem termination system and a cluster software system with rejuvenation, respectively. Recently, Xie et al. [18] develop a two-level software rejuvenation scheme with service-level rejuvenation and box-level rejuvenation.
Fujio et al. [19] and Okamura et al. [20] also formulate the control-limit type of rejuvenation policies and compare them with the corresponding time-based policies [13,21] numerically, where the control-limit policy triggers the software rejuvenation at the time instant when the system state reaches to a threshold level, while the time-based policy does at a pre-specified time. As expected intuitively, it is shown in the references [19,20] that the control-limit type of rejuvenation policies can provide better performance than the time-based ones. The main purpose of this study is to prove that the control-limit type of software rejuvenation policy is the best policy among all the Markovian policies in a simple multistage degradation model. In real time applications, actually, the static models [5][6][7][8] are difficult for use, because the decision making whether the software rejuvenation should be triggered or not is impossible at an arbitrary timing. On the other hand, the dynamic rejuvenation policy in Pfening et al. [12] would be useful to trigger the software rejuvenation sequentially as observing the system state, although they never take account of an event of hard (system) failure. In other words, Pfening et al. [12] represent a performance degradation process by the decreasing processing rate in a Markovian queueing system, but we model it by the right-skip free CTMC for a nontransaction based software system. Our approach in this study can be classified into a condition-based preventive maintenance with observation of system state [22] .

Fig. 1: Markovian transition diagram of software degradation level
Consider an operational software system which deteriorates with time. State of the software system deteriorates stochastically and changes from i to j, where states 0 and s + 1 are the normal (robust) state and the system down state, respectively. Without any loss of generality, the level j is degradated more severely than the level i (< j). Suppose that the state of software at time t, {N(t), t ≥ 0}, is described by a right right-skip free CTMC with state space I = {0, 1, …, s + 1} and that the transition rate from i to j (i, Fig. 1). That is, it is assumed that the system state with degradation level i can make a transition to all the upper levels j (> i). When the system failure occurs, then the system is down (j = s + 1) and the recovery operation immediately starts, where the time to complete the recovery operation is an independent and identically distributed (i.i.d.) random variable having the cumulative distribution function (c.d.f.) H s+1 (x) and mean 1/ω s+1 (> 0). On the other hand, one makes a decision whether to trigger the software rejuvenation at the time instant when the state of software system changes from i to j (= i + 1, i + 2, …, s). If one decides to continue operation, the state is monitored until the next change of state, otherwise, the software rejuvenation is preventively triggered, where the time to complete the rejuvenation is also an i.i.d. random variable with the c.d.f. H i (x) and mean 1/ω i (> 0), depending on the state i (= 0, 1, …, s). Let x 1 (> 0) and x 2 (> 0) be the rejuvenation cost per unit time and the recovery cost per unit time, respectively. In both periods of rejuvenation and recovery operation, the system operation is stopped. Also, it is assumed that the state-dependent cost a i (> 0) is incurred per unit operation time for i = 0, 1, …, s. Note that the system state can be described by only the index j (0 < j ≤ s + 1). At each time instant when the state changes from i to j, one has an option to choose Action 1 (rejuvenation) or Action 2 (continuation of processing). When the system failure occurs, i.e. the state of system becomes j = s + 1, the recovery operation (Action 3) is taken. Let Q (δ) (i, j) denote the probability that the state changes from i to j under Action δ (= 1, 2, 3). Then it is seen that where the mean rejuvenation time (overhead) is given by where the mean recovery time (overhead) is given by After completing rejuvenation and recovery operations, the state of software system becomes as good as new, i.e. j = 0 in Eqs. (1) and (4) and the same cycle repeats again and again over an infinite time horizon. We define the time interval from the initial point to the completion of rejuvenation or recovery operation whichever occurs first, as one cycle.

SEMI-MARKOV DECISION PROCESS
Observing the state of software system, we sequentially determine the optimal timing to trigger the software rejuvenation so as to minimize the expected operation cost per unit time in the steady state. Define the following cost component: V From the preliminary above, the Bellman equation based on the principle of optimality [10] is given by It is well known that the software rejuvenation policy satisfying Eq. (7) is the best policy among all the Markovian policies [10] . To solve the above functional equation numerically, we can easily develop the wellknown value iteration algorithm for the semi-Markov decision process. Define ε : tolerance level, τ : design parameter satisfying 0 ≤ τ /h i for all i, τΓ i and τ/h s+1 ≤ 1 (Tijms [10] ).
In general, V n (i) denotes the value function at n-th iteration. Then, the value iteration algorithm is given in the following: Value Iteration Algorithm: Step 1: Step 2: Step 3: If 0 ≤ B(n) -A(n) ≤ εA(n), then stop the procedure, otherwise, n := n + 1 and go to Step 1.
It would be possible to derive the optimal software rejuvenation schedule by applying the above value iteration algorithm, if there exists a unique optimal solution. However, it is worth noting that an analytical approach to characterize the optimal rejuvenation policy, without solving the Bellman equation directly, is possible by making some parametric (but reasonable) assumptions. In the following section, we investigate some mathematical properties for the optimal software rejuvenation policy and prove its optimality.

OPTIMALITY OF CONTROL-LIMIT POLICY
We prove the optimality of the control-limit type of policy. We make the following assumptions: The assumption (A-1) implies that the mean sojourn time in each state decreases, as the system deteriorates. The assumption (A-2) seems to be somewhat technical, but is intuitively reasonable. For instance, let f j be any cost parameter depending on state j. In this case, the expected cost incurred when the system state makes a transition, ), / ( , tends to increase as the degraded level i progresses. In the assumption (A-3), one expects in the sense of probability that the recovery time from system failure is strictly greater than the rejuvenation overhead. The assumption (A-4) means that the operation cost increases, but according to (A-5) the advantage of triggering rejuvenation of the software increases gradually, as the software system deteriorates. In the assumption (A-6), we require that the recovery cost is greater than the rejuvenation cost, where the both cost parameters, x 1 and x 2 , are greater than the expected operation cost per unit time in the steady state, say, z. In fact, if z > x 2 > x 1 , the system is down in the steady state with probability one. That is, due to the deductive argument, the assumption (A-6) has to necessarily hold.
We give the main results of this study.

Lemma 4.1: The function V(i) is increasing in i.
Proof: It is evident from (A-3) and , it can be seen that which is due to (A-1), (A-2) and (A-4). Thus, it can be shown that V(i) ≤ V(i+1). From the inductive argument, it can be proved that V(i) ≤ V(i+1) for an arbitrary i.
Proof: -1), (A-2), (A-5) and Lemma 4.1, it is seen that W(i) -M(i) is a monotonically increasing function of i. Hence, the proof is completed.
From Theorem 4.2, the problem can be reduced to obtain the optimal control-limit N * + 1 so as to minimize the expected operation cost per unit time in the steady state. In fact, this type of rejuvenation policy is essentially the same as the workload-based rejuvenation policy in the transaction-based software system [20] . In other words, even for our non-queueing system framework, the control-limit type of rejuvenation policy is better than any time-based one [13,21] .
In Table 1, we obtain the so-called decision table  to characterize the optimal software rejuvenation policy. From this result, it is optimal to trigger the software rejuvenation at the first time when the system state reaches to i = N * + 1 = 2. Then the associated minimum expected operation cost per unit time in the steady state is given by z * = z(N*) = z(1) = 4.255 ($). Next, we examine the dependence of the cost ratio x 1 /x 2 on the optimal software rejuvenation policy. Table  2 presents the dependence of x 1 /x 2 on the expected operation cost, where the corresponding optimal threshold level N * is insensitive to the change of x 1 /x 2 in the both ranges of 7/15 ~ 10/15 and 11/15 ~ 15/15. This is because the optimal threshold level is given by an integer value. As the rejuvenation cost per unit time is relatively larger with a fixed recovery cost parameter, the expected operation cost also increases monotonically for larger rejuvenation cost.

CONCLUSION
In this study, we have considered a dynamic rejuvenation policy for a multistage degradation software system. We have formulated the underlying optimization problem by a semi-Markov decision process and proved the optimality of control-limit type of software rejuvenation policy. A numerical example has been presented to illustrate the dynamic rejuvenation policy and its associated expected operation cost per unit time in the steady state. Here we have derived the decision table to characterize the optimal policy and performed the sensitivity analysis of cost parameters on it.
The result can be applied to the preventive maintenance problem with garbage collection for an application software, if the degradation level can be quantified by the total amount of memory leak. Based on the optimality of control-limit policy, the software user monitors the level of resource exhaustion and can trigger the garbage collection at the best timing in terms of the cost minimization. Of course, it is essentially important to estimate the transition rate from the field data with higher accuracy. If one fails to collect such data on the degradation time, the applicability of our stochastic model to the real software fault management can not be limited. Also, it is worth noting that the optimality of control-limit policy can not be guaranteed if the assumptions (A-1)-(A-6) do not hold, so that these parametric assumptions have to be checked carefully in practice.
In future, we will consider an adaptive software rejuvenation policy in the same modeling framework as this study. In practice, it is not so easy for an arbitrary software user to model software aging phenomena (number of degradation states and transition architecture) and to estimate the transition rates from his or her operational experience. To challenge the above issue, adaptive algorithms like non-parametric statistics and reinforcement learning should be applied to design an autonomic rejuvenation protocol with adaptive prediction ability.