Survival Analysis of Data of HIV Infected Persons Receiving Antiretroviral Therapy Using a Model-Based Binary Tree Approach

Discrete-time approach is used in survival data analysis when only the time interval in which the event of interest has occurred is known or when this event occurs in a discrete - time scale. The work presented in this paper is motivated by the analysis of HIV/AIDS follow-up data collected in Burkina Faso during the 5-YEAR Global Fund program implemented to fight AIDS, Tuberculosis and Malaria. The research question that motivated the work is the likely existence of different mortality risk profiles of people infected with HIV/AIDS, depending on their characteristics and health status at the beginning of their care. In order to answer these questions, we considered a binary tree regression approach for survival data analysis since such a model owns the ability to handle interaction effects between the outcome covariates without a tight specification of such effects during the model statement step. This helps to prevent specification and interpretation errors. The fitted model resulted in splitting patients into three disjoint subgroups, corresponding each to a specific hazard profile.


Introduction
From 2003 to 2007, the Global Fund supported health institutions in Burkina Faso to promote the access of HIV infected persons to Antiretroviral Therapy (ART), considered as a therapeutic advance in the fight against HIV/AIDS (Kouanda et al., 2008). Every six months, health data were recorded during clinical visits. An evaluation of this program was done in 2008 in order to assess the efficiency of the program and it involved the analysis of follow-up data gathered during the program execution. We sought to address two research questions in this paper: are there groups of patients with different and specific risk profiles? Which characteristics, among those that are recorded, are correlated with this risk and can help to predict accurately the hazard of death? To achieve this goal, we will use survival tree methods for the analysis of survival data. The main characteristic of this approach is its ability to capture interaction effects between predictors, specifically when there is a large number of predictors considered for modeling the distribution of a response variable. A binary tree is fitted to the dataset by recursively splitting covariates to create partitions of covariate space in order to obtain homogeneous groups with respect to the studied response.
Contributions in tree-based methods for discrete-time survival analysis include (Bou-Hamad et al., 2009) and (Schmid et al., 2016). Both methods consider that timeto-event data are observed jointly with covariates that describe individuals and consider that hazard of event occurrence is a function of time and covariates. Let us consider X = (X1, X2,…, Xp) a vector of covariates, and {Nk, k = 1,…, K} a partition of the covariates space denoted by D(X). Let h denote the hazard function. In Bou-Hamad et al. (2009) approach, hazard probability has been modeled as follows: for all x  D(X) and time index t,

355
where, I(x  A) is the indicator function of the set A. In the Schmid et al. (2016) approach, the time variable denoted T is a candidate splitting variable for tree construction. Schmid et al. (2016) consider a partition {Nk, k = 1,…, K} of the input space D(T,X) and a hazard model specified by: where t is a time index and x is an observed value of the covariate vector X.
Our present work proposes a semiparametric modelbased binary tree for the analysis of the follow-up data of the HIV infected people who were included in the Global Fund program, with the aim of analyzing the correlation between the survival of the people who had access to this therapy, and their health status at the time of admission to the program, as well as the means by which these patients arrived at the program, the characteristics of the patients and the attributes of care facilities. Unlike (Bou-Hamad et al., 2009) and (Schmid et al., 2016) methods, the proposed approach distinguishes two groups of covariates: the first group is involved in the binary tree construction, and the second group is involved in the statement of a parametric regression model in each node of the tree.
The rest of the paper unfolds as follows. Section 2 is devoted to the exposition of the statistical model proposed for discrete-time survival analysis and the model fitting algorithm; in section 3, survival data are analysed on the basis of the fitted model; section 4 is devoted to the discussion of the results.

Notations
Let us consider a sequence of time intervals [0,t1[,[t1,t2[,…,[ts-1,ts[,... where the subscript s indexes a time interval. The covariates are splited into two parts. The first one is made of covariates Z1, Z2,…, Zq, called moderators according to (Fokkema et al., 2018) and will be involved in the binary tree construction. These covariates are used to characterize subgroups of the population through a partition of the input space Z2,…,Zq) and D(Zj) denotes the set of all possible outcomes of Zj, j = 1,…, q. The second set of covariates X1, X2,…,Xp, is involved in the statement of a parametric regression model, the linear prediction of a generalized linear model (glm). Let X = (X1, X2,…, Xp). We consider TE = 1, 2,… and TC = 1, 2,…, a random event time interval index, and a random censoring time interval index respectively. Let T := min (TE, TC). We denote by := I(TE ≤ TC) the random binary variable that indicates whether TE, the time to event, is censored 1 or not. The observed sample is denoted by {(ti, i, xi, zi), i = 1,…, n} where ti is the observed time of individual i, i indicating whether the individual is censored or not, zi = (z1i, z2i,…, zqi), xi = (x1i,x2i,…, xpi) are the observed values of Z and X.

Binary Regression Tree Model for Discrete-Time Hazard
The discrete-time hazard function h is the conditional probability that a randomly selected individual will experience death in time interval s, given that he didn't experience death prior to s. For a time index s, the discrete hazard is defined by: where X is the vector of the predictors, Z is the vector of the moderators, TE is the time-to-event variable and s is the observed time. We consider a discrete hazard model specified by: where s is a time index, x is an observed value of the covariate vector X, z is an observed value of the moderator vector Z and g is a strictly monotically increasing link function. It clearly appears that the model is a varying-coefficient model (Hastie and Tibshirani, 1993): coefficients αs and l change with the value of moderator variables Zm, m = 1,…, q. Those variables, called effect modifiers by (Hastie and Tibshirani, 1993) are distinct from covariates Xl, l = 1,…, p involved in the linear model.
which is the model considered by Bou-Hamad et al. (2009). It is claimed in (Singer and Willett, 1993) that the specification of time effect as αk(s) in (6) is the most general parameterization of time effect. Classical link functions include probit, logit and cloglog functions. The latter leads to discrete-time counterpart of an extended Cox proportional hazard model with respect to a link function f(h(s|x)) = -log(1-h(s | x)) = exp((s))exp(x) where exp(α(s)) stands for the baseline hazard.

Augmented Design Matrix
One can consider (8) as the likelihood of a binomial model with probabilities h(s | xi, zi) and independent observations yis of independent statistical binary variables yis. It turns out that the model parameters can be estimated by using binary response regression techniques. For that purpose, we need to create an (augmented) data matrix (Berger and Schmid, 2017) with ti rows derived from the initial data {(ti, i, xi, zi), i = 1,…, n}. In this matrix, the sth row contains information about the sth time interval. The first column of the data matrix is related to , the second is related to T (observed time), the p subsequent columns consist of the observed values of X and the last q columns are the observed values of Z. The rows of the augmented design matrix corresponding to a subject i that has experienced the event (i = 1) are given by: In the case of a censored subject i (i = 0), the rows corresponding to i are given by: By stacking these augmented matrices one obtains a design matrix that will be used by the model fitting algorithm. Singer and Willet (1993) proposed a similar design matrix. In short, the difference is on the Time variable. Instead of one variable, the Time variable is

Model Fitting
Model-Based recursive partitioning (MOB) is an algorithm that aims to produce unbiased binary tree where fitted parametric models are associated with each terminal node (Zeileis et al., 2008;Hothorn et al., 2006). We have used MOB to build the tree. The algorithm finds the partitions after completion of four steps (Algorithm 1). More details on parameter instability tests can be found in (Hornik and Zeileis, 2007;Hothorn et al., 2006). Unlike other binary regression tree methods, MOB does not require post-pruning to avoid overfitting and it results in an optimally sized tree. For tree construction, two hyperparameters are provided to run the algorithm: the significance level for parameter instability tests denoted  and the minimum terminal node size denoted minsize. The terminal node size refers to rows of the augmented data matrix. The final tree is obtained by the selection of an optimal value among hyperparameter values. This can be achieved by looking for an optimal joint value of a minimal terminal node size minsize and a numeric significance level  by using Bayesian Information Criterion (BIC).

Algorithm 1 MOB Algorithm
Step 1: Fit the model to the dataset. Let us consider the augmented data matrix previously described to be the dataset. It consists of two sets of variables: model variables and moderator variables. The model can be fitted by maximizing the loglikelihood resulting from (8). We denote by  the parameter estimate.
Step 2: Test for parameter instability with respect to every partitioning variable.
Let  be the current node and ˆ  the estimate of parameter   in . We denote by L and R the children nodes resulting from a binary partition of .  (Hornik and Zeileis, 2007). The sup LM statistic is used for numerical moderators and a  2 statistic is used for categorical moderators (Hornik and Zeileis, 2007). See supplementary material for details on sup LM statistic.
Step 3: If there is some overall parameter instability, split the dataset with respect to the variable associated with the highest instability (the smallest p-value) into two children nodes.
To determine whether there is some overall instability, it is checked whether the minimal p-value falls below , a pre-specified significance level. The Bonferroni method can be used to adjust for multiple testing. The split point is found by applying an exhaustive search procedure: for every conceivable split point, the parametric model is fitted in each one of the two children nodes generated by this split point and then the split associated with the maximum sum of the two observed log-likelihoods in the children nodes is chosen.
Step 4: Repeat the procedure in each of the resulting subgroups until no significant instability is detected or a minimum terminal node size criterion is met.
This method allowed us to cover a large set of hyperparameter values and then to select the optimal tree using the BIC criterion. We used the partykit package ) to fit the model. Hazard probabilities were estimated using (6).

Stability Analysis
The success of binary trees among statistical methods of decision-making should not obscure the potential instability of models resulting from the execution of the algorithm used to fit the data. Therefore, it is essential to ensure the stability of the fitted model before its subsequent use, as for prediction task. Stability assessment can be done by fitting the same model to bootstrapped samples from the training dataset. Bootstrap trees may select variables and cutpoints that were not selected by the original tree. Metrics for stability assessment include the relative variable selection frequency, the mean frequency of the variable selections per tree and the frequency of each cutpoint over the trees (Philipp et al., 2016):  The relative variable selection frequency for a splitting variable zj, j = 1,…, q equals the total number of bootstrap trees that have selected zj at least once, divided by the total number of bootstrap trees.  The mean frequency of the variable selections per tree for zj is the total number of times zj is selected for splitting by a bootstrap tree over the repetitions, divided by the total number of bootstrap trees.
 The relative frequency of a cutpoint c(zj) equals the total number of bootstrap trees that have selected c(zj) to split the variable zj, divided by the total number of bootstrap trees.
A variable selection is stable if its frequency of selection is close to 100% and its average split count is close to its number of selections in the original tree. Different graphics are used to hightlight a variable cutpoint variability depending on the nature of the variable (categorical, numerical). For an ordered categorical variable, a barplot is used to show the frequency of all possible cutpoints. A histogram is used to illustrate the cutpoint variability when the splitting variable is numerical. It is expected that the cutpoints selected in the original tree have the highest frequencies (one or more peaks in the histogram). For an unordered categorical variable, a specific plot is used to visualize the partitions' variability over the repetitions. The plot uses the same color for categories that belong to the same node. The combination of categories that corresponds to a partition observed in the original tree is marked on the right side of the plot by a solid red line. In addition, two dashed lines enclose the area representing the partition. The level(s) of the corresponding split(s) in the original tree are indicated by the number(s) on the right side of the area. To sum up, a cutpoint is stable if it is selected by most resampling trees. More details on the approach can be found in (Philipp et al., 2016).
We resort to a semiparametric bootstrap method which consists of two steps: a sampling step and an assignment step. During the first step, we sample with replacement from the survival data (ti, i), i = 1,…, n. In the second step, we assign covariate vectors xi, zi conditional on the (ti, i) sampled on the basis of a probability distribution determined by the fitted model (Zelterman et al., 1996). The fitted hazard probability is used for uncensored individuals (i = 1) and the fitted survival probability is used for censored individuals (i = 0).

Study Population and Measurements
The population concerned by the study is the overall HIV infected patients that had started antiretroviral therapy in Burkina Faso between 1st January 2003 and 31st December 2007. The country health system is splitted into 13 health regions totalizing 55 health districts and 77 ART centers. Among other initiatives, the ART centers had received funding from the Global Fund to fight AIDS, Tuberculosis and Malaria. For the purpose of the Global 5-Year program evaluation, those centers were classified as low, medium and high scale, according to the level of funding received by health districts in 2007. In each ART center, all the patients older than 15 that had laboratory documentation of HIV infection and that had received Highly Active Antiretroviral Therapy (HAART) in the center for at least 6 months of follow-up were included in the data collection. Among the 5608 patients initially studied by (Kouanda et al., 2011), 1267 patients with missing initial CD4 count or unknown WHO clinical stage were excluded for the present analysis. People who had begun ART before they joined the ART centers involved in the evaluation were not included in this study. We didn't consider the covariates Body Mass Index (BMI) and timedependent CD4 count because of the large proportion of missing values. At the date of collection, 77.5% were alive, 7.0% were lost-to-follow-up and 11.6% were dead. The empirical hazard function decreased from 7.49% in semester 1 to 2.02% in semester 2 ( Table 2).  For statistical analysis, time indexes 7, 8, 9, 10 and 11 have been grouped into a single category 7. For the model building, we consider the variables Gender (Genre), Age, Entry mode (EntryMod) that describes the patients as predictors in the linear model. Potential partitioning variables were baseline CD4 count (inCD4), WHO clinical stage (StadeOMS), Intensity of the intervention (Scale) and health district category (District).

Identified Subgroups and Hazards of Death Profiles
The fitted tree results in three terminal nodes (Fig. 1). The WHO clinical stage (StadeOMS) is selected for the first split, revealing that among the predictors, the baseline disease stage is the most closely correlated with the mortality hazard trajectories. Within the group of patients with one of the first three WHO clinical stages, the baseline CD4 count (inCD4) is the most correlated with the mortality hazard trajectories and induces a new split of the group. CD4 count is known to be a good predictor of the HIV dynamics during treatment in resource-limited countries. Patients with baseline CD4 count ≤ 50 cells/µL have a risk profile that is different from that of patients with baseline CD4 count > 50 cells/µL. Figures 2 to 4 illustrate the correlation between covariates and the hazard function. In the subgroup of patients with WHO clinical stage 4, the hazard function highly decreases from semester 1 to semester 2 (Fig. 2). The hazards of death in semester 1 are slightly higher for male patients, compared to female patients in all categories of patients and higher for 30-40 age group compared to the other age groups. But there is no significant difference in hazards of death between the categories of patients from the semester 2.
In the subgroup of patients with baseline CD4 count ≤ 50 cells/µL and WHO clinical stage ≤ 3, patients supported by NGOs or Relatives have similar hazard profiles and the lowest hazard estimates in semester 1 (Fig. 3). Transferred patients have the highest hazard estimates in semester 1. Patients aged 40 and over have the highest hazard estimates. The difference in risk between age categories is lower in patients followed by NGOs or supported by relatives than in transferred patients. In addition, the hazard function increases between semester 6 and time interval 7. The increase is lower in patients followed by NGOs or supported by relatives compared to patients from other modes of entry.
In the subgroup of patients with baseline CD4 count > 50 cells/µL and one of the first three WHO clinical stages (Fig. 4), the hazards of death in semester 1 are slightly higher for male patients compared to female patients and significantly higher for transferred patients compared to patients from other modes of entry. The hazard estimates are also higher for patients between the ages of 30 and 40. Patients from other age categories have similar hazard profiles. For all categories, the hazard function remains constant after the third semester.
A comparison of the results from the three subgroups reveals main features. First, in all subgroups, the hazard function decreases significantly between semester 1 and semester 2. The hazard estimates during these two semesters are higher in the subgroup of patients with baseline WHO clinical stage 4, compared to the two other subgroups of patients. The subgroup with baseline CD4 count > 50 cells/µL and one of the first three disease stages has the lowest hazard estimates during these two semesters. These findings underline, on the one hand the treatment efficacy for all HIV infected persons and on the other hand, that the efficacy is best for patients that initiate treatment at an early stage of infection. Secondly, hazard profiles differ significantly depending on how patients are entered into the active list of persons living with HIV/AIDS under ART. Patients supported by parents or NGOs have similar risk profiles. Patients recruited through health facilities and transfers also have comparable profiles. Thirdly, except in the subgroup of patients with baseline CD4 count ≤ 50 cells/µL and WHO clinical stage ≤ 3, hazard estimates in semester 1 are higher for male patients compared to female patients. Lastly, for all subgroups, there is a difference in hazard between age categories in the first semester. This difference is well illustrated in the subgroup of patients with baseline CD4 count > 50 cells/µL and one of the first three disease stages (Fig. 4).

Assessment of the Fitted Binary Tree Stability
We used 500 bootstrap samples for the stability analysis. Figure 5 and Table 3 show that the relative frequency of selecting WHO clinical stage (StadeOMS on the table), was 100%. In addition, Fig. 7 shows that For the variable inCD4, the relative frequency of selection is also 100% but, unlike in the original tree, the variable is selected 1.9 times for splitting by each bootstrap tree (Table 3). Most bootstrapped trees have selected the same cutpoint as in the original tree (CD4 count ≤ 50 cells/µL) on the second level. In most cases, an additional cutpoint (between 80 and 200) is also selected. This second cutpoint indicates that the subgroup of patients with CD4 count > 50 cells/µL may be splitted into two categories with specific survival profiles by a cutpoint between 80 cells/µL and 200 cells/µL. The upper bound of the possible cutpoints is known to be a threshold limit under which a HIV infected person is immunodepressed and in very urgent need of treatment. To sum up, the variable CD4 count is definitely relevant for predicting survival of ART patients although its split is less stable than the split of the variable WHO clinical stage. About 69.8% of the trees were built by selecting only the variables WHO clinical stage and baseline CD4 count as in the original tree (Fig. 6).      (Table 3 and Fig. 6). This occurred mostly in the subgroup of patients with CD4 count > 50 cells/μL.

363
In most cases, the Boulmiougou district forms a first node and Bogodogo and other districts are directed to the second node. As District was not selected by the original tree and its selection frequency is less than 50%, this means that this variable may carry some information that is useful for predicting survival, but it is not among the most important ones. For instance, patients in the Boulmiougou district were followed-up by Médecins Sans Frontières, an NGO from Luxembourg. As a result, health workers involved in the care of HIV infected persons in that district received salaries three times higher than those of other public health centers (Perelman, 2003). Thus, patients may receive a better care compared to patients in the rest of the districts.
Finally, very few bootstrap trees (1.2%) have selected the variable Scale for splitting. It need not be retained.

Concluding Discussion and Remarks
In this study, we proposed a tree-based approach for the analysis of discrete time-to-event data. The method is related to previously proposed methods by (Bou-Hamad et al., 2009;Schmid et al., 2016). But there are important differences between our method and the two others. First, our method distinguishes two groups of covariates: those used to build a binary tree and those that define a linear model in the nodes of the binary tree; the former are called moderators and the latter have kept the qualifiers of covariates. For example, unlike (Schmid et al., 2016) where Time is a splitting covariate (moderator variable in our approach) and each terminal node corresponds to one time interval, our method uses Time as a model covariate and each terminal node corresponds to the whole set of time intervals. When Time alone is included as a covariate, our model is reduced to the Bou-Hamad et al. (2009) The second difference is found in the algorithm used to fit the model. An important advantage of our method is that it uses Model-Based algorithm (MOB) (Zeileis et al., 2008). The objective is to search for subsets of the data that have the best fits of the hazard model, assuming that the model may not fit perfectly all the dataset. It turns out that each leaf is associated with a fitted model. The algorithm uses the model likelihood function both for parameter estimation and split point search. A benefit of this approach is that the parameter estimates and the corresponding score functions have to be evaluated once in a node. Score functions are then reordered and aggregated into a scalar test statistic each time a parameter instability test is performed. Finally, a Model-Based algorithm is suitable for the identification of subgroups of individuals with similar survival behaviors (Seibold et al., 2016).
We have used our method for the analysis of HAART data from Burkina Faso. The model has identified three subgroups of patients with different survival behaviors. The subgroups are determined by the combination of baseline WHO clinical stage and baseline CD4 count. They differ in the shape of the hazard function as well as in the existence and the amount of correlation between the hazard function and at least one predictor variable among Age, Gender and Entry Mode. In each subgroup, the hazard of death is highest in the 1st semester. This early mortality is probably explained by late presentation. Most patients started ART with an advanced HIV infection level. The median baseline CD4 count was 122 (60; 180). As expected, the WHO clinical stage 4 subgroup had the highest within 6-months mortality. Some of the clinical criteria used to assign the disease stage 4 were found to be the main causes of ART patients' deaths in Burkina Faso (Kouanda et al., 2011). Patients in other WHO clinical stages that had a low baseline CD4 count (CD4 count < 50 cells/µL) were the second high-risk subgroup. CD4 count < 50 cells/µL was found to be associated with a higher risk of death in other studies (Lawna et al., 2008;Kouanda et al., 2011). Gender and Age identified as correlated with the hazard function in the subgroup of patients with WHO clinical stage ≤ 3 and baseline CD4 > 50 cells/µL have been reported by other studies as predictors of mortality (De Beaudrap et al., 2008;Kouanda et al., 2011). Bila and Egrot (2008) reported that the representations of masculinity in Burkina Faso are a factor of men's reluctance to attend health care for persons living with HIV/AIDS (Bila and Egrot, 2008). On the other hand, in each subgroup, categories of patients' hazard profiles differ by the hazard of death estimate in the 1st semester. Patients supported by NGOs or Relatives have lower hazards of death. Social support is important for persons living with HIV/AIDS in the Subsaharan context, where fear of exclusion may lead to poor adherence to treatment (Merten et al., 2010). In contrast, transferred patients were found to have the highest hazard of death in the semester 1. The explanation can be the fact that transfer generally occurs in emergency circumstances. The stability analysis showed that a slight instability occurred in the third subgroups of patients defined by WHO clinical stage ≤ 3 and CD4 count > 50 cells/µL. In contrast, the subgroup of patients defined by WHO clinical stage 4 and the one defined by WHO clinical stage ≤ 3 and CD4 count ≤ 50 cells/µL were stable. Therefore the fitted model is fairly stable. On the other hand, we have analysed data for HIV patients on treatment in the thirteen health districts selected for the evaluation of the 5-YEAR Global fund program and in the Boulmiougou district. So our findings should be valid for HIV patients in Burkina Faso and in low-income Subsaharan countries with a similar health system.