Survival and Mediation Analysis with Correlated Frailty

Corresponding Author: Dr. Usha S Govindarajulu Center for Biostatistics, Department of Population Health Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA Email: usha.govindarajulu@mountsinai.org Abstract: Many studies have focused on mediation effects, but fewer have focused on these in a time-to-event analysis and even fewer have focused on these effects in a survival model with frailty effects. The purpose of this study was to show unique modeling of correlated frailty effects with mediation and to demonstrate their use through simulations as well as on a real data application. We focused on a technique introduced by (Lange et al., 2012), which involves a procedure based on Marginal Structural Models (MSM) that directly parameterize the natural direct and indirect effects of interest. Using their method of specifying the MSM structure along with a correlated frailty model allowed us to incorporate the unexplained heterogeneity still unaccounted for by these models. We were able to ascertain direct and indirect mediating effects in the simulations. In the real dataset, we were also able to demonstrate stronger frailty effect in the correlated frailty model accounting for mediation.


Introduction
Mediation is the process through which an exposure causes disease. It is up to researchers to decide that some or all, of the total effect on an outcome operates through a mediator, which is defined as an effect of the exposure and cause of the outcome (Robins and Greenland, 1992). When a mediator is hypothesized, such as in our study, the total effect can be broken down in to two parts, direct and indirect effect. The direct effect is the effect of exposure on the outcome minus the mediator. The indirect effect is the effect of exposure on the outcome that works through the mediator. Galea (2013) states that in public health, it has been said that epidemiologists should seek to estimate parameters that have a logical correspondence with some realistic intervention that might be taken to improve population health. Similarly, causal inference suffers from the idea that one cannot estimate causal effects without some clearly defined exposure intervention (Glass et al., 2013). This fueled the debate on the usefulness of decomposing effects into their natural direct and indirect components. One argument is that because natural direct and indirect effects cannot be identified using intervention-based causal models and cannot be estimated in a randomized trial and, therefore, they cannot be interpreted as effects that have a logical correspondence with some public health action (Robins and Greenland, 1992). Others argue that natural direct and indirect effects are identifiable using causal inference models that make stricter assumptions in order to infer mechanistic relations and that many causal effects cannot be estimated in a randomized intervention trial for logistical or ethical reasons, and thus, should not be ruled out as a means of providing information on actions of different mechanisms (Pearl, 2009).
Casual mediation analysis have existed in the survival analysis world. We employed a technique to allow for survival and causal mediation analysis. This technique was introduced by Lange et al., 2012, which is in the realm of causal mediation. Although one of the best known methods that also works with survival analysis is that of Valeri and VanderWeele (2015), which uses a technique for causal inference of mediation to implement analysis in the presence of exposure-outcome interaction accounting for different type of outcomes; such as survival. The causal inference method allows for effect decomposition even in the presence of exposure-mediator interaction. The issue is that the outcome has to be rare to work in the Cox model format and their method is cumbersome to adapt for frailty analysis. The limitations of their methods did not allow for the Cox proportional hazard model to be employed unless the outcome is rare.
The technique introduced by Lange et al. (2012) involves directly modeling the natural direct and indirect effects of interest. Their method functions via a procedure based on Marginal Structural Models (MSM), which will directly parameterize the (Glass et al., 2013) natural direct and indirect effects of interest. It has tended to produce 22 more parsimonious results than current techniques, greatly simplifies testing for the presence of a direct or an indirect effect and has the advantage that it can be implemented in standard software like SAS or R. This approach can be used for any type of outcome (binary, continuous, survival, categorical, etc.,) and any type of mediator, even when exposure-mediator interactions exists. However, its simplicity comes at the price of relying on correct specifications of models for the distribution of the mediator and exposure and accepting some loss of precision compared with more complex methods.
We have previously published our frailty model framework (Govindarajulu et al., 2011) and demonstrated its flexibility in handling univariate, multivariate, or correlated frailty. Frailty models have been described as essentially survival models with both fixed and random effect terms. While the fixed effects comprised the explained or observed portion of the model, the random effect term accounted for the unexplained variability of the model. In other words, the random effect, or frailty, modeled the unexplained heterogeneity in the model, which could be exhibited by the heterogeneity of hazard rates beyond recorded covariates shown in a population. The frailty term (Vaupel et al., 1979) was initially developed to describe heterogeneity at the individual level, but was expanded to describe heterogeneity among groups of individuals or within an individual. These were then considered levels of clustering, where an individual is one level of clustering (Duchateau, 2008) and then groups or observations within an individual were additional forms of clustering.

Mediation Framework
We employed an already published technique to determine the survival and causal mediation analysis and then enhanced this method. The second technique introduced by Lange et al. (2012) involved a direct modeling of the natural direct and indirect effects of interest. This approach was applicable to survival outcomes with any type of mediator, even when exposure-outcome interactions exist. The method described by Lange et al. (2012) allowed us to model the natural direct and indirect effect yielding results simpler for reporting. We adapted the technique for a continuous outcome though they had mainly demonstrated a categorical outcome.
We then incorporated our prior published frailty model, which allows for univariate, multivariate, or correlated frailty. With this we focused on the correlated frailty aspects with mediation. We have presented a Directed Acrylic Graph (DAG) of the causal structure of variable relationships along with frailty in Fig. 1 which is modified from Lange et al. (2012). Since frailty is generally multiplied onto the function of covariates in the hazard function, we attempted to portray its causal effects on the covariates but even on the mediation. We report our steps of setting up the MSM structure proposed by Lange et al. (2012) and with our final incorporation of the frailty: Step 1: Start out with g(): being a link function, in this case, a logit link, where a is the exposure, c are the covariates represented by  ′X and m is the mediator: (1) Step 2: Continuous variable comes from a normal distribution with mean and variance from the exposure: Step 3: Using their derivation for computing the weights. Generally, there is a no unmeasured confounders assumption but we had to modify these assumptions to include the frailty effect (f): Yamf || A|C|F, Ma || A|C|F, Yamf ||M|A = a, C = c, and Yamf || Ma*|C. They eventually derive that: where, W refers to the weights. These weights from Equation 3 are then utilized in the modeling in Equation 4.
Step 4: Utilizing the weights computed in Step 3 in the Cox proportional hazards model with correlated frailty, which are multiplied onto the hazard: where, hij is the hazard function for the jth observation from the ith cluster where the cluster can be an individual or a group and wij is a vector of random effects associated with the covariates vector zij for cluster i observation j, which usually includes an intercept. The random effects account for the heterogeneity effects of the zij. When zij includes only an intercept, the model is a simple A C M Y F 23 multiplicative frailty model. The wij accounts for the correlation among individuals within a cluster. Therefore, this term represents frailty as univariate or multivariate. In univariate survival data, each cluster has only one individual with only one survival outcome. Multivariate survival data consists of a cluster of more than one individual. The clusters may be multiple survival outcomes for a single individual or one or more survival outcomes for multiple correlated individuals, such as relatives (Govindarajulu et al., 2011).
The correlated frailty can be considered to be an extension of the additive polygenic model often used to estimate heritability for quantitative phenotypes (Govindarajulu et al., 2011). The assumptions of a familyspecific frailty model is that everyone within the same family shares the same risk level. Persons that are not genetically related but are related through marriage will share the same family-specific effect. For the model described in Equation 4, in the shared frailty model, the random effect, wij, is the same for all individuals (j) in pedigree i, while in the correlated frailty model, the degree of genetic relationship between individuals is specified through the frailty. Individuals that are more closely related are expected to have more similar frailty, whereas more distant relatives and unrelated individuals have less similar frailty. This model might be appropriate when individuals in a family are correlated because of unmeasured genotypic effects.

Simulation Framework
We utilized a simulation framework akin to methods previously used to simulate survival data (Govindarajulu and Malloy, 2015;Govindarajulu et al., 2007;2009) adapted to correlated frailty models. We created 50 families. For each family, each of which ranged in size from 1 to 5, we simulated family id, father id and mother id, each generated from sampling without replacement between a particular range of integer values for each. We then randomly sampled covariates with replacement, allowing age to range from 25 to 90 years old and systolic blood pressure (SBP) from 116 to 156 units. Gender and smoking status were binary so we sampled those from a uniform distribution. Bender et al. (2005) discussed the use of different distributions for generating survival times. We allowed for a baseline Weibull hazard (Bender et al., 2005;Klein, 1997), so that: depending on parameters  and ν, are chosen to generate realistic survival data. Survival times are found from this generating distribution by using the relationship between the hazard function, survival function, S(t) and cumulative distribution function (CDF), F(t), of the survival time random variable, T. Given the baseline hazard function, h0(t), the CDF is found through the relationship: where, H(t) is the cumulative hazard function. This in turn is given by: where, g(x,y,w) = Tx + (1|id) is the true log hazard ratio defined in equation for H(t) above. Using the probability integral transformation, for a given t, the CDF, F, has a Uniform distribution on the interval from 0 to 1. Generating U1 from a Uniform (0,1) distribution and solving for t gives the simulated survival time: for given values of x, y, w, θ and ν. We also incorporated a competing risk into the simulations. Competing risk times were generated in a similar manner based on an exponential distribution with scale parameter γ, giving: The observed survival time was taken to be the minimum of t0, tcr and a pre-specified end-of-study time, τ. An observation was considered censored if t0 was larger than the minimum of tcr and τ. The user-specified parameters, θ, ν,γ and τ were set to generate realistic survival time distributions. We used τ = 20 years of follow-up time and θ = 5 which is considered typical used value. The parameters, ν and γ, were chosen to control the amount of censoring in the simulated data. We had 71 cases on average in the dataset where the average number of cases increased across the scenarios from 47 to 91. The final simulated datasets contain: all covariates as well as the clustering variables for id, family id, father and mother ids, survival time, event indicator, center number, and physician indicator.

Real Dataset Framework
We examined age-at-death data from the original cohort of the Framingham Heart Study, the longest running cohort study in the United States. There were a total of n = 5205 individuals with age-at-death or last contact data, coming from 2601 nuclear and extended families. Of the 5205 individuals, 4653 had died, and 552 were censored at age of last contact for this analysis. Families ranged in size from having a single individual to twenty-five individuals with age data. Most families (96%) contributed five or fewer members, and 38% of families contributed two individuals. Among the 5205 individuals, there are 1307 first degree relative pairs (sib and parent-offspring pairs), 61 second degree relative 24 pairs (avuncular and half-sib pairs), and 11 third degree relative pairs (cousins).
All computations were performed in R for both the simulations and real data analysis. In building the correlated frailty models, we utilized the kinship package to fit the models. The kinship package has a coxme function fits mixed-effect Cox models, including those with correlated frailties. In addition, the kinship package uses sparse block diagonal matrices to compute and represent kinship matrices.

Results
From the real dataset application from the Framingham Heart Study, we utilized the dataset described in the Methods section with n = 5205 individuals. We were then able to model univariate at an individual level as well as multivariate correlated frailty models with the family and sib-pair information. These models have been described in the Methods section.
The simulations were based off the covariates from this dataset as described in the Methods, which were then emulated around the covariates from the real dataset previously described. The simulation results show that the natural direct effect of age is statistically significant with a Hazard Ratio (HR) around 2.7 (p<0.0001), while for 2 out of the 20 simulations, the natural indirect mediating effect was significant in the correlated frailty model without weighting (Table 1). In the correlated frailty model with the weighting (Table 2), it was statistically significant for 5 out of 20 of the simulations with a reduced HR rate as compared to the direct effect. The other covariates, gender, smoke and SBP were generally statistically significant in the model.
Real dataset results showed potentially a significant natural direct effect and a nonsignificant natural indirect effect of age in all three models shown for correlated frailty: univariate, multivariate and multivariate with weighting. The natural direct effect appears to become more significant in the multivariate correlated frailty models while the indirect effect stays the same, including in our version of the correlated frailty model adjusted for mediation. Meanwhile the other covariates stayed statistically significant in each model.
The AIC results varied across simulations and do not appear correlated with a significant indirect mediating effect (i.e., lower AIC ≈ significant indirect mediating effect). In the real dataset analysis, it does not seem like AIC improved from just shared frailty only to a multivariate frailty model. We also included variance of frailty results separately for both individual frailty, 2 p  and shared frailty, 2 f  , between families. The frailty variances for both effects are generally very small in the simulations on the order of 10 4 and 10 5 (Table 1) Table 2 where the weighting is included. In the real data analyses, the initial correlated frailty shows 23% of variance accounted for by the frailty (Table 3) but then Table 4 shows small effect mainly for individual frailty of about 0.03% while the shared frailty is about 14%, which stays about the same in Table 5 with the weighted correlated multivariate frailty model. However, the individual frailty variance went up to > 11% so at least from an individual level, accounting for causal mediation induces a stronger frailty effect.

Discussion
The purpose of this study was to show unique modeling of these frailty effects with mediation and to demonstrate their use through simulations as well as on a real data application. We focused on a technique employed introduced by Lange, Vandsteelandt and Baeker 1 which involves a procedure based on Marginal Structural Models (MSM) that directly parameterize the natural direct and indirect effects of interest. Using these MSM models along with a correlated frailty model allowed us to incorporate the unexplained heterogeneity still unaccounted for by these models. We demonstrated this in both simulations and a real dataset with a correlated frailty model including weighting using the methodology of Lange et al. (2012). We were able to ascertain direct and indirect mediating effects in the simulations. In the real dataset we are able to ascertain also stronger frailty effect in the correlated frailty model accounting for mediation.
There is a broad and sustained interest in mediation analysis from many areas of public health and other fields. In particular, for time-to-event studies, employing survival analysis for modeling has been a standard and with the advent of frailty models, they could additionally incorporate unexplained heterogeneity, which generally is thought to be due to unmeasured confounding. While mediation analysis in the counterfactual methodology helps to define causal relationships between outcome, exposure and predictors, the idea of unexplained heterogeneity seems lost in this world, mainly relying without that possibility. In fact, Valeri and Vanderweele (2013) made these assumptions of no confounding as an identifiability assumption for their models incorporating mediation (Valeri and VanderWeele, 2015;Valente et al., 2017;Valeri and Vanderweele, 2013).
In terms of future work, this research can be extended to other datasets which can be modeled with correlated frailty and which also have potential mediating effects. Modeling different datasets with different genetic associations and some factor with mediating influence would be interesting to see how they perform with this new modeling. What is especially important is ascertaining the degree of associations amongst measures as we had attempted in the beginning of this article with a DAG to explain the counterfactual diagram. It might be interesting also to explore the weighting mechanisms behind the MSM structure.