General Linear Models in a Missing Outcome Environment of Clinical Trials Incorporating with Splines for Time-Invariant Continuous Adjustment

Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA Department of Biostatistics, The University of Texas School of Public Health, Houston, Texas, USA Department of Epidemiology, Human Genetics and Environmental Sciences, The University of Texas School of Public Health, Houston, Texas, USA Department of Biostatistics, The University of Texas School of Public Health, Houston, Texas, USA


Introduction
Missing data plagues health care research, weakening a study's conclusion. A review of randomized controlled trials reported 89% of 71 trials had partially missing primary outcomes and 18% of trials had more than 20% missing outcomes (Wood et al., 2004). Although missing data occur for a variety of understandable reasons (e.g., subject non-response, subject dropout, technical device error, or data entry error), the absence of information undermines the research effort (Schafer and Graham, 2002). The missing data problems become more acute in longitudinal studies where the commitment of subjects to the study over a period of time can lag. Hence, many helpful approaches handling missing data have been introduced for decades; in turn, this has resulted in new methods for managing missing data.
Missing data analyses have been remarkably extended since the missing data mechanism was defined by Little and Rubin (1987): Missing Completely At Random (MCAR) when a missing value depends on neither observed nor unobserved values, Missing At Random (MAR) when a missing value depends only on the observed values, Missing Not At Random (MNAR) when a missing value depends on unobserved data. Many advanced imputation methods have been developed based on the missing data mechanism. Multiple Imputations (MI) introduced by Rubin (1978), is a distribution-based method sampling values from a distribution of observed data. Hence, it is valid only under MAR. The Expectation-Maximization (EM) algorithm approach is one of the likelihood-based methods (Rubin, 1976) finding maximum likelihood estimates for fractional data, also valid under MAR (Dempster et al., 1977). The mixed effect models and the Generalized Estimating Equations (GEE) are commonly used for repeated measurements in longitudinal study. Both can operate in unbalanced designs, hence, they incorporate all available data without deletion or imputation. The former one is valid when missing data is ignorable (i.e., MCAR or MAR) and the latter one is valid only when MCAR holds (Liang and Zeger, 1986). The pattern mixture (Little, 1993;1995) and shared-parameter models (Wu and Bailey, 1988;1989) have been developed for use in MNAR situation which is more complex since it requires the specification of missing process (Verbeke et al., 2001;Michiels et al., 2002;Hogan and Laird, 1997).
Yet in the face of superlative efforts to identify proper methods for managing missing data, utilization remains limited in practice. Among studies having missing primary outcomes (Wood et al., 2004), 65% of trials carried out the Complete Case (CC) analysis for primary outcome and 24% of trials used the following imputation methods in order of frequency: Last Observation Carried Forward (LOCF), worst case value, regression-based imputation and MI. In addition, 8% of trials used repeated measures analyses such as the GEE approach (Wood et al., 2004). Others also noted that most studies applied the CC analysis or LOCF imputation method for primary analyses (Fielding et al., 2008). In fact, a major reason why the CC analysis and simple imputations such as LOCF imputation are the most commonly used procedures to manage missing data is the complexity of assumptions underlying the more advanced models (Little and Rubin, 1987). For researchers who are not specialists in missing data analyses, the CC analysis and simple imputations are relatively easy to implement. However, these simple approaches are valid only under MCAR which is less congenial than MAR in clinical trials. Besides, dropouts which are commonly occurred in clinical trials are usually regarded as MAR. Hence, simple approaches cannot avoid a serious bias unless the condition is met. Therefore, it is critical to find an approach which is as simple and straightforward as CC analysis or simple imputations such as LOCF, yet incorporating all available data which avoids additional assumptions by imputing non-existing data such as MI or the EM algorithm approach and also being valid under MAR.
The purpose of this study is to introduce an alternative method to manage incomplete outcome data, yielding the Best Linear Unbiased Estimate (BLUE) by employing the Generalized Least Squares (GLS) method to rearranged matrices in the model after considering missing outcomes without imputation or deletion of cases. In addition, B-spline is employed for managing non-linear relationships between outcome and continuous covariate which can be a potential confounder.

Proposed Approach
The proposed approach employed the GLS method, yielding the BLUE in general linear models (Seber and Lee, 2003). For example, suppose there is a trial evaluating blood pressure (Y) changes over time using two longitudinal measures at baseline (X = 0) and endpoint (X = 1) with n subjects, i.e., a design similar to the paired t-test. Assuming a constant variance σ 2 in both groups, this model can be specified as E(Y) = β 0 +β 1 X and Var (Y) = σ 2 V, where V is a correlation matrix allowing correlated outcomes (Diggle et al., 2002). In the proposed approach, repeated time was included in the model as categorical variable and time-invariant covariates, e.g., gender, race, are considered if needed.
In the presence of missing data, we removed rows from the design matrix, both rows and columns from covariance matrix corresponding to the missing elements of y ij in the outcome vector, where y ij is the outcome value of the j th time point in the i th subject. Assuming that there are l missing values at baseline and m missing values at endpoint, the dimension of outcome vector becomes (2n-l-m) and the design matrix and correlation matrix are turned to be (2n-l-m)×2 and (2n-l-m)×(2n-lm), respectively, after removing the components corresponding to the missing outcomes and collapsing the subsequent structure. The GLS method can be applied to the collapsed model yielding BLUE, since the matrices remain full rank. The GLS estimates with missing data are formulated as: , c is an appropriate constant, A is the number of all paired data.
Once the estimated correlation matrix c V ⌢ is obtained, it is used to solve for the parameter estimates.

B-Spline
The linearity of the parameters (-/+) is one of the underlying assumptions in general linear models when a continuous covariate is included in the model. In circumstances where the model includes a continuous covariate having non-linear relationships with outcomes, adding polynomial terms, e.g., x 2 , x 3 ,… x p is a commonly used method for explaining a nonlinear shape. As an alternative, we addressed the B-spline which is composed of polynomial non-negative pieces divided by knots, where a knot is a joint connecting two consecutive curves (de Boor, 2001). Let the knots be non-decreasing real numbers as, t 0 ≤t 1 ≤⋅⋅⋅≤t k ≤t k+1 , where t 1 , …, t k are k number of interior knots and t 0 ,t k+1 be two endpoints and let us define the step function as:

Comparison of Missing Analyses
Results from the above approach were compared to several popularly used alternative models.
The naïve way of managing missing data is the CC analysis which discards all incomplete cases. It was applied under MCAR only. The LOCF imputation method was also applied under MCAR only, as one of simple imputation methods by replacing missing values with the latest previous observation.
Among advanced imputation methods, MI and the EM algorithm approaches were applied under all missing data mechanisms. The MI method replaces missing values with an inference of n datasets drawn from the conditional distribution of missing data. In the simulation study, the averaged values of five datasets were imputed using R function mice (van Buuren and Groothuis-Oudshoorn, 2011). In the EM algorithm approach, the missing elements were drawn from the multivariate normal model with estimates obtained by the maximum likelihood estimation method using the EM algorithm, incorporating a general iterative algorithm by R package NORM (Dempster et al., 1977).
The LMM and GEE approach were applied under all missing mechanisms. In the LMM, the model included the random intercept for explaining subject-specific variability and applied in R function lme. For the correlation estimate, the gold standard was assigned to an initial value. The GEE was also applied in R function geeglm (Højsgaard et al., 2005) as a standard longitudinal data analysis incorporating all available data without imputation.
The random-effect pattern mixture model was applied only under MNAR situation (Hedeker and Gibbons, 1997). Defining missing patterns, we generated a dummy variable (D) for each pattern and included it in the model. In the example of trial with two time points, there can be two possible patterns as presented below: Data A simulation study was conducted in R version 2.15.2. to assess the validity of the proposed approach by comparison to commonly used missing analyses (CC analysis, LOCF, LMM, MI, the EM algorithm approach, the GEE approach) under each missing data mechanism with various missing rates in possible scenarios based on the number of parameters. The comparison between models was based on bias, Standard Deviation (Std.Dev) of estimates, mean Standard Error (SE) and 95% Coverage Probability (CP) which is the proportion of the time that the 95% confidence interval contains the true value of interest. For convenience, the following condition was employed for the sampling: where, σ 2 is a constant variance and V is a positive definite correlation matrix. One thousand individuals were randomly generated in a dataset and one thousand datasets were generated for each scenario. Since it is a longitudinal setting, the total number of outcomes were 1000× (time points), e.g., two thousand outcomes for a model including two time points.
Missing data were generated based on each missing data mechanism. Under MCAR, it was randomly generated from the given missing rates. For MAR or MNAR, it was generated based on the criteria specified upper quartile of missing rate from the normal distribution with gold standard, N(1,2), assuming no missing at baseline. For example, if the missing rate is 10%, the cut point for determining missing data is 2.81.
A simulation was conducted to compare the proposed approach and popularly used missing data analyses mentioned in earlier sections (CC, LOCF, MI, EM, LMM, GEE). The proposed approach performed superior to the CC or LOCF under MCAR and performed better or similar to MI, EM, LMM or GEE regardless of the missing data mechanism. When MCAR or MAR, the B-spline model produced smaller biases with high coverage probabilities, however, it didn't show substantial difference under MNAR (Appendix I).

Simulation Results
Complete Case Analysis Table 1 presents the results of three time-point models with the first order autoregressive correlation structure (AR(1)) for three scenarios of different missing rates when missing is MCAR. Under MCAR, the CC analysis produced a larger standard deviation and standard error than the proposed approach as the missing rate increased, while its coverage probability remained high due to the large standard deviation which was caused by discarding incomplete cases. Table 1. Estimates of three parameter missing analyses with missing rates of (time1, time2, time3) under MCAR Gold standard: β0(Baseline) = 1, β1(time(2-1)) = 0, β2(time(3-1)) = 0, σ 2 = 2, ρ = 0.7, Correlation structure: autoregressive (1) -  When the missing rate for each time point was 25% in Table 1, for example, the standard deviation and standard error of β 1 were 0.0531 and 0.0533 in CC analysis which is distant from 0.0434 and 0.0428 in the proposed approach. However, the coverage probabilities were close to 95% in both approaches. Besides, CC analysis retained above 94% of coverage probabilities in all parameters of interest even after 50% of data was missing. Hence, high coverage probability in CC analysis is not parallel to its accuracy. Table 1 shows that the proposed approach with AR(1) was superior to the CC analysis regarding both bias and coverage probability and it was remarked when missing rates were various for three time points (10%, 25%, 50%). The bias was -0.0015 in both β 1 and β 2 in CC analysis, however, the proposed approach produced less biases, -0.0006 and 0.0007 in β 1 and β 2 , respectively. Also, coverage probabilities for estimates of β 1 and β 2 in proposed approach were higher than ones in CC analysis.
Under MCAR, the EM algorithm approach produced larger bias compared to the proposed approach especially if AR(1), besides, it also yielded a larger difference between standard deviation and standard error causing lower coverage probability in both correlation structures (Table 1). When each time point had 25% of missing in Table 1, the bias of β 1 was 0.0018 in the EM algorithm approach while it was 0.0002 in the proposed one. The EM algorithm approach also retained 85.8% of coverage probability which is lower than the proposed one, 94.6%. Also, the difference between standard deviation and standard error in the EM algorithm approach was 0.0113 which is much larger than 0.0006 in the proposed one. Although it produced smallest bias and highest coverage probability under MAR, substantial difference between standard deviation and standard error remained in problem. With 25% of missing rate, the difference between standard deviation and standard error in the EM algorithm approach was 0.0122 which is much larger than 0.0016 in the proposed one (Table 2).
Under MNAR, the MI method showed greater bias and inferior coverage probability and the EM algorithm approach yielded similar results compared to the proposed approach, yet both showed a larger difference between standard deviation and standard error (Appendix).

Comparisons among Approaches using all Available Data
The LMM behaved very similar to the proposed approach under MCAR, except that it obtained the substantially biased correlation estimate (Table 1). When each time point had 10% of missing in Table 1, the bias of correlation in the LMM was -0.0362 while the proposed one yielded -0.0055.When MAR holds, the proposed approach with AR(1) performed superior to the LMM. With 10% of missing rate in Table 2, the LMM produced -0.0472 and -0.0983 in bias of β 1 and β 2 , respectively, which are larger than -0.0263 and -0.0516 from the proposed approach. Also, the coverage probabilities in the proposed approach were 90.4% and 82.2% for β 1 and β 2 , respectively, which is much higher than ones from LMM, 74.5% and 45.6%. Furthermore, the proposed approach remarkably performed better in terms of bias and coverage probability when the true correlation parameter was low (ρ = 0.2) ( Table 3). With missing rates of group1 and group2 were 10 and 25% in Table 3, the LMM yielded -0.0720 in bias of β 2 which is larger than -0.0326 from the proposed approach and the coverage probability was also lower in the LMM (78.8%) than the proposed one (91.6%). Besides, the difference between standard deviation and standard error was smaller in the proposed approach. The results showed the similarities between the LMM and the proposed approach under MNAR, however, estimates from the proposed approach with AR(1) tended to be less biased (see Appendix (Table 9.1-9.4)).
The GEE approach also showed analogous results to the proposed one under MCAR except for the correlation which was much less biased in the proposed one (Table 1). With 25% missing rate of each time point in Table 1, the GEE approach produced -0.0147 in bias of correlation while the proposed one produced -0.0011. When MAR holds, the proposed approach performed better than the GEE approach, especially if correlation structure is AR(1) ( Table 2). When the missing rate was 10% in Table 2, the GEE approach yielded -0.0588 and -0.1180 in bias of β 1 and β 2 , respectively, which are much larger than -0.0263 and -0.1974 from the proposed one. Also, the coverage probabilities in the GEE approach (65.7 and 33.1%) were also much lower than ones in the proposed approach (90.4 and 82.2%). Under MNAR, both the GEE and proposed approach produced similar results, however, the proposed one tended to attain smaller bias and higher coverage probability (Appendix (Table 9.1-9.4)).Overall, the proposed approach performed superior to the GEE approach if the correlation structure is AR(1) regardless of the missing data mechanism.
The random-effect pattern mixture model yielded a smaller bias on time effect (β 1 ), yet it produced greater biases for other estimates (Table 4). When ρ = 0.7 in Table 4, a bias of β 1 in the random-effect pattern mixture model was -0.0919 which is smaller than both LMM (-0.1644) and the proposed approach (-0.1734), however, β 2 was much more biased (-0.1842) than LMM (-0.0480) and the proposed one (-0.0508). Hence, the randomeffect pattern mixture model does not seem to be a compatible alternative over the proposed approach under MNAR mechanism unless time effect itself becomes the primary interest in a study.

B-Spline
When MCAR or MAR, the B-spline model produced smaller biases with high coverage probabilities except for the baseline (β 0 ) and its standard deviation and error became also much smaller when B-spline was applied (Table 5).When MAR with 25% of missing in Table 5, the bias and standard deviation of β 3 were reduced from -0.3018 to 0.0014 and 0.3465 to 0.0773, respectively. The results did not show much difference among three B-spline models with different number of knots which means that adding more knots did not improve the model fitting in this example. Under MNAR, B-spline model did not make a substantial difference compared to the model without B-spline (Appendix II).
A total of 120 post-AMI patients who were satisfied inclusion and exclusion criteria for TIME were enrolled from 5 sites and randomized to treatment and control therapy in a ratio of 2:1 for each time administration (Traverse et al., 2009). Study population was adults at least 21 years of age.

Results
The distribution of population in TIME study was summarized in the previous article (Traverse et al., 2012). There were substantially more males than females in the randomized cohort. Patients were on average 57 years of age. Cell therapy produced no significant changes in LV function in either the Day 3 or Day 7 evaluations. There were no missing values at SPI, but four missing occurred at the Day 3 endpoint and two and three missing values occurred at SPI and endpoint, respectively on Day7. Hence, the total missing rate was 3.0% on Day3 and 4.7% on Day7. Table 6 showed the results of the effect of the model adjusted for gender and age, separately by time administrations (Day3 and Day7). The BMMNC therapy effect did not show a large difference from the basic model except for MI and EM algorithm approach which demonstrated fluctuations after the adjustment. For example on Day7, the point estimate of BMMNC therapy was decreased 0.33% after adjusting for gender and age in proposed approach, while it was decreased 15.49% in MI and 43.33% in EM algorithm approach. It showed that imputation methods may be more sensitive on covariates included in the model as adjustments than the methods which incorporate all available data. The time effect was slightly changed after the adjustment, yet none of the 7 missing analyses applied (CC, LOCF, MI, EM, LMM, GEE) including the proposed approach showed a change on its significance. Global LV function among female was significantly larger than male on Day3, yet not on Day7. Age did not show statistical significance on improvement of global LV function. The absolute effect size of BMMNC therapy increased after adjustment for gender and age in both time administrations but not significantly.
A scatter plot of age and the global LV function at SPI time are presented in Fig. 1 with the predicted lines (solid line) from B-spline models with eight and five internal knots for Day3 and Day7, respectively. The number of internal knots was determined based on the improvement of the model fitting by the spline. The two dashed lines from each plot represent the upper and lower 95% confidence intervals. Table 7 presents the results of the proposed approach without and with B-spline function. The mean global LV function at SPI changed from 43.150 to 41.964 on Day3 and from 48.640 to 51.202 on Day7 when B-spline function was applied. B-spline moved the BMMNC therapy effect towards the null value in both time administrations, changing from 4.447 to 4.0 on Day3 and from -4.629 to -3.743 on Day7. However, other effects showed analogous results between models without and with B-spline.   Table 7. Estimates from the proposed approach in the basic model adjusting for gender and age with and without spline using generalized least squares method at each time administration

Discussion
Public health relies on clinical trials to approve an effect of therapy or drug. However, clinical trial interpretation is vitiated in the presence of missing data, a common occurrence in clinical trials. Since missing outcome in longitudinal studies occurred more frequently, it is important for statisticians to perform the most compatible longitudinal data analysis in missing outcome environment.
The proposed approach based on the generalized least squares method produces the best linear unbiased estimate if the covariance matrix V is known even in the presence of missing outcome. In practice, however, V is usually unknown. In the proposed approach, the REML method was applied to estimate the correlation parameter assuming common variance σ 2 under the pre-specified correlation structure (i.e., compound symmetry or AR(1)) using all paired data. Since only all paired data is used for parameter estimation of correlation, it maylead a bias if all paired data does not represent the entire sample in terms of the correlation over time within subjects. However, this is not a problem of only the proposed approach, all other missing analyses incorporating all available data, e.g., LMM and the GEE approach, also have difficulties of estimating covariance parameters in fractional data. The problem can be even worse in imputation methods due to the additional uncertainty by imputing a value which is not actually observed.
Another limitation of the proposed approach is that it only allows time-invariant covariates, hence, further research is required for managing the time-variant covariates. Nevertheless, the simulation study showed the validity of the proposed approach, especially its superiority where the correlation structure is AR(1) under MAR mechanism which is a more plausible in longitudinal health research. The proposed approach also can be compatible when missing data is MCAR regardless of correlation structure, however, it may not be valid under MNAR.
Each missing analysis applied in this study has critical pitfalls along with strengths and benefits. The CC analysis is easy to execute which is a great advantage over the proposed approach, however, it obtained a larger variance due to loss of information which can be drastic in longitudinal setting. Also, CC analysis can be seriously biased unless MCAR holds (Demissie et al., 2003). Imputations can be useful when the conditions are met, however, it can also lead a severe bias, inflating the uncertainty of parameter estimate by imputing unobserved value. Simple imputations such as LOCF are also easy to carry out, yet, it ignores the variability of missing data which is not plausible in reality.
Also, it addresses a serious bias unless MCAR holds (Lane, 2008). The MI reflects the variability of missing data complementing the drawback of simple imputations, nevertheless, its plausibility is still vague (Sterne et al., 2009;Engels and Diehr, 2003;Schafer and Olsen, 1998). Besides, since it is a distributionbased method, it can lead a bias if the distribution is mis-defined, while the proposed approach is not restricted to the distribution since it is based on the Gauss-Markov theorem. The EM algorithm is a likelihood-based imputation method guaranteed to maximize the likelihood; however, it also has shortcomings that the proposed approach does not have, e.g., not guaranteed to obtain the closed form which is an analytical expression having finite number of functions, convergence issue. Moreover, both MI and the EM algorithm approach, showed substantial difference between standard deviation and standard error indicating unreliable estimate in the simulation study. This was caused by the additional uncertainty from imputed values which were not actually observed.
Missing analyses incorporating all available data behaved similar to the proposed approach. The LMM, yielding BLUE for fixed effects when the covariance is known, allows unbalanced data when MAR holds (Beunckens et al., 2005;Laird and Ware, 1982). However, it can fail to converge for several reasons such as ill-conditioned samples, mis-specified model or violation of the normality assumption. Also, the models are more restricted due to additional assumptions for random effects. Therefore, the proposed approach can be widely used in various situations than LMM since it is less constrained in assumptions. The GEE approach is less restricted by employing the estimating equations with working correlation instead of using fully joint multivariate distribution (Liang and Zeger, 1986), yet, it produces inconsistent estimate unless MCAR holds. Robin (Robins et al., 1995) developed the weighted GEE yielding consistent estimates under MAR. However, it is only valid in a correctly specified model with consistent missing probability, besides, its superiority is still vague (Qu et al., 2011). We applied the random-effect pattern mixture model for the comparison to the proposed approach when missing is MNAR. The random-effect pattern mixture model was not superior to the proposed approach unless time effect is the primary interest in the study. The purpose of the simulation study is to understand some of the asymptotic properties of the estimates, hence, it requires large sample sizes. Estimator properties in small sample sizes are worthy of further examination. In addition, it may not be valid unless it has sufficient number of missing pattern. However, there are several other methods introduced for managing missing not at random, further research can be made on comparing them to the proposed approach (Troxel et al., 1998).
The simulation study showed that applying B-spline can help in obtaining smaller variance of estimate which derives the efficient hypothesis testing under MCAR or MAR. However, it didn't show much improvement on the model fitting under MNAR, hence, further research may be required for the evaluation.
Although most missing data analyses are restricted to the missing data mechanism, it is difficult to identify the missing mechanism (MCAR or MAR) in reality. Hence, it is hard to determine which missing analysis is the most compatible in TIME trial. Besides, with small number of missing data in TIME trial, missing analyses may not be quite distinguishable on BMMNCs therapy effect. However, we expect more variability among missing analyses in future studies with higher missing rates since loss of information or adding more uncertainty to the estimates by imputations tends to carry a greater risk of contorting the truth than approaches using all available data. Although the LMM or GEE approach also incorporates all available data, the proposed approach is less restricted than the LMM since it does not include random effects and more flexible than GEE approach regarding the missing mechanism. Therefore, the proposed approach can be an alternative of current missing analysis with a continuous outcome.
There is no perfect solution for missing data problems due to loss of information which should have been collected. Although missing information can be categorized, it is difficult to identify three types of missing data mechanism in practice, except for a few cases where MCAR and MAR are distinguishable. Also, current models are properly established with complete data, hence, none can promise the correct solution in missing environment. Therefore, we should first and foremost try our best to collect complete data. In the presence of missing, we need to fully comprehend the study and apply the most compatible approach. We believe that the proposed approach can be a powerful alternative for managing missing outcomes in longitudinal studies where the correlation over time within subjects can be expressed as AR(1).

Conclusion
Since missing data is commonly occurred in health research, many statistical methodologies have been introduced to effectively manage missing data problems. Although more advanced methods are recommended, simple approaches still have been used for the primary analysis in majority of studies due to its simplicity in understanding and execution. The proposed approach is as straightforward as simple approaches, incorporating all available data without deletion or imputation of information. The simulation study showed its validity, especially it was superior to current missing analyses applied for the comparison when the correlation structure was the first-order autoregressive under missing at random mechanism.