AN EFFECTIVE TECHNIQUE OF MULTIPLE IMPUTATION IN NONPARAMETRIC QUANTILE REGRESSION

In this study, we consider the nonparametric quanti le regression model with the covariates Missing at Random (MAR). Multiple imputation is becoming an in creasingly popular approach for analyzing missing data, which combined with quantile regression is no t well-developed. We propose an effective and accur ate two-stage multiple imputation method for the model based on the quantile regression, which consists of  initial imputation in the first stage and multiple imputation in the second stage. The estimation proc edure makes full use of the entire dataset to achieve inc reased efficiency and we show the proposed two-stag e multiple imputation estimator to be asymptotically normal. In simulation study, we compare the performance of the proposed imputation estimator wi th Complete Case (CC) estimator and other imputatio n estimators, e.g., the regression imputation estimat or nd k-Nearest-Neighbor imputation estimator. We conclude that the proposed estimator is robust to t he initial imputation and illustrates more desirabl e performance than other comparative methods. We also apply the proposed multiple imputation method to a n AIDS clinical trial data set to show its practical application.


INTRODUCTION
Quantile regression has been widely used in analyzing the relationship between response and covariates since its first introduction in (Koenker and Bassett, 1978). Compared with mean regression, quantile regression is able to depict the impact of covariates on various quantiles of the response, which provides more information for analysis. Furthermore, quantile regression is robust to outliers in data and distributionfree for error term. Due to its advantages, quantile regression has illustrated its increasingly importance in modeling and has attracted great attention in data analysis and empirical applications, nonparametric quantile regression modeling is such an example. Consider the following nonparametric regression model: where quantile τ∈(0, 1), Q τ (Y|X = x) is the τ-th conditional quantile of Y given X = x. c τ is the τ-th quantile of error term ∈ and satisfies c τ = F -1 (∈), where F(∈) is the unknown distribution function of ∈.
Here, without loss of generality, covariate vector x does not contain constant 1, which means there is no intercept term in m(x) and ensures the identification of the model. This model overcomes many disadvantages of usually used parametric models in

Science Publications
JMSS which misspecification could be encountered. Nonparametric regression does not assume that the relationship between response and covariates to be linear or satisfy some specified form, which might be more reasonable for most of data set and thus more flexible than parametric models. Especially when data set does not present some kind of parametric form, nonparametric regression model could be a plausible choice since it avoids the great bias due to the wrong model form assumption and brings increasing accuracy and more reasonable explanations. The above nonparametric quantile regression model can be widely applied to many empirical data analysis, where the data set is complete. However, it is unavoidable to face with data set with missing data, so it is necessary to extend the above model to deal with missing data. In practice, missing data is very pervasive and the reasons for missing are various. More details Little and Rubin (1978); Robins et al. (1994) and Vach (1994). In this study, we pay more attention to the nonparametric quantile regression model (1.1) with the covariates missing at random, which has the following form Equation (1.2): where, (X, Z) are covariate vectors, X may be missing whereas Z is all observed in sample interval. Denote n as the sample size. For notation simplicity, we suppose that the first n 1 observations are complete while the remaining n 0 are missing in X. Therefore, rewrite the sample as {(Y i , X i , Z i ): i = 1,···, n 1 } and {(Y j ,·, Z j ): j = n 1 +1,..., n}. Let δ be a missing indicator whose value is 1 when X is observed and else 0 when X is missing. Then, δ i = 1 for i = 1,···, n 1 while δ i = 0 for i = n 1 +1,..., n. Here we assume that X is MAR which takes the form of conditional independence, i.e., X and δ are conditionally independent given (Y, Z) Equation ( In order to estimate model (1.2), we may just consider the observed data and ignore the observations with missing values, which is called the CC analysis. Although we can obtain consistent estimator for m(x, z) through CC analysis under MAR assumption, it may be misleading and inefficient when missing rate is high. Therefore, it is necessary to construct a more reasonable estimator to make use of the information in data set, e.g., imputation estimator. Particularly, the multiple imputation methods often bring more reliable inference than single imputation methods and perform better in missing data problems. In this study, we focus on the estimation of model (1.2) under MAR assumption based on local linear fitting and propose an effective and easyto-use two stage multiple imputation estimator, which improves the estimation efficiency to a large extent.
In the context of mean regression, parametric or nonparametric regression models with missing data have been studied in many papers. Anderson (1957) derived the maximum likelihood estimates of parametric models and Cheng (1994) and Chu and Cheng (1995) studied the nonparametric regression estimation with missing response. Wang and Rao (2001; and Wang et al. (2004) studied the estimation of generalized linear models, linear models, semiparametric models with missing response, respectively. Furthermore, quantile regression models with missing data also have been considered in literature. It should be noted that the above research mainly consider the models with missing response rather than missing covariates. Under mean regression, Liang et al. (2004) considered the partially linear model with covariate missing depending on other complete covariates and response. Wu and Wu (2001) proposed a multiple imputation method for missing covariates in non-linear mixed-effects models and applied the proposed method to HIV Dynamics. Robins et al. (1994) studied the regression coefficients estimation with missing covariates. Wang (2009) also considered the estimation of partial linear models with covariables data missing at random. With respect to quantile regression, Wei et al. (2012) studied the multiple imputation for parametric quantile regression model with missing covariates, which provided a new imputation method. However, nonparametric quantile regression with missing covariates has not been considered up to now. Based on the existing research and methods, we propose a two-stage multiple imputation method for nonparametric quantile regression with missing covariates, which greatly enriches the methods to cope with missing data in quantile regression.
The rest of the paper is organized as follows. In Section 2, we develop nonparametric quantile regression with missing covariates based on a two-stage multiple imputation method and present main results of the asymptotic properties for the proposed estimator. Section 3 compares our methods with regression imputation method, k-Nearest-Neighbour and Nearest-Neighbour methods through simulation study. Discussion is available in Section 4.

ESTIMATION WITH MULTIPLE IMPUTATION
In this section, we present the estimation of model (1.2) under CC case and propose a two stage multiple imputation estimator. For the CC estimator and the proposed estimator, we further study its large sample properties.

CC Estimator
For model (1.2), we first consider the model estimation under CC case, which is the basis of the twostage multiple imputation estimator.
To estimate Q τ (Y|X = x), the conditional quantile of Y given X = x in model (1.2) under CC case, we apply classical local linear fitting and quantile regression method and consequently have the following objective function Equation (2.1):

Two-Stage Multiple Imputation Estimator
In this subsection, we propose a two-stage multiple imputation estimator for model (1.2). The basic idea of the two-stage multiple imputation estimator is to impute the missing data via the estimated conditional density ( ) f x | y, z and then estimate model (1.2) based on the complete data including imputed data.
To obtain this estimator, two stages are performed, where initial imputation values are realized in the first stage while multiple imputation values are obtained based on these initial imputation values. Then we discuss about these two stages in detail.

First-Stage Imputation
In the first stage, we can obtain initial imputation values through many imputation methods. Here we consider the following three methods to get initial imputation values: • Regression Imputation. Based on the MAR assumption in Section 1 and the dependence of x on z, construct linear regression model for x given z with the complete data and obtain the parametric estimates. Then impute the missing x via the prediction values based on the corresponding z • k-Nearest-Neighbor Imputation. For j = n 1 + 1,...,n, find the k nearest data pairs (y l , z l ) (l = 1,..., k) of data pair (y j , z j ) in the complete data and the corresponding points x l (l = 1,..., k) are the k nearest points in distance of missing data x j . Then impute x j by averaging these points, i.e., k j l l 1 1 x x k = = ∑ ɶ • Nearest-Neighbor Imputation. Different with k-Nearest-Neighbor imputation, the Nearest-Neighbor imputation just considers the nearest one point of missing x as the imputation value, i.e., k = 1. For j = n 1 + 1,...,n, find the nearest data pair (y l , z l ) of data pair (y j , z j ) in the complete data and the corresponding point x l are the nearest point in distance of missing data x j . Then impute x j through this point, i.e., j l x x = ɶ

Remark:
The first imputation method is based on regression imputation while the third method belongs to matching method. The above regression imputation requires the linear relationship between missing covariate and regression variables. Matching is nonparametric imputation method which allows imputation without estimating conditional distribution of missing variable. Further information about regression imputation and matching method, Little and Rubin (1987) and Chen and Shao (2000).
It should be noted that a reasonable two-stage imputation estimator should be insensitive to the above initial imputation methods. In other words, if our proposed Science Publications JMSS two-stage multiple imputation estimator is effective and reasonable, it should be stable under different initial imputation methods. In the simulation study, we will illustrate the robustness of the proposed estimator to initial imputation. Based on these initial imputation values of missing x, we can estimate the conditional density f(y|x, z) and then obtain the estimated conditional density ( ) f x | y, z . We discuss the above estimation process in the following second-stage imputation.

Second-Stage Imputation
In this stage, we realize the multiple imputation based on the estimated conditional density ( ) f x | y, z and estimate model (1.2) using the whole data after multiple imputation. This stage can be concluded as the following steps: Step 1: Estimate conditional density f(x|y, z).
According to Bayes formula, f(x|y, z) ∝ f(y|x, z)f(x|z). It is reasonable to estimate f(x|y, z) through estimating f(x|z) and f(y|x, z) respectively, which can be realized via the following steps.
Step 1a: Estimate conditional density f(x|z). Model x given z parametrically as f(x|z, η) and obtain the estimate η and the estimated conditional density ( ) f x | z of x given z can be denoted as ( ) f x | z,η .
Step 1b: Estimate conditional density f(y|x, z). The quantile function is the inverse distribution function, so the density function can be expressed as the reciprocal of the first derivative of the quantile function at the corresponding quantile level. Here we choose K n quantile levels τ k = k/(K n + 1) (k = 1,..., K n ), similarly and approximate the conditional density f(y|x, z) as follows Equation (2.3): is the estimated τ k -th conditional quantile of Y in model (1.2) with the whole data set including initial imputed data from first-stage imputation.
At last, normalize ( ) ( )f y | x, z f x | z to be a density, then we get the estimated conditional density ( ) f x | y, z : Step 2 as the quantile levels and obtain the corresponding quantiles from empirical distribution function ( ) F x | y, z , which can be seen as the imputation values.
Step 3: Estimation of model (1.2) using the whole data after multiple imputation. Consider a new objective function including the observed data and the l-th imputed data set as follows: as the estimated coefficient under the l-th imputation data. Repeat the imputation estimation step L times and obtain the two-stage multiple imputation estimator Equation (2.4): For the two-stage multiple imputation estimator ˆ* ( ) β τ obtained based on the above two-stage Science Publications JMSS imputation, we derive its asymptotic properties in the following subsection.

Large Sample Properties
In this section, we give the asymptotic distribution of the two-stage multiple imputation estimator ˆ* ( ) β τ . Let h(τ; X,Z) = 1/Q τ (X, Z) be the density of Y given X and Z at τ-th quantile. Recall that l n ( ) β τ is the CC estimator in section 2.1 and n(l) ( ) β τ is the estimator obtained from the objective function based on the whole data including the l-th imputed data of missing values in section 2.2.2. Define the following objective function: The above three estimators are the basis for the twostage multiple imputation estimator ˆ* ( ) β τ . Then define the functions as follows: To obtain the asymptotic properties for ˆ* ( ) β τ , we list the following assumptions needed in proof.

Assumption 3:
The covariate X has bounded support χ. The true conditional density f(x|z) = f(x|z, η = η 0 ), where f(x|z, η) is a continuous function of η uniformly for (x, z) in a neighbourhood of η 0 and is bounded away from zero and infinity for all (x, z).

Assumption 6:
The d-dimensional kernel function K(·) is a bounded density function with a compact support C d within the interior of the support of f(x) such that ∫K(u)du = 1, ∫uK(u)du = 0 d , ∫uu T K(u)du > 0 d×d .

Assumption 7:
The bandwidth matrix H of the kernel function satisfies det(H) → 0 and n · det(H) → ∞, as n→∞.

Remark:
The Assumption 1 and Assumption 2 ensure the existence of solutions for objective functions. Assumption 3 and Assumption 4 focus on the conditional density f(y|x, z). Assumption 6 and Assumption 7 are common in nonparametric estimation, which represent the assumptions for kernel functions and bandwidths, respectively.
Additionally, we also make the following definitions: Based on the above regularity conditions, the twostage multiple imputation estimator ˆ* ( ) β τ has the asymptotic distribution in the following theorem.

Theorem 1:
Under (1.3) and the above Assumptions 1-7, for K n →∞ and K n n -1 → 0, the multiple estimator Based on the Theorem 1, we gives the large sample property of conditional quantile estimator Q (x, z) τ as follows.

Bandwidth Selection
It is well known that the selection of bandwidths in nonparametric regression estimation is of vital significance. The nonparametric estimation results depend on the bandwidth selection to a large extent. Silverman (1986) pointed out that the choice of bandwidth is much more important than the choice of kernel function. Thus, it is necessary to choose reasonable bandwidths to improve the performance of estimation. There are many bandwidth selection methods, such as Plug-in method and cross-validation method. Based on the bandwidth selection in mean regression and quantile regression proposed in Yu and Jones (1998) and Silverman (1986), we discuss about the selection of bandwidths in estimating model (1.2).
According to Yu and Jones (1998), we have the following bandwidth selection formula for quantile regression Equation (2.5): where, τ is the quantile level, h τ is the optimal bandwidth for the τ-th quantile regression, h mean is the optimal bandwidth for the mean regression estimation, φ(·) and Φ(·) are the probability density function and cumulative distribution function of the standard normal distribution respectively. In terms of the optimal bandwidth for the mean regression estimation h mean , we choose the Silverman's rule-of-thumb bandwidth, i.e., h mean , where σ can be the sample estimator of standard deviation σ. Based on (2.5) and the rule-ofthumb bandwidth, we obtain the optimal bandwidth for model (1.2).

NUMERICAL SIMULATION
In this section, we implement three simulation examples to illustrate the finite sample performance of the two-stage multiple imputation estimator and compare the performance of the proposed imputation estimator with the CC estimator and other imputation estimators. Specifically, we utilize the two-stage multiple imputation based on the three initial imputation methods in section 2.2.1 and compare these results with that of the first-stage imputation and the CC estimator. Denote the CC estimator, the three first-stage imputation estimators (the regression imputation estimator, the k-Nearest-Neighbor imputation estimator and the Nearest-Neighbor imputation estimator) and the two-stage multiple imputation estimator based on the above three initial imputation methods as CC, RI, kNN, NN, TSMI1, TSMI2 and TSMI3, respectively.
The first example represents the linear case of function m(x, z) to be estimated while the second example is on behalf of nonlinear case of function m(x, z) and the third example stands for the heteroscedastic case. In both the three simulation examples, we consider different sample sizes n = 60, 120 and 200, respectively and distinct missing probability function P(y, z) under different quantile levels τ = 0.25, 0.5 and 0.75. In terms of kernel function in estimation, we choose Gaussian kernel and product kernel K h (x, z) = K h (x)K h (z). For the selection of bandwidths, here we choose bandwidths for the above 7 estimators according to the selection rule in section 2.4. For the missing probability function P(y, z), we choose the following three functions Case 1: where AMSECC is the AMSE of the CC estimator while AMSETMSI1 is the AMSE of TMSI1.

Example 1:
Consider the following linear quantile regression model Equation (3.1): where the covariate (X, Z) are jointly normal with mean vector (4, 4) T , variance (1, 1) T and correlation 0.5 and ∈ is from standard normal distribution N(0, 1). For this model, we consider the above three missing cases. where the covariate (X, Z) are jointly normal with mean vector (4, 4) T , variance (1, 1) T and correlation 0.5 and ∈ ~ N(0, 1). For model (3.2), we also choose the above three missing probability functions. The AMSE values of the 7 estimators and ARE values of our proposed estimators with other estimators for model (3.2) are given in Table 2 and 5, respectively.

Example 3:
A remarkable advantage of quantile regression is that it does not require strict assumptions on error distribution, which brings us convenience to analyze model with heteroscedasticity. Thus, here we consider the following heteroscedastic model Equation (3.3): where the covariate (X, Z) are jointly normal with mean vector (4, 4) T , variance (1, 1) T and correlation 0.5 and ∈ is from standard normal distribution. For model (3.3), we still use the above three missing cases. The AMSE values of the 7 estimators and ARE values of our proposed estimators with other estimators for model (3.3) are shown in Table 3 and 6 respectively. Table 1-3 illustrate the estimation results AMSE of CC, RI, kNN, NN, TSMI1, TSMI2 and TSMI3 estimators for model (3.1), model (3.2), model (3.3), respectively, with different sample sizes n = 60, 120 and 200, respectively and distinct missing probability function p(y, z) under different quantile levels τ = 0.25, 0.5 and 0.75. From these tables, overall, the estimation effects are the best under τ = 0.5 for all the 7 estimators, which is consistent with the conclusions for quantile regression models. Via the comparison of AMSE values for the 7 estimators under the same sample size, the same missing function and the same quantile level, we conclude that the estimation performance of our proposed estimators TSMI1, TSMI2 and TSMI3 is uniformly better than that of the CC estimator and the initial imputation estimators. Compared with the CC estimator, the initial imputation estimators RI, kNN and NN have similar estimation results and even perform worse than the CC estimator, while our proposed estimators improve a lot than the CC estimator. Apparently, it is necessary to use our proposed estimators to improve estimation performance.   In terms of the different missing functions, we can see that, on the whole, all the 7 estimators perform worse as the missing rates increase under the same sample size and the same quantile level, which is common for analyzing data sets with missing values. However, for this conclusion some exceptions exist as sample size increases and when it is big enough. These two conclusions reflect the relative importance of imputation when sample size is small and missing rate is high.

JMSS
Furthermore, Table 4-6 show ARE values of TSMI1, TSMI2 and TSMI3 with CC and corresponding firststage imputation estimators for model (3.1), model (3.2), model (3.3), respectively, with different sample sizes and various missing probability functions under distinct quantile levels. According to Table 4-6, with all the ARE values are larger than 1, we find that our proposed two-stage multiple imputation estimators are uniformly more effective than the CC estimator and the first-stage multiple imputation estimators for all the models considered. For any one of the above models, under the same quantile level and the same missing function, overall, the relative efficiency of our proposed estimators increases as sample size increases. Additionally, under the same quantile level and the same sample size, overall, the relative efficiency of our proposed estimators increases as missing rate increases. What's more, the advantages of our proposed estimators are more obvious when model is nonlinear or heteroscedastic and at extreme quantile levels.
In addition, we can see that the estimation results of our proposed estimators TSMI1, TSMI2 and TSMI3 are very close, which reflects the robustness of our two-stage multiple imputation estimator to the initial imputation methods. This point is of vital importance for the application of our methods. Based on this good property, we can choose one kind of initial imputation methods to realize our two-stage multiple imputation, which provides great convenience for implementation and applications.

EMPIRICAL DATA ANALYSIS
In this section, we apply the proposed two-stage multiple imputation method to the ACTG 315 data set, which can be found on the website:

JMSS
http://www.urmc.rochester.edu/biostat/people/faculty/wu site/datasets/data/ACTG315LongitudinalDataNLME Data3.cfm. Meanwhile we analyze this data set using CC method for comparison. The ACTG 315 data set comes from an AIDS clinical trial group (ACTG 315) study which aimed to investigate the relationship between virologic and immunologic responses in AIDS clinical trials. In this data set, virologic response RNA was measured by viral load while immunologic response was measured by CD4 cell count. The ACTG 315 data set has been analyzed by many papers. Liang et al. (2004) analyzed this data set via partially linear models. Wu and Wu (2002) used non-linear mixed effects models for this data set in which more details about the data can be found. Similar research Wu and Wu (2001) and Zeger and Diggle (1994). Recently, Grun and Hornik (2012) used a mixed effects model while accounting for censored longitudinal data. Guo et al. (2014) considered the multi-index regression models with missing covariates at random to study the effect of the tumor necrosis factor. However, these papers just constructed mean regression models to analyze this data set, we may want to obtain more information from the analysis. For instance, we may be more interested in the influence of covariates on different quantiles of response variable; we may want to explore the influence pattern without specifying the model form in advance. Such analysis aims can be realized by the nonparametric quantile regression model, which is the interested model in this article. The data set we used here has 317 observations in total with 20.19% CD4 cell counts missing. Similar to the analysis in Liang et al. (2004), we here choose the viral load as the response Y while CD4 cell count as the missing covariate X and time as the complete covariate Z. According to the related research, the missingness of CD4 cell counts is due to the distinct measure times of CD4 cell counts and viral load. Thus, it is reasonable to assume this data as MAR.
Since the missing rate is relatively high, CC analysis may lead to information loss to some extent and hence imputation for the missing data can be necessary to consider. Based on the above mentioned analysis aims and data imputation requirement, we apply model (1.2) to this data set and utilize the proposed two-stage multiple imputation methods to estimate the model. To verify and compare the performance of our proposed two-stage multiple imputation methods, CC method is also implemented. Here we consider quantile levels τ = 0.25, 0.5 and 0.75 and choose Gaussian kernel and product kernel. In terms of the bandwidths selection, we use the bandwidths obtained according to the selection rule in Section 2.4. and TSMI3 represent the ARSSs of the CC method and the proposed two-stage multiple imputation methods, respectively. Figure 1-3 show the estimation results of quantile function Q(x, z) based on different methods under τ = 0.5, 0.25 and 0.75, respectively. From Table 7, we can know that, overall, the smaller values of ARSS of our proposed two-stage multiple imputation methods show that our methods perform better than CC method in terms of data fitting. We also calculate the relative efficiency of our methods compared with the CC method, which is measured via the ratio of ARSSs and we find that our proposed twostage multiple imputation methods can improve about 5% under τ = 0.5. In addition, the estimation result under τ = 0.5 is best, which is common on quantile regression. From Fig. 1, we can see, under τ = 0.5, our three multiple imputation methods show similar results, which reflects the robustness of the proposed imputation method to the initial imputation. Furthermore, our estimation results represent bigger variation of viral load between different time, which shows the distinct influence of cd4 cell count on virologic response under different time. Therefore, our proposed two-stage multiple imputation methods reflect more helpful information to some extent due to their full use of more data information. Similar conclusion can be obtained from Fig. 2 and 3.
In addition, from the comparison of these three figures, we can see the different influence patterns among distinct quantile levels of the viral load. In other words, at different virologic response levels, immunologic responses show diversity. Such additional information and conclusion from our analysis can provide more useful signal for relevant research.

CONCLUSION
In this study, we study the nonparametric quantile regression model with the covariates missing at random. We propose an effective and convenient two-stage multiple imputation method for the model and construct the two-stage multiple imputation estimator and give the asymptotic properties of the proposed estimator. Via several simulation examples, we compare the finite sample performance of the proposed estimators under different initial imputation methods with CC estimator, the regression imputation estimator, k-Nearest-Neighbor imputation estimator and the Nearest-Neighbor imputation estimator, which reflects the accuracy and efficiency of the proposed method. In empirical analysis, we construct nonparametric quantile regression model and apply the proposed multiple imputation methods to analyze the ACTG 315 data set and we find that our methods could fit better and give more useful information than CC method.
In addition, the research can be extended to additive quantile regression models with missing covariates, which can avoid the "curse of dimensionality" in nonparametric regression. Furthermore, we can apply the proposed two-stage multiple imputation method to semiparametric models which have more flexibility and interpretation.