MISSING DATA IMPUTATION USING WEIGHTED OF REGIME SWITCHING MEAN AND REGRESSION

Missing data imputation is an important task in cas es where it is crucial to use all available data an d not discard records with missing values. The purpose of this work were first to develop the Weighted of Re gime Switching Mean and Regression (WRSMRI) for missing data estimation and secondly to compare its efficiency of estimation and statistical power of a test under Missing Complete At Random (MCAR) and simple random sampling with another methods, namely ; Mean Imputation (MI) Regression Imputation (RI) Regime Switching Mean Imputation (RSMI) Regime Swit ching Regression Imputation (RSRI) and Average of Regime Switching Mean and Regression Imputation (ARSMRI). By using simulation data, the comparisons  were made with the following conditions: (i) Three sample size (100, 200 and 500) (ii) three level of correlation of variables (low, moderate and high) a nd (iii) four level of percentage of missing data ( 5, 10, 15 and 20%). The best imputation under MSE and sample correlation estimated were obtained using WRSMRI method, under MAE MAPE power of the test sample mea n and variance estimated were obtained using RSRI.


INTRODUCTION
Missing data is a common problem that has been found in quantitative research (Heeringa, 2010) albeit there were controlled rigidly in preventive and corrective mechanism (Huisman, 2000). Streiner, (2013) proved that missing of the multivariate random variables by 10% provided analytical errors up to 59%. Estimation of missing data can vigorously improve quality of research in education services (Peng, 2006). For example, in the examination paper impact of missing marking was crucial, in which it could cause errors in both type I and type II (Robitszsch and Rupp, 2009).
On the strong points of the missing data methods, (Sentas and Angelis, 2006) described that in Listwise deletion, cases with missing values for any of the variables are omitted from the analysis. The procedure is quite common in practice because of its simplicity, but when the percentage of missing values is high, it results in a small complete subset of the initial data sets and therefore in difficulties in constructing a valid cost model. Moreover, the Mean Imputation (MI) method replaces the missing observations of a certain variable with the mean of the observed values in that variable. It is a simple method that generally performs well, especially when valid data are normally distributed. In Regression Imputation (RI) method, the missing values were estimated through the application of multiple regression where the variable with missing data was considered as the dependent one and all other variables as predictors.
On the weak points, Little and Rubin (2002), explained that the values of variance from the LD technique is underestimated. However, Brockmeier et al. (2003) tested that the variance from the MI technique is undervalued. Apparently, Little (2005) showed that the RI method conceived the same undervalue, in which it exemplified to a problem of multicollinearity.

JMSS
This study presents a novel approach in recovery of missing data by employing Weighted of Regime Switching Mean and Regression (WRSMRI). The objectives of this study to compare its efficiency of estimation missing value estimation sample mean, sample variance, sample correlation and power of the test under both Missing Complete At Random (MCAR) and Simple Random Sampling (SRS) with another methods, namely; Mean Imputation (MI) Regression Imputation (RI) Regime Switching Mean Imputation (RSMI) Regime Switching Regression Imputation (RSRI) and Average of Regime Switching Mean and Regression Imputation (ARSMRI).

Data Set
In this section, we introduce and describe the data set: • Three groups of population were simulated data by Monte Carlo technique with three level of correlation of variables (low ρ = 0.3 moderate ρ = 0.5 and high ρ = 0.7) (Chaimongkol, 2004;Heeringa, 2010;Little and Rubin, 2002;Viragoontavan, 2000) with size 10,000 units per group • Sampling methods: We used Simple Random Sampling (SRS) with size with 100, 200 and 500 units. (Chaimongkol, 2004;Viragoontavan, 2000). The data set represented by y 1 , y 2 , …, y n • Missing data pattern: We generated missing data using Missing Complete at Random (MCAR) at 5 10 15 and 20% of the sample. (Viragoontavan, 2000). From completed data set we created missing data set by MCAR. The data set split into two groups: Completed data set y 1 , y 2 , …, y r and missing data set y r+1 , y r+2 , …, y n

Methods
In this section, we introduce and describe the methods applied to impute the original incomplete data set and describe the imputation method used based on WRSMRI. The subsequent subsections are organized as follows. First, several general considerations are made to explain how the imputation methods have been implemented. Then, the five imputation techniques applied are described: MI RI RSMI RSRI and ARSMRI. Finally, the WRSMRI method to impute missing value is described together with statistical methods commonly used in methods accuracy evaluation. Hamilton (2005) mentioned in a dramatic change in the behavior of a single variable y t . Suppose that the typical historical behavior could be described with a first-order autoregression Equation 1:

Regimes Switching Model
With ε∼N (0, σ 2 ), which seemed to adequately describe the observed data for t = 1, 2, …, t 0 . Suppose that at date t 0 there was a significant change in the average level of the series, so that we would instead wish to describe the data according to Equation 2: For t = t 0 + 1, t 0 + 2,…. This fix of changing the value of the intercept from c 1 to c 2 might help the model to get back on track with better forecasts, but it is rather unsatisfactory as a probability law that could have generated the data. We surely would not want to maintain that the change from c 1 to c 2 at date t 0 was a deterministic event that anyone would have been able to predict with certainty looking ahead from date t = 1. Instead there must have been some imperfectly predictable forces that produced the change. Hence, rather than claim that expression (1) governed the data up to date t 0 and (2) after that date, what we must have in mind is that there is some larger model encompassing them both Equation 3: where, s t is a random variable that, as a result of institutional changes, assume the value s t = 1for t = 1,2,…, t 0 and s t = 2 for t = t 0 + 1, t 0 +2,… A complete description of the probability law governing the observed data would then require a probabilistic model of what caused the change from s t = 1 to s t = 2. The simplest such specification is that s t is the realization of a two-state Markov chain with Equation 4: Assuming that we do not observe s t directly, but only infer its operation through the observed behavior of y t , the parameters necessary to fully describe the probability law governing y t are then the variance of the Gaussian innovation σ 2 , the autoregressive coefficient φ, the two Science Publications JMSS intercepts c 1 and c 2 and the two state transition probabilities, p 11 and p 22 .
The specification in (4) assumes that the probability of a change in regime depends on the past only through the value of the most recent regime, though, as noted below, nothing in the approach described below precludes looking at more general probabilistic specifications. But the simple time-invariant Markov chain (4) seems the natural starting point and is clearly preferable to acting as if the shift from c 1 to c 2 be a deterministic event. Permanence of the shift would be represented by p 22 = 1, though the Markov formulation invites the more general possibility that p 22 <1. Certainly in the case of business cycles or financial crises, we know that the situation, though dramatic, is not permanent. Furthermore, if the regime change reflects a fundamental change in monetary or fiscal policy, the prudent assumption would seem to be to allow the possibility for it to change back again, suggesting that p 22 <1 is often a more natural formulation for thinking about changes in regime than p 22 = 1.

Mean Imputation (MI)
In the general approach to mean imputation, the mean value of each non-missing variable is used to fill in missing values for all observations Equation 5:

Regime Switching Mean Imputation (RSMI)
The mean value of each non-missing variable in each group is used to fill in missing values for all observations in group Equation 7: t r s t i s t i 1 j t s t s y y ;s 1, 2, k; j r 1, r 2, ,n r

Regime Switching Regression Imputation (RSRI)
The completed data set in each group (y 1st , x 1st ), (y 2st , x 2st ),…, (y rst , x rst ) used to construct regression equation for impute missing data in each group by Equation 8:

Average of Regime Switching Mean and Regression Imputation (ARSMRI)
ARSMRI use average of (7) and (8)

Weighted of Regime Switching Mean and Regression (WRSMRI)
WRSMRI use weighted of (7) and (8)  Var y wV ar y Var y

Model Evaluation
The accuracy of missing data imputation methods is evaluated by Mean Square Error (MSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and power of the test. To evaluate more precisely the difference in prognosis accuracy among the missing data imputation methods, mean square error of sample mean sample variance and sample correlation were evaluated.

RESULTS
Missing data imputation methods: MI, RI, RSMI, RSRI, ARSMRI and WRSMRI were applied to impute missing data. The goal was to analyses the improvements in accuracy when different algorithms were applied to impute missing data values. Table 1-3 indicates the average of MSE MAE and MAPE of imputation methods classified by sample sizes correlation levels and percentage of missing data respectively. Table 4 indicates the average of MSE MAE and MAPE of imputation methods. Table 5 indicates the average power of the test of imputation methods. Table 6-8 indicates the average of MSE MAE and MAPE of sample mean variance and correlation of imputation methods classified by sample sizes correlation levels and percentage of missing data respectively. Table 9 indicates the average of MSE MAE and MAPE of sample mean variance and correlation of imputation methods.
In terms of MSE WRSMRI outperformed in overall and at sample size 100 and 200, correlation low and high and percentage of missing data 5 15 and 20. RSRI outperformed at sample size 500, correlation moderate and percentage of missing data 10. In terms of MAE and MAPE RSRI outperformed. In terms of power of the test RSRI outperformed.
In terms of sample mean variance and correlation estimated WRSMRI outperformed when estimated sample correlation and RSRI outperformed when estimated sample mean and variance.            Table 6. Average MSE of sample mean variance and correlation of imputation methods classified by sample sizes

CONCLUSION
We applied six imputation methods to treat the problem of missing data. We reviewed and provided technical details of the different methods used included MI RI RSMI RSRI ARSMRI and WRSMRI.
As depicted in Table 1-9, all imputation methods led to an improvement in prediction accuracy, as measured by MSE MAE MAPE Power of the test MSE for sample mean variance and correlation estimated. The best imputation under MSE and sample correlation estimated were obtained using WRSMRI method, under MAE MAPE power of the test sample mean and variance estimated were obtained using RSRI.
After the text edit has been completed, the paper is ready for the template. Duplicate the template file by using the Save As command. In this newly created file, highlight all of the contents and import your prepared text file. You are now ready to style your paper.