PARTIAL LEAST SQUARES REGRESSION BASED VARIABLES SELECTION FOR WATER LEVEL PREDICTIONS

Floods are common phenomenon in the state of Kuala Krai, specifically in Kelantan-Malaysia. Every year , floods affecting biodiversity on this region and al so causing property loss of this residential area. The residents in Kelantan always suffered from floods s ince the water overflows to the areas adjoining to the rivers, lakes or dams. Months, average monthly rain fall, temperature, relative humidity and surface wi nd were used as predictors while the water level of Ga las River was used as response. The selection of su itable predictor variables becomes an important issue for developing prediction model since the analysis data uses many variables from meteorological and hydrogical d epartments. In this study, we conduct K-fold CrossValidation (CV) to select the important variables f or the water level predictions. A suitable predicti on model is needed to forecast the water level in Gala s River by adopting the Ordinary Linear Regression (OLR) and Partial Least Squares Regression (PLSR). However, we need to perform pre-processing data of the datasets since the original data contain missin g data. We perform two types of pre-processing data which are using mean of the corresponding months (t ype I pre-processing data) and OLR (type II preprocessing data) of missing data. Based on the experiment, PLSR is more suitable mode l rather than OLR for predicting the water level in Galas River and t he use of the type I pre-processing data gives high er accuracy than the type II pre-processing data.


INTRODUCTION
Floods are common phenomenon which can be defined as the presence of excess of water in the place that is normally dry. Floods are often cited as being the most lethal of all natural disasters (Noji, 1997;Alexander, 1993;Jonkman and Kelman, 2005). The flooding of Malaysian rivers is mainly due to the high amount of rainfall in river basins because of the climate is greatly influenced by the monsoon winds. The worst flood in Malaysia was recorded in 1926 which has been described as having caused the most extensive damage to the natural environment. Subsequent major floods were recorded in 1931, 1947, 1954, 1957, 1967 and 1971. Most countries in Malaysia suffer from floods during monsoon season especially in Kedah, Kelantan, Terengganu, Pahang and Johor. Kelantan is a state in the east coast of Peninsular Malaysia that has never missed a flooding event, which occurs every year during the northeast monsoon period.
Floods affect many of the engineering structures such as bridges, embankments, tanks, reservoirs and significantly disrupt or interfere with human and societal activity. Kuala Krai is one of the districts in Kelantan that always affected with flood. The factor that cause flood at Kuala Krai district of Kelantan state was due to Science Publications AJAS a combination of physical factors such as elevation and its close proximity to the sea apart from heavy rainfall experienced during monsoon period. The severe floods all over Kelantan are resulted from heavy rainfall during the north east monsoon season especially in November and December. In order to facilitate the prediction of flooding in the river and the warning beforehand, this study aims to build a model on the relation between the selected predictors and the water level of Galas River by adopting the OLR and PLSR.
Kelantan state consists of more than 25 rivers and having seven major river basins that are Galas, Kelantan, Golok, Semerak, Pengkalan Chepa, Pengkalan Datu and Kemasin river basins. Kelantan River Basin is the biggest river basin in Kelantan and it drains a catchment area of about 12,000 km 2 in north-east Malaysia including part of the National Park and flows northwards into the South China Sea (Rohasliney, 2010). The Kelantan River is about 248 km long and occupying more than 85% of the State of Kelantan. It divides into Galas and Lebir Rivers near Kuala Krai, about 100 km from the river mouth which means that Kelantan River is the main river while Galas and Lebir Rivers are the tributary rivers. For this study, we focused on one main tributary of Kelantan River which is Galas River in Kuala Krai, Kelantan. Figure 1 shows the location of the study area which is Galas River.
The data for this analysis are collected from Water Resources Management and Hydrology Division and Malaysian Meteorological Department. It is noticed that the original data contain missing data. Missing data are common issue for data quality and most real datasets consist of missing data. There are four types of serious data quality problems in real datasets which are incomplete, redundant, inconsistent and noisy data. Based on our observation, the data has incomplete data which missing values in certain months. Due to the presence of missing data, the two methods can be inappropriate to be used directly for water level prediction, therefore, we need to perform a preprocessing data of the dataset. There are five factors that were identified and related to the level of the Galas River which can lead to the occurrence of flood phenomenon in Kuala Krai, Kelantan: (1) Months from January until December for 11 years starting from 2001 until 2011, (2) Monthly mean of rainfall, (3) Monthly mean of temperature, (4) Monthly mean of relative humidity and (5) Monthly mean of surface wind.

AJAS
Different approaches for water level predictions can be found in hydrology science literature. The most common approaches for predicting the water level are stepwise regression (Zou et al., 2010), Artificial Neural Network (ANN) (Bustami et al., 2007) and ANN combined with PLSR (Shu et al., 2008). The previuos research on floods monitoring was conducted which is the utilizing of the GPS data for monitoring the severe flood in Kuala Krai Kelantan in order to detect the influence of heavy rainfall towards severe floods (Suparta et al., 2012). Another previuos research on the influence of groundwater flow systems towards climate change was reviewed to recommend the solutions that are more economical and enviromentally in managing the flooding water (Carrillo-Rivera and Cardona, 2012). A number of papers have previously reviewed on variables selection such as N-PLSR as empirical downscaling tool in climate change studies (Bergant and Kajfez-Bogataj, 2005) and application of PLSR as downscaling tool for Pichola lake in India (Goyal and Ojha, 2010). PLSR is successful mostly in chemometrics since the origin of PLSR lies in chemistry. It is useful when the factors are many highly collinear for constructing predictive models. In this study, we apply this method for variables selection to develope water level models. The following sections present an approach to the development of the water level models. Materials and methods are discussed in Section 2 while the results are described in Section 3. The discussion is reported in Section 4 and finally the conclusion is given in Section 5.

MATERIALS AND METHODS
The linear regression model is given as in Equation (1) (Mevik and Cederkvist, 2004): where, y is an n×1 vector of observations on the response variable, X is an n×p matrix consisting of n observations and p predictors, β 0 is an unknown constant, β 1 is an p×1 vector of unknown regression coefficient,1 n is an n×1 ones vector and ε is an n×1 vector of errors identically and independently distributed with mean zero and variance σ 2 >0, respectively.

Ordinary Linear Regression
OLR often being used in fitting models to make an observation which is applied by minimizing the sum of the squared residuals between the predicted and actual response. When matrix X 1 = [1 n X] has full rank of p, the OLR estimator of , is estimated in Equation (2): The prediction of y is given in Equation (3): The model for OLR can be represented by Equation (4): where, x = [x 1 ,x 2 ,…,x p ] T ∈ R P

Partial Least Squares Regression
Partial Least Squares (PLS) has been proven to be an effective approach to solve the problems in chemometrics such as by predicting the bioactivity of molecules to facilitate discovery of novel pharmaceuticals. The PLS approach was originated around 1975 by Herman Wold for modeling the complicated datasets in terms of matrices blocks which called path models (Joreskog, 1982). The PLS method has been introduced in the chemical literature as an algorithm and it is only recently that its numerical and statistical properties have become more apparent (Stone, 1974). PLSR is a technique for modeling a linear relationship between a set of output variables (response) with L-dimensional responses and a set of input variables (regressors) with p number of variables (Rosipal and Trejo, 2002). The data matrices X and y in this analysis are assumed to be centered as a first step to perform PLSR.
In this study, we only use one dimensional response which is L equals to one. PLS is a method for modeling relations between sets of observed variables by means of latent variables which are linear combinations of the original regressors while maintaining most information in the input variables. PLS is useful when the number of explanatory variables exceeds the number of observation and high level of multicollinearity among those variables is assumed. The weights used to determine the linear combinations of the original regressors are proportional to the covariance among input and output variables (Helland, 1988).

Partial Least Squares Regression Using SIMPLS Algorithm
SIMPLS algorithm was used to compute the regression coefficient in order to find the model for predicting water level in Dungun River of Terengganu. SIMPLS algorithms work very well, resistant to be more appropriate, fast, easy to implement and simple to tune (Bennett and Embrechts, 2003). In PLSR approach, we need to obtain the PLSR estimator, say and it starts with computing the cross-product of (Jong, 1993;Ibrahim and Wibowo, 2012) as shown in Equation (5): Then, the computing of the iteration is followed starting from 1 until A latent variables where A is determined in advanced and 1≤A ≤ p. The algorithm of SIMPLS is given as follows: For a =1to A: • If a = 1, then do the singular value decomposition (svd) of S: [u, u, v] = svd (S) Otherwise, if a > 1, we compute the svd of: • Get weights for r which is the first singular vector: r = u (:, 1) • Compute the scores: t = Xr • Compute the loadings: p = X T t/(t T t) • The vector r, t and p are stored into R, T and P respectively The last step is computing a regression coefficient can be shown in Equation (6): Then, the estimate of PLSR is given in Equation (7): The model for PLSR can be represented by Equation (8): where, y is the mean of response y i and p x is the mean of observation data of x p .

Evaluating the Quality of the Prediction
The quality of the prediction is evaluated using A latent variables, ɵ i y and y i (Helland, 1988;Ibrahim and Wibowo, 2012). CV technique is used to estimate the prediction capacity and the data are separated between the training data set to build the model and testing data set to test the model. The CV is applied in three cases which are in performance estimation, model selection and tuning learning model parameters. In this study, CV is used in predictors' selection and model selection for predicting water level of Dungun River. The CV is a statistical method to evaluate the algorithms by dividing the data into two segments which are for training and validation and the basic form of cross-validation is K-fold CV. The idea for CV was originated in the 1930s (Larson, 1931;Refaeilzadeh et al., 2008;Ibrahim and Wibowo, 2012). In 1970s, CV was employed as means for choosing proper model parameters, as opposed to using cross-validation purely for estimating model performance (Geisser, 1975;Sjgstrgm et al., 1983;Ibrahim and Wibowo, 2012).
Stratified 10-fold CV was recommended as the best model selection method since it tends to provide less biased estimation of the accuracy compared to regular cross-validation, leave-one-out CV and bootstrap methods (Refaeilzadeh et al., 2008;Ibrahim and Wibowo, 2012). For this analysis, we used 10-fold CV because it can give accurate performance estimation and it suitable for small samples of performance estimation. We were using this type of CV to choose an appropriate model between normalized original data and cleansing data by comparing the value of Mean Squared Error of Cross-Validation (MSECV) based on OLR and PLSR. The data are divided into K segments of roughly equal size and the inner sum of MSECV is taken over the observations in the kth segment (Davison and Hinkley, 1997;Mevik and Cederkvist, 2004;Ibrahim and Wibowo, 2012). For each of K experiments, the K-fold CV uses K-1 folds for training and the remaining one for testing. There is an advantage of using K-fold CV which is all the examples in the dataset are eventually used for both training and testing. For this type of CV, we used the function in Matlab software called 'crossval' to obtain the value of MSECV which is a scalar containing a 10-fold CV estimate of mean-squared error. We will select a better model according to lowest value of MSECV and it is a measure of how well the model fits the data.

Data
As predictors in predicting water level of Galas River, months (x 1 ), average monthly rainfall (x 2 ), temperature (x 3 ), relative humidity (x 4 ) and surface wind (x 5 ) were identified and related to the occurrence of flood phenomenon in Kuala Krai, Kelantan. Observed predictors and response for the period 2001-2011 were extracted from the Water Resources Management and Hydrology Division in Kuala Lumpur and Malaysian Meteorological Department in Selangor. Variable selection is performed to select the suitable predictors in predicting the water level based on the MSECV of OLR and PLSR. It is noted that the data consist of missing values for rainfall and water level and we performed cleaning data to replace these missing values. The data are separated into two sub data which are 120 data for developing models and variables selection using 10-fold CV and 12 data for validating the models. The data that were used in this analysis are shown in Table 1.

Original Data
The data set is cover from January until December for 11 years and yet it has shown a total of 132 data. Table 1-4 describe the predictors and response used over training period in predicting water level of Galas River. The first column, second column, third column, fourth column, fifth column and sixth column represent months, rainfall, temperature, relative humidity, surface wind and water level data, respectively. They show the raw data and 47th month is in November 2004 and the NA values means that there are missing values of rainfall in November and December 2004.

Pre-Processing Data
Data preprocessing is the process that was performed to the original data in order to prepare it for next processing procedure. Thus, it will transform the data into the format that more effective according to our purpose of analysis. Data preprocessing is important since the real world data normally are noisy which are containing errors and outliers. There are five tasks in performing data preprocessing which are data cleaning, data integration, data transformation, data reduction and data discretization. For this analysis, we performed two types of data cleaning which are using mean of the corresponding months throughout 11 years and OLR to replace the missing values of rainfall and water level.

Pre-processing Data Using Mean of the Corresponding Months
For this subsection, we used mean of the corresponding months which is represented by type I pre-processing data to replace these missing values. For example, NA value of rainfall in November and December 2004 for Galas River are replaced by the means of the corresponding months throughout 11 years. Table 4 presents the snapshot of the preprocessing data using mean of the corresponding months for Galas River.

Pre-processing Data Using Ordinary Linear Regression
The second type of cleaning data that we used is OLR and we represent it as type II pre-processing data. We performed OLR to replace the missing values of the dataset in Galas River. Table 3 shows the snapshot of the pre-processing data using OLR for Galas River.
The model to replace the missing value of the water level for Galas River is given in Equation (9): The model to replace the missing values of the rainfall for Galas River is represented by Equation (10)

Selection of Predictors
The selection of appropriate predictors is one of the most important steps in predicting the water level of Galas River. The predictors are chosen based on the smallest value of MSECV and the result is compared between two types ofpre-processing data which are type I pre-processing data and type II pre-processing data. It can be seen from Table 5 that five predictor variables namely months (x 1 ), average monthly rainfall (x 2 ), temperature (x 3 ), relative humidity (x 4 ) and surface wind (x 5 ) with type I pre-processing data have their lowest value of MSECV when ncomp is equals to five. Hence, these variables are used in the water level predictions.

Models Development
The models for water level predictions of Galas River were developed using OLR and PLSR. The results were compared between these two approaches and between two types of pre-processing data.

Ordinary Linear Regression
LR is performed in this experiment to build the model for water level in Galas River. This subsection presents the results of the experiment which are the prediction models for water level over training period based on two types of pre-processing data. The prediction model for water level Science Publications AJAS using type I pre-processing data of Galas River is given by Equation (11) The prediction model for water level using type II pre-processing data of Galas River is given in Equation (12)

Partial Least Squares Regression:
PLSR is another method that we use in this experiment in order to get the prediction model and the results based on these two methods are being compared between original data and cleansing data. Validation method is used for choosing number of components of PLS and the model with the lowest MSECV is considered to be the optimal one. The prediction model for water level of Galas River using type I pre-processing data is represented by Equation (13) The prediction model for water level using type II pre-processing data of Galas River is given in Equation (14)

Model Selection
In this study, we will restrict ourselves to the common variants of CV called K-fold CV, where the calibration objects are divided in k segments and for this experiment we use k = 10 (Breiman, 1984;Wiklund et al., 2007;Ibrahim and Wibowo, 2012). The selected number of components using k-fold CV correctly find this range, the actual value of the number of components is immaterial as long as the prediction error is close to its minimum (Wiklund et al., 2007;Ibrahim and Wibowo, 2012). We used 10-fold CV to obtain the appropriate model for predicting water level at Galas River of Kuala Krai using two types of pre-processing data. The data were analyzed using OLR and PLSR and the results are compared between these two types of preprocessing data to obtain a better model according to lowest value of MSECV. Table 6 illustrates the comparison of MSECV for Water Level in Galas River using 10-fold CV of OLR and PLSR. From this result, it shows that PLSR with type I pre-processing data of ncomp equals to 5 has the smallest MSECV. Therefore, this PLSR is considered as the best model. Figure 2 shows the comparison between actual and prediction monthly water level for Galas River with test data in 2011 using type I pre-processing data and Fig. 3 presents the comparison between predicted and actual water level in Galas River with test data using type II pre-processing data. From these graph, it is clear that the use of type I pre-processing data achieves closer agreement between actual and predicted water level rather than using type II pre-processing data.    Fig. 3. A comparison between actual and prediction monthly water level for Galas River with test data of 2011 using type II preprocessing data

CONCLUSION
In Kuala Krai district, rising water levels of the river become critical issues since it can induce flood and destroy a lot of things. We had compared between two types of pre-processing data which are type I and type II pre-processing data using OLR and PLSR approaches for variables selection and model selection. The experiment had shown that PLSR is a suitable method in variables selection and model development since it give higher accuracy than using OLR. Our further research will focus on the use of nonlinear method and compare them to PLSR model.

AJAS
Zamalah scholarship for supporting her master by research program.