Hydrological Forecasting Using Hybrid Data-Driven Approach

Corresponding Author: Sungwon Kim Department of Railroad and Civil Engineering, Dongyang University, Yeongju, 36040, South Korea Email: swkim1968@dyu.ac.kr Abstract: This study develops a hybrid model, EEMD-FANN, coupling feed Forward Artificial Neural Network (FANN) and Ensemble Empirical Mode Decomposition (EEMD) for improving the accuracy of daily river stage forecasting. An original river stage data is broken down into a residue and Intrinsic Mode Functions (IMFs) using the EEMD and different FANNs are developed as forecasting models for the decomposed IMFs and residue, respectively. The final forecasted time series is produced by the ensemble aggregation of the forecasted IMFs and residue. The efficiency of EEMD-FANN model is assessed based on the comparison with that of single Adaptive Neuro-Fuzzy Inference System (ANFIS) and FANN to demonstrate the applicability of the hybrid approach in daily river stage forecasting. As a result, it is found that the EEMD-FANN model utilizing time series decomposition by the EEMD and ensemble aggregation produces better performance than the single ANFIS and FANN models using original river stage time series as inputs. The results of this study also signify that the approach coupling the EEMD and FANN can significantly enhance the forecasting ability of the single FANN model and can be utilized as an effective modeling methodology to forecast river stage precisely.


Introduction
Forecasting river stage precisely plays an important role for enhancing hydrologic practices such as dam operation, water supply, river management and flood and drought prevention. For decades, the attention on datadriven approaches has been gained for estimating hydrological variables including rainfall-runoff, river discharge, soil moisture, evaporation, reservoir water level, etc. Especially, Support Vector Machine (SVM), ANFIS and FANN have been received attention as effectual approaches to analyze complicated and nonlinear hydrologic phenomena (Kisi, 2007;Kim and Kim, 2008;Othman and Naseri, 2011;Kim et al., 2012;Kaltech, 2015;Seo et al., 2015b).
The development of hybrid models combining various statistical approaches and data-driven models has been increased for improving the efficiency of conventional forecasting models. Especially, time series decomposition utilizing wavelet and wavelet packet transforms has been known to further improve the forecasting ability of conventional data-driven models (Amiri and Asadi, 2009;Adamowski and Sun, 2010;Gokhale and Khanduja, 2010;Kisi et al., 2011;Nourani et al., 2012;Ravikumar and Tamilselvan, 2014;Seo, 2015;Seo et al., 2015a).
Recently, nonlinear data analysis and hybrid model development utilizing Empirical Mode Decomposition (EMD)-based approaches have been successfully performed in various fields. The EMD, which is a selfadaptive and empirical technique to break down a time series, can be utilized to examine nonlinear and nonstationary meteorological and hydrologic data (Tang et al., 2012). The EEMD is a data processing technique based on noise addition (Wu and Huang, 2009) which is devised to improve the EMD. Huang et al. (2009) applied EMD and signal analysis based on Hilbert transform (Hilbert spectral analysis) to investigate the characteristics of nonlinear river flow time series. Karthikeyan and Kumar (2013) assessed the forecasting ability of nonstationary time series applying forecasting models which are based on wavelet and EMD. Kisi et al. (2014) presented a nonparametric method which is to build up a model combining Artificial Neural Network (ANN) and EMD to forecast monthly river stage. Huang et al. (2014) examined a conjunction model which combines modified EMD and SVM in monthly stream flow prediction. Shabri and Samsudin (2015) proposed a hybrid modeling methodology utilizing Least Square Support Vector Machine (LSSVM) combined with EMD to predict water demand. Wang et al. (2015) suggested a time series forecasting model combining autoregressive integrated moving average and EEMD to predict runoff on annual basis. Zhao and Chen (2015) proposed a novel hybrid model based on EEMD and autoregressive forecasting model for annual runoff forecasting. Wang et al. (2013) presented an approach which includes the decomposition of annual rainfall series utilizing EEMD and rainfall-runoff modeling utilizing SVM. Guo et al. (2016) applied EEMD to examine the intrinsic multiscale properties which are inherent in the variability of precipitation. Ouyang et al. (2016) proposed a hybrid modeling approach using EEMD, Support Vector Regression (SVR) and phase-space reconstruction for monthly rainfall forecasting. Xu et al. (2016) presented a conjunction model integrating EEMD, Back-Propagation (BP)-based ANN (BPANN) and nonlinear regression equation for annual runoff simulation.
This study presents a hybrid modeling approach, EEMD-FANN, coupling EEMD and FANN to enhance the model performance in river stage forecasting. For accessing the efficiency of EEMD-FANN, the model accuracy was evaluated utilizing some favorite statistical performance indexes. The comparative analysis of EEMD-FANN and single forecasting models, ANFIS and FANN, was also implemented for assessing the applicability of EEMD-FANN.

Used Data
The Andong Dam watershed, located in the upper Nakdong river basin of the southeastern inland region of South Korea, was selected as a study area to forecast daily river stage utilizing EEMD-FANN model. The time series data used for developing the model were gathered from two hydrological observatories (Socheon and Dosan) in the watershed ( Fig. 1) utilizing the Water Management Information System (WAMIS) which is an internet-based portal system providing the water resources information of South Korea. Figure 1 depicts the study area andong Dam watershed, including the structure of stream network and the location of hydrological observatories. The time series data of daily river stage were collected between 2002 and 2013 (12 years). For model training and testing, they were split into two sub-periods, data for 2002-2010 (9 years, 75%) and data for 2011-2013 (3 years, 25%).  Huang et al. (1998), is a data analysis technique breaking down a nonlinear and nonstationary signal into several components. Unlike other approaches such as wavelet and Fourier transforms, the technique includes empirical and self-adaptive properties (Tang et al., 2012). The key concept of the EMD is to break down an input data series into two components: A residue and Intrinsic Mode Functions (IMFs) (Yu et al., 2008). Based on the concept proposed by Huang et al. (1998), the IMFs should comply with the following requirements: • In data set, the number of zero-crossings and local extreme values (local maximum and minimum values) should be identical each other, or the difference of the number should be one at most • The mean values of two envelopes, which are created by interpolating local maximum and minimum values, respectively, should be zero at all data points The EMD algorithm, which is also called the shifting procedure, is summarized as the following key steps (Huang et al., 1998;2003):  (1)-(5) repeatedly, until the IMF cannot be derived any more as R(t) has a local extreme point or becomes monotonic.
Once an EMD process is completed, the sum of the final residue, R k (t) and IMFs, C j (t), can produce the original time series as in Equation 1: where, k denotes the number of IMFs. For detailed information on EEMD, readers can refer to Huang et al. (1998;2003).

EEMD
EEMD, which is a data analysis technique proposed by Wu and Huang (2009), is developed to resolve the problem of mode mixing which is an obvious disadvantage of EMD. The mode mixing indicates that an IMF is made up of signal covering the broad bandwidth of frequency, or signals in a similar frequency band are contained in one or more IMF (Ren et al., 2015). The key point of EEMD is signal decomposition utilizing noise addition. The EEMD is based on the concept that a signal is formed by adding Additive White Gaussian Noise (AWGN) to true value and the ensemble mean of decomposed time series with different white noise series yields better estimates of true time series. According to Wu and Huang (2009), the EEMD algorithm is comprised of four key phases: • Generate an AWGN series and add it to original time series • Break down the time series obtained from step (1) into IMFs and a residue using EMD • Perform the phases (1)-(2)  For more detailed information on EEMD, readers can refer to Wu and Huang (2009).

FANN
FANN is a data-driven modeling system which is mathematically emulated based on the architecture and function of the neural system of human brain. Multilayer Perceptron (MLP) type of FANN is generally comprised of input, output and hidden layers. For example, FANN architecture with five input neurons, 13 hidden neurons and an output neuron can be depicted schematically as in Fig. 2. According to Günther and Fritsch (2010), the MLP composed of a hidden layer with k neurons calculates the output based on the following equation: where, w 0 and w 0j are respectively the intercepts for the neuron of output layer and the jth neuron of hidden layer, w j is the connection strength between the jth neuron of hidden layer and the output neuron of output layer, w j = (w 1j ,⋅ ⋅⋅,w nj ) is the set of connection strengths between the neurons of input layer and the jth neuron of hidden layer and x = (x 1 ,⋅ ⋅⋅,x n ) is the input vector.

EEMD-FANN
EEMD-FANN is a hybrid model coupling EEMD and FANN. As depicted in Fig. 3, the EEMD-FANN approach is comprised of three key steps: Decomposition, single forecasting and ensemble forecasting. The main algorithm of EEMD-FANN approach is comprised of three key phases: • Decomposition: Original river stage time series is broken down into a residue and n IMFs utilizing EEMD as described in previous section • Single forecasting: The FANN is utilized as a forecasting model for the residue and IMFs. The FANN models built up for the IMFs and residue perform one-day-ahead forecasting, respectively. The selection of input variables for each FANN is based on the optimal lag time determined by partial autocorrelation and cross correlation functions • Ensemble forecasting: The final forecasted time series is determined through the aggregation of the single forecasting for the residue and IMFs obtained from phase 2

Analysis
Original time series gathered from two hydrological observatories was broken down utilizing EEMD to build up EEMD-FANN model. In the decomposition process, the ensemble sample size and the standard deviation of AWGN are respectively set up to 100 and 0.2, based on the previous studies (Tang et al., 2012;Xu et al., 2016).
In daily river stage forecasting utilizing EEMD-FANN approach, one of the significant phases is to choose the efficient input variables. The input variables of FANN models for IMFs and a residue were selected based on the optimal lag time determined by statistical correlation analysis utilizing cross-correlation function (also known as sliding inner-product) and partial autocorrelation function according to the previous studies (Sudheer et al., 2002;Shabri and Samsudin, 2015). Table 1 summarizes the model configuration for the IMFs and residue.
This study employed MLP type of FANN model for IMF and residue forecasting. In the MLP modeling, the number of hidden neurons was optimized by an iterative approach which examines RMSE values depending on different number of hidden neurons. The operation in neurons was implemented utilizing the logistic sigmoid activation function and the MLP model was trained utilizing the most popular Back Propagation (BP) learning algorithm. The training and testing data were normalized to [0, 1] for enhancing the efficiency of BP algorithm (Dawson and Wilby, 2001).
The performance of EEMD-FANN approach was assessed quantitatively utilizing seven model efficiency indexes and compared with that of single ANFIS and FANN models which were investigated by Seo et al. (2015a). The indexes applied in this study are as follows: For specific information on the indexes, readers can refer to Dawson and Wilby (2001).  (Seo et al., 2015a;Dawson and Wilby, 2001). The result demonstrated that the EEMD-FANN was superior to the ANFIS and FANN models. From this result, it was found that the EEMD-FANN model coupling EEMD and FANN produced better forecasting performance than the ANFIS and FANN, in terms of the indexes. The result also signified that the EEMD can further boost the forecasting ability of single FANN model. IMF2 DS (t-4), IMF2 DS (t-3), IMF2 DS (t-2), IMF2 DS (t-1), IMF2 SC (t) IMF2 DS (t) IMF3

Evaluation of Model Performance
IMF3 DS (t-4), IMF3 DS (t-3), IMF3 DS (t-2), IMF3 DS (t-1), IMF3 SC (t) IMF3 DS (t) IMF4 IMF4 DS (t-3), IMF4 DS (t-2), IMF4 DS (t-1), IMF4 SC (t) IMF4 DS (t) IMF5 IMF5 DS (t-2), IMF5 DS (t-1), IMF5 SC (t) RES DS (t-1), RES SC (t) RES DS (t) Note. DS, Dosan; SC, Socheon; RES, Residue  Fig. 4-6 that the degree of dispersion around the 45degree slope line (red line) for the EEMD-FANN model is smaller than that for the ANFIS and FANN models. When straight lines (blue lines), y = ax + b, fitted for the scatter points were examined, it was also observed that the values of slope (a) and intercept (b) for the EEMD-FANN were closer to the values of 1 and 0, respectively. From the figures, it was apparent that the values forecasted by the EEMD-FANN get closer to the observed river stage values and the error values were smaller, in comparison with the single ANFIS and FANN. From the graphical comparison, the EEMD-FANN was found to provide more excellent forecasting ability, in comparison with the single ANFIS and FANN. The results also indicated that the EEMD can further elevate the forecasting efficiency of the single FANN model in daily river stage forecasting.

Conclusion
This research investigates the efficiency of a hybrid data-driven approach, EEMD-FANN, which integrates EEMD and FANN for forecasting daily river stage. The detailed purposes are to build up the hybrid data-driven model for improving the accuracy of daily river stage forecasting in the Andong Dam watershed located in the eastern inland region of South Korea and assess the model applicability based on comparison with the performance of single ANFIS and FANN models. The efficiency of EEMD-FANN, ANFIS and FANN models is evaluated utilizing dimensionless indexes (IA, r 2 and CE) and residual error-based indexes (MAE, RMSE, MS4E and MSRE). As a result, the EEMD-FANN model produces more excellent efficiency than the single ANFIS and FANN models, in terms of the model efficiency indexes and graphical comparison. The results indicate that the hybrid data-driven approach coupling EEMD and FANN model can further boost the forecasting ability of single FANN model and can be an effective hydrological forecasting approach.