A Hybrid Model of Bidirectional Long-Short Term Memory and CNN for Multivariate Time Series Classification of Remote Sensing Data

represented by a word to form a codebook. Then, every word was utilized with another random forest for the multivariate time series classification. A similar classifier model that is Learned Pattern Similarity (LPS) by (Baydogan and Runger, 2016) extracted segments from the multivariate time series. The regression trees were trained by these segments to explore the dependencies between them. Each node in the tree was represented by a word and a similarity measure in conjunction with these words were used to classify unknown multivariate time sequence . Ultra-Fast Shapeless (UFS) technique (Wistuba et al ., 2015) Abstract: Classification of multivariate time series has got massive attention in the last decade. The traditional modeling classifiers are complicated patterns and are incompetent to capture the dependencies of multivariate time series data. To include both of the effective features and the embedded relationships in the multivariate time series, a new hybrid model which incorporates both Convolutional Neural Network (CNN) and Bidirectional long-short term memory (BiLSTM) named Conv-BiLSTM is proposed in this study. The proposed Conv-BiLSTM is carried out for classifying the land cover multivariate time series of Landsat 8 satellite images. The efficacy of the proposed network is verified through its comparison with the-state-of-the-art methods using different cases of training dataset. The suggested network outperforms the classification techniques as Random Forest (RF), BiLSTM and the CNN and it has classification accuracy on average 6.5, 8 and 8.7% over that of those classifiers respectively. Moreover, the classification accuracy of the proposed Conv-BiLSTM network in F-Score metric is larger than that value of the state-of-the-art WEASEL+MUSE technique in average by 1.38%.


Introduction
Time series data captures a particular attention in different studies fields such as weather readings and psychological signals (Cui et al., 2015) (Kadous, 2002) (Sharabiani et al., 2017). A data set of time series can be classified into univariate, where a series of measurements is collected from the one variable, or multivariate where a series of measurements is collected from multi variables or multiple sensors (Prieto et al., 2015). Therefore, multivariate time series are more complicated and are not easy to predict comparing to the time series of univariate. Multivariate Time Series (MTS) has received a massive attention in the field of data mining, because of its wide applicability in different domains such as medical diagnosis, motion detection, anomaly detection, financial prediction and remote sensing fields that uses satellite images, etc. (Huai-Shuo Huang et al., 2018) (Wang et al., 2016) (Spiegel et al., 2011). Over the past decade, the classification of multivariate time series has a great attention in many domains such as healthcare (Kang and Choi 2014), sound classification (Kang and Choi 2014), phoneme classification (Graves and Schmidhuber, 2005), object recognition, human activity recognition and actions recognition (Fu, 2015) (Geurts, 2001) (Yu and Lee, 2015). One of the common approaches of multivariate time series classification is by applying dimensional reduction techniques or concatenating all dimensions of multivariate time series into a univariate time series (Karim, 2019). Symbolic representation approach applied the random forest on Multivariate Time Series named (SMTS) by (Baydogan and Runger, 2015) to partition it into leaf nodes and each leaf is represented by a word to form a codebook. Then, every word was utilized with another random forest for the multivariate time series classification. A similar classifier model that is Learned Pattern Similarity (LPS) by (Baydogan and Runger, 2016) extracted segments from the multivariate time series. The regression trees were trained by these segments to explore the dependencies between them. Each node in the tree was represented by a word and a similarity measure in conjunction with these words were used to classify unknown multivariate time sequence. Ultra-Fast Shapeless (UFS) technique (Wistuba et al., 2015) extracted random shape let from the multivariate time series and used the linear SVM or the Random Forest as classifier. Thereafter, UFS was improved by calculating the derivatives as features (dUFS). Also, the Auto-Regressive (AR) kernel approach (Cuturi and Doucet, 2011) performed an AR kernel based on distance measure to classify the multivariate time series. On the other hand, Auto-Regressive forests (Tuncel and Baydogan, 2018) used a tree ensemble for modeling multivariate time series, where the trees were trained with different time lags. Most recently, the method combining WEASEL and MUSE techniques (Schäfer and Leser, 2017) build a multivariate feature vector by using a classical approach of bag of patterns on each variable to capture the discrete features such as words and pairs of words. The final classification result is produced by using a logistic classifier on the final feature vector.
In recent years, deep learning has gained incredible popularity and many achievements can be found in literatures Längkvist et al., 2014) (Bengio, 2013) (Bengio, 2009) (Schmidhuber, 2015). As one of deep networks, Convolutional Neural Network (CNN) has been successfully used in pattern recognition field. Instead of human-designed features, CNN can automatically mine and generate deep features of input images. Besides, it has a strong robustness against transformation, scaling and rotation. Three important points make CNN discriminative than other traditional feed-forward neural networks which are: Local receptive field, weights sharing and pooling (Swietojanski et al., 2014). Inspiring from the CNN structure for image recognition, a deep learning framework for time series classification was proposed by the work (Zhao B. et al., 2017), in which a novel CNN framework was presented for this domain. Both of convolution and pooling operations are used as a substitution to create deep features and then features are connected to a Multilayer Perceptron (MLP) to perform classification. Experimental results on both simulated and real data sets prove that CNN achieves best results comparing to the state-of-the-art methods in terms of the classification accuracy and noise tolerance. Among the previous work of CNN for multivariate time series classification was summarized as follows. Zheng et al. (2014) proposed a deep convolution neural network with Multi-Channels Named (MC-DCNN) which separated multivariate time series into univariate ones ones and thereafter latent features are detected from an input of each variable. Finally, the classification is performed through the latent features which fed into MLP laye and it was found from experiments that a good classification performance was achieved by the presented method. However, one major limitation of this method is that it cannot extract the interrelationship between different univariate time series.
The work by (Zaho et al., 2017) modified the previous algorithm, instead of feature learning individually, the multivariate time series was jointly trained for feature extraction using a typical CNN architecture for three-variate time series classification with two convolutional layers and two pooling layers as illustrated in his work.
In this study, a new hybrid model that incorporates both Convolutional Neural Network (CNN) and Bidirectional long-short term memory (BiLSTM) named Conv-BiLSTM is suggested for classifying landcover multivariate time series. The proposed Conv-BiLSTM network has the advantages of combining the effective features and the embedded relationships in the long time sequences over other techniques. Comparing to Random Forest (RF), BiLSTM and CNN techniques, the proposed Conv-BiLSTM has classification accuracy on average 6.5, 8 and 8.7% respectively over that of those classifiers.

Related Work
For the last decades, Earth observation has been utilized to study our planet surface and follow its development. The surface changes characterization resulted from deforestation, evolution of agricultural activity or urbanization, are essential to estimate population increasing and climate changes (Running, 2008). Land cover and Land use are among the applications of remote sensing field for Earth surface monitoring. The data and information concerning the land cover such as buildings, rivers, fields, mountains, trees and others are very important for studying the human resources and beings life. As the use of unclean energy sources, human activity, rapid urbanization and climate changes, the earth's surface changes largely on both scales of regional and universal (Schafer et al., 2018). Therefore, the accurate analysis in a proper time to the land resources use and any happen surface changes plays an important role in enhancing human societies.
Remote sensing satellites are among the tools used to capture and collect time series data especially for the applications of earth's surface monitoring. They produce up-to-date maps for land use/land cover changes such as urbanization, deforestation and desertification (Schafer et al., 2018). Earth's surface data collected by satellites is large sets of hyper-spectral or multi-spectral data with a spatial resolution and a temporal density either in form of images or time series and needs to the classification. So, the balancing between the runtime and accuracy of classification becomes more challenging with existing of multiple readings for pixel via the periodic scans of satellite (Schafer et al., 2018). Automatic classification especially of remote sensing data has a very attention because the object label phase is restricted by the following issues: (a) The high cost in both of human resources and time either through using the field campaigns or the labelling method by experts, (b) by the huge amount of data to be labelled and (c) in addition to the fast changes in landscape which required to the updating of maps and this cannot be done manually (Bailly, 2018). The machine learning methods used in land use and land cover (Rajendran, 2020) classification are as follows.
Random Forest (RF) is one of the most machine learning algorithms that is widely used (Breiman, 2001) for the purposes of classification and regression Among the RF applications is the Earth science which includes modeling the forest cover (Betts et al., 2017), land-use (Araki et al., 2018), land-cover (Nitze et al., 2015) and object oriented mapping (Kavzoglu, 2017). In the work presented by (Rodriguez-Galiano et al., 2012), they proved that RF algorithm outperformed the classification by decision trees algorithm and achieved a high accuracy reached to 92%. The high accuracy of RF algorithm compared to other classification trees was imputed to its sets architecture in which several of them were trained on subsets of training data.
Also, Support Vector Machine (SVM) has a high capability to generalize complicated features, accordingly it outperformed the other classifiers as presented in following works (Shao and Lunetta, 2012) (Mountrakis et al., 2011). In a study to the land cover classification which was involved six classes of land-cover of the Landsat-8 data, SVM proved its ability to achieve a relatively high overall accuracy of 88% as mentioned in work (Goodin et al., 2015). Recently, a study to analysis the effect of the training size samples on the overall accuracies of both SVM and RF was presented by (Mansaray et al., 2020) with application for mapping the paddy rice in China in 2015 and 2016. In this study for mapping the paddy rice in 2015, the overall accuracies of SVM and RF classifiers reached to 90.8 and 89.2%, respectively using 10 observations from Landsat-8 and Sentinel-1A satellites. While in the presented study for mapping the paddy rice in 2016, SVM and RF achieved overall accuracies of 93.4 and 95.2%, respectively by using 14 observations from Landsat-8, Sentinel-1A and Sentinel-2A satellite.
Lately, neural networks have been used for classifying satellite images and were usually using one or two hidden layers which still remained low efficiency due to the expensive of data and the inadequate of computing power (Mas and Flores, 2008). At the beginning of the 21st century and with increasing the earth observations data and the computing resources, the deeper hidden layers and complex network architectures were merited using. Deep Learning (DL) embeds a family of different algorithm architectures which are constructed using neural networks. These architectures include multi-layer perceptrons, deep belief networks, stacked auto-encoders, deep neural networks and restricted Boltzmann machines and others. DL has been widely used in many applications since 2015 such as mapping land-cover (Li et al., 2016) and crops (Kussul et al., 2017) (Zhong, 2019), estimating crop yields (Kuwata and Shibasaki, 2015), detecting oil palm trees (Li et al., 2017) and plant diseases (Mohanty et al., 2016) with accuracies reached to 90%. A review to Different methods of deep learning for classifying land cover and land use of remote sensing data were presented in (Abebaw Alem and Shailender, 2020). An easy systematic review to the application of transfer learning for scene classification using different Dataset of Land cover and land Use and with different models of deep learning were presented in (De Lima and Marfurt, 2020). In this study (Unnikrishnan et al., 2019), the Normalized Difference Vegetation Index (NDVI) concept is utilized and consequently only the information of the red and Near Infrared (NIR) bands are used for the classification of the available public SAT-4 and SAT-6 datasets. New deep learning architectures of the three common networks Alex Net, Conv-Net and VGG were suggested in this study by tuning the hyper-parameters of networks with two the bands data as input.

Data Sets
Monitoring the changes in land use is a significant area of research because the land cover is the main variable that drives balancing in Earth's energy, carbon and hydrological cycle and the supplying of the natural resources (Bengio, 2009). Since land surfaces have various structures and different chemical properties, they absorb and reflect the sunlight differently and dependently on the wavelength and consequently information of land cover can be extracted from these spectral bands. For example, water surface absorbs much of the near infrared radiation; therefore, these wavelengths are benefit for discriminating the boundaries of the land water which are not clear in visible light. Similarly, the green vegetation's absorb much of the arrived radiation in the red spectrum while about 50% from this radiation is reflected in the near-infrared spectrum. Multi-spectral sensors for Earth surface observation receive the sun's energy reflected by a surface with a few distinct spectral wavelengths named bands, e.g. blue, green and red in the visible spectrum bands from (400 to 700 nm), near infrared bands from (700 to 1100 nm) and short-wave infrared with wavelengths from (1100 to 3000 nm). Among the multispectral sensors are the American Landsat 8 sensor and the two European Sentinel-2 sensors. Landsat 8 sensor takes images to the Earth with spatial resolution at 30-m every 16 days, while Sentinel-2 sensor acquires images with a spatial resolution of 10−20 m frequently every 5 days.
In this study, a public dataset given in the published research by (Bailly, 2018) for Time Series Land Cover Classification Challenge (TiSeLaC, 2017) is used. This dataset was collected by Landsat 8 satellite over Reunion Island in 2014 during intervals of 16 days where it included time series data of length 23 steps, the Island is shown in Fig. 1.
For each time step, ten spectral features were collected which are seven reflectance bands and three vegetation indices. The vegetation indices are Normalized Difference Vegetation Index (NDVI), the Normalized Difference Water Index (NDWI) and the Brightness Index (BI). Consequently, each time sequence of the collected data consists of 23-time steps with 10 bands for each time step meaning it has length equals 230. The time series data of Landsat 8 was reshaped to be appeared in image format of size 2866 rows and 2633 columns with 10 bands for each of time step as shown in Fig. 2. and 3 shows examples of NDVI spectral band of twenty three time series data for different classes of land cover.
As shown in Fig. 3, NDVI values range from -1.0 to +1.0. Low NDVI values which are 0.1 or less are corresponding to rocks, sand or snow. While, the medium NDVI values are representing the sparse vegetation such as shrubs, grasslands or crops in low season harvesting. Finally, for the high NDVI values which are 0.6 or more are matching the dense vegetation such as forests and crops before harvest.
Each time series of data is labeled with a class name from predefined labels. The reference dataset ((TiSeLaC, 2017) contains 81714 readings categorized to 9 classes and are used for training the classifier model as detailed in Table 1, while the dataset of 17973 readings are used for testing the classifier model as depicted in the same table.
The description of the different classes of dataset by (Bailly, 2018)  The Rocks and bare soil category has samples of low NDVI values which close to or below 0. The Grassland class combines both grazed and mowed grasslands but with different time series signatures due to variation in their re-growth. The time series of Sugarcane crop has the most differentiate profiles because its NDVI values drop after harvests. Sugarcane crops represent 60% of the cultivated area on Reunion Island (Denize, 2015). The Other crops class formed from many crops such as pineapple and bananas crops as well as mangos and this diversity of crops results in high intra-class variability and this faces a difficulty in classification. The Water class has a high intra-class variability which results in NDVI values perturbation.
In classification process, a classifier is a module with a function that is learned via a set of a labeled time series data which is the training data set and then in test phase it takes an un-labeled time series data as input and results a labeled data. The proposed network for land cover classification is detailed in the next section.

Convolutional Neural Network
Convolutional Neural Network (CNN) is a type of multi-layers neural network and it usually includes two main parts (Pires de Lima et al., 2020). One is a feature extractor, which learns features automatically from initial data. And the other part is a trainable fully connected Multi-Layer Perceptron (MLP), which implements the classification process depend on the learned features from the preceding part. Generally, the feature extractor consists of multiple identical stages and each stage is comprised of three cascade layers: Filter layer, activation layer and pooling layer. The output from each layer is called feature map (LeCun and Bengio 1995). CNN is with more detailed in (Krizhevsky 2012).

Recurrent Neural Network
Recurrent Neural Network (RNN) is a category of Adaptive Neural Network (ANN) and differentiates itself from the neural network by it depending on feed forward through the possibility of utilizing its internal state (memory) to handle sequences of inputs. RNN can remember the state of an input from the preceding time steps which helps it to take a decision for the posterior time step. Recently, the Long Short-Term Memory (LSTM) network is a new structure of Recurrent Neural Network (RNN) and it further addresses the problem of vanishing gradients of the previous RNN and holds the long short-term memory. LSTM has two aspects differentiate it from RNN which are as follows. First, the cell state is divided into two parts, the long-term state c(t) and the short-term state h(t). Second, LSTM has three control gates which are the forget gate, the input gate and the output gate, all of them are along the state path to regulate the cell states as introduced in (Gharghory, 2020).

Bidirectional Long Short-Term Memory
Bidirectional Long Short-Term Memory (BiLSTM) learns bidirectional long-term dependencies between time steps of time series data and this is useful to a network needed to be learned at each time step from the complete time series. BiLSTM processes input sequences in both directions with two sub-layers. It consists of two recurrent network layers (Schuster and Paliwal, 1997), in which the first one processes the sequence of inputs in forwards direction, while the second processes the inputs sequence in backwards. Both layers connected to the same output layer and consequently, BiLSTM network reaches to the total information about previous and future sequence of data points. These two sub-layers compute forward and backward hidden sequences ℎ ⃗ and ℎ ⃖⃗ respectively, which are then combined to compute the output sequence as depicted in the following equations. Also, both the structures of LSTM cell and BiLSTM network are shown in Fig. 4 and 5, respectively: Where: xh W denotes the connection weight between input layer and forward hidden layer, xh W denotes the connection weight between input layer and backward hidden layer, ℎ ⃗ ⃗ ℎ ⃗ ⃗ denotes the connection weight between the sequences among forward hidden layer and ℎ ⃗ ⃗ denotes the connection weight between forward hidden layer and output layer.

The Architecture of the Proposed Conv-BiLSTM Network Model
A hybrid model combining CNN and BiLSTM networks named conv-BiLSTM is proposed in this study for the classification of multivariate time series of land cover data. The suggested model is used as a semantic classifier to classify sequences of time series samples of length 23 and multivariable of 10 bands into 9 classes. The proposed network consists of multi-layers included the suggested convolutional layers and BiLSTM layers, thereafter the fully connected layers and the layers of softmax and classification. To perform the convolutional operations on each time step independently, a sequence of folding layer is included before the convolutional layers. To restore the sequence structure and reshape the output of the convolutional layers to sequences of feature vectors, a sequence of unfolding layer and a flatten layer are inserted between the convolutional layers and the BiLSTM layer. The different layers as well as their parameters values of the proposed network structure are depicted as follows:  A sequence input layer with an input size of dimensions [10 23 1]  Sequence Folding Layer  CNN consists of convolutional layer, batch normalization layer and ReLU layer with number of filters equals 30 and with filter size of 5-by-5. Two series of the abovementioned structural layers are used with the proposed network  Sequence Unfolding Layer restores the sequence structure of the input data after sequence folding  Flatten Layer  BiLSTM network with 350 hidden units that outputs the last time step only  A fully connected layer of size 9 which is the number of classes followed by a softmax layer and a classification layer The structure of the proposed hybrid Conv-BiLSTM network as well the structure and graph of its layers are shown in Fig. 6 and 7.

The Exeperimental Results
The suggested Conv-BiLSTM classifier is verified using four different training data sets of land cover which are presented as follows:  In first case, all training samples 81714 depicted in Table 1 (Breinman, 2001), Precision, Recall and F-score metrics are used as metric for this purpose as depicted in equations from (4-6). The classification results of the suggested Conv-BiLSTM classifier with the above mentioned four training cases compared to BiLSTM classifier and the Random Forest classifier in metric of Precision, Recall and F-score are depicted from Table 2 to 5. Also, both the average of F-Score results and the accuracy of the suggested Conv-BiLSTM network compared to that of BiLSTM, CNN and RF classifiers versus the ratio of the used training dataset in each aforementioned case to the total training data are shown in Fig. 8 and 9 respectively. The classification accuracy of suggested Conv-BiLSTM classifier compared to the the accuracy of the other classifiers is depicted in Table 6.        Table 4: Classification accuracy of the suggested Conv-BiLSTM network compared to BiLSTM network and RF classifier in terms of precision, recall and F-score using the third training case 3-1600samples from each class of training data sets are used for training the classifiers and 999 samples for testing the classifiers     Besides, the efficiency of the proposed Conv-BiLSTM is verified through its comparison to the state-of-the-art method; weasel + muse (Word ExtrAction for time SEries cLassification + MUltivariate Symbols and dErivatives) (Schäfer and Leser, 2017) in metric of F-Score. The average F-Score of the proposed Conv-BiLSTM network reached to 87.8% when training by 1600 samples from each class of training data sets, while the WEASEL + MUSE technique reached to 86.6% when training with half training samples. The proposed Conv-BiLSTM network has larger F-Score value comparing to the state-of-the-art WEASEL + MUSE by 1.38%.

Results and Dissucision
From the results given in Table 2 to 5, the suggested network for multivariate time series classification of nine land cover categories outperforms the other compared classifiers BiLSTM, CNN and Random Forest. The classification accuracy of the suggested Conv-BiLSTM network reaches to 90.1%. Also, it is demonstrated that the suggested network has best F-Score value when it is trained using equal samples from each class of the training dataset and its F-Score value ranges from 87.8 to 85.8%. While the suggested network has smaller F-Score value when it is trained with non-equal samples from each class of the training dataset either using the total training samples or half samples and its F-Score value ranges from 84.4 to 82.2% respectively. Concerning the other compared classifiers, the accuracy of classification of both BiLSTM and RF in terms of F-Score have same values when training the two classifiers with 1000 samples from each class of land cover dataset. The accuracy of classification with BiLSTM has larger value in average by 0.55% compared to RF when using the last two cases of training data. While the accuracy of RF classifier in terms F-Score is larger than that of BiLSTM classifier in average 11.65% with the two first cases of training data respectively. On other side, the accuracy in classifying the other crops class with all classifiers using the first cases of training dataset in metric of Precision and F-Score was very poor due to the diversity of crops inside the class which result in high intra-class variability and this faced a difficulty in classification. Generally, the classification accuracy of the suggested Conv-BiLSTM network with all training cases has larger value in average by 6.5, 8 and 8.7% compared to RF, BiLSTM and CNN classifiers respectively. Besides, the accuracy of classification of the proposed Conv-BiLSTM network in metric of F-score is in average larger by 1.38% than that value of the state-of-the-art WEASEL+MUSE technique.

Conclusion
In this study, a hybrid model for multivariate time series classification of land cover is suggested. The suggested model combines both of the convolutional network and BiLSTM network which enable the suggested model to extract the spatial and temporal features of the small land cover data set used. From the simulation results of training the suggested network in metrics of F-Score, Precision and Recall which are given in previous tables prove that the suggested model outperforms the other comparison techniques. The classification accuracy of the suggested network using the aforementioned different cases of training data set ranges from 90.1 to 85.8% respectively. Moreover, the accuracy of classification of the proposed Conv-BiLSTM network in metric of F-Score is in average 1.38% better than that value of the state-of-the-art WEASEL+MUSE technique in literature.