Comparative Analysis of Deep Learning Models for Multi-Step Prediction of Financial Time Series

: Financial time series prediction has been a key topic of interest among researchers considering the complexity of the domain and also due to its significant impact on a wide range of applications. In contrast to one-step ahead prediction, multi-step forecasting is more desirable in the industry but the task is more challenging. In recent days, advancement in deep learning has shown impressive accomplishments across various tasks including sequence learning and time series forecasting. Although most previous studies are focused on applications of deep learning models for single-step ahead prediction, multi-step financial time series forecasting has not been explored exhaustively. This paper aims at extensively evaluating the performance of various state-of-the-art deep learning models for multiple multi-steps ahead prediction horizons on real-world stock and forex markets dataset. Specifically, we focus on Long-Short Term Memory (LSTM) network and its variations, Encoder-Decoder based sequence to sequence models, Temporal Convolution Network (TCN), hybrid Exponential Smoothing-Recurrent Neural Networks (ES-RNN) and Neural Basis Expansion Analysis for interpretable Time Series forecasting (N-BEATS). Experimental results show that the latest deep learning models such as N-BEATS, ES-LSTM and TCN produced better results for all stock market related datasets by obtaining around 50% less Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) scores for each prediction horizon as compared to other models. However, the conventional LSTM-based models still prove to be dominant in the forex domain by comparatively achieving around 2% less error values.


Introduction
Financial time series forecasting has drawn significant attention among the researchers from both academia and financial industry. It is a complex domain which requires modelling of nonlinear behaviour and stochastic pattern while learning the temporal dependencies between the data. The signal to noise ratio is considerably low which contributes to the intricacy while forecasting. Researchers and stakeholders are consistently working on implementing new methodologies for improving the accuracy of predictive models due to the higher demand from the financial market. Numerous studies have been carried out in regards to both statistical and machine learning based forecasting techniques. Deep learning based models have achieved commendable results across various fields of natural language processing (Devlin et al., 2018;Brown et al., 2020), speech processing (Ogunfunmi et al., 2019), neural machine translation Wu et al., 2016), image classification (Krizhevsky et al., 2012) and reinforcement learning (Silver et al., 2017). Moreover, recent deep architectures have also demonstrated significant improvements in accuracy for time series forecasting (Rangapuram et al., 2018;Salinas et al., 2020) and specifically in the financial setting (Yan and Ouyang, 2018;Chen et al., 2019) over traditional time series models. One of the key reasons for effectively handling the non-stationary nature and complications of the changing financial environment is due to its ability to learn representations through hierarchical hidden layer structure. The multi-layer architecture allows the deep models to process and analyze the complex nonlinear temporal dependencies and establish proper latent representations.
In most real world applications, multi-step or multihorizon forecasting are more valued as opposed to single or one-step ahead prediction. Long-term prediction mechanisms can help provide key insights for optimizing resource allocation and assisting the decision making process. Investors and financial firms can re-evaluate and efficiently plan their investment strategy to gain maximum profits by observing the longer predicted trajectory. Two major multi-horizon forecasting approaches have been explored lately, based on the foundation of deep learning architecture, Iterative and Direct (Sequence-to-Sequence). Iterative method deals with recursively applying one-step ahead prediction where the predicted output is fed as input for the next forecast while direct approach overcomes the shortcoming of recursive technique by forecasting the prediction vector directly at once.
While there have been numerous investigations of multi-step forecasting across diverse fields, such as, electricity load consumption Masum et al., 2018), traffic flow Lv et al., 2014), renewable energy production (Ghaderi et al., 2017), Electrocardiogram (ECG) analysis (Chauhan and Vig, 2015), very few research have focused on financial applications (Ouyang and Yin, 2018). A comprehensive analysis of the performance of sophisticated deep learning techniques for multi-step financial time series forecasting is lacking in the literature.
This work considers the most novel and relevant deep architectures and compare them in terms of performance accuracy on financial benchmarks. Specifically, we focus on LSTM and it's two variations, Bi-directional LSTM, Stacked LSTMs, Encoder-Decoder architecture, TCN, ESRNN and N-BEATS. The experiment is carried out on stock market datasets including S&P500, DJIA and NASDAQ 100 and forex markets such as EURUSD, EURGBP and EURJPY for multiple multi-steps ahead forecast horizons (2, 3, 5, 7 and 10 steps). To the best of our knowledge, such comparison of multi-step forecasting using deep learning models for financial markets has not been done yet.
The remainder of the paper is structured into four sections. Works related to multi-step forecasting and financial multi-horizon forecasting using deep learning is described in section 2. Section 3 describes the deep learning models considered for this research. In section 4 we present the details of experiments conducted and discuss the results. The conclusion, limitations and future work is provided in section 5.

Related Work
In the field of multi-horizon time series prediction, deep learning models have been employed increasingly due to their performance dominance over statistical and traditional time series models. Direct strategy used for forecasting generally consists of sequence-to-sequence (Sutskever et al., 2014;Cho et al., 2014) architecture where the encoder encodes the historical inputs to provide a compressed representation and the decoder architecture is used to generate future predictions based on the vector. The overall model is jointly trained to generate the vector of forecast for pre-defined horizon. The Multi-horizon Quantile Recurrent forecaster (MQRNN) technique (Wen et al., 2017) generates the hidden latent representation of historical time series using LSTM which is then fed to Multi-Layer Perceptrons (MLPs) to produce multiple quantile forecasts for multiple horizons. The authors in (Fox et al., 2018) propose a novel deep multi-output forecasting framework called DeepMo for predicting blood glucose trajectories. They introduce the concept of function forecasting which predicts the representation of the data in contrast to learning the distribution of future values based on the past. To complement this model, the authors also develop new architecture to model temporal dependencies and allow information propagation across the prediction window.
A new LSTM based architecture (Laptev et al., 2017) for extreme event forecasting at Uber is proposed which uses an autoencoder for feature extraction which is then combined using an ensembling technique and fed to LSTM based forecaster. The novel architecture provides a framework which is trained using heterogeneous time series and achieves significant improvement over traditional stacked LSTMs. Similarly, a novel Diffusion Convolutional Recurrent Neural Network (DCRNN) for traffic forecasting is introduced in . The framework integrates both spatial dependency using bidirectional random walks on the directed graph and the temporal dependency using the encoder-decoder architecture. Higher-Order Tensor RNN (HOT-RNN) is presented in  to address the long-term forecasting challenges. The proposed architecture captures the higher-order nonlinear dynamics using higher-order state interactions of previous hidden states. The authors show that the proposed architecture is more expressive and accurate than standard Recurrent Neural Network (RNN) and LSTM. With recent advancements, attention-based model like in  is also used which allows to focus on relevant and important time steps and patterns in the historical data. In addition, Transformer-based architecture like in Lim et al., 2019) has also been explored.
Iterative strategy in contrast operates by recursively feeding the single-step ahead forecast as future inputs to obtain multiple forecasts. Inspired by WaveNet (Oord et al., 2016) architecture, authors in (Borovykh et al., 2017) extended it for predicting financial time series achieving better results than autoregressive and LSTM based recurrent networks. The proposed model employs dilated convolutions followed by residual skip connections and uses ReLU activation function for optimizing the training time. To account for correlation between financial time series, the model takes multivariate time series as input which allows conditioning the forecast of a time series based on its own past data as well as that of other time series. Authors in (Hussein et al., 2016) utilized coevolutionary RNN for multi-step time series prediction using recursive technique where cooperative coevolution and Back-Propagation Through Time (BPTT) is employed for training the neural network model.
There have been studies in regard to multi-steps ahead probabilistic forecasts as well. DeepAR (Salinas et al., 2020) uses autoregressive recurrent neural networks to obtain a global model by training the historical data of all related time series which generates the Gaussian distribution for the forecast. Similarly, Deep State-Space Models (DSSM) (Rangapuram et al., 2018) follows a similar approach by exploiting recurrent neural networks to parameterize the pre-defined linear state-space model with Kalman filtering based predictive distribution.
In regards to multi-step financial time series prediction, authors in (Ouyang and Yin, 2018) extended the concept of self-organizing Autoregressive (AR) models to Varied Length Mixture models (VLM) to forecast the financial time series over multiple steps. One significant advantage of modelling such varied length models is that it allows to preserve the relationships among the input points within the forecast horizon. A comprehensive review of deep learning based financial time series prediction across various domains is presented in (Sezer et al., 2020). The authors observed that deep learning models performed better than machine learning models in most of the studies. Also, most researches are based on movement prediction of financial assets for short-term forecasting and the literature on multi-step price prediction is still scarce. Specifically, a detailed overview of applicability of deep models in the stock market domain is carried out in (Jiang, 2020).
In a recent study (Chatigny et al., 2020) related to multivariate multi-step setting, a novel variable-length attention mechanism is proposed for improving the performance of RNN based on the Dynamic Factor Graph (DFG) framework using which a new class of self-supervised generative neural architecture is also introduced. The overall model has the capacity to effectively capture temporal dependencies for multivariate time series and performs better even with limited data. Hwang (2020) used LSTM model with trainable initial hidden states which allows the model to reconstruct the abstract representation of the time series along with its parameters and forecast the future values based on the latent representation. A comparative analysis of Autoregressive Integrated Moving Average (ARIMA), LSTM and Bidirectional LSTM (BiLSTM) for various stock indices prediction is performed in (Siami-Namini et al., 2019) which showed that BiLSTM performs better as compared to others.

Methodology
In this section, we provide an overview of different deep learning models used in this study. We also briefly describe the multi-step prediction technique utilized for the comparative analysis purpose.

Long-Short Term Memory
LSTM networks (Hochreiter and Schmidhuber, 1997) belong to the special category of RNN family that overcomes the exploding and vanishing gradients limitation of simple RNNs (Hochreiter, 1998). By introducing an internal cell state or memory state and gating mechanisms, it can capture long-range dependencies in the data while retaining the short term memory. They have achieved state-of-the-art performance in sequence learning domain such as machine translation (Sutskever et al., 2014), language modeling (Sundermeyer et al., 2015), signal processing (Yildirim, 2018) and audio and video processing (Eck and Schmidhuber, 2002;Liu et al., 2019). Application of LSTM models in the financial domain (Fischer and Krauss, 2018;Heaton et al., 2017;Bao et al., 2017) have also shown promising results outperforming other traditional statistical models.
The inner working structure of an LSTM cell is shown in Fig. 1. There are three main gates in each cell that contribute to the cell state ct: Input gate (it), output gate (ot) and forget gate (ft). The first component is the forget gate which is responsible for controlling how much of the information should be forgotten or removed from the previous cell state. It takes two inputs, output of the previous hidden state (ht-1) and input of the current state (xt) and passes them through the sigmoid activation function which outputs a vector between 0 and 1 for each value in the cell state. The 0 output for a particular value in the cell state indicates that the information is completely removed whereas 1 represents that the information is remembered. The input gate deals with updating the cell state with new information. It performs similar sigmoid calculation to the same set of inputs like in the forget gate and acts as a filter for the information from the previous hidden state (ht-1) and current input (xt). A candidate memory cell   t C is also created to modulate the network which applies tanh activation function on the same inputs while squeezing the result in between -1 and 1. It generates a vector of all possible information that can be added to the cell state as perceived from its inputs. The negative value from tanh function indicates dropping information from the cell state while positive result infers adding new information. The output from the tanh and input gate sigmoid activation is multiplied to obtain a result which defines how much each cell state value should be updated by based on the new information: Finally to obtain a new cell state, the forget vector (ft) is multiplied with the previous cell state (ct-1) and the result is combined with the multiplication between input gate and tanh vector via additive operation: The output gate contributes in updating the hidden state (ht) of the cell and evaluates which information from the cell state is to be used as output for the next step. The cell state vector is passed through a tanh transformation function to scale the values between -1 and +1 and then multiplied with the output vector sigmoid activation which decides whether or not the cell state value will be sent as an output for the next step and also as hidden state for the next cell:

Stacked LSTM
Deep LSTM or Stacked LSTM (Pascanu et al., 2013;Graves et al., 2013b) is an extension of the simple LSTM cell, which contains multiple LSTM cells stacked on top of each other. Adding several layers brings more depth to the architecture and increases the level of abstraction of the input sequence over time (Pascanu et al., 2013). Figure 2 shows the structure of three layered stacked LSTM cells. The output from the lower hidden layer cell   1 1 t h  is passed as input to successive layers while each layer maintains their own hidden state and cell state.

Bidirectional LSTM
Another variation of LSTM network is Bidirectional LSTM (Graves and Schmidhuber, 2005) which processes the sequential data in both forward and backward direction using two separate hidden LSTM layers. BiLSTM connects both the layers to the same output layer. The forward layer processes the information following the same direction of the given sequence while the backward layer computes its operations using inputs from the reverse direction. Given an input sequence (x) with time steps from t-n to t-1, the hidden state of the forward layer ( h ) traverse through the inputs from t-n to t-1, while for the backward layer the hidden state ( h ) propagates from t-1 to t-n. Both the layers constitute LSTM cell performing standard operations. The final output of the BiLSTM layer is given by Equation (7) where the function () used to combine the two hidden states can be a concatenating, summation, average or a multiplication function: The architecture of an unfolded BiLSTM layer is shown in Fig. 3. BiLSTMs have achieved great success in time series forecasting domain, such as speech recognition (Graves et al., 2013a) and traffic speed prediction (Cui et al., 2018).

Encoder-Decoder Model
Encoder-decoder architecture  or sequence-to-sequence models (Sutskever et al., 2014) were first introduced to overcome the limitation of RNNs to produce output sequences of arbitrary length. Since then, they have been widely used in neural machine translation Bahdanau et al., 2014;Wu et al., 2016), speech recognition (Graves et al., 2013b;Chorowski et al., 2015;Bahdanau et al., 2016) and also time series forecasting tasks (Qin et al., 2017;Liang et al., 2018). In the heart of this framework lies two different networks, namely encoder and decoder where both are sequential based networks.  (Cui et al., 2018)

Fig. 4: Encoder-decoder model
The encoder network processes the input sequence X of length t one time step at a time and produces a fixed dimensional compressed vector representation c , which is also commonly termed as context vector or latent vector and this processing of obtaining the context vector is called encoding. The context vector is usually the last hidden state ( e t h ) produced from the encoder network. Then, the decoder network produces the output sequence (ŷ ) given the context vector. While the decoder maintains its own hidden state, the final hidden state of the encoder network (or context vector) is replicated across each time step as inputs in a basic encoder-decoder setting. Both the encoder and decoder network can be a simple LSTM cell or stacked LSTM layers conducting its standard gating operations and are jointly trained to minimize the cost function. A general overview of the architecture is depicted in Fig. 4:

Temporal Convolution Network
Belonging to the family of Convolutional Neural Networks (CNNs) that were initially dedicated for image dataset and computer vision tasks (Krizhevsky et al., 2012;Gu et al., 2018), TCN (Bai et al., 2018) is an extension to adapt with sequential dataset and problems. After a series of thorough experiments, the authors claimed that TCN outperformed regular RNNs such as LSTMs on various benchmark datasets and tasks while demonstrating longer effective memory.
The basics of TCN consists of two propositions, first being that given an input sequence of arbitrary length, the network maps it to an output sequence of the same length. This principle is achieved by using 1D fullyconvolutional network architecture, where the length of each hidden layer is the same as the input layer and zero padding of length (kernelsize -1) is employed such that the subsequent layers has the same length as the previous one. The second concept associated with TCN specifies that there is no information leakage from future to the past. To address this point, it replaces standard convolution operator by causal convolution such that information only from the past is used for forecasting and has no access to the future samples. In order to achieve long-term dependencies in the sequential data and build a long effective history size, TCN makes use of dilated convolutions (Oord et al., 2016). It can skip outputs from the previous layer that allows to cover information from farther distance values in the sequence (increase the receptive field). Dilated convolution F on an element s of the 1-D sequence n x  can be expressed as: where, d is the dilation factor and k is the filter size. Thus, the receptive field can be increased by choosing larger filter sizes k and increasing the dilation factor d. Figure 5 shows an example of a dilated causal convolution with a kernel size of 2 and a dilation factor of [1,2,4]. In addition to dilated causal convolution, TCN also implements residual blocks (He et al., 2016) in place of a convolutional layer to account for stabilization of deeper and larger networks. A TCN residual block consists of two layers of dilated causal convolution, weight normalization, rectified linear unit and spatial dropout. A 11 convolution operation is also added in each residual block to account for inconsistent input and output size.

Exponential Smoothing-Long Short Term Memory
This hybrid model (Smyl, 2020) which is an effective combination of statistical based Exponential Smoothing (ES) model and modern neural network based LSTM model is the winner of M4 competition (Makridakis et al., 2020) with significant margin. It is a hierarchical model which can be used to forecast multiple series, where the ES component captures the local parameters for each series such as seasonality and level whereas the weights of connections inside the LSTM model accounts for global parameters shared by all series. A high-level architecture of ES-LSTM is shown in Fig. 6. Initially, Holts-Winter exponential smoothing (Hyndman et al., 2008) with multiplicative seasonality is computed, however, the trend component is not accounted for as the model does not consider linear trend in the series: where, yt is the time series, lt is the smoothing or level component, s is the multiplicative seasonality coefficient, m is the number of observations per seasonal period and ,  are smoothing coefficients between zero and one.
To produce non-linear trend forecasting with multiple steps ahead, a neural network model (RNN) is used instead, the output of which is subsequently seasonalized and denormalized again to produce the forecast. Finally, the Holt-Winters model is combined with a RNN model to get the forecast from the final hybrid model: where, h is the forecasting horizon and Xt is a vector of deseasonalized and normalized time-series derived features of which a scalar component xt is calculated as: The neural network model employs a stack of dilated LSTM networks (Chang et al., 2017) interlinked with residual connections (He et al., 2016). Each block contains a sequence of one to four layers with each layer belonging to one of the dilated LSTM categories: Standard dilated LSTM (Chang et al., 2017), dual-stage attention based LSTM (Qin et al., 2017) and residual LSTM (Kim et al., 2017). At the fundamental level, the model consists of a block which is a multi-layered fully connected network with Rectified Linear Unit (ReLU) activation function that produces two outputs, the block's standard output of given horizon (forecast) and the best estimate of it's input given the functional limitations that it can operate on (backcast). The layer of blocks are combined together using a novel hierarchical doubly residual stacking topology.

Neural Basis Expansion Analysis for Interpretable Time Series Forecasting
Different from the common residual architecture which either involves concatenating the input of a layer to its output before passing to the subsequent layer or adding new connection from the output of each layer to the input of every other layer that follows it, the new architecture introduces two residual branches. The backcast residual branch makes it easier for subsequent blocks to forecast by removing the backcast signal from the block's input while the forecast output from each block is first integrated at the stack level and finally at the overall network level to produce the final global forecast. The high level architecture of the model is depicted in Fig. 7.
The model is also designed to have interpretable outputs for each stack by decomposing the trend and seasonality of the series. In addition, the model is also associated with the concept of meta-learning, where the inner training loop is enclosed inside the basic building blocks while outer training procedure is contained with the parameters of the overall network, learned through gradient descent.

Multi-Step Prediction Strategy
While there are several methods defined in the literature for multi-step forecasting (Taieb et al., 2012), we focus only on Multiple-Input Multiple-Output (MIMO) (Bontempi, 2008) which has outperformed other techniques and achieved the best results for the task (Taieb et al., 2012). This strategy employs a single model to output the vector of future values (forecast) at one shot: where, xt is the input series vector at time t, F is the trained model and ŷt is the vector of predicted output for the input sequence.

Experiments and Results
We first describe the datasets used in this study. Then, the experimental settings of different models are introduced followed by the test strategy and evaluation metrics used. Finally, we compare and analyze the performance results of various models in our benchmark datasets.

Dataset and Pre-Processing
In order to have a thorough comparison and analyze the ability of models for long-term forecast, we use six different financial benchmark datasets from the stock market and exchange rate domain. These data have been widely used in financial time series forecasting domain (Sezer et al., 2020). Description of the datasets is shown in Table 1. All the datasets are publicly available online and can be downloaded from the Yahoo finance website 1 . In our experiments, all the dataset has been split into training set (80%), validation set (10%) and test set (10%) in a chronological order. We perform a univariate analysis by considering only the closing price of all assets. The historical closing price values are used to predict the future values. A sliding window approach is implemented to create the supervised dataset from the training set as shown in Fig. 8. We also preprocess the data considering the large unscaled values which affects the training of the model and slows down the convergence. For each dataset, we normalize the values by subtracting the mean () and dividing by the standard deviation () to have 0 mean and a standard deviation of 1 as shown in Equation (17). The normalization is fit and transformed in the training set while the validation and test set is only transformed to prevent look-ahead bias:

Evaluation Metrics
We use two widely adopted evaluation metrics in financial time-series forecasting domain (Guo et al., 2014;Sezer et al., 2020) (19) where, N is the number of samples, yt and ŷt are the actual and model predicted value respectively. Both the metrics are scale-dependent measures and they represent how closer are the actual and predicted values. Hence, there is no definitive maximum value as threshold and higher values indicate less accuracy. However, values closer to zero indicate higher accuracy and better performance. While we relatively compare the error scores across different models in the same setting, the model which achieves the least scores can be defined as best performing and optimal.

Walk-Forward Validation
The test dataset is evaluated using the Walk-Forward validation sliding window approach. In this method, we take the first n values from the test set, where n is the input time lag, for which the model predicts the next h future values at once. The window shifts one step towards the right taking the actual values to predict again. This process continues until the end of the test set. In this manner, the model always predicts using the available true data. In our case where we forecast for multiple steps ahead, h = [2,3,5,7,10]. This process is similar to the sliding window approach in Fig. 8.
The RMSE and MAE is calculated at each sliding instance and finally, the average is calculated for the overall test set. Finally, we compare the average RMSE and MAE for each model and each horizon.

Experimental Details
While training the model, we have several parameters to be defined for each model. As a general setting to all the models, the batch size is set to 32 and mean squared error is selected as the loss function. We adopt Adam (Kingma and Ba, 2014) as the optimization algorithm with the learning rate set to 0.001. All the models are trained for 1000 epochs with early stopping implemented as a callback function to prevent overfitting. Specifically, we monitor the validation loss after the end of each epoch and the training process is stopped if the loss does not improve for 50 iterations. Based on the experiments conducted, we selected the input sliding window size (t) to be 16 days which depicts the best trade-off between performance accuracy and system requirements. All the LSTM based models adopt 100 hidden units. The simple LSTM and BiLSTM model adopts a single hidden layer while the deep LSTM model implements three layers, stacked on top of each other. Both the encoder and decoder network consist of a single layer of LSTM model with the same hidden units to obtain a compressed representation and generate the output vector. In the ESLSTM network, the seasonality was empirically selected to 30 for all the financial assets and follows the same LSTM configuration with one layer and 100 neurons.
For the TCN model, the dilations is specified to [1,2,4,8] with kernel size of 2 and the amount of filters used is set to 100 to have common grounds with the LSTM models. The model also employs a single stack of residual blocks and allows for skip connections from input to each block. With the configured parameters, the receptive field of the TCN network is the same as the selected input window size of 16.
In the N-BEATS architecture, the backcast length is set to 16 for each forecast lengths. For each stack, two blocks are considered whereas the hidden layer units in each block is set to 100. Moreover, generic architecture based stack types are used which do not depend on specific knowledge related to time-series.
All the models were implemented based on Keras library with Tensor flow backend and the experiments were conducted on a machine with Intel (R) Core (TM) i7-9700F CPU and Nvidia GeForce RTX 2080 Ti GPU. The models were trained multiple times to address the random initialization of weights and the average performance on the test set was recorded for comparison.

Results and Discussion
Our experimental results for six different financial datasets on the test dataset are shown in Table 2 to 7. Each table summarizes the average RMSE and MAE scores for multi-step continuous forecasts across several deep learning models used in this study. The error values are computed after post-processing where the model predictions are re-scaled to the original range of actual value. The stock index values are in much higher range and suffer from significant price movements as compared to the forex values. Hence, we can observe substantial difference in RMSE and MAE scores for stock and forex datasets. The best metrics scores for each forecast horizon are highlighted in bold.
The results show that the deep neural models depict inconsistent performances depending on the domain and behaviour of the financial markets. For S&P 500 index, the temporal convolutional based model drastically reduces the RMSE score by more than 50% for all forecast horizons. The ES-LSTM and pure deep neural based N-BEATS models also show significant improvements. The deep LSTM model exhibited poor results followed by simple LSTM. BiLSTM based architecture beats encoder-decoder on short-term forecasts but fails when the prediction horizon is longer (7 and 10 days).
The LSTM based models along with Encoder-Decoder architecture also performed poorly when applied to DJIA index with 57% higher error rate. Although the best performing model in this case was ES-LSTM, the RMSE and MAE scores for TCN and N-BEATS were also relatively lower. However, it can be noted that the range of errors in case of DJIA stock index is much higher in case of classical LSTM models as compared to S&P 500. Based on the results, it is also worth commenting that simple LSTM and BiLSTM outperform sequence to sequence architecture that are designed to capture the complex temporal relationship in an effective way.
Similar to the other two stock markets, deep LSTM recorded the highest error scores compared to other models for NASDAQ 100. Also, TCN architecture outperformed other models for all prediction horizons except for 7 days ahead for which ES-LSTM obtained better result. N-BEATS also exhibited considerable accuracy compared to the other models. With regards to memory based models, the error metrics for BiLSTM is relatively lower as compared to LSTM and Encoder-Decoder.
Overall, we can observe that the state-of-the-art TCN, ES-LSTM and N-BEATS architectures heavily outperformed other traditional models in the stock market domain for almost all forecast horizons. We can also note that the RMSE and MAE scores do not always gradually increase along with the forecast horizons. This is consistent with previous research findings (Bao et al., 2014) while forecasting chaotic time series for multiple horizons.
The forex datasets behave differently as compared to the stock indexes. LSTM based models prove to be more dominant and display better results in most cases. N-BEATS recorded the least error for short-term and longrange forecast horizons (2,3 and 10 days) for EURUSD. However, deep LSTM outperformed others in case of mid-range (5 and 7 days) forecasts. Similarly, LSTM, stacked LSTM, BiLSTM and Encoder-decoder architectures performed better for most forecast horizons for the Euro to Pounds market (EURGBP). In some scenarios, we can report that multiple models achieve similar accuracy for same forecast horizons such as, stacked LSTM and BiLSTM achieved the same score for 2 days ahead forecast. Also, both encoder-decoder and TCN model recorded least RMSE and MAE scores while forecasting 5 days ahead. The stacked-LSTM exhibits the least error score for 3, 5 and 7 days ahead forecast of EURJPY exchange rate, while BiLSTM and simple LSTM outperforms other models while forecasting 2 and 10 days ahead respectively. Unlike other two forex dataset, RMSE and MAE scores are relatively higher for all models in case of long-range forecast (10 days) of Euro to Yen. It is interesting to mention that ES-LSTM was the least accurate model for all three forex markets. Similarly, unlike stock market datasets, almost all models follow gradual linear trend as the forecast horizon increases.               Analysis of results show that sophisticated deep models provides promising avenue in effectively capturing the underlying dynamics and patterns of the stock market. Specifically, the inherent hierarchical learning ability of N-BEATS and TCN architecture as well as the pre-processed exponential smoothing combined with LSTM model allows for remarkable results. However, the gated mechanism based pure LSTM, Bidirectional LSTM and the sequential architecture outperforms the state-of-the-art models in forex market with small margin.

Conclusion
In this study, we investigated the performance of the most relevant deep learning models for multi-step-ahead forecasts in the financial domain. The experiments have been carried out on three real world datasets of the stock market and forex domain. After a sound comprehensive assessment, we can observe that the recent benchmark models for time-series prediction such as ES-LSTM, TCN and N-BEATS, showcased exceptional results for the stock market domain, while the classical LSTM, BiLSTM and Encoder-Decoder models still have an upperhand in predicting the forex markets. The results obtained also conclude that the relationship between forecast horizon and error values is non-linear for most of the models in the stock domain, while the forex market manifests linear correlation for almost all neural networks.
This research is however limited to univariate analysis where only historical data is considered as the source of input. The stochastic and dynamically driven financial field is significantly impacted by several other external factors such as news (Du and Tanaka-Ishii, 2020) and interrelationship between multiple time series (Borovykh et al., 2017) which could be incorporated and examined in the future work. Also, hyper parameter tuning for each model can be automated using an optimization algorithm (Bergstra et al., 2011).