Modeling Positive Time Series Data: A Neglected Aspect in Time Series Courses

Corresponding Author: Maman A. Djauhari Institute for Mathematical Research (INSPEM), Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia; Email: maman_abd@upm.edu.my and maman_djauhari@yahoo.com Abstract: Something has been forgotten in time series courses, in particular, when dealing with positive datasets. To describe the pattern hidden in this type of datasets, before we use a sophisticated method of modeling such as Autoregressive Integrated Moving Average (ARIMA), we propose to check first whether the data represent a Geometric Brownian Motion (GBM) process. If it is affirmative, unlike the other methods, the method of GBM time series modeling might provide the desired model in a simple and easy to digest procedure with cheaper cost and high speed of computation. Because of its simplicity and practicality, even nonstatisticians who have a very limited background in statistics could take easily the fruit and benefit of this method. In this study, unlike the standard approach that can be found in the literature, GBM process will be approached from log-normal process. This is the first result of this paper which shows the simplicity of GBM process. To identify this process, as a strong indication that a process is GBM process, we can see the value of the serial correlation. The smaller the serial correlation of log returns the higher the tendency that the process is GBM process. As the second result, for practical purposes, a new procedure of time series modeling if data are positive will be introduced. These results show that, when dealing with positive dataset, GBM time series modeling is worthwhile to be included in any introductory Time Series course especially for non-statistics students. To illustrate the practical advantages of GBM time series modeling, real case studies from industries as well as government agencies and internet will be presented and discussed.


Introduction
A postgraduate student in biology came to our laboratory and requested consultation on how to find a good time series model to describe the hidden pattern behind a collection of time series data. She told us: "I feel uncomfortable when I use the methods of time series modeling available in the textbooks and statistical packages. As a non-statistician person, the procedure is too complicated for me. Some statistical packages are like a black-box. Now, I have to perform ARIMA. Could you suggest the simpler method that simple and fast computational?". After a short discussion, we find that the standard methods of time series modeling such as ARIMA is too sophisticated and beyond her statistical knowledge.
We are not surprised with her request as it is well known that ARIMA modeling requires special expertise and experience (Armstrong, 1984) and the model identification procedure is very subjective (Gomez and Maravall, 2001). ARIMA needs human intervention in model building process. Furthermore, in terms of computational, it might be time consuming and costly (Makridakis et al., 1982) and thus, in general, it is not apt for model building in a short period of time (Lusk and Neves, 1984). Although numerous statistical packages are available and some even provide automatic procedures, the use of automatic procedure to find a good model under a complex method is not recommended especially when comparable results can be achieved with simple models (Armstrong, 1985). Moreover, ARIMA could have a greater risk of overfitting (Adhikari and Agrawal, 2013) since the number of parameters involved might be large. In this regards, during early development of ARIMA, Box and Jenkins (1976) have given a warning that a simple model with fewer parameters should be selected rather than a complicated model. These authors remarked that the goal of time series modeling is to find a simple method which are computationally cheap and fast, simple model with fewer parameters and comparable accuracy. The model with these criteria is what actually the biology postgraduate student is seeking.
The problem encountered is not a simple matter. It is very common in practice where a student with a very limited background in statistics is dealing with a sophisticated statistical problem. Actually, for all statisticians who are often working with nonstatisticians, helping people like her remains one of the most challenging problems to handle. It is really a great job (i) to provide our counterparts with a simple statistical method which are easy to digest and computationally cheap and fast and (ii) to enable them to take the fruit of statistics. In this regard, according to Hindls and Hronová (2015), having experience in teaching statistics to non-statisticians is a must. Therefore, we are indebted to her question that motivated us for conducting this research.
There is no doubt that for those who have good background in statistics and are familiar with time series modeling, they can help her easily. They can find the best model without difficulty by using, for example, Box-Jenkins method (Box et al., 2008). But, for nonstatistician, it is not easy to digest the process of such model building and to understand why a particular statistical package like, for instance, R programming language has chosen a certain model for a given dataset.
Actually, time series modeling is one of the most exciting statistics subjects to be taught. The students are invited to use their imagination and develop their critical thinking and creativity in modeling before doing any further data analysis. The success of modeling depends largely on personal creativity to find the desired model. There is no single way to reach such model. Researchers are free to find the desired model using their own ways. The only clue is given in the so-called Box-Jenkins general guidance (Box et al., 2008) which consists of: (i) model identification, (ii) parameter estimation and (iii) model validation. Therefore, it is not rare that two different experts in time series modeling may choose two different models for the same time series dataset (Anderson, 1977).
One important thing that we have noted during the presentation given by the biology postgraduate student is that the time series data of her concern are positive. In this study, we show that when we are dealing with positive time series datasets, instead of using directly the sophisticated method of modeling such as ARIMA, it is better to check first whether the data represent a GBM process. If it is affirmative, GBM time series modeling could be very attractive due to its simplicity and practicality. Moreover, its accuracy is comparable and the computation is cheap and very fast. This is an aspect in time series course that has been neglected all this time.
The rest of the paper is devoted to show this claim. Our aim is to find a simple method of modeling which can provide a simple model with fewer parameters but with comparable accuracy for positive time series data. In the next section, we begin our discussion with the motivation that leads us to conduct this research followed by the methodology of GBM time series modeling. To start with, this model is approached from log-normal time series by considering the lag t t t X X X +∆ ∆ = − when t ∆ tends to 0. Section 3 presents and discusses five case studies to illustrate the advantages of this model. Concluding remarks in Section 4 will close this presentation. It consists of a pedagogical issue and a proposal to include GBM modeling Introductory Time Series course.

Motivation
Positive time series data can be found easily in many scientific investigations using time series analysis. Interestingly, when we are dealing with positive time series data, log-normally distributed data are also common is practice. As we will show in the next subsection, for log-normal time series data, there is a good chance that they are governed by GBM process.
The idea to conduct this study was inspired primarily by the works of the economists. GBM is used to model the behavior of economic commodities' price under which the log-return follows an AR(1) process with constant term. Since the day GBM was popularized by the Nobel laureate Paul Samuelson (Taqqu, 2001), a great number of its applications in different areas have been developed not only in continuous but also in discrete processes. For example, in modeling the wear of a machine (Rishel, 1991), in forecasting the river flow (Lefebvre, 2002), accelerated life testing and failure model (Park and Padgett, 2005), supply chain management to predict a precise future procurement and sales trend (Wattanarat et al., 2010), dynamic capacity planning (Chou et al., 2007), energy prices (Esunge and Snyder-Beattie, 2011) and modeling the number of aircraft passengers (Asrah and Djauhari, 2013). In what follows we will see how GBM time series modeling works.

Methodology
Let { } t X be a time series satisfying the lag-1 difference model: In a general form, for any time interval T, we have: As a consequence, if we write In the above models, X t is normally distributed. Thus, its value could be positive, zero or negative. Now, an interesting property reveals. If X t is positive and lognormally distributed, ln(X t ) satisfies the model in Equation 1 and 0 t ∆ → , then it satisfies this SDE: Since dX t is a Wiener process, it is well known that the stochastic process {X t } satisfying Equation 3 is a GBM process.
The solution of the SDE in Equation 3 is very wellknown and can be found in the standard literature of SDE such as, for example, Oksendal (2002), Wilmott (2007) and Ross (2011). If X 0 is the initial value of X t satisfying Equation 3, then the general solution is: From this solution, the following properties are straightforward: Consequently, {X t } is memoryless a desired property in time series modeling • We can equivalently write, which is similar to the model in Equation 2 • The log-returns mean 0 and constant variance which is similar to the model in Equation 1 The third property is a natural consequence when we are dealing with (i) positive and log-normally distributed time series X t , (ii) ln(X t ) satisfies Equation 1 and (iii) (Wilmott, 2007;Ross, 2011), i.e.: Under this model, the predicted value of X t is: In the next section five examples will be delivered to show that (i) positive time series data might be governed by GBM process and (ii) there is good chance to describe the data by GBM time series model.

Case Study, Results and Discussion
Five case studies are used to present and discuss the advantages of GBM time series modeling. Three of them are real examples belong to three different industries where we have experienced, one is borrowed from specialized textbook on time series and the last one is downloaded from the official web site of a government agency.

Case 1 (Moisture Content)
Define moisture content in cocoa powder industry is an important quality characteristic that needs an important attention from the management. Since it is time dependent, to understand its behavior from time to time, a careful and rigorous time series modeling is absolutely necessary. Furthermore, since the data are positive, GBM model could possibly be used.
During a preliminary study, the presence of autocorrelation in the data is visualized by using lag-1 scatter plot as suggested in NIST/SEMATECH (2012). The plot is presented in Fig. 1.
To confirm the independency and normality of R t , two statistical tests are used; Durbin-Watson test D for testing the absence of autocorrelation and Anderson-Darling test AD (Anderson and Darling, 1954) for testing the departure from normality. Figure 1 strongly indicates that the autocorrelation cannot be neglected. To test its significance, Durbin-Watson test D (Durbin and Watson, 1950;1951;1971) is used. The result is affirmative that moisture content is significantly autocorrelated (D = 0.0032 and, at 5% significance level, the critical points are D L = 1.6540 and D U = 1.6944).
Interestingly, as we can see in Fig. 2, the run chart of the log-return R t shows a typical condition for a process to be considered as a GBM. Figure 3, which represents the (a) lag-1 scatter plot and (b) QQ-plot of R t , clearly indicates that R t are i.i.n.d In fact, at 5 % significance level: • The autocorrelation is not present (D = 2.3174 with D L = 1.6522 and D U = 1.6930) • The normality of R t cannot be rejected (Anderson-Darling test AD = 0.3190 with p-value 0.5298) Thus, moisture content is a GBM process.
To find the fitted model of process data, we calculate the estimates ĉ and θ in the regression model in Equation 5 and we find ˆ-0.000977 c = and ˆ-0.160692 θ = . Accordingly, from Equation 6, the model is: with MAPE = 4.59% and running time 0.11 sec (CPU time).   (Diebold and Mariano, 1995), their prediction accuracy is not significantly different (DM = -1.4357 and p-value is 0.1543). Therefore, GBM model in Equation 7 is more preferable than ARIMA model in Equation 8 due to its simplicity and practicality with shorter running time. This result shows that GBM model might be as accurate as ARIMA model.

Case 2 (Brush Housing)
Brush housing is an important part of vacuum cleaner. It is produced by a plastic industry. During production process, its bending must be monitored from time to time. From the data that we have collected, lag-1 scatter plot presented in Fig. 4 indicates the presence of autocorrelation. Fig. 4. Lag-1 scatter plot for brush housing Accordingly, we strongly suspect that the bending of brush housing is autocorrelated. In fact, the process is significantly autocorrelated (D = 0.0059 and, at 5% significance level, the critical points are D L = 1.6345 and D U = 1.6794). Therefore, a time series model is needed to understand the behavior of the process.
Since the data are positive, now, we claim that the process is a GBM process. Figure 5, which represents the run chart of log-return, strongly supports that claim.
Indeed, the data give us: • D = 2.1167 with D L = 1.6324 and D U = 1.6778 (at 5% significance level) which means that R t are independent • AD = 0.2948 with p-value = 0.5902 which implies that the normality of R t cannot be rejected at 5% level of significance These results show that that the data represent a GBM process and, accordingly, the model is: with MAPE = 6.46%. The running time to get this model is 0.17 sec. What if ARIMA is used? When ARIMA model is used, the ACF and PACF lead to following best model: Although, in terms of MAPE, this model in Equation 10 is seemingly better than GBM model in Equation 9, Diebold-Mariano test shows that both are not significantly different (DM = -0.7339 with p-value = 0.465). Therefore, in this example also, GBM modeling is more preferable than ARIMA modeling due to its simplicity, velocity and practicality with comparable accuracy.

Case 3 (Electricity Consumption)
Maximum daily electricity consumption in Malaysia during the period of one year from September 2005 until August 2006 is investigated. The data are belong to Malaysia National Power Ltd. (TNB) and provided by Professor Zuhaimy Ismail, Department of Mathematical Sciences, Universiti Teknologi Malaysia. We thank him for the opportunity to use those data. From the data we derive that the GBM model is: with MAPE = 6.89% and running time 0.12 sec. This is a highly accurate model. On the other hand, if we use ARIMA, the model is ARIMA(4,1,5): with MAPE = 4.17% and running time 5.24 sec. Although both models in Equation 12 and 13 are of high accuracy, the former is more parsimonious than the latter. A more surprising result will be obtained if further analysis is conducted to investigate the seasonal effect. It is confirmed that this effect occurs with seasonal period of 7. If this effect is incorporated in the model, the appropriate SARIMA model is SARIMA(1,0,0)(0,1,1) 7 : with MAPE = 2.66%. In terms of MAPE, this model is better than the two previous ones. However, if we use both GBM model and ARIMA on deseasonalized data and then bring back to the original data, we come up with GBM model: with MAPE=2.61% and ARIMA model: with MAPE=2.57%. Here: Interestingly, the MAPE of this model is 1.034% which is highly accurate and we need just only 0.19 sec to achieve this model. This result is exactly the same as that given by ARIMA model because daily oil palm price data are governed by AR(1) process. However, ARIMA needs 5.55 sec to give the desired model. This example, like the previous one, demonstrates the advantage of GBM modeling compared to ARIMA.

Case 5 (Chemical Process Viscosity)
In this example, we show another evident that GBM time series modeling is more advantageous. This example is about time series modeling for chemical viscosity data in hourly readings borrowed from Box et al. (2008), Series D. We find that the predicted GBM model is:

Pedagogical Issues
The development that we set forth in this study has been exposed to and discussed with the biology student mentioned earlier. We are happy to find that she is very satisfied with the above results. GBM modeling is really helpful in solving her problem to have an easy to digest, simple to implement, computationally efficient model building and might give highly accurate model. She wrote: "Once I am exposed to the method of GBM modeling, I prefer to use it since it is easy to digest, provides faster solution and might give the desired prediction accuracy." The method developed and the results obtained in this study have also been introduced in a postgraduate class of Time Series course. The way we conduct the class has been developed where positive dataset was given special emphasis. At the end of the class, all students were asked to write their reactions. Surprisingly, their reactions are very encouraging and thus we believe that we are on the right track when we modified our way of lecturing.
The GBM modeling has made the biology student feels comfortable and satisfied. It is so with the students in postgraduate class of Time Series courses. We suggest them, when dealing with positive time series data to: • Use GBM model in Equation 6 first before going to search for an ARIMA model. As a useful indication, if the data represent GBM, the serial correlation of log-return is small • Use GBM model for further analysis if its accuracy is as desired. Otherwise, use ARIMA model or other model

Time Series Modeling
If the time series data are positive, GBM modeling can be considered first before using another method of modeling such as ARIMA. Due to its simplicity and practicality with shorter computational running time, if the accuracy of GBM model is as desired, then it is more preferable than the latter. GBM modeling does not need special statistical skills except logarithmic transformation and parameter estimation of simple linear regression. Therefore, it is easy to digest, computationally cheap and fast and simple to implement even by those who have a very limited background in statistics. Due to these advantages, to close this presentation, we propose a procedure for modeling positive time series data that might be included in introductory Time Series course. Figure 6 shows the flow chart of this procedure. In this figure, the yellow area refers to the current steps of ARIMA model building while the green area refers to the proposed steps. Step 2. Compute the ratio of X t and CMA t obtained in Step 1. Example: Step 3. Compute the unadjusted factor UnAdj(F t ):   Step 5. Deseasonalized data at time t is: