Model Comparison for the Prediction of Stock Prices in the NYSE

The stock market is an integral part of investments as well as the economy. The prediction of stock prices is an exciting and challenging problem that has been considered by many due to the complexity and noise within the market and to the potential profit that can be yielded from accurate predictions. We aim to construct and compare models used for the prediction of weekly closing prices for some of the top stocks in the New York Stock Exchange (NYSE) and to discuss the relationship between stock prices and the predictor variables. Relationships explored in the study include that with macroeconomic variables such as the Federal Funds Rate and the M1 money supply and market indexes such as the CBOE Volatility Index, the Wilshire 5000 Total Market Full Cap Index, the CBOE interest rate for 10-year T-notes and bonds, and NYSE commodity indexes including XOI and HUI. Models are built using methods of regression analysis and time series analysis. Models are analyzed and compared with one another by considering their predictive ability, accuracy, fit to the underlying model assumptions, and usefulness in application. The final models considered are a pooled regression model involving the median weekly closing price across all stocks, a varying intercept model considering the weekly closing price for each individual stock, and an ARIMA time series model that predicts the median weekly closing stock price based on past prices.

When it comes to creating an investment portfolio to build up held assets, there are several different asset classes to choose from which include bonds, cash equivalents, and equities. Equities, or stocks, are the most volatile type of these asset classes but also have the ability create a large profit. The key is to know when to buy and when to sell. This is why understanding patterns in equity prices is vital to those who wish to invest in the stock market.
The purpose of this study is to estimate the values of time series data within each selected stock of the New York Stock Exchange (NYSE). The study focuses on various stocks within a certain exchange rather than in a certain index such as the S&P 500, because we want to keep the values estimated to be consistent within the same market. Stocks within the S&P 500 are sold in different exchanges which can affect the estimated price of that stock. Because of this, the model provides insight exclusively for participation within the NYSE. Out of all possible exchanges, the NYSE was chosen in this model since it is the largest stock exchange in the world.
Stock data is time series since it contains prices over time where each time point is related to the previous time point. It is possible to look at stock prices over different time intervals such as hourly, daily, weekly and monthly. This study looks at weekly time intervals. The reason is that when looking at daily stock data, there are some days that are missing. Excluding these data points would make the intervals between time points unequal, which would not be appropriate for time series analysis. A solution to this problem would be to fill in the missing data. However, this study chooses to instead use weekly time points since there can be a great amount of variation among stock prices per day. This variation could be largely due to white noise which we are not necessarily interested in tracking. This study is interested in looking at trends over longer periods of time.
There are approximately 2,800 companies that have equities listed in the NYSE. Instead of 2 trying to fit a model that encompasses all of these stocks, this model focuses on a subset of stocks within the NYSE. There are two options for doing this. The first is to take a random sample of stocks and create a model looking at the stocks within that sample. The second is to take a selection of the most common or most popular stocks within the exchange. This study follows the latter option. Using a model to create predictions for more popular stocks makes the study more relevant in application for those who participate within the NYSE. On the other hand, a random selection could include more obscure stocks, although which could provide a more diverse look at equites, would be less practical for application. The selection of common stocks that are modeled in this study are those within the NYSE 100. stocks belong to the Utilities (UT) sector which includes electric, gas, and water utilities. Sectors will be treated as categorical variables in the model and are discussed later in this study.

Macroeconomic Variables
Changes in stock prices are connected to several aspects of the economy, including those at the highest levels. Macroeconomic variables are those features of a national or international economy that describe the state of the market as a whole. These variables tend to be recorded monthly or annually rather than weekly. Because of this, macroeconomic variables are more useful for observing trends over long periods of time rather than accounting for variation in the short run.
Within this modeling process, we will consider several of these factors to see any possible relations between the changes in equities and the economy at a macroeconomic level. are expected to change quickly. A higher amount of volatility also indicates increased uncertainty and risk in the market, which can deter investments and spending. The data obtained is weekly.
The VIX data was originally released by the CBOE and was retrieved from FRED. This study defines VIX as V IX t for each week t = 1, ..., 939 . (Online, c) The CBOE also has an index that measures the interest rate for 10-year T-notes and bonds.
TNX is the ticker symbol for this index. Equities or stocks are a type of asset class along with bonds. Since both equities and bonds are used in financial portfolios, it is possible that changes in bond rates can affect whether or not a person decides to invest in stocks or bonds. The data obtained is weekly and was retrieved from Yahoo! Finance. This study defines TNX as T N X t for each week t = 1, ..., 939.
The Wilshire 5000 Total Market Full Cap Index is one of several Wilshire Indexes. The Total Market Index is known as being a comprehensive measure of equity in the U.S. market by including the average price of nearly 5000 different stocks from various exchanges. The data obtained is weekly, was originally published by Wilshire Associates, and was retrieved from FRED. This study defines the Wilshire 5000 as W IL t for each week t = 1, ..939. (Online, e) Stock prices can also be related to prices of major commodities within the United States. The NYSE provides current prices of various commodity groups alongside the prices of their equities for reference. The NYSE and looks at three different commodity groups. The first is softs and includes goods such as coffee, cocoa, sugar, and cotton. The NYSE does not have an index for summarizing the prices of softs, so instead this model looks to the other two commodity groups.
The second commodity group is energy and includes fuels such as gas and oil. In this study, the changes in prices of fuels are modeled using the NYSE ARCA Oil and Gas Index (XOI). The index provides the average prices of major oil and gas components within the market. The data is weekly and was retrieved from Yahoo! Finance. This study defines XOI as XOI t for each week t = 1, ..., 939. The third and final commodity group is precious metals and includes rates for gold, silver, and platinum. In this study, the changes in prices of precious metals are modeled 6 by the NYSE ARCA Gold Bugs Index (HUI). The index provides the average prices of stocks in companies within the gold mining industry. The data obtained is weekly and was retrieved from The prediction of stock prices is an interesting and challenging endeavor that has been considered by economists, financial analysts, statisticians and computer scientists alike. The stock market has intrigued many due to the complexity and uncertainty of the market as well as the potential financial gain that can come from accurate predictions. A classic dream is to "make it big" on the stock market and data modelers have considered many different techniques to achieve this end.
In literature, perhaps the most common method for stock prediction is through machine learning and neural networks due to its versatile nature in using many predictor variables and its lenient model assumptions. The most common neural network is the Artificial Neural Network (ANN) see in studies such as one by Moghaddama, Moghaddamb, and Esfandyari (2016), who consider the prediction of daily NASDAQ rates using the day of the week and historical prices as inputs to produce accurate predictions. The drawback with neural networks is that they act as a sort of "black box" building relations between the stock prices and the predictors meaning that it is difficult to interpret the model and the relationships therein. So although neural networks and machine learning tend to be the most popular choice for stock predictions, this study turns to more classical methods including multiple linear regression and time series analysis to provide more meaningful interpretations. Chang, Wang, and Zhou (2012) who study the daily stock trends using another type of neural network, the evolving partially connected neural network (EPCNN) explain that "mining stock market trend is a challenging task due to its high volatility and noisy environment". Stock prices can be very volatile especially in the short run, which is why unlike many other studies, this study will consider a longer time interval using weekly data rather than daily data to account for this noisy short term environment. Chang, Wang, and Zhou (2012) also express the strong relationship between stock trends and other outside factors which is why in this study, we consider many other 8 economic variables as described in Chapter 1.
This study takes advantage of the ability of classical models to provide insight in the relationship between stock prices and other predictor variables. We also consider larger time periods to account for the access noise in the short term.
Previous studies such as those by Al-Tamimi, Alwan, and Rahman (2011) and Sharif, Purohit, and Pillai (2015) consider regression analysis for the prediction of stock prices using other predictor variables even though it is understood that there is a dependent relationship within the data due to its time series nature. However, as this study will show, despite failing to meet the underlying assumption of independence, regression can still be used to statistically show the relationships between stocks and other variables that are known from an economic standpoint.

Outline of the Study
In this section we briefly discuss what you can expect to see in this study. As stated previously, the goal of this study is to compare various models used for the prediction of the weekly closing stock price for selected stocks in the NYSE.
In Chapter 1, you become familiar with the the data that was collected and is used for the modeling process as well as for testing the fit of the models. In Chapter 2, we have discussed the This study is concluded in Chapter 6 which gives comparisons of the three models as well as discusses further considerations for modeling. f 9 CHAPTER 3 POOLED REGRESSION MODEL

Multiple Linear Regression
Multiple linear regression is a model used create predictions based on information that is known of other variables. This study uses regression models to show how the variation of stock prices are related to our predictor variables and how these variables can be used to explain variation in prices.
Since the data are time series, we consider time as one of these predictor variables. In this study we consider several regression models, however, this chapter focuses on a pooled model. To create the pooled model, we consider predicting Y t which represents the median closing stock price for each week t. When pooling the closing price over all stock indexes, we consider the median rather than the average. The average merges the variability within the data over time while the median will retain the patterns of variation.
Furthermore, the distribution of prices at time t tends to be right skewed. This is because there are some larger and more popular companies within the NYSE 100 that have much higher stock prices than other companies. Figure 3.1 illustrates the difference in the distribution of closing prices when looking at the price for each stock for each week as opposed to considering the median price for each week. For the original price data, we see that the distribution is highly right skewed.
For the median price, we see that the distribution is no longer skewed, but it does appear to be bimodal. This could be due to fluctuations in the economy over time causing prices to sit at different levels over different periods. Considering the median is a different perspective on pooling index prices compared to the calculations usually used for stock market indexes, such as the S&P 10 500 which considers the weighted average price which is dependent on the number of stock shares for that company. We also remember that one of our predictor variables, the volume of stock i sold at time t is also a variable that depends of the stock index. Similarly with the closing price, in this model we consider the median volume sold at time t over each stock i. In other words, we are looking at V t rather than V it .
When conducting regression modeling, we first partition the data into two parts: training and testing data. The training data is used in the process of creating the model. Once created, we use the chosen model to make predictions for the testing data in order to check the fit of our model and make sure that there is not an overfitting of the model to the training data. Since the pooled data contains one data point or observation for each week, t, the data set for this model contains 939 data points. The data are randomly partitioned into the two groups with a 70%-30% split. In other words, there are 657 observations in the training data and 282 observation in the testing data. For the purpose of creating the model, use only the training data. We will look at the testing data later in our analysis.
Before creating the model, we look at the mathematical qualities that go into the creation of a multiple linear regression model. For this model, we assume that the relationship between the median closing stock price at week t and our selected set of predictors variables, X t1 , ..., X tk , roughly follows the linear regression model where t is a random variable that represents the error. We suppose E( t ) = 0 such that the expected median closing stock price, is a function of our predictor variables. β 0 is the intercept and β 1 , ..., β k are the slopes. All of the beta coefficients are defined as fixed and unknown parameters. The regression model uses the least squares estimates of β 1 , ..., β k which are the values that minimize the residual sum of squares The notation for a multiple regression model can be simplified by writing the model in terms of vectors and matrices. We set Y as the vector of median closing prices Y t from t = 1, ..., 657. X is set as the (k + 1)X657 matrix of all K predictor variables X tj from t = 1, ..., 657 and j = 1, ..., k.
The first column of the matrix is a vector of all ones corresponding to the intercept. We then have β as the vector of the unknown parameters β 0 , ..., β k and as the vector of error terms from t = 1, ..., 657. Defined with matrices, we can rewrite the model as We can also rewrite the residual sum of squares as Finally, we will define the least squares estimates of the models beta parameters aŝ The k predictor variables in X come from the group of original predictor variables discussed in the first chapter. We decide which predictor variables to include in the model based on variable selection methods.

Stepwise Variable Selection
Classically, there are three popular methods of variable selection which include forward selection, backwards elimination, and stepwise selection. Each of these methods determine which predictor variables should be included in the model based on some selection criterion. We consider variable selection with the goal of minimizing the Akaike Information Criterion (AIC), which is defined as where k is the number of parameters and L(β) is the likelihood function. The median closing prices are assumed to be normally distributed such that Y ∼N (Xβ, σ 2 ). This means that the likelihood function of the beta coefficients given some point, can be defined as AIC is a criterion used for model comparison where the ideal model is the one with the smallest AIC. The criterion considers the fit of the model to the data using the likelihood function.
Subtracting the likelihood function means that the AIC will decrease with an increased likelihood  the cost of adding an additional variable. In the end, the procedure leaves us with the full model.
Looking at some of the specifics of the stepwise selection, we see that there are 9 iterations, and for each iteration a variable was introduced to the model. We also see that for each iteration, the AIC decreases at a slower rate. This indicates that there are diminishing returns for the reduction of AIC due to the cost of adding an additional variable. Finally, we note that the variables introduced into the model first are the variables that increase the likelihood function the most.
Even though the stepwise process gives a full model, this is not the final model. To select the final model, we must consider the significance of the predictor variables and possible multicollinearity between predictor variables. We see from the previous section that the stepwise process gives the full model as the model with the lowest AIC. A reason for this outcome could be possible multicollinearity. Multicollinearity in a model occurs when predictor variables are highly correlated with one another. This means that each predictor variable can be used to explain the variation in our response variable, however, the information that is explained by each variable will be the same. In other words, highly correlated predictor variables tell us the same information about our response. Multicollinearity is undesirable because it adds unnecessary complexity to the model.
One way to check for multicollinearity is by assessing the correlation between each of the predictor variables.  To deal with multicollinearity, it is best to remove predictor variables that are highly correlated with other predictors. The Variance Inflation Factor (VIF) for each variable can be used to determine which predictors should be removed from the model and is defined as If we consider the X 1 , ..., X k predictor variables, then R 2 j is the correlation coefficient for the fit of X j on the remaining k −1 variables. The correlation coefficient will be discussed later in further detail, but it represents the amount of variation in X j explained by the remaining predictors. If X j is highly correlated with the other predictors, R 2 j will be closer to 1 meaning that V IF j will be larger.
A general rule of thumb is that a VIF greater than 10 indicates a problem of multicollinearity. be used to explain the variation in the median closing price. In terms of the model, the slope coefficient for the variable will be significantly greater than zero. When testing for the significance of a variable we can test the null hypothesis that the beta coefficient for that variable is equal to zero versus the alternative hypothesis that the beta coefficient is not equal to zero. We can also think of the test as Here, the null hypothesis is that the expected value of the median closing price follows a reduced model where one variable is removed leaving 6 predictor variables. The alternate hypothesis is that the expected value follows the full model instead where the 7 predictor variables are included. The 7 variables in the model for the alternative hypothesis are the 7 remaining variables 17 after M 1 and T IM E were removed due to multicollinearity. The test statistic is where SSR F and SSR R are the sum of squares for the regression model for the full and reduced models. These represent the amount of variation explained by the model. SSE F is the error sum of squares for the model or the amount of variation that is not explained by the model. n − k − 1 is the degrees of freedom under the full model where n = 657 is the number of observations in the training data and k = 7 is the number of predictor variables that we are considering for the full model. We are interested to see if the gain in the regression sum of squares is significant enough to justify keeping that additional variable in the model. The corresponding p-value for the test is where 1 represents the additional variable that the full model has over the reduced model. Thus F 1,n−k−1 represents an F-distribution with 1 and n − k − 1 degrees of freedom, and the p-value is the probability that we obtain a value of F from this distribution that is greater than our test statistic. Table 3.3 gives the F-statistics and corresponding p-values for each of the 7 variables that we are considering. For most of the predictor variables, the p-values for the test statistics are close to zero indicating that for each of those tests, there is sufficient evidence to reject the null hypothesis and conclude that the predictor variable is significant within the model. The only insignificant variable in the model is the median volume sold in week t which has an F-statistic of 0.1194 and a corresponding p-value of 0.7298. Since the p-value is much greater than 0, there is not sufficient evidence to reject the null hypothesis, and we conclude that the median volume is not a significant variable for predicting the median closing price when also considering our other predictor variables.
After removing volume, we can define the final pooled model aŝ whereŶ t is the predicted median closing price at week t,β 0 , ...,β 6 are the parameter estimates for the y-intercept and slope coefficients. The chosen predictor variables include the Wilshire 5000 Index (WIL), the Oil and Gas Index (XOI), the Volatility Index (VIX), Gold Bugs Index (HUI), the interest rate for 10-year T-notes and bonds (TNX), and the Federal Funds Rate (DFF).
Before considering the specifics of the model such as interpretations and applications, we first discuss the diagnostics of the model. We wish to see whether the selected model satisfies the underlying assumptions for a linear regression model. The four main assumptions of the model as shown by Dielman (2005), are linearity between the closing price and our predictor variables, independence of the error terms or residuals, constant variance or homoscedasticity of the residuals, and normality for the distribution of the residuals. We analyze each assumption individually.
The chosen pooled model is a linear regression model meaning that it is assumed that there is a linear relationship between the weekly median closing stock price and the predictor variables.
This assumption can be visualized by looking at a plot of the residuals and fitted values. If there is a linear relationship, then the residuals should be distributed centered around a straight line across all values to. The plot gives a red line that represents the center of the residuals for each set of values. It appears that the relationship is linear except for large fitted values where it appears to be slightly curved.
Next we wish to test whether the residuals are independent of each other. This means that we  We can also check for the independence of the residuals with formal testing using the Durbin-Watson test where we test the null hypothesis that the autocorrelation among the residuals is zero 20 against the alternative which suggests that the autocorrelation is greater than zero, or The test statistic is defined as where we consider the ratio of the sum of squares of the difference between each residual and the previous residual and the sum of squares of the errors. The value of the test statistic obtain from the residuals of the pooled model is t = 0.39664 with a corresponding p-value of nearly zero.
These results indicate that there is significant evidence to reject the null hypothesis and conclude that the autocorrelation of the residuals is greater than zero. In other words, we conclude that the residuals are not independence and that the model assumption of independence is violated. One possible reason that we obtain this result is because the data are time series data. For time series data, there is generally a correlation among previous values which could indicate a correlation in the residuals. For example, the closing price for one week will be correlated to the closing price of the previous week since in general, there is not going to be a major shift in the economy that will drastically change the median stock price in one week as opposed to a longer time period.
The third model assumption that we must acknowledge is that of constant variance of the residuals or homoscedasticity. We assume that the residuals have variances that are equal and unknown. This can be checked visually by looking at a plot of the residuals versus the fitted values. If the variances are equal, then we should expect the residuals to appear randomly and equally spread among all values. We note from the plot that the residuals across all values tend to be centered around zero showing that the model tends to provide predicted prices centered around the actual prices across all values. We also notice that there are a few values that have extremely high or low residuals that could be due to the presence of outliers where the model was not as 21 accurate in its predictions. Finally, we can see that for larger values such as predicted prices over $70, we see that the variance appears to be smaller than for other values. When the closing median stock prices is predicted to be a high value such as above $70, the corresponding residuals tend to be smaller. This means that these high predictions tend to be higher than the actual closing price.
In conclusion, the variance appears to be constant except for some extremely large fitted values.
Finally, we consider the assumption of normality for the distribution of the errors or residuals which we assume in the creation of the model. In other words, t ∼ N (0, σ 2 ) for some variance.
For the model to be correct, it is assumed that E( t ) = 0. We must furthermore assume that the errors also follow a normal distribution when testing for the significance of the model parameters as well as forecasting with the model. Normality can be checked both graphically as well as with the use of formal testing. Visually, we turn to the normal Q-Q plot of the standardized residuals versus the theoretical quantities where the residuals follow the normal distribution. If the assumption of normality is met, then the standardized residuals and the theoretical values should be very similar meaning that the normal Q-Q plot shows values that do not stray far from a straight line. values and will stray from the normal. Since the data we are analyzing for the pooled model is the median closing stock price, this will exclude the effects of outliers present across individual stock indexes. However, outliers can still be present in the form of extremely large or small median prices. The unusual prices can be due to external effects or anomalies in the market or across the economy as a whole. For example, we see that there is a curve of many smaller values that stray from the line. These could indicate unusually lower prices due to some economic downturn such as the Great Recession which was brought on by a crash in the housing market.
The normality assumption can also be checked with formal testing using the Anderson-Darling test for normality. This test is versatile because it can be used to check if a sample distribution fits any probability distribution. This means that this test can be applied specifically for the testing of normality. The null hypothesis that the residuals come from a normal distribution is tested against the alternative hypothesis that the errors are not normally distributed.
H 0 : t are normally distributed H a : t are not normally distributed.
The Anderson-Darling test considers the distance between the sample distribution of the observed residuals and the hypothesized normal distribution. The test statistic A 2 is used to quantify the discrepancy rather than simply looking at the plot. The test also places a higher weight on the endpoint values which is where there tends to be the greatest difference from the normal. The test statistic is defined as where the function F ( * ) is the cumulative normal distribution. Therefore, if the standardized resid-23 ual distribution closely follows the normal, then it is expected to have a smaller test statistic. The value of test statistic for the residuals based on the pooled model is A 2 = 0.96234. The corresponding p-value for the test statistic is 0.01516. If we are basing our decision with a standard 95% confidence level, we can conclude that there is sufficient evidence to reject the null hypothesis and conclude that the residuals do not follow a normal distribution. Holding a 95% level of confidence indicates that we allow the type I error, or the probability of falsely rejecting the null hypothesis, to be up to 5%. In other words, the p-value tells us that the probability that we obtain a test statistic of 0.96234 given that the residuals come from a normal distribution is approximately 2%. In conclusion, the test results indicate that the normality assumption is not satisfied. This result could possibly be related to a problem with linearity in the model.

Interpretation and Fit of the Model
In this final section, we consider the fit of the model by considering the error and analyzing the testing data. We also discuss interpretations of the model coefficients and what the model tells us about how the median stock prices relate to the predictor variables.
Perhaps the most common way to assess the fit or the predictive capabilities of the model is with the coefficient of determination which is a value that represents the percentage of the variability in the response value that can be explained by the variability in the predictor variables. The coefficient of determination is defined as where SSR is the sum of the squares of the residuals, or t e 2 t , and SST is the total sum of squares, or t (y t −ȳ) 2 . The residuals represent the amount of error caused by the discrepancies between the estimated values and the actual values. Therefore the ratio of the residual sum of squares and the total sum of squares gives the percentage of variation related to the residuals. This represents the unexplained variance which is not accounted for by the model. Therefore R 2 then gives the variation of the median stock price that is accounted for by the model. Models that have a higher coefficient of determination tend to be a better fit and give better predictions because the 24 model is able to explain more of the variation in the stock price. The coefficient of determination for the pooled model is R 2 = 0.9664. This indicates that 96.64% of the total variance in the weekly median stock price for the top stocks in the NYSE 100 is linearly associated with the variance in the Wilshire 5000 Index (WIL), the Oil and Gas Index (XOI), the Volatility Index (VIX), Gold Bugs Index (HUI), the interest rate for 10-year T-notes and bonds (TNX), and the Federal Funds Rate (DFF). The percentage of explained variation is very high indicating that the model provides a good fit for the median stock price for the top stocks in the NYSE.
The problem with considering R 2 as a measurement of the fit for the model is that it will always increase when more predictor variables are added to the model. This means that based solely on the R 2 a better model would be a model with more predictor variables. However, this is not true. As discussed previously, when creating a model, the goal is to have a well fit model that is as simple as possible. Additional variables can bring additional explanation to the variation of the response variable, however, there is a point where adding an additional variable is not worth the increasing the complexity of the model. An example would be predictor variables that are highly correlated with each other. If we have too many variables, it is likely we will see a high correlation among predictors. Highly correlated predictors give the information about the variability in the response, so it would be unnecessary to include all of them into the model. To account for the complexity of the model when considering the fit, we look instead to the adjusted coefficient of determination which is defined as Looking at the equation, the adjusted value is similar to R 2 . The difference is the the residual sum of squares and the total sum of squares are divided by their respective degrees of freedom.
Therefore, the adjusted coefficent of determination considers both the sample size, n, and the number of predictor variables in the model, K. The adjusted value R 2 adj will never be larger than prices. Furthermore, we then expect the model to provide accurate price forecasts which will be tested later using the testing data.
Considering the coefficient of determination is useful for determining the fit of the model and how well the predictors overall explain the variation in the response, however, we are interested to consider each predictor individually to gain information on the relationship between each of the macro and microeconomic variables in the model and the weekly median closing price. This is done by interpreting the beta coefficients for the predictor variables from the final pooled model.
The coefficients give the relationship between each predictor and the median closing price by analyzing the change or variability in stock price related to the change in our predictors. Before we begin our time series analysis, let us clarify the data that is being considered. In this analysis we are looking to forecast or predict the weekly median closing price for the top stocks in the NYSE as is done in the pooled regression model from the previous chapter. At the end of this section, we consider what would change in our analysis if we were to use the weekly closing price for each index rather than the median.
The first step in the time series analysis is a first look at the data to consider any patterns, trends, cycles, or abnormalities that occur in the weekly closing stock price over time. This consideration is the exploratory data analysis of the time series data before modeling can occur. We visualize these patterns through a time series plot of the data as shown in Figure 4.1 which includes the median closing price for the NYSE for each week from January 01, 2000 through December 23,

2017.
Based on the time series plot, there is an overall increasing trend in the median stock price from January 01, 2000 through December 23, 2017. This indicates that over time, the median weekly closing price tends to increase. Based on economic theory, this makes sense because over time, inflation will drive prices higher. The data analyzed in this study has been recorded over 17 years which is a substantial period of time such that the effects of inflation are visible. Any sequence of random variables, Y t , is defined as a stochastic process and includes times series data such as the weekly median closing stock price. For time series or a stochastic process, the mean function or expected value at time t is E(Y t ) = µ t . This indicates that the expected median closing stock We have pointed out the overall increasing trend as well as the abnormality in the median price that occurred during the Great Recession. Lastly we use the time series plot to analyze the variation in the closing price over time. When we consider variation, we are interested in the change in the median closing price. Based on the plot, we see that the variation does not appear to be constant.
Over short periods of time, there can be small or large changes in the stock prices. In the short run, stock prices can vary greatly leading to unpredictable prices. These unpredictable short run changes are referred to as white noise. For example, a policy change such as the change in the federal funds rate can influence the economy including stock prices. Short term effects can be stabilized in the long run which is a reason why we consider longer periods of time in this study.
Here we consider weekly stock prices rather than daily data. First we will consider the variation in price for a single week or time point t. For the analysis we consider the value or price at each time point as a random variable. The variance for the median closing price at time t can be defined in the same way as any random variable. Here we define the variance as which is the squared expected value of the difference between the observed closing price and the 33 true closing price at time t. For time series data, we also consider the variation between time periods, or the covariance. The auto-covariance function (ACVF) between the median closing price at times t and t + h is defined as which represents the expected value of the product of the differences between the closing price and the true mean closing price for weeks t and t + h. h represents a time period or lag after time week t. The auto-covariance functions can be used to measure the linear dependence for median closing price at various weeks. From the covariance, we can define the autocorrelation function (ACF) between the median closing price at times t and t + h as The correlation between the median closing price at times t and t + h represents the amount of variation in each variable that can be explained by the other. We see from the formula that the correlation is a ratio of the covariance and the product of the standard deviations which represents the overall variations of both variables. We note that γ t,t = V (Y t ) and γ t+h,t+h = V (Y t + h) since the covariance between the same variable is simply the variance.
Next we discuss some of the properties of the correlation. First notice that |γ t,t+h | ≤ √ γ t,t γ s,s . Since the covariance, γ t,t+h , measures the variation between the median closing prices at weeks t and t+h, it is bounded by the variation of the two variables. Therefore, the correlation between the two random variables are bounded between -1 and 1. In other words, |ρ t,t+h | ≤ 1. When |ρ t,t+h | is closer to one, the covariance or interaction between the two variables is close to the overall variations each. This indicates that the overall variance can be closely explained by the variation between the variables indicating that there is a strong relationship between the variables. In other words, the median closing stock price at week t + h is related to the closing price at a previous 34 week t. A value of ρ t,t+h close to zero indicates that the variation between the two variables is small relative to the overall variations of both variables. In other words, the closing prices would be considered uncorrelated.
Before we can model the time series process, we need to be able to make an assumption about the behavior or structure of the process over time. If we are to make a model based on the observed time series data with the purpose of predicting future values, we must assume that the structure of the process remains the same. This type of process is considered to be stationary. With stationarity, there is strict stationarity, however we simply wish to model a weakly stationary process which is weaker mathematically but holds some similar assumptions. For a stochastic process to be weakly stationary, the mean function must be constant over time, and the variance must be constant over time. Both of these requirements indicate that if met, the process maintains the same structure over time. If the mean function is constant over time, then the expected value of the closing price at any time point is the same. In other words, E(Y t ) = µ. This is stronger than the definition of the expected value or mean function that we have previously defined. For the second requirement, if the variances are equal, then V t = V t+h = V 0 such that the variance is not a function of time but is a constant for all weeks. We can write the constant variance in terms of covariance as γ 0 . Also, when we consider constant variance over time, the covariance between random variables that are equal distances apart should be equal. In other words, Cov(Y t , Y t+h ) = Cov(Y t+k , Y t+k+h ) for some lag h and some time period k. This is the same as saying γ h,t = γ h,t+k = γ h . Therefore, the variance is constant and not dependent on time. Instead the covariance between the closing prices Y t and Y t+h is instead based on the time lag difference between them, h. We can also extend the implications of constant variance to the correlation between two random variables in the stationary time series process. Suppose we consider the correlation between weeks t and t + h for a stationary process.
Then, based on our redefined values for the variance and covariance we obtain, ρ t,t+h = γ t,t+h γ 0 . Based on these redefined formulas, notice that for a stationary process, the variance is a constant value, and the covariance and correaltion between two variables is dependent on the lag h between them.
Based on our observations of the time series plot, we first noticed that due to inflation, the median closing stock price increases over time. This indicates that rather than being constant over time, the function is instead dependent on time. This would then indicate that the median stock price over time is not a stationary process because the requirement of a constant mean is not met.
We also noticed periods where the change in the closing stock price is either larger or smaller.
The most obvious example of this was the extreme decrease in price observed during the Great Recession. Because the occurrence of the recession is an anomaly in the economy, not only can it be related to a non-stationary process, it can also have a potential of creating a model that is overfitted to this anomaly. For the sake of providing a model that better meets the underlying assumptions and makes accurate predictions, we will consider time series models that are based on the only the data after the the Great Recession as well as the entirety of the data. In other words, we will consider the modeling process based on the data from January 1, 2000 through December 23, 2017 as well as the data from June 6, 2009 through December 23, 2017.  Stationarity can also be formally tested using the Augmented Dickey-Fuller (ADF) test which tests the null hypothesis that the data are not stationary against the alternative hypothesis that the data are stationary. For the entire set of data for all 939 weeks, the test statistic is -2.0018 with a p-value of 0.5776, which indicates that there is not sufficient evidence to reject the null hypothesis and we conclude that the data are non-stationary. For the data occurring after the Great Recession, the test statistic is -2.5896 with a p-value of 0.3282, which indicates that there is not sufficient evidence to reject the null hypothesis and we conclude that the data are non-stationary. Both of these conclusions align with those constructed from the time series plots. Also, the p-value for the data after the Great Recession is smaller than that for the entire data set which indicates that removing the drastic drop in price that occurred during the recession has reduced non-stationarity in the data.
Similarly with the construction of the pooled model, the data for both time intervals must be partitioned into training and testing data. Since the data are being retained as time series, the data are not partitioned randomly. Instead, the data for the last year will be reserved for testing so that the actual data can be compared with the model predictions. For the entire data set, the 888 weeks from January 1, 2000 through December 31, 2016 make up the training data while the 51 weeks from January 7, 2017 through December 23, 2017 make up the testing data. For the post-recession data, the 396 weeks from June 6, 2009 through December 31, 2016 make up the training data while the year of 2017 still represents the testing data set. Both models use the same testing data, which means that it is simpler to compare the fits of the models based on the accuracy of their predictions.
Since it has been established that the data are non-stationary, it is necessary to make the data stationary before modeling can be done. The most common way to obtain stationarity is through differencing. This means that instead of modeling the price at week t which is increasing over time, we consider the difference or change in price over each week. In other words, we wish to  From the graph, we see that for both time period, the average value appears to be centered around zero. This indicates that the median closing price tends to be similar to the median closing price for the previous week. In other words, the stock price does not differ greatly on a weekly basis. Furthermore, this indicates that the first order difference are stationary data because the mean is not dependent on time but is rather a constant. By looking at the variation, we see that it tends to be constant for both data sets. One exceptions would be for the entire data set, the impact of the Great Recession is still visible although not as drastic as the data was before the differencing. At the point of the recession there is an extremely small difference illustrating the catastrophic decrease in price when the housing market crashed. Some change in variation is also slightly visible for the data occurring after the recession. It appears that there is greater price variability for more recent weeks which is illustrated by the larger and smaller differenced values for the more current data. In conclusion, based on the time series plots, it appears that the first order difference appears to be stationary.
We can formally test the hypothesis that the first order difference are not stationary against the null hypothesis that the first order difference are stationary using the ADF test. Performing the test for the entire data set yields a statistic of -10.818 with a corresponding p-value of less than 0.01 which means that there is sufficient evidence to reject the null hypothesis and conclude that the 38 differenced data for the entire data set are stationary. The test statistic for the first order difference of the post-recession data is -7.7475 with a corresponding p-value of less than 0.01 which means that there is sufficient evidence to reject the null hypothesis and conclude that the differenced postrecession data are stationary. Since the data are stationary, we continue to the modeling process using the first order differences for both data sets.

Model Identification and Selection
It has been determined that the differenced data, ∆Y t = Y t − Y t−1 are stationary. When fitting a model for a stationary time series, we model the data based on past observations and past errors.
In other words, the differenced data can be expressed as where t represent the white noise that is assumed to be normally distributed with mean 0 and variance σ 2 . The past p observations included in the model and the corresponding coefficients of φ represent the components of an auto-regressive process of order p. The past q white noise terms and the corresponding coefficients of θ represent the components of a moving average process of order q. These types of models are called Univariate Box-Jenkins (UBJ) models. They are also referred to as ARIM A(p, d, q) models where AR(p) indicates the auto-regressive component, M A(q) indicates the moving average component, and d is the degree of differencing for non-stationary data. As discussed on the previous section, first degree differencing is satisfactory meaning that we consider the median closing stock price to follow an ARIM A(p, 1, q). The goal is to identify possible candidate models of auto-regressive and moving average components and then select the model with the best fit.
Since the the differenced data are stationary, we can simply the notation. We can write the mean as E(∆Y t ) = µ t = µ since there is a constant mean difference. We write the variance as V (∆Y t ) = γ t,t = γ 0 since the variance is constant over time. The covariance between any two observations ∆Y i and ∆Y j where |i − j| = h can be simply written as γ h since the covariance is a function of the lag. Based on these observations, we define the correlation between any two observations ∆Y i and ∆Y j where |i − j| = h, as ρ h = γ h γ 0 which is the autocorrelation function (ACF).
First we consider the auto-regressive model of degree p, AR(p), which we define as From the equation, we see that the process is based on the previous p observations. Let us consider a simple AR(1) model. If we consider the equation for the AR(1) and then multiply each side of the equation by ∆Y t−h and take the expected values, the autocorrelation function, ρ h , can be derived as as shown in the text by Cryer and Chan (2008) Based on the theoretical values of the autocorrelation function for an AR(1) model, we can identify an process as being auto-regressive if ACF experiences exponential decay. However, the autocorrelation function does not allow us to identify the degree of the auto-regressive process. For this, we turn to the partial autocorrelation function (PACF). The parital autocorrelation at lag k is defined as the correlation of two observation ∆Y t and ∆Y t−k accounting for the effect of the variables in between, Y t−1 , ..., Y t−k+1 . In other words, As shown in the text by Cryer and Chan (2008) this means that for an auto-regressive process of degree p, φ k,k = 0 for k > 1. Furthermore, an AR(p) model can be identified by a dampening of the PACF after lag p.
Next we discuss the moving average model of degree q, M A(q), which we define as The moving average process defines the differenced median closing price as a function of the current random error and previous random error terms. Since the error terms are assumed to be normally distributed with mean zero, the expected value of the differenced data would also be zero. Lets consider the simplest moving average model which is one of degree one where ∆Y t = t + θ 1 t−1 . Then the variance is γ 0 = σ 2 (1 + θ 2 ). The autocorrelation function is then ρ h = −θ 1+θ 2 for h = 1 and ρ h = 0 for h > 1. This can be expanded to the general case for any M A(q) model. If a process follows an M A(q) model, the autocorrelation is zero for any lag greater than q. Therefore, the moving average component can be identified from a plot of the ACF. Now that we have defined the auto-regressive and moving average components of the UBJ model, we can analyze the sample autocorrelation functions (SACF) and sample partial autocorrelation functions (SPACF) of the differenced data in order to identify appropriate ARIM A(p, 1, q) candidate models.
We begin by observing the autocorrelation for the first order difference of the entire data set which is plotted in Figure 4.4. From the autocorrelation function, we first see that the the correlation for a lag of zero is 1. This is because any point is 100% correlated with itself. Also, there is a significant correlation when the lag is 1. This means that there is a significant relationship between the difference in price between two weeks and the difference in price for the week prior.
In other words, there is a significant correlation between ∆Y t and ∆ t−1 for any week t. After lag 2, the correlation becomes insignificant. In other words, there is not a significant relationship between ∆Y t and∆ t−2 for any week t. These observations indicate the presence of a moving average component of order 1 within the model. Therefore, the first candidate model that we consider is an ARIM A(0, 1, 1). This means that we consider a model where the price difference is a function of the previous error term. In other words, we consider ∆Y t = t + θ t−1 .
Next, we analyze the sample partial autocorrelation function. From the plot, we notice that the SPACF is significant for a lag of 1, and then becomes insignificant for any lag greater than 1. This means that when accounting for the effect of all intervening variables, there is only a significant relationship between the difference in price and the difference in price for the previous week. These 41 observations indicate the presence of an auto-regressive component of degree 1 within the model. Therefore, the second candidate model that we consider is an ARIM A(1, 1, 0). This means that we consider a model where the difference in price is a function of the previous price difference. In other words, we consider ∆Y t = φ∆Y t−1 + t .
Since the correlation plots indicate the presence of a moving average component and an autoregressive component, the third candidate model that we include is an ARIM A(1, 1, 1) where the price difference is a function of the previous price difference as well as the previous error term. In other words, we include the model ∆Y t = φ∆Y t−1 + t + θ t−1 . Now that we have identified candidate models when considering the entirety of the data, we now turn to the SACF and SPACF of the price difference for the post-recession data which is plotted in Figure 4.5. From the autocorrelation function, we see that again there is a correlation of 1 for lag 0 indicating the absolute correlation with a value to itself. There is also significant correlation for lags 1 and 10 while the correlation for all other lags are insignificant. Based on these observations, the first candidate model that we consider is an ARIM A(0, 1, 1) which includes the moving average component of degree 1. Even though a lag of 10 is significant, we will not include an M A(10) component in our candidate model because it would include the all intervening lags which we find to be insignificant. In other words, the first candidate model is ∆Y t = t + θ t−1 .
From the sample partial autocorrelation function, we notice that similarly with the SACF, there 42 is a significant correlation for lags 1 and 10 and the correlation for all other lags are insignificant.
Based on these observations, we choose an ARIM A(1, 1, 0) model as the second candidate model which includes an auto-regressive component of degree one. This means that the model defines the price difference as a function of the previous price difference. In other words, the second candidate model is ∆Y t = φ∆Y t−1 + t . An AR(10) component is not included in the candidate model since all intervening lags are insignificant. Finally, the third candidate model is an ARIM A(1, 1, 1) model which combines the autoregressive and moving average components of the previous two candidate models. In other words, the third candidate model is ∆Y t = φ∆Y t−1 + t + θ t−1 . Notice that the candidate models chosen are the same when considering the entirety of the data and only the post-recession data. However, this does not mean that the models will be the same for both data sets. Because the models will be built using different training data, the model fits will differ as well as the parameter estimates.
Three candidate UBJ models have been identified for both sets of training data which means that a final model can be selected for each using a selection criterion. For consistency, we use the same criterion used for regression modeling. We again use the Akaike Information Criterion (AIC) defined as AIC = 2k − 2log(L) where k is the number of unknown parameters and L is the likelihood function. As discussed previously, the model with the smallest AIC is selected since a model that is least complex with the best fit is desirable. Table 4.1 gives the value of the selection 43 criterion for each of the candidate models. ∆Y t = t + θ t−1 2947.85 ARIMA (1,1,0) ∆Y t = φ∆Y t−1 + t 2947.35 ARIMA(1,1,1) ∆Y t = φ∆Y t−1 + t + θ t−1 2948.98 Based on the AIC, all of the models have similar fits. The largest AIC is for the ARIM A(1, 1, 1) model meaning that the adding the additional complexity to the model with the additional unknown parameter does not improve on the fit. The ARIM A(0, 1, 1) and ARIM A(1, 1, 0) are very similar with AIC values of 2947.85 and 2947.35 respectively. However, since the ARIM A(1, 1, 0) has the smallest AIC, it is chosen as the best fit. This means that the difference in closing price is best modeled as a function of the previous price difference when considering the entire data set. Next we consider model selection process for the candidate models based on the post-recession data. Table 4.2 gives the value of the AIC for each of the candidate models. Even though the candidate models are the same for the post-recession data, the fits will differ since different training data are involved. ∆Y t = t + θ t−1 1337.058 ARIMA(1,1,0) ∆Y t = φ∆Y t−1 + t 1337.651 ARIMA(1,1,1) ∆Y t = φ∆Y t−1 + t + θ t−1 1333.520 The first thing that we can notice is that the AIC are significantly lower for each of the candidate models when the post-recession data are used rather than the entirety of the data. In other words, we are able to produce models with better fits when only modeling based on the data occurring after the recession. This is what we would expect since the recession is an anomaly and does not represent the typical trend for the closing stock prices. When considering the post-recession data, the model with the best fit is an ARIM A(1, 1, 1) since it has the smallest AIC with a value of 1333.520. This is interesting since this model contains more unknown parameters than the 44 other candidate models meaning that the added complexity of the model is outweighed by the improvement of the fit. The ARIM A(1, 1, 1) is chosen as the final model since it is the model with the smallest AIC.

Diagnostics
Before we interpret the final model, we first must consider the diagnostics of the models including the significance of the parameter estimates and the assumptions of the model. We have described the model as ∆Y t = φ∆Y t−1 + t + θ t−1 , however the error for the current week is not obtainable. Therefore, to use the model, we write it in terms as the predicted price difference where the error is represented by the difference between the predicted price difference and the actual value. The values of the estimates for the unknown parameters are given in table 4.3.   We use the Anderson-Darling test to test the null hypothesis that the standardized residuals are normally distributed against the alternative hypothesis that the residuals are not normally distributed. The value of the test statistic is A = 1.5122 with a corresponding p-value of 0.0007 which indicates that there is sufficient evidence to reject the null hypothesis and conclude that the residuals are not normally distributed. This means that the assumption of normality is not satisfied.
To check for the independence and constant variance of the residuals, we look to some diagnostics plots shown in Figure 4.7.
The second model assumption is that the variance of the residuals is constant over time. From the plot of the standardized residuals, we see that they tend to be centered around zero or the mean.
The variance over time is represented by how large or small the residuals are. It appears that there tends to be a larger amount of variance for more current periods of time which was an observation that was also noticeable from the original time series plot. This indicates a possible problem with the assumption of constant variance in the model. The third model assumption is that the residuals are independent from each other. To check this assumption, we first look at the autocorrelation function of the residuals. Based on the plot, it appears that the correlation is insignificant for all lag values except for zero which is expected. Based on the ACF, it would appear that the assumption of independence is satisfied. Independence can also be formally tested using the Ljung-Box test which tests the null hypothesis that the residuals are independent against the alternative hypothesis that the residuals are correlated. The p-values for the test are visualized in the plot since the test is performed for every lag value. For each value of the lag, the p-value corresponding to the Ljung-Box test is significantly greater than 0.05 indicating that at a 5% significance level, there is not sufficient evidence to reject the null hypothesis and we conclude that the residuals are independent. Therefore, we conclude that the independence assumption is satisfied. The results of the diagnostics are significant because the time series model, unlike the regression model, is able to provide independent residuals.

Forecasting and Model Interpretation
Now that the final model is selected, we now wish to interpret the model, test the fit of the model, and check the accuracy of the model's predictions. We have defined the model in terms of the first order difference in price, however, the model can be rewritten in terms of the median closing price by simply substituting ∆ t with Y t − Y t−1 . The substitution yields: Since  These actual values represent the testing data set and were not used in the building of the model.  predict the overall state of the stock market and can be used as a market index for popular stocks in the NYSE. However, these models cannot be used to predict the prices for individual stocks, which is why we also consider a regression model with a varying intercept. If we consider each stock separately, the model then can be used by anyone who is invested into a particular stock within the top stocks in the the NYSE. This type of model is also useful for comparing the differences in prices over time for different stock types. Analyzing these patterns can be helpful to investors considering various stocks and which stocks tend to have higher or more steady prices.
First we look at the data that will be used for the modeling process. The pooled models used the median price meaning that there was only one observation for each of the 939 weeks collected.
Since we are interested in each individual stock type, this model will use the entire data set collected which includes the closing stock price for each of the 939 weeks for each of the 85 stocks chosen from the NYSE 100. For the modeling process, the data set is randomly partitioned at a 70%-30% split for the training data which is used in the creation of the model and the testing data which is used to analyze the fit of the model. In other words, out of the 79,815 observations, 60,882 are in the training data and 18,993 are in the testing data. As described previously when discussing the pooled model, the distribution of stock prices is skewed right due to a minority of companies that on average have high stock values. To normalize the distribution, the log transformation of the stock price will be used as shown in Figure 5.1.

Figure 5.1: Distribution of Log Price
In the previous chapter, the specifics of the format of a multiple linear regression model were discussed. Here we highlight the differences in the multiple regression model when considering the median price over all stocks versus the closing price for each stock individually. The pooled regression model followed the following format where β 0 is the intercept and k is the number of predictor variables. For the varying intercept model, we consider where i = 1, ..., 85 represent the 85 stock indexes in the NYSE 100 that we are considering and I i is an indicator function relating to each index i. In other words, the stock index is a dummy variable which corresponds to a unique coefficient α i such that α i + β 0 represents a different intercept for each stock. It is for this reason that the model is referred to as having a varying intercept. Notice that the model does not include a coefficient for the 85th index so that there is not an issue of multicollinearity among the variables. The intercept for the 85th stock index is represented simply by β 0 . Also notice that the equation is being used to model log(Y it ) which is the log of the closing price for each index i for each week t.

Stepwise Variable Selection
Similarly with the modeling process for both the pooled regression model and the time series model, the Akaike Information Criterion (AIC) is again used as the selection criterion. The stepwise selection process begins with the null model and each step of the process then adds or removes variables until the smallest AIC is obtained. All variables that are considered in the process are identical to those considered in the pooled model except for two variables. and TIME) remain the same since they are not related to specific stock therefore do not differ for each stock index.  +T N X -96937.78 The process also shows that the categorical variable representing the stock index is the first variable added into the model. This indicates that the stock index is the variable most related to the closing price. This makes sense since some stocks can see increasing prices and other stocks can see decreasing prices over time depending on the financial health of its corresponding company.
Before selecting the final model, we must first analyze the relationship between the predictor variables to check for any possible problem of multicollinearity. As discussed previously, to remove the effects of multicollinearity, it is necessary to remove predictor variables that are highly correlated to other predictors. Again we rely on the Variance Inflation Factor (VIF) to quantify the correlation of the each predictor relative to the other predictors. there no longer appears to be a problem of multicollinearity within the model. The third assumption is that the residuals are independent of one another. Independence is tested formally using the Durbin-Watson test where the null hypothesis that the autocorrelation among the residuals is zero is testing against the alternative hypothesis that the autocorrelation is greater than zero. The test statistic for the varying intercept model is d = 0.037461 with a corresponding p-values of less than 0.0001. This indicates that there is sufficient evidence to reject the null hypothesis and conclude that the autocorrelation among the residuals is greater than zero.
In other words, we conclude that the independence assumption has been violated. with a corresponding p-value of less than 0.0001. This indicates that there is sufficient evidence to reject the null hypothesis and conclude that the residuals are not normally distributed. This means that the normality assumption has been violated.
From the diagnostics, it is shown that each of the predictor variables are significant in the model and that there are serious issues with the assumptions of the model. In the next section, we look to see how useful the model is for making predictions given these violations.

Model Fit and Interpretation
The final model obtained for the varying intercept regression can be written as In the section we consider the fit of the model analyzing the coefficient of determination and by comparing the values predicted by the model to the testing data. We also look at the interpretation 57 of the model as a way to explain the relationship between the closing stock price and the predictor variables.
The coefficient of determination for the varying intercept model is R 2 = 0.6433 and indicates that 65.33% of the total variance in the log of the weekly closing stock price for the top stocks in the NYSE 100 is linearly associated with the variation in the weekly traded volume for each stock index (V), the Oil and Gas Index (XOI), the Volatility Index (VIX), the interest rate for 10- year T-notes and bonds (TNX), and the Federal Funds Rate (DFF). The percentage of explained variation is relatively high indicating that the model provides a good fit for the log weekly closing stock price. The coefficient of determination adjusted for the degrees of freedom is R 2 adj = 0.6428 and indicates that considering the complexity and sample size used in the model, 64.28% of the total variance in the log of the weekly closing stock price for the top stocks in the NYSE 100 is linearly associated with the variation in the weekly traded volume for each stock index (V), the Oil and Gas Index (XOI), the Volatility Index (VIX), the interest rate for 10-year T-notes and bonds (TNX), and the Federal Funds Rate (DFF). Even accounting for the degrees of freedom in the model, the explained variance is still high which indicates that the model is a good fit.
Interpreting the coefficients for the predictor variables gives insight on the relationship between the closing price for each stock and each predictor individually. The coefficient for XOI is 0.00051 and indicates that holding all other variables constant, when the NYSE ARCA Oil and Gas Index increases by one point, it is predicted that the closing stock price will increase by 0.0510%. This means that there is a positive relationship between the overall oil and gas prices and the prices of individual stocks in the NYSE.
The regression coefficient for V is -1.608e-09 and indicates that holding all other variables constant, when the weekly volume increases by one unit, it is estimated that the closing stock price will decrease by a percentage that is near zero, or 1.608e-07%. This means that when more volume of a stock is sold, the price tends to be cheaper. This makes sense because people tend to buy more when prices are lower. Also notice that the coefficient for volume is extremely small yet still significant based on the p-value. This is because the weekly volume of stock sold is extremely high, therefore a change in one unit is very small. However, when considering larger changes in volume yields a more substantial predicted decrease in the price.
The slope coefficient for V IX is -0.007607 and indicates that holding all other variables constant, when the CBOE Volatility Index increases by one point, it is estimated that the closing stock price will decrease by an average of approximately 0.7578%. This indicates that there is a negative relationship between volatility and price meaning that when there is more volatility or uncertainty for the future, buyers tend to hold off and prices decrease.
The coefficient for DF F is 0.03756 and indicates that holding all other variables constant, when the Federal Funds Rate increases by one percent, it is predicted that the closing stock price will increase by an average of 3.8274%. This illustrates a positive relationship between the interest rate and price as also seen in the pooled model.
The coefficient for T N X is -0.1047 and indicates that holding all other variables constant, when the CBOE interest rate for 10 year T-note bonds increases by 1%, it is estimated that the closing stock price will decrease by an average of 9.94%. Similarly to the pooled model, we again see a negative relationship.
Lastly, we consider the intercepts of the model represented by the dummy variables for each individual stock. Since the coefficients for each dummy variable is significant but one, this indicates a significant difference in the closing stock prices for each variable.
Before the modeling process was started, the data are split into the training data, which the model is built off of, and the testing data. Since the testing data was not used in the building of the model, we compare the predicted values for the testing data based on the chosen varying intercept regression model against the actual testing data. From Figure 5.4, we see that there is indeed a rather strong correlation between the predicted and actual values. However, the relationship does not appear to be as strong as the one seen for the pooled regression model. This makes sense since the varying intercept model has a smaller coefficient of determination.  Now that each of the three models have been thoroughly explored, we now look to compare the benefits and drawbacks of the models. Table 6.1 gives the equations for the pooled regression, time series, and varying intercept regression models. Pooled Regression Y t =β 0 +β 1 W IL t +β 2 XOI t +β 3 V IX t +β 4 HU I t +β 5 T N X t +β 6 DF F t Time Series First lets consider the time series model against the regression models. The benefit of the regression models over the time series model is that multiple regression allows the consideration of other variables as predictors and provides insight on the relationship between the closing stock price and these additional factors. A benefit of the time series model over the regression models is that the time series better fits the nature of the data where the closing price is highly correlated with the closing price of the previous week. For this data, more underlying assumptions of the time series model are satisfied over the underlying assumptions of linear regression. The time series model also uses previous observations which are more readily available information that current data which is used for prediction in the regression models. A benefit of the regression models is that time is not used as a variable but rather as an index. In other words, the model only requires knowledge of the values for the week of interest, not the time relative to other data points.
Now that we have compared the time series against the regression models, we consider the differences between the two regression models. The pooled regression model uses the median or 'pooled' weekly closing stock price over all of the stock indexes considered from the NYSE.
The benefit of this is that the model can be used to give a comprehensive overview of the trends 64 of these selected stocks. The drawback to pooling the data is that the model cannot be used to predict individual stocks. On the other hand, the varying intercept model considers each stock individually which allows for investors interested in specific NYSE stock to compare the trends of each. However, the varying intercept model includes the index as a categorical variable which adds 84 dummy variables to the model making it a much more complex model than the pooled regression. Finally, the regression models can be compared by their predictive ability by considering the coefficient of determination or the amount of variation in the closing price that is explained by each model. The amount of variation explained by the pooled model is 96.61% and the amount of variation explained by the varying intercept model is much lower with 64.28%. In conclusion, the pooled model gives a general comprehensive view of the NYSE stocks overall with high accuracy in predictive power while the varying intercept model gives more in depth information on individual stocks at the cost of lower predictive capabilities.

Further Considerations
In the future, it would be interesting to consider some different models such as neural networks to predict the weekly closing stock price for each index since these types of models have more lenient model assumptions. This could be a more comprehensive analysis since the regression models do not tend to meet the underlying model assumptions.
It would also be interesting to consider individual time series models for each stock index rather than the pooled median price. This would be a study interesting for investors interested in specific stock within the ones selected from the NYSE.