1. Introduction
Stock market performance is one key indicator of the economic condition of a country [
1]. A stock market is a public and open market in which stocks and derivatives of a company are traded [
2]. Stock prices fluctuate in the stock market because there are many factors influencing the stock market, such as general economic conditions, political events, and traders’ expectations [
3]. Moreover, stock price movements are characterized by nonlinearities and high-frequent undulant components. Stock price forecasting is important for investors and stockbrokers. Thus, many researchers and financial analysts have tried to predict stock price trends with various techniques proposed over the years [
4].
A time series model is often used to forecast stock prices. It is based on the past observations of the same variable, to build a model which can be used to predict the future trend. Note that it is not necessary to know the information of other variables in a time series model. One of the most efficient and widely used forecasting techniques for time series in social science is the autoregressive integrated moving average (ARIMA) model. The popularity of this model owes to its statistical properties and the famous Box–Jenkins methodology [
5] in the modeling process. Recently, based on an analysis of three stock markets, Alshawarbeh and Abdulrahman [
6] revealed that the hybrid ARIMA-ANN is overall superior to individual ANN and ARIMA models. Chen et al. [
7] built an advanced hybrid model based on ARIMA to predict the prices of game stocks. ARIMA models are relatively more efficient and robust than complex structural models for short-run forecasting [
8]. However, there is a major limitation for ARIMA, that is, the model presumes a linear correlation structure, and the nonlinear structure cannot be captured by the ARIMA model effectively. Thus, it is not satisfactory for many complex real-world problems. To circumvent this shortcoming and to model the nonlinear relationships observed in the real world, some nonlinear models such as the autoregressive conditional heteroscedastic (ARCH) model [
9] and the threshold autoregressive (TAR) model [
10] have been proposed. However, these nonlinear models only apply to some specific nonlinear relationships [
11], and they may not work for other nonlinear structures of time series.
On the other hand, an artificial neural network (ANN) and support vector machine (SVM) are widely used in stock price forecasting because of their ability to capture the nonlinear characteristics in the data. Stock volatility represents a crucial impact in asset pricing models, portfolio management, and trading strategies. D’Ecclesia and Clementi [
12] found that the artificial neural network (ANN) was the most accurate in tracking the implied stock volatility. Based on the characteristics of the same stock in different periods and the characteristics of different stocks in the same period, an adaptive SVR was presented by Guo et al. [
13]. However, these methods still have some defects. They are good at capturing the nonlinear features in the data but may neglect some linear features.
In recent years, with the development of computer technology, deep learning has become a popular algorithm, which has been applied in image recognition, unmanned driving, and other fields. It is also applied in stock forecasting. Li et al. [
14] proposed a clustering-enhanced deep learning framework to improve the accuracy of stock price prediction. Agrawal et al. [
15] built an evolutionary deep learning model (EDLM), which is used to identify stock price trends. The model implements the deep learning model and establishes the concept of the related tensor. Although the deep learning algorithm shows good accuracy in stock price prediction, it has many internal parameters, and the process of adjusting parameters is complex and time-consuming.
Because of the non-stationary and chaotic characteristics of the stock price data [
16], the time series data of stock prices may contain both linear and nonlinear features. Therefore, the combination of linear and nonlinear models can account for the complexity in modeling the stock market and overcome the limitation of individual technology [
17]. Pai et al. [
18] proposed a hybrid methodology based on the SVM model and ARIMA model to predict stock price. Based on the stock prices of 10 companies, it was found that the hybrid methodology performed better than the ARIMA model or the SVM model in predicting stock prices. Hajirahimi and Khashei proposed a new parallel hybrid model to implement an integrated hybrid framework in which all pure linear and nonlinear patterns in real time series can be appropriately simulated [
19]. Li et al. [
20] built some combined prediction models including ARIMA + SVM and ARIMA + ANN based on the artificial intelligence technique.
Adapting to the advantages and disadvantages of the algorithm or the characteristics of the data, a combined model can overcome the limitations of the single techniques and improve the prediction effect. Although some combination models have been applied in stock price forecasting, it is difficult to find an effective model with strong adaptability for the time series of stock prices with nonlinearity, interruption, and high-frequency fluctuation. Therefore, investors and financial institutions are keen to seek more effective forecasting models.
Gaussian process regression (GPR) is a highly valuable technique in the field of machine learning, and it is widely applied in many nonlinear problems [
21,
22,
23,
24] due to its advantages of being flexible, probabilistic, and nonparametric in the nonlinear modeling process. A GPR model is completely specified by the mean function and the covariance function. Therefore, compared to the neural network and support vector machines, the number of parameters for GPR is much less, and parameter optimization and convergence for GPR are easier. However, little research has been conducted on applying the Gaussian process regression algorithm in the field of stock market prediction.
In this paper, the ARIMA model which deals with linear characteristics and the GPR model which focuses on nonlinear characteristics are combined to form a new stock price forecasting model. It is expected that both linear and nonlinear characteristics of stock price movements can be captured by the proposed model, and better forecasting results can thus be obtained.
Specifically, the ARIMA model is first used to extract the linear features of the sample data for fitting and prediction. The residual sequence of the original data is obtained based on the fitting results. Then, the GPR model is used to train the residual sequence to obtain the nonlinear characteristics of the data. Finally, the results of the ARIMA model are modified and fine-tuned to improve the accuracy of forecasting, which is important for making investment decisions.
The rest of the paper is organized as follows. The stock price forecast model framework is developed in
Section 2. The methodology is described in
Section 3.
Section 4 presents the results and analysis.
Section 5 summarizes the paper with some concluding remarks.
2. Stock Price Forecast Model Framework
Figure 1 shows the framework of the stock price forecast model consisting of four steps.
In
Figure 1, step 1 is data acquisition. This paper selects the closing prices of Beijing Oriental Electronics Technology Group Co., Ltd. (BOE) (Beijing, China), Gree Electric Appliances (GREE) (Zhuhai, China), and Zhongxing Telecommunication Equipment Corporation (ZTE) (Shenzhen, China) on the Shanghai Stock Exchange (SSE) (Shanghai, China) from 4 January 2007 to 30 September 2017, a total of 3923 trading days for the stock price forecast, of which 3893 trading days from 4 January 2007 to 31 August 2017 are the training set, and the remaining 30 trading days in September 2017 are the test set.
Step 2 is modeling. In order to solve the complex linear and nonlinear relationship between stock price time series, the ARIMA model can preliminarily estimate the range of autoregressive p and moving average q through the diagrams of the autocorrelation function (ACF) and partial autocorrelation function (PACF), which is suitable for capturing linear features. At the same time, two different covariance functions are combined to form a GPR model to focus on nonlinear characteristics, and then a mixed model of the ARIMA model and GPRC model is constructed for the stock price forecast.
Step 3 is the forecast. The forecast performance of the GPR model with a single covariance function (GPRS) and the GPR model with a combined covariance function (GPRC) are compared based on the three stocks.
Step 4 is evaluation. The mean absolute error (MAE), mean absolute percentage error (MAPE), mean square error (MSE), and root-mean-square error (RMSE) are common indicators of the forecast model. To verify the effectiveness of the proposed hybrid model, the mixed model (ARIMA + GPRC) is compared with the ARIMA, GPRC, and ANN models.
4. Results and Analysis
The Shanghai Stock Exchange (SSE) is one of the two stock exchanges in Mainland China, and it is one of the largest stock markets in the world. More and more investors from both China and abroad are attracted to the SSE for the potential to gain high returns. Thus, it is important to understand the stock price movements in the SSE, and stock price forecasting has been an important research topic. In this paper, we therefore use published stock price data from the SSE.
MATLAB R2018a software is used for testing and validation. The proposed hybrid GPR model and GPR model is implemented with the help of the GPML (Gaussian processes for machine learning) toolbox. The neural network toolbox in MATLAB R2018a is adopted for building the ANN model.
4.1. Data Set and Evaluating Indicators
In this paper, we use the historical daily close prices from the SSE, as the close prices reflect all the trading activities of the day. Three companies listed on the SSE are selected: BOE, GREE, and ZTE for this study. The sample period is from 4 January 2007 to 30 September 2017.
Figure 2,
Figure 3 and
Figure 4 display the time series of the closing prices for BOE, GREE, and ZTE, respectively. Each data set is partitioned into two parts: the data from 4 January 2007 to 31 August 2017, for training and the remaining samples (September 2017) for testing. The training data set is used to build the forecasting models, and the effectiveness of the forecasting models is assessed based on the test data set.
Table 1 shows the detailed information of the training and testing data set.
To evaluate the performance of the hybrid model quantitatively, four performance indicators which measure the deviation between the predicted and the observed values are selected: the mean absolute error (MAE), mean absolute percentage error (MAPE), mean square error (MSE), and root-mean-square error (RMSE). Smaller values of these measures usually indicate better predictive performance. They are given as
where
and
are the observed and predicted stock price at time
i, respectively;
n denotes the number of predicted data.
4.2. Comparison of GPR Model with Single vs. Combined Covariance Function
In order to select the more appropriate covariance function of the GPR model, we compare the forecasting performance of the GPR model with a single covariance function (GPRS) and that of the GPR model with a combined covariance function (GPRC).
Figure 5,
Figure 6 and
Figure 7 show the forecast results by a single and combined covariance function (GPRS and GPRC models) for the three selected stocks. From
Figure 5a,
Figure 6a and
Figure 7a, it can be seen that both GPRC and GPRS can obtain curves with similar trends to the actual curve, but the predicted curve obtained by GPRC is closer to the actual curve than GPRS. In
Figure 5b, the absolute errors obtained by GPRS and GPRC are both very small, with the maximum absolute error not exceeding 0.1. Among the 20 testing samples, 15 of the absolute error values obtained by GPRC are less than those of GPRS.
Figure 6b shows the absolute error of GPRS and GPRC in obtaining the test set. Except for the error value greater than 1 in the 14th test sample, all others are less than 1. Although the error value of 11 samples obtained by GPRC is greater than that of GPRS, 8 of them are only slightly greater than that of GPRS. In
Figure 7b, the absolute errors obtained by GPRS and GPRC are also very small, and the maximum absolute error does not exceed 0.1, and for the 20 samples tested, 14 of the absolute error values obtained by GPRC are less than GPRS. Therefore, it is clear from
Figure 5,
Figure 6 and
Figure 7 that the forecasting accuracy of GPRC is higher than that of GPRS.
Table 2,
Table 3 and
Table 4 present the evaluation indicators of the forecasting performance. In
Table 2,
Table 3 and
Table 4, for BOE and GREE, all indicators of GPRC are smaller than those for GPRS. For ZTE, the same holds for all indicators except MAE. Overall, the forecasting accuracy of the GPRC model is better than that of the GPRS model.
4.3. Forecasting Results of Various Models and Comparative Analysis
In order to validate the forecasting accuracy of the proposed model, we compare it with a few other models including ARIMA, GPRC, and an ANN. As shown above, the GPRC model outperforms GPRS; GPRS is not considered in this section.
For the ANN model, we employ a three-layer (one of which is a hidden layer) perceptron model, because it can approximate any continuous function in a reasonable way [
26]. The number of hidden nodes is the integer number closest to log(
n), where
n is the number of training observations [
26]. Five input variables are grouped into one vector as the input for day
t − 1. These variables are the daily high price (
Ht−1), daily low price (
Lt−1), the open price (
Ot−1), daily close price (
Ct−1), and trading volume (
Vt−1). The output variable is the close price for day
t. For the GPRC model, the input and output variables are the same as those for the ANN model.
Figure 8,
Figure 9 and
Figure 10 show the forecast values and absolute errors across different models (ARIMA, GPRC, ANN, ARIMA + GPRC) for the three selected stocks, respectively, and the evaluation indicator values of the four models are shown in
Figure 11,
Figure 12 and
Figure 13.
The forecasting curves and the absolute errors of four forecasting models on BOE are shown in
Figure 8. In
Figure 8a, all four models can obtain the same trend as the actual curve, and overall, the ARIMA + GPRC curve is closer to the actual curve. In
Figure 8b, the absolute error obtained by the four models is very small, with a maximum error of no more than 0.3. Among the 20 test samples, ARIMA + GPRC has 7 samples with less error than the other three models, ARIMA has 4 samples with less error than the other three models, and GPRC has 8 samples with less error than the other three models. A further analysis of the evaluation indicators in
Figure 11 shows that among the evaluation indicators obtained by ARIMA + GPRC except for REMS slightly greater than GPRC, all other indicators are smaller than models ARIMA, ANN, and GPRC. Therefore, for the closing price forecast of BOE, the ARIMA + GPRC forecasting accuracy constructed is better than the three models compared.
In
Figure 9a, four models can obtain the same trend as the actual curve of the closing price forecast for GREE; the forecasting curves of the first 12 samples obtained by ARIMA + GPRC are the closest to the actual curves among the four models. In
Figure 9b, the absolute error obtained by the four models is small; except for samples 16, 19, and 20, the absolute errors of all samples are less than 1. Among the 19 test samples, ARIMA + GPRC has 7 samples with less error than the other three models, ARIMA has 3 samples with less error than the other three models, and GPRC has 7 samples with less error than the other three models. In
Figure 11, all evaluation indicators obtained by ARIMA + GPRC are smaller than the other three models. Therefore, for the closing price forecast of GREE, the ARIMA + GPRC constructed is superior to the three models compared.
From
Figure 10a, it can be seen that all four models can obtain trends similar to the actual curve, except for samples 15, 19, and 20, ARIMA + GPRC can obtain the curve closest to the actual curve. In
Figure 10b, the absolute error obtained by the four models is very small; except for samples 16, 18, and 20, the absolute errors of all samples are lower than 1. And among the test samples, ARIMA + GPRC, ARIMA, GPRC, and ANN have nine, two, two, and seven samples with less error than the other three models, respectively. And it is clear from
Figure 13 that the ARIMA + GPRC model obtains the smallest value of all evaluation indicators among the four models. For the closing price of BOE, the MAPE obtained by ARIMA + GPRC increased by 24.1%, 54.7%, and 4.9%, respectively, by comparing to the ARIMA, ANN, and GPRC models. Calculating the forecast indicators for the GREE closing price, it was found that the MAPE of ARIMA + GPRC jumped by 31.2%, 22.2%, and 4.6%, respectively, by comparing to the models ARIMA, ANN, and GPRC. And analyzing the accuracy of the ZTE closing prices forecasted by four models, comparing to the ARIMA, ANN, and GPRC models, the MAPE for the ARIMA + GPRC model also increased by 38.7%, 36.5%, and 23.9%, respectively. Therefore, the ARIMA + GPRC model has best the forecasting accuracy among the four models.
5. Conclusions
Over the years, a lot of research has been devoted to stock price forecasting which is critical for investment decisions. However, there are hardly any effective forecasting models. Thus, it is necessary to continue to study how to improve the effectiveness of forecasting models. In addition, it is usually difficult to predict stock prices accurately due to the complexity and volatility of the stock market.
This paper presents a new hybrid model which combines ARIMA with GPR, to improve the forecasting performance of the stock market in terms of statistical and financial terms. Meanwhile, in order to select more suitable covariance functions, the GPR model with different types of covariance functions is evaluated. It is found that GPR with a combined covariance function outperforms GPR with a single covariance function. Based on the proposed hybrid model, the ARIMA model captures the linear structure of stock prices series, and the nonlinear structure is modeled by GPRC. And using three actual data sets of the trading day price verified the validity of the ARIMA + GPRC model. The simulation results indicated that compared with ARIMA, ANN, and GPRC, in most cases, the proposed hybrid model gave the best forecasting performance in terms of MAE, MAPE, MSE, and RMSE. In summary, it can be concluded that the proposed method is an effective way to improve forecasting performance, which is beneficial to investors for investment decisions and risk management.
There are many factors that affect stock prices; in this study, only the influence of historical closing prices was considered; other influencing factors on the stock market will be taken into account in future research, which contributes to uncovering their functions and subsequent deep penetration into this area.