Stock Market Prediction Using Machine Learning Algorithms A Classification Study
Stock Market Prediction Using Machine Learning Algorithms A Classification Study
Abstract- Predicting the stock market has been an area of analytics for Stock Market Prediction. Prediction of stock
interest not only for traders but also for the computer engineers. market behaviour assists financiers to act with greater
Predictions can be performed by mainly two means, one by certainty, take into account the threat and volatility of an
using previous data available against the stock and the other by investment, and know the right time to buy or sell the stocks.
analysing the social media information. Predictions based on
previous data lack accuracy due to changing patterns in the This paper elaborates the technical analysis [16] of prediction
stock market al.so, some fields might have been missed due to that is performed using historical data of stocks by applying
their insignificance in some stocks or unavailability of data. For
example, some models may require ‘return rate’ as a parameter
Machine Learning (ML) algorithms. However, predictive
for stock prediction, but the available data might not have it. On analysis can be performed in one more way that is a
the other hand, a model predicting only on the basis of the fundamental analysis which is achieved by applying
return rate may find opening and closing price to be sentiment analysis to the social media information
insignificant parameters. The data has to be cleansed before it exchanges.
can be used for predictions. This paper focuses on categorising
various methods used for predictive analytics in different Prediction of data is based on the existing stock data that
domains to date, their shortcomings. Further, the authors of this includes previous opening price, closing price, highest price,
paper have suggested some improvements that could be lowest price, adjusted closing price and volume of the
incorporated to achieve better accuracy in these approaches. security traded [13].
Keywords- Data Analysis, Machine Learning, Predictive Analysis, The objective being implemented in the proposed research
Stock Prediction, Linear Regression. project involves using the existing data and building a model
I. INTRODUCTION using machine learning algorithms. It will be helpful in
predicting the future outcomes regarding a particular stock.
The stock market is one of the factors that symbolizes a The development is being done from the customer’s point of
country’s economy. Few people excel at correctly view so that they can invest in the stock market by null
understanding the changing trend of stocks, and thus many avoiding the risk as much as possible. Different mathematical
people fear from investing in stocks. The alleged concept of models such as Neural Network (NN), Linear Regression
the sole involvement of economics and finance in studying (LR), Naïve Bayesian and Support Vector Machine (SVM)
stocks has been broken by data science (data analytics) by its are used for getting the best results. The major contribution
scope of prediction. Data analytics involves interpreting a of this work has enhanced the stock market predictions as
large volume of data and inferring results based on it. The close to reality as possible.
procedure of converting raw data into optimized information
involves descriptive analytics, diagnostic analytics, II. LITERATURE REVIEW
predictive analytics and prescriptive analytics. Descriptive Hu Z. et al. (2013) [3] had discussed the prediction of the
analytics gives the patterns based on past performance which stock market using SVM. It is a good idea to use SVM as it
offers us the insight to the data, for instance, a periodic profit always gives unique results and works well even at local
and loss statement. Diagnostic analytics identify the reason minima. The authors worked on a dataset from the Federal
of occurrence of a particular event in past through results Reserve Bank of St. Louis over the data of 15 companies.
obtained by descriptive analytics. Predictive analytics will There is no definitive way to define a good or poor, so they
foretell the behaviour of stock in the near future based on the had considered it a good investment if the stock price of a
historical data by extrapolating it. Prescriptive analytics work company surges over a period of time. The study
as a feedback system after applying the previous analytics demonstrated that SVM produces results with good accuracy
techniques. This paper takes account of only predictive for the sample of data which is outside the training sample.
2475
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on August 03,2021 at 08:50:04 UTC from IEEE Xplore. Restrictions apply.
International Conference on Recent Innovations in Electrical, Electronics & Communication Engineering - (ICRIEECE)
Zhao L. and Wang L. (2015) [7] had developed an outlier shown itself to be better than individual classification
Data Mining algorithm for predicting fluctuations in Stock methods in predicting the stock market.
Market. This paper elaborated whether a stock trend can be
predicted using the noticed anomalies of the historical Verma R. et al. (2017) [6] had discussed the prediction of
financial data (in this case, tick -by-tick data) or not. In this, stock market using Artificial Neural Networks (ANN), they
the data was pre-processed to find the anomalies and further had also described the theoretical idea of ANN and its salient
a clustering algorithm was applied to predict the stock trends. features. This paper elaborated that ANN shows fairly
accurate results unless there is variation in the actual data. In
Attigeri G. V. et al. (2015) [12] had introduced the concepts
[12], the authors had applied forward feed and further
of financial derivatives like “no arbitrage” principle and those
updated the values using backward propagation for getting
of the prediction model, such as random walk theory and
better results. These results assist us to learn that weights
efficient market hypothesis (EMH). They had used a semi-
should be assigned to the synapses in accordance with
strong form of EMH due to the type of data available. News
respective errors in each pattern, also values should be
articles were analysed and the prediction was made on their
normalised according to the activation function.
basis. Steps involved in prediction were data preparation,
analysis, aggregation and visualisation. The outputs were Tiwari S. et al. (2017) [11] had used different mathematical
positive, negative or neutral. Data were collected on a daily models’ algorithms like Polynomial Model, Radial Basis
basis and predictions were made on Logistic Regression. The Function (RBF) NN and Multi-layer Perceptron NN, LR
accuracy obtained was 70%. Generalized linear model (LM) method. They had compared all of these models and found
(binomial family) can be used as a logistic regression model. out that Feed Forward NN gave the most accurate results for
the opening price of the stock. They had also discussed the
Kavitha S. et al. (2016) [4] had compared the LR model and
strategy of “Buy low, Sell high”. For the evaluation of the
Support Vector Regression model for data prediction based
most accurate model, various factors like mean error (ME),
on the historical data. They had used LeastMedSq function
RMSE, mean absolute percentage error (MAPE) and first-
and SMOReg function for the regression techniques
order Auto-correlation coefficient were calculated. They had
respectively. The metrics for comparison were mean
obtained the average opening price of the stock.
absolute error (MAE) and root mean squared error
(RMSE). When these models were applied to the regression Bhuriya D. et al. (2017) [1] had discussed the prediction
data, LeastMedSq function fits more for prediction the using regression techniques. They had compared the LR,
values, although it was taking more time than SMOReg. Polynomial Regression and RBF Regression models on the
basis of the Confidence Values (CV). The analysis gave the
Kim H. and Han T. S. (2016) [14] had given an enhanced result that LR works better than Polynomial Regression and
form of Random Forest (RF) and compared its predictive RBF Regression, and gave the most appropriate closing price
capacity against the normal RF algorithm. In the enhanced of the stock on daily basis.
RF, as a part of training dataset, they had used the change
in closing price as a weight in the bootstrap methodology. Sharma A. et al. (2017) [5] used regression modelling to
Then they compared the performance of both types of RF make predictions. The model was redefined each time the
techniques and the basis of the average accuracy for various degree of the problem changes. The experimentation was
time periods. They even proposed a method for ensemble performed on Polynomial Regression, RBF Regression,
on the classification data. Sigmoid Regression and LR. For prediction purposes, LR
proved to be the most suited and they were fitted using the
Kumar P. and Bala A. (2016) [9] had discussed a problem least squares approach.
for the binary classification dataset based on different
machine learning algorithms especially focusing on
Decision Tree (DT), LM and RF. They used all the three Guo P. et al. (2017) [2] had discussed the high dimensionality
of data found whilst working on stock market data. LR was
approaches and compared all of them on the basis of
basically used and further Principal Component Analysis
plotting a curve between the sensitivity and false positive
(PCA) [15] was applied along with. PCA helped to reduce the
rate (FPR) at different settings, further they calculated the
area under the curve (AUC) and accuracy. After evaluating dimensionality of data by finding principal components.
all the cases, they found out that RF gave the most accurate Hence, the number of points to be considered reduced
significantly. It was observed that accuracy increases on use
solution with nearest AUC value to 1 over DT and LM.
of PCA with LR.
Rajput V. and Bobde S. (2016) [10] had used a hybrid
approach for predicting the trends. It was based on a set of
relevant metrics, and the prediction used clustering
algorithms. It was observed that noise creeps in easily when
data for a very small period of time was taken as reference.
That issue was solved by using the hybrid approach. It had
2476
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on August 03,2021 at 08:50:04 UTC from IEEE Xplore. Restrictions apply.
International Conference on Recent Innovations in Electrical, Electronics & Communication Engineering - (ICRIEECE)
III. CATEGORIZATION OF MACHINE LEARNING al. (2017) [1], data set that has been used for testing is taken
ALGORITHMS APPLIED FOR PREDICTIVE out of the initial data collected and the sample size of data
ANALYSIS used is 10% of the population size. If the same analysis is
done on large datasets, the results may vary.
TABLE 1: ANALYSIS OF ML ALGORITHMS
Algorithm Used Model Applied Parameters Metrics The discussion about the reduction of dimensionality before
applying predictive analysis by Guo P. et al. (2017) [2] takes
into consideration three stock exchanges: the London, New
LR [2] Classification RMSE Without
York and Karachi Stock Exchange. It is expected that after
PCA= 16.43 applying PCA, the accuracy of prediction shall increase as
With PCA= PCA takes into account only those eigenvectors which are
1.4 needed to describe almost 98% of the variance. However, the
SVM [3] Non-Linear Accuracy 96.15 % decrease in accuracy of the Karachi Stock Exchange after
Classification applying PCA clearly suggests that the selection of right
SVM [4] Regression RMSE 12.8725 parameters and tuning them to the correct values are
LR Regression RMSE 12.54 important for accurate prediction. Also, while reducing the
LR [1] Regression CV 0.97 dimensionality we must ensure to keep the important features
Polynomial Regression CV 0.46 intact for better results.
RBF Regression CV 0.56
DT [9] Binary Classification Accuracy 51.87 % V. CONCLUSION
LM Binary Classification Accuracy 52.83 %
RF Binary Classification Accuracy 54.12 % The main aim of the proposed research study is to extract out
NN [6] - Accuracy 88 % the model from the existing pool, having the highest accuracy
Feed Forward NN [11] - MAPE 1.81 % and minimal error metrics for stock market prediction.
LM - MAPE 6.848 % Therefore, many algorithms have been compared for the said
ARIMA - MAPE 2.20 % purpose. The predictions made on classification data using
Auto ARIMA - MAPE 2.07 % LR demonstrates that accuracy is largely improved when
Multilayer Perceptron - MAPE 0.605 % PCA is applied to the data and further assists the prediction
task, as stock market data is known to have high
dimensionality. SVM shows high accuracy on non-linear
IV. LIMITATIONS OF EXISTING APPROACHES classification data whereas LR is the preferred algorithm if
the available model is that of regression, as it has a high
The introduction of the concepts of “no arbitrage” and EMH confidence value. RF shows high accuracy on binary
by Attigeri G. V. et al. (2015) [12] paper lacks a large amount classification model and multilayer perceptron offers the
of testing data. This work has just performed the testing on a least error in prediction. Hence, it can be concluded that
small dataset which has produced comparatively low choosing the algorithm mainly depends on the type and
accuracy than that of the training dataset, so to get more real volume of data on which predictions are to be analysed.
results it must be performed on a larger dataset for the testing
purpose. Acknowledgment
The work performed by Kavitha S. et al. (2016) [4] clearly We, the students of Thapar Institute of Engineering and
depicts the comparison between LR and Support Vector Technology, would like to present our gratitude to our mentor
Regression on the basis of the RMSE, but it doesn’t focus on Ms. Harkiran Kaur, without whom this research paper could
the time taken to build the model. No matter LR has a low not have been possible. Her guidance has been of utmost
error rate than that of Support Vector Regression, but it takes importance in the completion of this research paper.
more time to build it than that of another one. Also, the main
parameter of evaluation lacks high efficiency, we can get a
model with less error rate to produce better results.
References
The predictive data mining techniques given by Kumar P. and
Bala A. (2016) [9] compare three different models that are [1] D. Bhuriya, G. Kaushal, A. Sharma and U. Singh, “Stock Market
DT, LM and RF, but all of them lack high accuracy rate. So, Predication Using A Linear Regression,” in International
Conference on Electronics, Communication and Aerospace
the ensemble of these models can produce a better result
Technology, 2017.
which must be evaluated to get more real and accurate results.
[2] P. Guo, M. Waqar, H. Dawood , M. B. Shahnawaz and M. A.
The hybrid approach for Stock Market Prediction by Rajput
Ghazanfar, “Prediction of Stock Market by Principle Component
V. and Bobde S. (2016) [10] does not take into consideration Analysis,” in 13th International Conference on Computational
the effect of other stocks on a particular stock. Whereas in the Intelligence and Security, 2017.
prediction using the regression techniques by Bhuriya D. et
2477
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on August 03,2021 at 08:50:04 UTC from IEEE Xplore. Restrictions apply.
International Conference on Recent Innovations in Electrical, Electronics & Communication Engineering - (ICRIEECE)
[3] Z. Hu, J. Zhu and K. Tse, “Stocks Market Prediction Using Support
Vector Machine,” in 6th International Conference on Information
Management, Innovation Management and Industrial Engineering,
2013.
[14] H. Kim and S. T. Han, “The Enhanced Classification for the Stock
Index Prediction,” Procedia Computer Science, pp. 284-286, 2016.
2478
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on August 03,2021 at 08:50:04 UTC from IEEE Xplore. Restrictions apply.