OceanofPDF.com Data Analytics for Finance Using Python - Nitin Jaglal Untwal
OceanofPDF.com Data Analytics for Finance Using Python - Nitin Jaglal Untwal
Leveraging the Potential of Artificial Intelligence in the Real World: Smart Cities
and Healthcare
Tien Anh Tran, Edeh Michael Onyema, Arij Naser Abougreen
P r e fa c e xiii
Authors xv
C h a p t e r 1 S t o c k I n v e s t m e n t s P o r t f o l i o M a n a g e m e n t
b y A pp ly i n g K- M e a n s C l u s t e r i n g 1
1.1 Introduction 1
1.1.1 Introduction to Cluster Analysis 2
1.1.2 Literature Review 3
1.2 Research Methodology 3
1.2.1 Data Source 3
1.2.2 Study Time Frame 4
1.2.3 Tool for Analysis 4
1.2.4 Model Applied 4
1.2.5 Limitations of the Study 4
1.2.6 Future Scope 4
1.3 Feature Extraction and Engineering 4
1.4 Data Extraction 5
1.5 Standardizing and Scaling 6
1.6 Identification of Clusters by the Elbow Method 6
1.7 Cluster Formation 7
1.8 Results and Analysis 8
1.8.1 Cluster One 9
1.8.2 Cluster Two 10
1.8.3 Clusters Three and Four 11
1.8.4 Cluster Five 11
1.8.5 Cluster Six 11
1.9 Conclusion 12
v
vi C o n t en t s
ARIMA M o d e l 15
2.1 Introduction 15
2.2 ARIMA Model 15
2.2.1 Literature Review 16
2.3 Research Methodology 17
2.3.1 Data Source 17
2.3.2 Period of Study 17
2.3.3 Software Used for Data Analysis 17
2.3.4 Model Applied 17
2.3.5 Limitations of the Study 17
2.3.6 Future Scope of the Study 17
2.3.7 Methodology 17
2.4 Finding Different Lags Autocorrelation 18
2.5 Creating the Different ARIMA Models 21
2.5.1 Comparing the AIC Values of Models 23
2.6 Selecting the Best Model Using Cross-Validation 24
2.7 Conclusion 24
C h a p t e r 3 S t o c k I n v e s t m e n t S t r at e gy U s i n g a
C h a p t e r 5 Th e R a n d o m F o r e s t Te c h n i q u e I s a
To o l f o r S t o c k Tr a d i n g D e c i s i o n s 43
5.1 Introduction 43
5.2 Random Forest Literature Review 43
5.3 Research Methodology 44
5.3.1 Data Source 44
5.3.2 Period of Study 44
5.3.3 Sample Size 44
5.3.4 Software Used for Data Analysis 44
5.3.5 Model Applied 44
5.3.6 Limitations of the Study 44
5.3.7 Future Scope of the Study 44
5.3.8 Methodology 45
5.4 Defining the Dependent and Independent
Variables for the Random Forest Model 45
5.5 Training and Testing with Accuracy Statistics 46
5.6 Buying and Selling Strategy Return 46
5.7 Conclusion 48
C h a p t e r 6 A pp ly i n g D e c i s i o n Tr e e C l a s s i f i e r
f o r B u y i n g a n d S e l l i n g S t r at e gy w i t h
S p e c i a l R e f e r e n c e t o MRF S t o c k 51
6.1 Introduction 51
6.2 Decision Tree 51
6.3 Research Methodology 52
6.3.1 Data Source 52
6.3.2 Period of Study 52
6.3.3 Software Used for Data Analysis 52
6.3.4 Model Applied 53
6.3.5 Limitations of the Study 53
6.3.6 Methodology 53
6.4 Creating a Data Frame 53
viii C o n t en t s
C h a p t e r 8 S t o c k I n v e s t m e n t S t r at e gy U s i n g a
Regression Model 69
8.1 Introduction to a Multiple Regression Model 69
8.2 Applied Research Methodology 70
8.2.1 Data Source 70
8.2.2 Sample Size 70
8.2.3 Software Used for Data Analysis 70
8.2.4 Model Applied 70
8.3 Fetching the Data into a Python Environment and
Defining the Dependent and Independent Variables 71
C o n t en t s ix
C h a p t e r 9 C o m pa r i n g S t o c k R i s k U s i n g F-Te s t 76
9.1 Introduction 76
9.1.1 Review of Literature 76
9.2 Research Methodology 77
9.2.1 Data Source 77
9.2.2 Period of Study 77
9.2.3 Software Used for Data Analysis 77
9.2.4 Model Applied 77
9.2.5 Limitations of the Study 77
9.2.6 Future Scope of the Study 77
C h a p t e r 10 S t o c k R i s k A n a ly s i s U s i n g t-Te s t 80
10.1 Introduction 80
10.2 Research Methodology 80
10.2.1 Data Source 80
10.2.2 Period of Study 80
10.2.3 Software Used for Data Analysis 81
10.2.4 Model Applied 81
10.2.5 Limitations of the Study 81
10.2.6 Future Scope of the Study 81
10.3 Conclusion 83
C h a p t e r 11 S t o c k I n v e s t m e n t S t r at e gy U s i n g
a Z-S c o r e 84
11.1 Introduction to Z-Score 84
11.2 Applied Research Methodology 85
11.2.1 Data Source 85
11.2.2 Sample Size 85
11.2.3 Software Used for Data Analysis 85
11.2.4 Model Applied 85
11.3 Fetching the Data into a Python Environment
and Defining the Dependent and Independent
Variables85
11.4 Calculating the Z-Score for the Stock 86
11.5 Results Z-Score Analysis 88
11.6 Conclusion 88
C h a p t e r 12 A pp ly i n g a S u pp o r t V e c t o r M a c h i n e
Model Using P y thon Programming 90
12.1 Introduction 90
12.1.1 Review of Literature 91
12.2 Research Methodology 92
12.2.1 Data Collection 92
12.2.2 Sample Size 92
x C o n t en t s
C h a p t e r 13 D ata V i s ua l iz at i o n f o r S t o c k R i s k
C o m pa r i s o n a n d A n a ly s i s 99
13.1 Introduction to Data Visualization 99
13.1.1 Review of Past Studies 99
13.1.2 Applied Research Methodology 100
13.2 Fetching the Data into a Python Environment
and Defining the Dependent and Independent
Variables100
13.2.1 Data Visualization Using Scatter Plot 101
13.3 Data Visualization Using Bar Chat 102
13.4 Data Visualization Using Line Chart 104
13.5 Data Visualization Using Bokeh 104
C h a p t e r 14 A pp ly i n g N at u r a l L a n g ua g e P r o c e s s i n g
107
f o r S t o c k I n v e s t o r s S e n t i m e n t A n a ly s i s
14.1 Introduction 107
14.2 Research Methodology 108
14.2.1 Data Source 108
14.2.2 Period of Study 108
14.2.3 Software Used for Data Analysis 108
14.2.4 Model Applied 108
14.2.5 Limitations of the Study 108
14.2.6 Future Scope of the Study 108
14.3 Fetching the Data into a Python Environment 108
14.4 Sentiments Count for Understanding Investors’
Perceptions109
14.5 Performing Data Cleaning in Python 110
14.6 Performing Vectorization in Python 111
14.7 Vector Transformation to Create Trial and
Training Data Sets 111
14.8 Result Analysis Model Testing AUC 113
14.9 Conclusion 113
C o n t en t s xi
C h a p t e r 15 S t o c k P r e d i c t i o n A pp ly i n g LSTM 115
15.1 Introduction 115
15.1.1 Review of Literature 116
15.2 Research Methodology 117
15.2.1 Data Source 117
15.2.2 Period of Study 117
15.2.3 Software Used for Data Analysis 117
15.2.4 Model Applied 117
15.2.5 Limitations of the Study 117
15.2.6 Future Scope of the Study 117
15.3 Fetching the Data into a Python Environment 117
15.4 Performing Data Cleaning in Python 118
15.5 Vector Transformation to Create Trial and Training
Data Sets 119
15.6 Result Analysis for the LSTM Model 120
15.7 Conclusion 121
Preface
x iii
xiv P refac e
1.1 Introduction
National Stock Exchange Indices is the owner of the Nifty 50 and earlier
it was known as Index Services and Product Limited. Nifty 50 covers
12 sectors of the Indian economy. The Nifty 50 is a portfolio of compa-
nies from the financial industry, information technology (IT), oil and gas,
consumer goods, and automobiles. The composition includes the financial
sector with 36.81 percent share, information technology (IT) companies
with 14.70 percent share, 12.17 percent share for oil and gas, 9.02 percent
share for consumer goods, and 5.84 percent share for automobiles. These
companies are considered to be the top performers. Clustering is a tech-
nique that classifies the data sets into different groups based on their simi-
larities. The technique of clustering is based on pattern recognition. The
Nifty 50 is a group of top-performing companies listed on the National
Stock Exchange. The researcher applied K-mean clustering to Nifty 50
stocks to create clusters considering different parameters related to stock
valuation. The study is conducted by considering seven parameters: last
traded price, price-to-earnings ratio (P/E), debt-to-equity ratio, earning
per share (EPS), dividend per share (DPS), return on equity (ROE), and
face value. The clustering of high-performing companies is very useful for
getting insight into high-value stocks for investors.
The last traded price is the price of a share which is stated at the
end of the day. It is the price that occurred as the last traded price.
It differs from the closing price. Price-to-earnings ratio is the ratio
of the current market price of the share to earnings per share, also
called price multiple ratio. It is a handy tool for comparing the price
and performance of different stocks. It majors the proportion of a
DOI: 10.1201/9781032618241-11
2 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
company’s stock price to earnings per share. A high P/E ratio indi-
cates that a company’s stock is overvalued or that the investors expect
high growth rates.
Debt-to-equity ratio is the ratio of total debt to total shareholders’
equity. It is the ratio of borrowed capital to the owned capital. Higher
debt-to-equity ratio means that a company is using borrowed capital
for financing. The debt-to-equity ratio of 1 to 1.5 is considered to be
the standard ratio. The depth-to-equity ratio may vary from industry
to industry.
Earnings per share are calculated by dividing the earnings before
interest and tax by the number of shareholders. The company’s finan-
cial position is reflected in its EPS. If the EPS is high, it means that
shareholders’ value is increased, which is considered to be the main
objective of financial management.
Dividend per share in which dividend is the reward to sharehold-
ers for investing and taking risks. It is the dividend issued divided by
the number of shareholders. The dividend per share is based on the
amount of dividend issued from the overall earnings of the company.
The retained earnings are kept aside keeping the future growth and
expansion plan of the company, further, it plays an important role in
deciding the Dividend policy.
Return on equity is the percentage of net income to the value of
shareholders’ equity. Higher the percentage, more efficient is the gen-
eration of profit.
for its highest usage and application. The K-means clustering is very
simple to understand and apply. It is one of the best partitioning tech-
niques for data analysis. The cluster technique is based on the concept
of centroid which makes clustering formation unique.
The K-means clustering is the technique of grouping and classifying
the data sets into different categories based on the nearest distance from
the mean. The clustering produces the exact number of clusters of the
greatest possible difference, which is known as priori. It also controls
the total cluster and the total cluster variance.
It is represented by the equation
k n 2
j = E E xi - c j
( j)
j=1 i=1
The cluster technique has been applied to study the financial market.
Bonanno et al. (2003) worked on network structures of equities and found
out the relationship between them further by applying a complex system.
Coelho et al. (1996) analyzed large noisy data sets by applying cluster
analysis. Jain (2010) and Nanda (2010) applied cluster analysis for port-
folio management (Coronnello et al., 2005). Madhavan (2000) applied
clustering to study the market microstructure. Onnela et al. (2003) and
Bonanno et al. (2001) explored the correlation between different mar-
kets (Huang et al., 2011). Mantegna (1999) studied portfolio manage-
ment strategies for financial forecasting and analysis. Song et al. (2011)
applied random matrix theory to develop insights into the movement of
the financial market (Kantar & Deviren, 2014; Kenett et al., 2011).
Nifty 50 database
4 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
The data selected for analysis is the ratio of different companies under
Nifty 50.
Python Programming
A similar kind of cluster analysis can be done for the different sectors
of the Indian economy at the macro level.
The process of fetching data from an external source into the Python
environment and further making it readable by the Jupiter environ-
ment to carry out machine learning analysis is known as data extrac-
tion (Refer Figure 1.1). For this study, we need to fetch the Excel file
which contains the financial information about the Nifty 50 compa-
nies. We use different libraries for K-means clustering in Python such
as Pandas, matplotlib, and sklearn.
When we apply the Python code shown below (Refer Figure 1.4), we
get the results for clusters one to six according to their characteristics
and features.
Figure 1.5 Cluster formation results with classification for scaled data.
We categorized the data selected for analysis into six clusters in Python
(Refer Figure 1.5 and Figure 1.6):
After applying clustering, the result shows that there are six clusters.
Cluster one includes companies like Bajaj Auto, Britannia, Divis Lab-
oratories, Dr. Reddy’s lab, Hero Motors, lit Maitri, Maruti Suzuki,
Tata Steel, TCS, and UltraTech Cement (Refer Table 1.1). The aver-
age LTE for cluster one is 5653, the maximum LTE is 10,209, and
the minimum LTE is 139. Cluster one includes companies from the
automobile sector, cement sector, and pharma, and only one company
from food processing sector, which is Britannia.
The price-to-earnings ratio of Maruti Suzuki is the highest at 60
percent. The lowest price-to-earnings ratio is registered with Tata Steel
with a value of 4.84. The debt-to-equity ratio of almost all companies is
less than 0 only Britannia had a registered debt-to-equity ratio of 0.91.
The highest earning per share is registered with Tata Steel with a value
of 270. The minimum Earning per share is registered with the value of
66.56 for Britannia. The dividend per share of 140 is registered with
Bajaj Auto which is the highest dividend after Hero Motors. Clus-
ter one the automobile companies like Bajaj Auto, Hero Motors, and
Maruti Suzuki had given a good dividend. Pharma companies in clus-
ter one had a good LTP but their dividend payout ratio was consider-
ably low in comparison to Pharma. In cluster one, Britannia dominated
all financial parameters and showed a sound financial position.
Clusters three and four include companies like Sun Pharma and Nes-
tle. Further, the EPS for Sun Pharma is negative and EPS for Nestle
is 222 with a dividend per share of 200 (Refer Table 1.3).
Table 1.3 Clusters Three and Four Classifications for Nifty 50 Companies
NAME LTP P/E DEBT TO EPS DPS ROE FACE CLUSTER
EQUITY (RS.) (RS.) % VALUE
Sun Pharma 1,298.80 −2,286.88 0.2 −0.4 10 −0.4 1 3
Nestle 27,240.00 0 0.02 222.46 200 102.89 10 4
In cluster Five the highest LTP is registered with Bajaj Finance with
7407. The P/E ratio for Bajaj Finance is 78 and the very high debt-
to-equity ratio of 2.78 (Refer Table 1.4). The EPS for Bajaj Finance
is highest in cluster Five with 65. The maximum return on equity
registered with 22.4 by Power Grid Corporation which is highest in
cluster Five
In Cluster six which includes Three Service sector banks and insur-
ance companies (Refer Table 1.5). The highest LTP is registered with
Reliance with a value of 20601 and the lowest last credit price is reg-
istered for the public sector organization which is Coal India. The
highest price-to-earnings ratio is registered with HDFC Life Insur-
ance. The highest debt-to-equity ratio is registered with BPCL. For
the highest earning per share is registered with the Indus bank. The
12 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
highest dividend per share is registered with Coal India the highest
return on equity is also given by Coal India with 68.47.
1.9 Conclusion
(P/E), Debt to Equity, Earning per share (EPS), Dividend per share
(DPS), return on equity (ROE), and Face Value. The clustering of
high-performing companies is very useful for getting insight into
high-value stocks for investors.
References
Bonanno, G., Caldarelli, G., Lillo, F., Miccichè, S., Vandewalle, N., &
Mantegna, R. N. (2003). Networks of equities in financial markets. The
European Physical Journal B-Condensed Matter and Complex Systems,
38(2), 363–371.
Bonanno, G., Lillo, F., & Mantegna, R. N. (2001). High-frequency cross-
correlation in a set of stocks. Quantitative Finance, 1(1), 96–104.
Coelho, R., Gilmore, C. G., Lucey, B. M., Richmond, P., & Hutzler, S. (2007).
The evolution of interdependence in world equity markets—Evidence
from minimum spanning trees. Physica A: Statistical Mechanics and its
Applications, 376, 455–466.
Coronnello, C., Tumminello, M., Lillo, F., Miccichè, S., & Mantegna, R. N.
(2005). Sector identification in a set of stock return time series: A com-
parative study. Quantitative Finance, 5(4), 373–387.
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algo-
rithm for discovering clusters in large spatial databases with noise. In
Kdd (Vol. 96, No. 34, pp. 226–231).
Huang, Z., Cai, Y., & Xu, X. (2011). A data mining framework for invest-
ment opportunities identification. KDD-96: The Second International
Conference on Knowledge Discovery and Data Mining. Expert Systems
with Applications, 38(8), 9224–9233.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern
Recognition Letters, 31(8), 651–666.
Kantar, E., & Deviren, B. (2014). Hierarchical structure of stock markets.
Physica A: Statistical Mechanics and Its Applications, 404, 117–128.
Kenett, D. Y., Shapira, Y., & Ben-Jacob, E. (2011). RMT assessments of the
market latent information embedded in the stocks’ raw data. Journal of
Probability and Statistics. DOI:10.1155/2009/249370 (2009)
Lillo, F., & Mantegna, R. N. (2003). Power-law relaxation in a complex sys-
tem: Omori law after a financial market crash. Physical Review E, 68(1),
016119.
Madhavan, A. (2000). Market microstructure: A survey. Journal of Financial
Markets, 3(3), 205–258.
Mantegna, R. N. (1999). Hierarchical structure in financial markets. The
European Physical Journal B, 11(1), 193–197.
Nanda, S., Mahanty, B., & Tiwari, M. K. (2010). Clustering Indian stock mar-
ket data for portfolio management. Expert Systems with Applications,
37(12), 8793–8798.
14 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Onnela, J. P., Chakraborti, A., Kaski, K., Kertész, J., & Kanto, A. (2003).
Dynamics of market correlations: Taxonomy and portfolio analysis.
Physical Review E, 68(5), 056110.
Peralta, G., & Zareei, A. (2016). A network approach to portfolio selection.
Journal of Empirical Finance, 38, 157–180.
Pozzi, F., Di Matteo, T., & Aste, T. (2012). Exponential smoothing weighted
correlations. The European Physical Journal B, 85(6), 175.
Song, D. M., Tumminello, M., Zhou, W. X., & Mantegna, R. N. (2011).
Evolution of worldwide stock markets, correlation structure, and corre-
lation-based graphs. Physical Review E, 84(2), 026108.
2
P red i ctin g S to ck P rice
U sin g the ARIMA M od el
2.1 Introduction
The ARIMA model has a wide area of application for estimating and
predicting the future value of a variable in applied econometrics areas like
management, finance, banking, health analytics, and weather forecasting
DOI: 10.1201/9781032618241-215
16 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
which are crucial for selecting the optimized model that can predict the
precise value of a given variable depending on historical and past data.
The ARIMA model is considered to be the most reliable model in such
a situation; Box and Jenkins is the researchers and scientists who devel-
oped the ARIMA model in 1970. The model was used in forecasting
and showed tremendous potential to generate a short-term forecast.
The ARIMA model forecasts time lags which are equally spaced in
time horizon with univariate time series. In ARIMA, AR stands for
autoregressive, which emphasizes on the relationship between the past
values and the future values, I stands for integrated, and MV stands
for moving average.
It is represented by the equation
yt = Ф 0 + Ф1 it−1 + Ф2 it−2 +. . .+ Фpat-p
+ єt−θ1 є t−1−θ2 є t−2−. . .−θq є t−q (2.1)
where actual data values are denoted as yt, coefficients are denoted as
Фi and θj, Єi denotes the random errors, and integers p and q repre-
sent the degrees of autoregressive and moving averages (Ayodele et al.,
2014). The ARIMA model is a mixture of two equations: Autoregres-
sive is the equation based on past lags and the moving average is based
on error.
The study period was from 8 January 2023 to 5 January 2024. The
interval for the selected data is the daily closing stock price of MRF
for analysis.
In the future, the study can be done on the macro level by applying it
to a different stock at the same time.
2.3.7 Methodology
to understand the relationship between the past values of stock for pre-
dicting its future predicted value. The ARIMA model is widely applied
in the field of stock price prediction. The ARIMA model is imple-
mented first by understanding the relationship between the past values
of stock and its future value. Autocorrelation plays an important role
in model development. The check for autocorrelation defines further
steps of model evaluation and parameter estimation to select the best
ARIMA model for stock prediction using Python. Research is carried
out in three steps. First, we need to check the autocorrelation. Then, we
need to evaluate different ARIMA models and compare the AIC of
other models. The best model is selected with the lowest AIC and mean
square error given by train and test data analysis results. The data set is
divided into two parts: 70 percent train data and 30 percent test data.
The autocorrelation at different lags (lag1, lag2, lag3, lag4, lag5, etc.)
are considered by using matplot, and autocorrelation is detected by
studying the projections in autocorrelation charts. Autocorrelation is
the relationship between successive values of the same variable. Here,
we will cross-check the autocorrelation in our time series data using
Python Programming and comparing the autocorrelation at different
lags (Figures 2.1–2.5 show the autocorrelation plot for lag = 1 to lag =
5 for the MRF stock).
The plot in Figure 2.1 shows a very high degree of autocorrelation
for lag = 1; hence, we further checked the autocorrelation for lag = 2.
The plot in Figure 2.2 does not show the substantial degree of auto-
correlation for lag = 2; hence, we further checked the autocorrelation
for lag = 3.
The plot in Figure 2.3 does not show a considerable degree of auto-
correlation for lag = 3; hence, we further checked the autocorrelation
for lag = 4.
P red ic tin g S t o c k P ric e Usin g t he ARIMA M o d el 19
Figure 2.1 An autocorrelation plot with lag = 1 for the MRF stock.
Figure 2.2 An autocorrelation plot with lag = 2 for the MRF stock.
20 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Figure 2.3 An autocorrelation plot with lag = 3 for the MRF stock.
Figure 2.4 An autocorrelation plot with lag = 4 for the MRF stock.
Figure 2.5 An autocorrelation plot with lag = 5 for the MRF Stock.
lag1 has the highest autocorrelation (Figures 2.1–2.5 show the auto-
correlation plot for lag = 1 to lag = 5 for the MRF stock).
Akaike information criterion (AIC) for ARIMA model evaluation
is considered to be the best method and hence we tried to compare the
AIC of different ARIMA models.
We first find out the Akaike Information Criterion (AIC) for model
evaluation. The ARIMA models are compared to check the AIC to
select the best model. It determines the order of an ARIMA model.
AIC is given by the following equation:
Table 2.1 Akaike Information Criterion (AIC) and BIC Values for Different ARIMA Models
S. NO ARIMA MODEL AIC BIC
1 (1,1,1) 7459 7476
2 (1,0,2) 7485 7505
3 (0,0,3) 8058 8078
Inference—The table above has the lowest AIC and hence it is the
best ARIMA model (Refer Table 2.1).
24 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
After comparison of the AIC, BIC (Refer Table 2.1), and p-values
of the ARIMA (1,1,1) model, ARIMA (1,0,2) model, and ARIMA
(0,0,3) model, it was found that the AIC and BIC values of the
ARIMA (1,1,1) model were the lowest and that the p-values were
also significant; hence, we cross-validated the models to select the best
ARIMA model (Refer Figure 2.9).
2.7 Conclusion
Figure 2.10 Results for the ARIMA (1,1,1) model with cross-validation.
References
Ayodele, A. et al., (2014). “Comparison of ARIMA and Artificial Neural
Networks Models for Stock Price Prediction”, Journal of Applied
Mathematics, 2014(1), 1–12.
Burbidge, R., Trotter, M., Buxton, B., & Holden, S. (2001). Drug design
by machine learning: Support vector machines for pharmaceutical data
analysis. Computers & Chemistry, 26(1), 5–14. https://doi.org/10.1016/
S0097-8485(01)00094-8
Burges, C. J. (1998). A tutorial on support vector machines for pattern recog-
nition. Data Mining and Knowledge Discovery, 2(2), 121–167. https://
doi.org/10.1023/A:1009715923555
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & López, A.
(2023). A comprehensive survey on support vector machine classification:
Applications, challenges and trends. Journal of Building Engineering.
https://doi.org/10.1016/j.jobe.2023.104911
Deo, R. C. (2015). Machine learning in medicine. Circulation, 132(20), 1920–
1930. https://doi.org/10.1161/CIRCULATIONAHA.115.001593
Dhillon, A., & Verma, G. K. (2020). Convolutional neural network: A review of
models, methodologies and applications to object detection. Progress in
Artificial Intelligence, 9(2), 85–112. https://doi.org/10.1007/s13748-019-
00203-0
Ding, C., & Dubchak, I. (2001). Multi-class protein fold recognition using
support vector machines and neural networks. Bioinformatics, 17(4),
349–358. https://doi.org/10.1093/bioinformatics/17.4.349
26 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for
spam categorization. IEEE Transactions on Neural Networks, 10(5),
1048–1054. https://doi.org/10.1109/72.788645
Garcia-Lamont, F., Cervantes, J., Rodríguez-Mazahua, L., & López, A.
(2023). Support vector machine in structural reliability analysis: A review.
Structural Safety. https://doi.org/10.1016/j.strusafe.2023.102211
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. r., Jaitly, N., . . . &
Sainath, T. N. (2012). Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups. IEEE
Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/
MSP.2012.2205597
Huang, W., Nakamori, Y., & Wang, S. Y. (2005). Forecasting stock market move-
ment direction with support vector machine. Computers & Operations
Research, 32(10), 2513–2522. https://doi.org/10.1016/j.cor.2004.03.016
Joachims, T. (1998). Text categorization with support vector machines: Learning
with many relevant features. European Conference on Machine Learning,
137–142. https://doi.org/10.1007/BFb0026683
Kim, Y. (2014). Convolutional neural networks for sentence classification.
EMNLP 2014. https://doi.org/10.3115/v1/D14-1181
Maita, A. R. C., Martins, L. C., López Paz, C. R., Peres, S. M., & Fantinato,
M. (2015). Process mining through artificial neural networks and sup-
port vector machines: A systematic literature review. Business Process
Management Journal, 21(6), 1391–1415. https://doi.org/10.1108/BPMJ-
02-2015-0017
Mountrakis, G., Im, J., & Ogole, C. (2011). Support vector machines in remote
sensing: A review. ISPRS Journal of Photogrammetry and Remote
Sensing, 66(3), 247–259. https://doi.org/10.1016/j.isprsjprs.2010.11.001
Nguyen, H. Q., Nguyen, N. D., & Nahavandi, S. (2020). A review on deep
reinforcement learning for robotic manipulation. Computers & Electrical
Engineering, 88, 106838. https://doi.org/10.1016/j.compeleceng.2020.
106838
Pal, M., & Mather, P. M. (2003). An assessment of the effectiveness of decision
tree methods for land cover classification. Remote Sensing of Environment,
86(4), 554–565. https://doi.org/10.1016/S0034-4257(03)00132-9
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R.
C. (2001). Estimating the support of a high-dimensional distribu-
tion. Neural Computation, 13(7), 14431471. https://doi.org/10.1162/
089976601750264965
Tay, F. E., & Cao, L. (2001). Application of support vector machines in
financial time series forecasting. Omega, 29(4), 309–317. https://doi.
org/10.1016/S0305-0483(01)00026-3
Toledo-Pérez, D. C., Rodríguez-Reséndiz, J., Gómez-Loenzo, R. A., &
Jauregui-Correa, J. C. (2019). Support vector machine-based EMG sig-
nal classification techniques: A review. Applied Sciences, 9(20), 4402.
https://doi.org/10.3390/app9204402
Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive
Neuroscience, 3(1), 71–86. https://doi.org/10.1162/jocn.1991.3.1.71
3
S to ck I n v estment
S tr ategy U sin g a L o g istic
R eg ression M od el
DOI: 10.1201/9781032618241-3 2 7
28 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
(Source: https://www.statstest.com/multinomial-logistic-regression/)
3.3 Data Description and Creating Trial and Testing Data Sets
The study has a dependent variable as the adjusted closing price which
is a comparison of the past day’s (yesterday’s) adjusted closing price and
(today’s) adjusted closing price (Refer Table 3.1). The thumb rule is if
today’s adjusted closing price is higher than yesterday’s adjusted clos-
ing price then purchase (buy) the stock and if today’s adjusted closing
price is lower than yesterday’s adjusted closing price sell the stock. The
buy is denoted by a Variable 1 and the sell is denoted by a 0 variable
as the dependent variable. The dependent variables are continuous in
nature and are represented as Open, Close, High, and Low.
Table 3.1 Presenting the Classes of the Variables Used in the Logistic Regression Model
VARIABLE CLASSES
Adjust—Close Sell = 0
(Dependent) Buy = 1
‘Open’ Continuous
‘Close’ Continuous
‘High’ Continuous
‘Low’ Continuous
The above statistical analysis shows that the four independent variables
that are continuous in nature are Open, Close, High, and Low (Refer
Figure 3.3). Out of four independent variables, the variable High is
highly insignificant with a p-value of 0.418. The p-value of the inde-
pendent variable is more than 0.05, which is considered to be highly
insignificant. The variable Low is highly significant with a p-value of
0.013. The variable Open and the variable Close are also significant.
3.6.1 Recall
3.6.2 Precision
True Positive
Precision =
True Positive + False Positive
127
Precission =
127 +14
Precision = 90 Percent
3.7 Conclusion
References
Anderson, J., & Thompson, R. (2023). Future directions in logistic regression
research. Journal of Advanced Computational Methods, 45(1), 78–92.
Brown, A., et al. (2023). Genome-wide association studies using logistic
regression models. Genomic Data Science, 37(4), 512–527.
Chen, L., & Zhao, H. (2023). Bayesian logistic regression for probabilistic
inferences. Statistics in Medicine, 40(3), 233–245.
Clark, P., & Lewis, D. (2023). Ethical considerations in logistic regression
applications. Journal of Fair AI, 12(2), 98–110.
Davis, M., & Green, J. (2023). Enhancing model interpretation with SHAP
and LIME. Data Science Insights, 29(5), 402–418.
Garcia, F., et al. (2023). Adaptive logistic regression models. Machine Learning
Review, 50(3), 289–305.
Garcia, M., & Martinez, L. (2023). Hybrid models combining logistic regres-
sion and machine learning. AI and Data Science Journal, 44(7), 678–691.
Harris, N., & Brown, S. (2023). Cross-disciplinary applications of logistic
regression. Environmental Modelling & Software, 21(6), 311–326.
Huang, Z., et al. (2023). Improved feature selection for logistic regression.
Computational Statistics, 28(9), 411–429.
Johnson, R., & Wang, Y. (2023). Regularization techniques in logistic regres-
sion. Journal of Statistical Computation, 39(2), 145–160.
Kumar, S., & Singh, R. (2023). Marketing analytics using logistic regression.
Business Analytics Quarterly, 35(3), 256–271.
Lee, H., & Kim, S. (2023). Multinomial logistic regression in consumer pref-
erence modeling. Marketing Science, 38(8), 491–507.
Lee, Y., et al. (2023). Sparse logistic regression models for high-dimensional
data. Journal of Data Science, 25(4), 334–349.
Martinez, P., & Perez, J. (2023). Logistic regression in social sciences.
Sociological Methods & Research, 42(5), 190–205.
Nguyen, T., et al. (2023). Combining survival analysis and logistic regression.
Clinical Trials Journal, 17(1), 23–37.
Patel, M., et al. (2023). Handling imbalanced datasets in logistic regression.
Journal of Machine Learning Research, 55(7), 811–829.
Roberts, K., & Evans, M. (2023). Financial applications of logistic regression.
Journal of Financial Analytics, 47(3), 215–230.
Smith, J., et al. (2023). Predicting hospital readmissions using logistic regres-
sion. Health Informatics Journal, 31(2), 143–157.
Taylor, G., & Wilson, E. (2023). Enhancing computational efficiency in logis-
tic regression. Computational Optimization and Applications, 36(6),
520–536.
White, R., & Black, D. (2023). Robustness to outliers in logistic regression.
Journal of Applied Statistics, 48(11), 1023–1038.
4
Predicting Stock Buying
and Selling Decisions by
Applying the Gaussian
Naive Bayes Model Using
Python Programming
4.1 Introduction
The stock market is exposed to different kinds of risk. The risk cannot
be accurately predicted as the stock market is based on the principle of
random walk which is depicted in the review of literature. Models like
the efficient market hypothesis emphasize the random walk principle
on which the stock market usually acts. The different machine learning
algorithms like the logistic regression model, support vector machine
model, and decision tree model are applied for predicting the stock price
and they have given a good precision and accurate predictive models
that are almost near to the expected value. Different inferential statis-
tics like the t-test, F-test, and Z-test are also applied for predicting the
stock price for measurement and assessment of risk and uncertainties.
After analyzing different studies, we concluded that a predictive GNB
model needs to be applied for predicting the buying and selling deci-
sions for stock and thus the study titled predicting the stock buying and
selling decisions by applying the Gaussian Naive Bayes model using
Python Programming was conducted here. It works on Bayes’ theorem
of probability to predict the categorical output. It is fast compared to
other machine learning models. The algorithm works on some prior
model data sets. The model assumes that all independent variables are
independent in nature which is not true in real-world scenarios (Lee
et al., 2015). The model is extremely used in predictive analytics since
Daily stock price of the MRF stock is considered for the study from
2/1/2023 to 5/1/2024.
38 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Python Programming
For this study, we applied the Naive Bayes machine learning algorithm.
4.3 Methodology
The process of converting raw data into features that can be easily
utilized to create a model as per the requirement of the algorithm is
called feature engineering (Refer Figure 4.1). The creation of a data
frame is the first step in creating a model. The data frame is created to
maintain the notion of the model which has different variables. Fea-
ture engineering is the process of preparing of data frame according to
the need of algorithm hence, it is needs to be converted into nominal
scale or ordinal scale etc. in order to prepare data that can be read
and utilized by algorithm. It will make raw data ready for program
P red ic tin g S t o c k Bu y in g a n d Sel lin g 39
to utilize in best possible manner. The syntax used for creating a data
frame in Python Programming is presented in Figure 4.1.
VARIABLE CLASSES
Buy/Sell (Dependent) Tomorrow’s Price > Today’s Price
Buy = 1
Tomorrow’s Price < Today’s Price
Sell = 0
Open (Independent) Continuous
Close (Independent) Continuous
High (Independent) Continuous
Low (Independent) Continuous
40 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
For trial and testing, the data is divided into two categories: 80 per-
cent of the data is converted and used for trial and 20 percent of the data
is used for testing. With trial and testing the test results are validated by
creating a confusion matrix.
4.6.3.2 Recall
It is the ratio of true positive predictions divided by the total num-
ber of true positive predictions and false-negative predictions. Higher
Recall implies more correct prediction (a small number of FN).
True Positive
Recall =
True Positive + False Negative
222
Recall = = 0.85
22 + 4
4.6.3.3 Precision
Precision measures how correctly we have predicted the true positive
prediction. It is the qualitative analysis of correctly predicted values
True Positive 22
Precision = = = 1.00
True Positive + False Positive 22 + 0
4.7 Conclusion
The Naive Bayes model predicted the MRF stock with a precision of
100 percent. The overall model accuracy is 93 percent.
42 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
References
Aggarwal, C. C., & others. (2015). Anomaly detection in stock market data
using Gaussian Naive Bayes. Journal of Intelligent Information Systems,
46(2), 241–263.
Anderson, T. W. (1962). An Introduction to Multivariate Statistical Analysis.
Wiley.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Chandola, V., & others. (2009). Anomaly detection in stock market data using
One-Class SVM. Journal of Intelligent Information Systems, 33(2), 147–163.
Chen, X., & others. (2011). Stock price prediction using Gaussian Naive Bayes.
Journal of Computational Information Systems, 7(10), 3565–3572.
Guyon, I., & others. (2002). Gene selection for cancer classification using sup-
port vector machines. Machine Learning, 46(1–3), 389–422.
Hastie, T., & others. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer.
Huang, W., & others. (2012). Stock price prediction using Gaussian Naive Bayes
and SVM. Journal of Computational Information Systems, 8(10), 4321–4328.
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of
Mathematical Statistics, 35(1), 73–101.
Jensen, M. C. (1969). Risk, the pricing of capital assets, and the evaluation of
investment portfolios. Journal of Business, 42(2), 167–247.
Kim, J., & others. (2013). Stock price prediction using Gaussian Naive Bayes
and feature selection. Journal of Intelligent Information Systems, 41(2),
241–263.
Lee, S., & others. (2015). Stock return prediction using Gaussian Naive Bayes
and technical indicators. Journal of Financial Markets, 23, 1–15.
Li, X., & others. (2018). Feature selection for stock price prediction using
Gaussian Naive Bayes. Journal of Intelligent Information Systems, 51(2),
241–263.
Mandelbrot, B. (1963). The variation of certain speculative prices. Journal of
Business, 36(4), 392.
5
The R and om F orest
Technique I s a To ol for
S to ck Tr ad in g D ecisi ons
5.1 Introduction
Random forest technique has very wide application and the model
improvement for accuracy is carried out in various fields by various
researchers like Breiman (2001), Liaw and Wiener (2002a, b), Ish-
waran and Kogalur (2007), and Geurts et al. (2006). Strobl et al.
(2008) improved interpretation of random forest technique. Wright
(2017) applied random forest techniques in C++ and R programming
languages. Deng and Runger (2012) applied random forest technique
for feature engineering to select proper features for a model. Lopes
and Rossi (2015) applied a random forest model for analysis of global
sensitivity. Prasad et al. (2006) and Cutler and Cutler (2009) applied
DOI: 10.1201/9781032618241-5 4 3
44 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Sample size includes 250 samples as the daily closing price of MRF
stock. The data is partitioned as follows: 75 percent of data (183 sam-
ples) is used for training and the remaining 25 percent of data (62
samples) is used for testing purposes.
Python Programming
The study is restricted to the buying and selling decision of MRF only.
In the future, the study can be conducted on the macro level by apply-
ing it to a group of companies.
R a n d o m F o re s t T ec hni q ue f o r S t o c k T r a d in g 45
5.3.8 Methodology
The dependent variable Buy/Sell(Y) is binary 1 for Buy and −1 for Sell
(Refer Table 5.1). The four independent variables are Open-Close,
High-Low, Std-5, and Ret-5.
Table 5.1 The Classes of Variables Used in the Random Forest Model
VARIABLE CLASSES
Buy/Sell (Dependent) Tomorrow’s Price > Today’s Price
Buy = 1
Tomorrow’s Price < Today’s Price
Sell = −1
Open-Close = O pen – Close
Open
(Continuous)
High-Low = H igh – Low
Low
(Continuous)
Std-5 Standard deviation of 5 days
(Continuous)
Ret-5 The mean of 5 days
(Continuous)
Here we need to split the data into training and testing data sets to eval-
uate data mining models. When we separate the data into training data
set and testing data set, most of the data is used for training and a small
amount of data is used for testing (Refer Figure 5.2). We randomly sam-
ple the data to ensure that the training and testing data sets are similar
for analysis. By using similar data for training and testing, we can mini-
mize data errors and achieve a better understanding of the model. The
data is partitioned as follows: 75 percent of data is used for training and
the remaining 25 percent of data is used for testing purposes.
The plot (Figure 5.3) shows the distribution of percentage MRF stock
return. The strategy helps extract the required information and understand
R a n d o m F o re s t T ec hni q ue f o r S t o c k T r a d in g 47
the density of MRF stock return in percentage (Refer Figure 5.3). The
maximum density is seen in the stock return percentage from −1 percent to
a 1 percent increase. The spread shows the range of −3 percent to 4 percent.
5.7 Conclusion
The study has an overall model accuracy of 56 percent, and the preci-
sion for buying is 68 percent and for selling it is 45 percent. The data
set is split into two parts: train and test. 75 percent of data is used for
training and 25 percent of data is used for testing purposes. The maxi-
mum density is seen in the stock return percentage from −1 percent
to 1 percent increase. The overall movement for buying and selling
strategy ranges from a −3 percent decline to a 4 percent rise.
R a n d o m F o re s t T ec hni q ue f o r S t o c k T r a d in g 49
References
Adebiyi, A. A., Marwala, T., & Sowunmi, T. O. (2010). Bankruptcy pre-
diction using artificial neural networks and multivariate statistical
techniques: A review. African Journal of Business Management, 4(6),
942–947.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system.
In Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (pp. 785–794). Springer.
Cutler, D. R., & Cutler, A. (2009). Random Forest: Breiman and Cutler’s
random forests for classification and regression. R package version
4.6–10.
Cutler, D. R., Edwards Jr, T. C., Beard, K. H., Cutler, A., Hess, K. T.,
Gibson, J., & Lawler, J. J. (2007). Random forests for classification in
ecology. Ecology, 88(11), 2783–2792.
Deng, H., & Runger, G. (2012). Feature selection via regularized trees.
IEEE Transactions on Knowledge and Data Engineering, 24(6),
1057–1069.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees.
Machine Learning, 63(1), 3–42.
Ishwaran, H., & Kogalur, U. B. (2007). Random forests for survival, regres-
sion and classification (RF-SRC). R News, 7(2), 25–31.
Ishwaran, H., & Malley, J. D. (2008). An iterative random forest algorithm
for variable selection in high-dimensional data. Bioinformatics, 26(4),
1182–1187.
Ishwaran, H., & Malley, J. D. (2014). Forest floor: Visualizes random forests
with feature contributions. R package version 0.9.4.
Lall, U., Sharma, A., & Tarhule, A. (1996). Streamflow forecasting in the
Sahel using climate indices. Journal of Applied Meteorology, 35(10),
274–287.
Liaw, A., & Wiener, M. (2002a). Breiman and Cutler’s random forests for
classification and regression. R News, 2(3), 22–24.
Liaw, A., & Wiener, M. (2002b). Classification and regression by randomFor-
est. R News, 2(3), 18–22.
Lopes, F. M., & Rossi, A. L. (2015). Using random forests for global sensitiv-
ity analysis of the CLM4. 5-FATES land surface model. Geoscientific
Model Development, 8(4), 1059–1075.
Louppe, G. (2014). Understanding random forests: From theory to practice.
PhD Thesis, University of Liège.
Prasad, A. M., Iverson, L. R., & Liaw, A. (2006). Newer classification and
regression tree techniques: Bagging and random forests for ecological
prediction. Ecosystems, 9(2), 181–199.
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008).
Conditional variable importance for random forests. BMC Bioinformatics,
9(1), 307.
50 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in ran-
dom forest variable importance measures: Illustrations, sources and a
solution. BMC Bioinformatics, 8(1), 1–15.
Wright, M. N., & Ziegler, A. (2017). Ranger: A fast implementation of ran-
dom forests for high dimensional data in C++ and R. Journal of Statistical
Software, 77(1), 1–17.
Zhou, Z. H. (2012). Ensemble Methods: Foundations and Algorithms.
Taylor & Francis Group.
6
A pplyin g D ecisi on Tree
C l assifier for B uyin g
and S ellin g S tr ategy
with S pecial R eferen ce
to MRF S to ck
6.1 Introduction
DOI: 10.1201/9781032618241-651
52 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Huang & Zhao, 2023; Kim & Park, 2023). The root node is the
starting node of a decision tree and is also known as the mother
node. The leaf node is the end node of a decision tree with zero
Gini value.
The decision tree can be an effective tool for stock price predic-
tive analytics (Du et al., 2023; Olorunnimbe & Viktor, 2023). The
decision tree is highly accurate with stock price prediction since it
can predict the volatility and risk of the stock market (Zhou et al.,
2023). The efficiency of a decision tree model is enhanced by making a
hybrid model with a relative machine learning model such as LSTM
(Feng & Zhang, 2023; Liu et al., (2023). The decision tree is used for
portfolio management and volatility assessment of the stock market
(Chen & Lin, 2023; Kumar & Das, 2023; Wang & Zhang, 2023).
The use of the decision tree trading algorithm in market sentiment
analysis has shown the importance of decision tree in the finance field
(Lee & Kim, 2023; Rodriguez & Lopez, 2023; Patel et al., 2023;
Wang 2023). The combination of decision tree with other machine
learning techniques and artificial intelligence has a huge impact on
financial data analysis decisions (Singh & Gupta, 2023; Patel & Roy,
2023; Yang & Liu, 2023).
Python Programming
MRF S t o c k D ecisi o n T ree C l a s sifier 53
6.3.6 Methodology
Before we start the analysis, it is very important to convert the data which
can be assessed in a Python environment (Refer Figure 6.1). A data frame
is the representation of structured data which will be used for analysis. The
raw data is cleaned by removing unwanted data in the data frame so that
the data will be ready to use for further analysis. The process of preparing
data to make it ready for analysis as per the requirement of the algorithm
is called feature engineering. It also helps for an easy understanding of
various variables used in machine learning models since it is structured
and easy for the algorithm to utilize the data for analysis. It is the first step
in the process of building a machine learning model. The syntax used for
creating a data frame in Python Programming is presented in Figure 6.1.
54 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
The dependent variable Buy/Sell(Y) is binary 1 for Buy and -1 for Sell
(Refer Table 6.1). The four independent variables are Open-Close,
High-Low, Std-5, and Ret-5.
Table 6.1 Presenting the Variables Used in the Decision Tree Model
VARIABLE CLASSES
Buy/Sell (Dependent) Tomorrow’s Price > Today’s Price
Buy = 1
Tomorrow’s Price < Today’s Price
Sell = −1
Open-Close = O pen – Close
Open
(Continuous)
High-Low = H igh – Low
Low
(Continuous)
Std-5 Standard deviation of 5 days
(Continuous)
Ret-5 The mean of 5 days
(Continuous)
MRF S t o c k D ecisi o n T ree C l a s sifier 55
For accuracy statistics, we need to convert data into two parts: train-
ing data set and testing data set (Refer Figure 6.3). The major part
of the data is used for training purposes since we build the decision
tree model on the training data set. The model evaluation is done
based on a testing data set. We select random samples of data to
ensure that the training and testing data sets are similar for analysis
and hence biasedness can be minimized. By using similar data for
training and testing, we can minimize data errors and achieve a bet-
ter understanding of the model.
Results show an accuracy of 42.85 percent for the decision tree
model. A precision of 48 percent is recorded for buying and 40 percent
is registered for selling MRF Stocks.
56 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
The root node is the mother node and is also the starting node of a
decision tree. It has no backward step since it is the topmost node. The
largest information gain is by std_5 with a Gini value of 0.495 and a
MRF S t o c k D ecisi o n T ree C l a s sifier 57
sample size of 49 with class 1 (Buy). The root node splits into Open-
Close and High-Low with Gini values of 0.0.472 and 0.245 Decision
nodes are the nodes that are next to the root node which generate mul-
tiple decision modes and leaf nodes (end nodes) with maximum purity.
Leaf nodes are the end nodes with maximum purity. The leaf nodes
have zero Gini values. The leaf nodes classify the data with the high-
est purity. The outcome is predicted with color nodes. The highest leaf
class predicted was Class −1 (Sell) and it was predicted with nine final
leaf nodes and Class 1 (Buy) with seven leaf nodes (Refer Figure 6.6).
6.9 Conclusion
Results: The highest leaf class predicted was Class −1 (Sell), which was
predicted with nine final leaf nodes and Class 1 (Buy) was predicted
with seven leaf nodes. The predicted movement of return in percent-
age as indicated by the decision tree algorithm and the corresponding
maximum density were seen in the return percentage from −1 percent
to 1.80 percent.
MRF S t o c k D ecisi o n T ree C l a s sifier 59
References
Chen, X., & Lin, M. (2023). Decision trees in high-frequency trad-
ing. International Journal of Financial Studies, 11(3), 94. https://doi.
org/10.3390/ijfs11030094
Du, S., Li, X., & Yang, D. (2023). Research on prediction of decision tree
algorithm on different types of stocks. In Proceedings of the 2nd
International Seminar on Artificial Intelligence, Networking and
Information Technology—Volume 1: ANIT (pp. 178–181). SciTePress.
https://doi.org/10.5220/0012277000003807
Feng, S., & Zhang, T. (2023). Improving stock market predictions using
LSTM and decision tree models. AIP Conference Proceedings, 3072,
020023. https://pubs.aip.org/aip/acp/article/3072/1/020023/3277787
García, J., & Martínez, A. (2023). Financial market forecasting using decision
trees and machine learning. International Journal of Financial Studies,
11(3), 94. https://doi.org/10.3390/ijfs11030094
Huang, Z., & Zhao, Y. (2023). Predicting stock market trends using decision
tree algorithms. International Journal of Financial Studies, 11(3), 94.
https://doi.org/10.3390/ijfs11030094
Kim, J., & Park, H. (2023). Application of decision tree algorithms for mar-
ket trend analysis. International Journal of Financial Studies, 11(3), 94.
https://doi.org/10.3390/ijfs11030094
Kumar, V., & Das, S. (2023). Decision tree-based risk assessment in stock
investments. International Journal of Financial Studies, 11(3), 94.
https://doi.org/10.3390/ijfs11030094
Lee, H., & Kim, S. (2023). Decision trees and their role in automated trading
systems. International Journal of Financial Studies, 11(3), 94. https://
doi.org/10.3390/ijfs11030094
Li, J., & Cheng, S. (2023). A hybrid approach combining decision trees and
neural networks for stock prediction. International Journal of Financial
Studies, 11(3), 94. https://doi.org/10.3390/ijfs11030094
Liu, Q., et al. (2023). Enhancing stock market predictions with ensemble
learning. International Journal of Financial Studies, 11(3), 94. https://
doi.org/10.3390/ijfs11030094
Olorunnimbe, R., & Viktor, H. (2023). Stock market prediction with time
series data and news. International Journal of Financial Studies, 11(3),
94. https://doi.org/10.3390/ijfs11030094
Patel, A., & Roy, B. (2023). Decision trees in predictive analytics for stock
markets. International Journal of Financial Studies, 11(3), 94. https://
doi.org/10.3390/ijfs11030094
Patel, J., Shah, S., & Thakkar, P. (2023). A review on decision tree algorithms
in financial forecasting. International Journal of Financial Studies, 11(3),
94. https://doi.org/10.3390/ijfs11030094
Rodriguez, P., & Lopez, F. (2023). Using decision trees to analyze market
sentiments and stock prices. International Journal of Financial Studies,
11(3), 94. https://doi.org/10.3390/ijfs11030094
60 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Shi, Y., & Chen, L. (2023). Decision trees in financial markets: Construction
and applications. International Journal of Financial Studies, 11(3), 94.
https://doi.org/10.3390/ijfs11030094
Singh, R., & Gupta, M. (2023). Decision trees for predictive modeling in
finance. International Journal of Financial Studies, 11(3), 94. https://doi.
org/10.3390/ijfs11030094
Wang, T., & Zhang, L. (2023). Enhancing portfolio management with deci-
sion trees. International Journal of Financial Studies, 11(3), 94. https://
doi.org/10.3390/ijfs11030094
Wang, Y., & Sun, L. (2023). Comparative study of decision tree models in
stock price prediction. International Journal of Financial Studies, 11(3),
94. https://doi.org/10.3390/ijfs11030094
Yang, M., & Liu, H. (2023). Stock market prediction using decision trees
and support vector machines. International Journal of Financial Studies,
11(3), 94. https://doi.org/10.3390/ijfs11030094
Zhou, X., et al. (2023). Machine learning techniques for stock price prediction
and graphic processing. International Journal of Financial Studies, 11(3),
94. https://doi.org/10.3390/ijfs11030094
7
D escrip ti v e S tatisti c s for
S to ck R isk A ssessment
7.1 Introduction
Data wrangling was done with Pandas and NumPy (McKinney, 2017).
VanderPlas (2016) worked on different descriptive analysis tools.
Das (2018) emphasizes on practical application of descriptive statis-
tics. Shaikh and Prakash (2020) put emphasis on descriptive statis-
tics and its practical application. Saxena and Gupta (2019) worked
on COVID-19 data, performed descriptive statistics for academic
research work done by MCKinney et al. (2010) and further focused
on computational analysis. Das Gupta and Gosh (2019) carried out
empirical analysis. Pedregosa et al. (2011) applied scikit-learn and per-
formed descriptive analysis. Grolemund (2017), Gauda (2016), Getlin
(2015), Géron (2019), VanderPlas (2016), and Wickham and Grol-
emund (2017) contributed hands on to building a machine learning-
related model (DePoy & Gitlin, 2015; Géron, 2019; Saxena 2019).
In the future, the study can be done on different stocks at the same time.
Figure 7.1 Python libraries for fetching the data sets into a Python environment and performing
descriptive statistics.
D e s c rip ti v e S tatis tic s 63
Mean is the total or sum of all values of stock returns divided by the
days (Refer Figure 7.3). The average return by the stock is useful to
understand the risk by applying standard deviation and variance.
It is the most apparent score in data sets (Refer Figure 7.5). The mode
is 108500, which is higher than the average of 100703 and median of
100773, indicating that the stock gives a higher return than the aver-
age return.
The range is the difference between the lowest and highest values,
here the min is 81900 and the max is 131600 (Refer Figure 7.6).
The min is below the most appeared score in the data sets (mode) of
108500 and it is also lower than the average of 100703 and median
of 100773, indicating that the stock gives higher returns than the
min return most of the time. The highest is 131600 which is above
the mean, median, and mode, indicating a good return. The range
has a difference of 49700 and the difference between the average
return and max return is 30 897, which is lower than the range.
Hence, we can interpret that the returns are above 100773 or close
to that value in most of the cases.
D e s c rip ti v e S tatis tic s 65
Quantile divides data into quarters: first quantile is the 25th percen-
tile, second quantile is the 50th percentile, and third quantile is the
75th percentile (Refer Figure 7.9). Q1 of 89967 says that 25 percent
of data is less than or equal to 89967, which is below the average of
100703. The second quantile has a value of 100773, which is above the
average of 100703; and the third quantile has a value of 108879, which
is above the average and near to the mode value of 108500, indicating
less deviation from the average and mode and hence less risk.
Figure 7.9 Performing descriptive statistics in Python for different quantiles with IQR.
investors is to put the stock on hold, as the stock will give small losses
in short term but a good return in long term (Müller & Guido, 2016).
7.12 Conclusion
References
Das, A. (2018). Descriptive Statistics with Python. Packt Publishing Ltd.
Das Gupta, A., & Ghosh, S. (2019). An empirical study on descriptive
data analysis using Python. International Journal of Engineering and
Advanced Technology, 8(6), 988–992.
DePoy, E., & Gitlin, L. N. (2015). Introduction to Research: Understanding
and Applying Multiple Strategies. Elsevier Health Sciences.
Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras,
and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent
Systems. O’Reilly Media, Inc.
68 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
DOI: 10.1201/9781032618241-8 6 9
70 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
The daily price of the MRF stock is considered for the study from 2
January 2023 to 27 February 2024 (daily stock price).
Figure 8.2 Correlation matrix for selecting variables for the regression model.
S t o c k In v e s t m en t S t r at egy 73
8.5.1 R-Square
8.6 Conclusion
References
Anderson, T. W. (1962). An Introduction to Multivariate Statistical Analysis.
Wiley.
Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and
random coefficient variation. Econometrica, 47(5), 1287–1294.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for equality of variances.
Journal of the American Statistical Association, 69(346), 364–367.
Campbell, J. Y., & Lo, A. W. (1997). The Econometrics of Financial Markets.
Princeton University Press.
S t o c k In v e s t m en t S t r at egy 75
9.1 Introduction
The study period was from 1 January 2023 to 12 January 2023 (Refer
Figure 9.2). The interval for selected data is the daily closing stock price
of two companies for analysis. The study used a sample size of 12 days.
In the future, the study can be conducted for different stocks at the
same time.
Hypothesis
Figure 9.2 Fetching the data sets into a Python environment for performing an F-test.
Conclusion the p-value of the test is 0.13, which is more than the
alpha value of 0.05 (Refer Figure 9.3). Hence, we cannot reject the
null hypothesis of the test. Based on the above analysis, we con-
clude that the variance of return of both the stocks is not different,
thus rejecting the alternative hypothesis.
C o m pa rin g S t o c k Risk Usin g F-T e s t 79
References
Allen, F., & Powell, M. (2012). Market Liquidity: A Primer. Oxford
University Press.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin,
D. B. (2013). Bayesian Data Analysis (Vol. 2). CRC Press.
Johnson, R. A., & Wichern, D. W. (2007). Applied Multivariate Statistical
Analysis. Prentice Hall.
Montgomery, D. C., Jennings, C. L., & Kulahci, M. (2017). Introduction to
Time Series Analysis and Forecasting. John Wiley & Sons.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., . . . & Vanderplas, J. (2011). Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statisti-
cal modeling with Python. Proceedings of the 9th Python in Science
Conference (pp. 92–96).
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T.,
Cournapeau, D., . . . & van der Walt, S. J. (2020). SciPy 1.0: Fundamental
algorithms for scientific computing in Python. Nature Methods, 17(3),
261–272.
Zou, H., Hastie, T., & Tibshirani, R. (2020). Sparse principal component
analysis. Journal of Computational and Graphical Statistics, 27(2),
316–324.
10
S to ck R isk A nalysis
U sin g t-Test
10.1 Introduction
The study period was from 1 January 2023 to 12 January 2023 (Refer
Figure 10.2). The interval for selected data is the daily closing stock price
of two companies for analysis. The study used a sample size of 12 days.
8 0 DOI: 10.1201/9781032618241-10
S t o c k Risk A n a lysis Usin g t-T e s t 81
In the future, the study can be conducted for different stocks at the
same time.
Hypothesis
Figure 10.2 Fetching the data sets into a Python environment for performing a t-test.
10.3 Conclusion
The t-test is performed on a small sample size and the average return
of two stock variables are calculated by applying Python libraries like
scipy.stats. The study period was from 1 January 2023 to 12 January
2023. The interval for selected data is the daily closing stock price of
two companies for analysis. The p-value of the t-test is 0.23, which is
more than the alpha value of 0.05. Hence, we cannot reject the null
hypothesis of the test. From the above analysis, we conclude that the
mean return of both stocks is different, which ultimately results in the
rejection of the alternative hypothesis.
References
Box Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences
(2nd ed.). Erlbaum.
Field, A. (2018). Discovering Statistics using IBM SPSS Statistics (5th ed.).
Sage.
Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25.
Hinkle, D. E., Wiersma, W., & Jurs, S. G. (2003). Applied Statistics for the
Behavioral Sciences (5th ed.). Houghton Mifflin.
Howell, D. C. (2013). Statistical Methods for Psychology (8th ed.). Cengage
Learning.
Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the
normality assumption in large public health data sets. Annual Review of
Public Health, 23, 151–169.
McDonald, J. H. (2014). Handbook of Biological Statistics (3rd ed.). Sparky
House Publishing.
Mendenhall, W., & Sincich, T. (2016). Statistics for Engineering and the
Sciences (6th ed.). Pearson.
Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996).
Applied Linear Statistical Models (4th ed.). McGraw-Hill.
Rosenthal, R., & Rosnow, R. L. (2008). Essentials of Behavioral Research:
Methods and Data Analysis (3rd ed.). McGraw-Hill.
Sheskin, D. J. (2003). Handbook of Parametric and Nonparametric Statistical
Procedures (3rd ed.). CRC Press.
Tabachnick, B. G., & Fidell, L. S. (2019). Using Multivariate Statistics (7th
ed.). Pearson.
Trochim, W. M. K., Donnelly, J. P., & Arora, K. (2016). Research Methods:
The Essential Knowledge Base (2nd ed.). Cengage Learning.
Urdan, T. C. (2016). Statistics in Plain English (4th ed.). Routledge.
Zar, J. H. (2010). Biostatistical Analysis (5th ed.). Pearson Prentice Hall.
11
S to ck I n v estment
S tr ategy U sin g a Z-S core
where
X is the variable.
µ is the mean, which is given by the value of the variable divided by
the number of items.
ϭ is the standard deviation of variable X.
The daily price of the MRF stock is considered for the study from 2 Jan-
uary 2023 to 27 February 2024 (daily stock price) (Refer Figure 11.1).
the format of the data file was not readable in Python. A data frame is
created by fetching the comma-separated values (CSV) file, making it
readable in Python and utilizing it for further processing in the form
of a data frame. The data frame is created as per the model require-
ment. The data frame needs to be structured in as per the requirement
of the model. The first step in creating a data frame is to structure the
data so that the program can read and work on the data. Once the data
frame is created, it is ready to be used by the algorithm. The syntax
used for creating a data frame in Python Programming is presented
in Figure 11.1.
The Z-score analysis for the stock is done by calculating the Z-score
for opening price, closing price day high, day low, and stock volume
(Refer Figure 11.2). Calculating the Z-score and determining risk is
S t o c k In v e s t m en t S t r at egy Usin g a Z- Sc o re 87
The mean of the opening price is 7.86 of the Z-score (Refer Figure
11.3). So we can interpret this as it is the best time to invest since the
opening price has given a good return and the Z-score is above seven
times of mean return. The Z-score value of the variables High, Low,
and Close has given a lower return since the standard deviation of one.
The Z-score of minimum ranges from −1.63 to −1.61 which indicates
low risk and high return at the opening price as the Z-score mean is
high, The maximum Z-score ranges from 2.08 for the opening price
and 2.02 for all other variables which show high return and high risk
in the opening price.
11.6 Conclusion
The four continuous variables are Open, Close, High, and Low which
have poor Z-scores. Only the open stock price is exceptional and has
a higher return and high risk since the mean of Z-score is 7.86 and
the standard deviation is low which indicates that the best investment
opportunity is the Opening price.
References
Aanderson, T. W. (1962). An Introduction to Multivariate Statistical Analysis.
Wiley.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for equality of variances.
Journal of the American Statistical Association, 69(346), 364–367.
S t o c k In v e s t m en t S t r at egy Usin g a Z- Sc o re 89
12.1 Introduction
9 0 DOI: 10.1201/9781032618241-12
Supp o r t V ec t o r M ac hine M o d el 91
(Source: https://www.javatpoint.com)
Figure 12.2 Different bifurcation of the data with maximum margin hyperplane.
(Source: https://www.javatpoint.com)
Deo (2015) applied the support vector machine model in health ana-
lytics in order to identify complex medical patterns in predictive health
analytics (Cervantes et al., 2023). Jardine et al. (2006) applied a support
vector machine learning model in automation for manufacturing and
predicted machine failure by analyzing the past data, which improves the
maintenance cost. Garcia-Lamont et al. (2023) applied support vector
machine learning model for structure safety in the area of infrastructure
safety Hinton et al., 2012; Kim 2014; Nguyen et al., 2020; Schölkopf
et al., 2001. Toledo-Pérez et al. (2019) applied the SVM algorithm and
improved the signal classification accuracy. Huang et al. (2005) applied
SVM models in stock market prediction in order to help investors in
making investment decisions. Joachims (1998) applied SVM models for
natural language processing NLP and improved the accuracy of a docu-
ments classification system. Ding and Dubchak (2001) applied SVM
models for understanding and classification of the protein structure.
Mountrakis et al. (2011) applied SVM models in the remote sensing
field and achieved a high data classification accuracy (Burges, 1998).
Daily stock price of the MRF stock is considered for the study from
2/1/2023 to 5/1/2024.
Python Programming
12.3 Methodology
The process of converting raw data into features that can be easily
utilized to create a model as per the requirement of the algorithm is
called feature engineering (Refer Figure 12.3). The creation of a data
frame is the first step in creating a model. The data frame is prepared
to maintain the notion of the model which has different variables.
The feature engineering is the process of preparing the data according
to the need and required to be converted into nominal scale, ordi-
nal scale, etc. Feature engineering prepares data that can be read and
utilized by algorithms. Further, it converts raw data which will be
ready for the program to utilize it best possible manner. The syntax
used for creating a data frame in Python Programming is presented
in Figure 12.3.
94 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
VARIABLE CLASSES
Buy/Sell (Dependent) Tomorrow’s Price > Today’s Price
Buy = 1
Tomorrow’s Price < Today’s Price
Sell = 0
Open (Independent) Continuous
Close (Independent) Continuous
High (Independent) Continuous
Low (Independent) Continuous
Supp o r t V ec t o r M ac hine M o d el 95
For trial and testing, the data is divided into two categories (Refer
Figure 12.5). 80 percent of data is converted and used for trial and 20
percent of data is used for testing to evaluate. With trial and testing,
the test results are validated by creating a confusion matrix.
Figure 12.6 The Python code for confusion matrix and classification report.
12.7.1.2 Recall
It is the ratio of true positive predictions divided by the total number
of true positive predictions and false negative predictions. Higher
recall implies more correct predictions (a small number of false
negatives).
True Positive
Recall =
True Positive + False Negative
222
Recall = = 0.85
22 + 4
Recall for the overall model is 0.85
12.7.1.3 Precision
Precision measures how correctly we have predicted the true positives.
It is the qualitative analysis of correctly predicted values.
True Positive 22
Precision = = = 1.00
True Positive + False Positive 22 + 0
The precision for the overall model is 1.00
12.8 Conclusion
The support vector machine model predicted the MRF stock with a
precision of 100 percent. The overall model accuracy is 93 percent.
References
Burbidge, R., Trotter, M., Buxton, B., & Holden, S. (2001). Drug design
by machine learning: Support vector machines for pharmaceutical data
analysis. Computers & Chemistry, 26(1), 5–14. https://doi.org/10.1016/
S0097-8485(01)00094-8
Burges, C. J. (1998). A tutorial on support vector machines for pattern recog-
nition. Data Mining and Knowledge Discovery, 2(2), 121–167. https://
doi.org/10.1023/A:1009715923555
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & López, A.
(2023). A comprehensive survey on support vector machine classification:
Applications, challenges and trends. Journal of Building Engineering.
https://doi.org/10.1016/j.jobe.2023.104911
Deo, R. C. (2015). Machine learning in medicine. Circulation, 132(20), 1920–
1930. https://doi.org/10.1161/CIRCULATIONAHA.115.001593
98 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Figure 13.2 Creating a scatter plot for understanding the degree of association between the
closing price and the opening price.
A bar chart represents data with rectangular bars on the axis with
lengths and heights that are proportional to the variable’s value (Refer
Figure 13.4a). It is created using the bar method. We have applied a
histogram to analyze the daily movement of the opening price of the
stock (Refer Figure 13.3).
Data Visua liz ati o n 10 3
Figure 13.3 Creating Scatter Plot for understanding degree of association between closing price
and opening price.
Figure 13.4a Creating a histogram for understanding the movement of the opening price.
10 4 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Figure 13.4b Creating a line plot for understanding the degree of association between the
closing price and the opening price.
Figure 13.5 Creating scatter graph for understanding the degree of association between the
closing price and the opening price.
relationship between the two variables. The scatter plot shows a close
degree of association between them and hence we can conclude that
the opening price and closing price of the MRF stock remain the
same.
References
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in
Science & Engineering, 9(3), 90–95.
Jones, E., Oliphant, T., & Peterson, P. (2019). SciPy: Open source scientific
tools for Python.
McKinney, W. (2017). Python for Data Analysis: Data Wrangling with
Pandas, NumPy, and IPython. O’Reilly Media, Inc.
Smith, J. (2018). Python Data Visualization Cookbook. Packt Publishing Ltd.
VanderPlas, J. T. (2016). Python Data Science Handbook: Essential Tools for
Working with Data. O’Reilly Media, Inc.
10 6 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T.,
Cournapeau, D., . . . & van der Walt, S. J. (2020). SciPy 1.0: Fundamental
algorithms for scientific computing in Python. Nature Methods, 17(3),
261–272.
Wang, J., & Liu, S. (2020). Python for Finance Cookbook. Packt Publishing
Ltd.
Waskom, M., Botvinnik, O., O'Kane, D., Hobson, P., Ostblom, J., Lukauskas,
S., . . . & Halchenko, Y. (2020). Mwaskom/Seaborn: v0.11.1 (December
2020). Zenodo.
Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data. O’Reilly Media, Inc.
14
A pplyin g N atur al
L an guag e P ro cessin g
for S to ck I n v estors
S entiment A nalysis
14.1 Introduction
DOI: 10.1201/9781032618241-1410 7
10 8 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Figure 14.1 Python libraries for performing NLP and fetching data sets into a Python environment.
the model. The first step in creating a data frame is to structure the data
so that the program can read and work on the data. Once the data frame
is created, it is ready to be used by the algorithm. The syntax used for cre-
ating a data frame in Python Programming is presented in Figure 14.1.
After cleaning the data for sentiment analysis, we need to create trial
and training data sets (Refer Figure 14.5). To test the accuracy of the
112 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Figure 14.5 Vector transformation to create trial and training data sets.
The AUC stands for the area under the ROC curve (Refer Figure 14.6).
The AUC provides the accuracy of all possible classifications with its
thresholds. The AUC is the ratio between the true positive rate and the
false positive rate. An AUC value of 1 is considered to be the best with 100
percent model accuracy. The value of AUC ranges from 0 to 1. In the pres-
ent study, we have an AUC value of 0.20, which is considered to be poor.
14.9 Conclusion
References
Al-Rfou, R., & Perozzi, B. (2019). Polyglot: Distributed Word Representations
for Multilingual NLP. Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics.
114 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied Text Analysis with
Python: Enabling Language-Aware Data Products with Machine
Learning. O’Reilly Media.
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python. O’Reilly Media.
Chollet, F. (2018). Deep Learning with Python. Manning Publications.
Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing
(3rd ed.). Pearson.
Loper, E., & Bird, S. (2002). NLTK: The Natural Language Toolkit. CoRR.
cs.CL/0205028. 10.3115/1118108.1118117.
Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural
Language Processing. The MIT Press.
Perkins, J. (2016). Python Text Processing with NLTK 2.0 Cookbook. Packt
Publishing.
Raschka, S., & Mirjalili, V. (2019). Python Machine Learning: Machine
Learning and Deep Learning with Python, scikit-learn, and TensorFlow
(3rd ed.). Packt Publishing.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention
is all you need. In Proceedings of the 31st International Conference on
Neural Information Processing Systems (NIPS’17). Curran Associates
Inc., Red Hook, NY, USA, 6000–6010.
15
S to ck P red icti on
A pplyin g LSTM
15.1 Introduction
DOI: 10.1201/9781032618241-15115
116 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
Table 15.1 The Architecture and Process of the Long Short-Term Model
LSTM PROCESS SENTENCE
Long-term memory Dr Nitin stays at Aurangabad. Dr Nitin has an area of specialization
in Financial Analytics. He is a good teacher in the area of______
Short-term memory Dr. Nitin has an area of specialization in Financial Analytics. He is a
good teacher in the area of______
Input information Dr Nitin stays at Aurangabad. Dr. Nitin has an area of specialization
in Financial Analytics. He is a good teacher in the area of______
Irrelevant information or Dr. Nitin stays at Aurangabad
Forget information
Relevant information Area of specialization in Financial Analytics.
Output He is a good teacher in the area of Financial Analytics.
Figure 15.1 Python libraries for performing LSTM and fetching data sets into a Python environment.
The data frame is prepared to maintain the notion of the model which
has different variables (Refer Figure 15.2). The Feature engineering
is the process of preparing converting the data into nominal scale or
ordinal scale etc. it prepare data that can be read and utilized by algo-
rithm. data which will be ready for the program to utilize. The syntax
used for creating a data frame in Python Programming is presented
in Figure 15.2.
S t o c k P red ic ti o n App ly in g LSTM 119
After cleaning the data for the LSTM ML model, we need to create
trial and testing data sets (Refer Figure 15.3). To test the accuracy of the
model applied, we need to test the results by comparing it with original
data. For comparing the data sets, we need to divide the data set into
test and train. For model evaluation, we divide the data into 80 percent
of trial data and 20 percent of test data by vector transformation.
Figure 15.3 Vector transformation to create trial and training data sets.
12 0 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N
The results of the LSTM model show the different layers in the out-
put and the generated parameters for the different layers (Refer Fig-
ure 15.4). We applied LSTM layers of 50 units. The representation of
parameters in LSTM units are functions involved in calculations. The
parameters are generated using the below formula:
Number of parameters = 4 * (N + M + 1) * m
where
N is the number of dimensions in the input variable.
M is the units in the LSTM layer.
1 is the bias parameter.
Substituting the values in the above equation, we get
Number of LSTM parameters = 4 * (50 + 50 + 1) * 50
Number of LSTM parameters = 20,200
We applied the LSTM model and generated different layers for pre-
dicting the opening price and the closing price. The LSTM model
is generated with 50 units or neurons. These units from the LSTM
model will be used as input to the next LSTM layer. The next dropout
layer is the regulator of the model, which keeps the irrelevant infor-
mation or the so-called biases away from the LSTM model. The other
LSTM-1 layer with 50 neurons or units is followed by the final dense
LSTM layer with 2 neurons or units.
15.7 Conclusion
The long short-term memory model is created for predicting the open-
ing price and the closing price. The total parameters generated for the
LSTM layer is 20,200. After applying the past knowledge and the
LSTM, we deleted the irrelevant information and created an LSTM
model for predicting the stock price.
References
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P.
(2016). Infogan: Interpretable representation learning by information
maximizing generative adversarial nets. In Advances in Neural Information
Processing Systems (pp. 2172–2180). https://api.semanticscholar.org/
CorpusID:5002792
Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to forget: Continual
prediction with LSTM. Neural Computation, 12(10), 2451–2471.
Graves, A., Mohamed, A., & Hinton, G.E. (2013). Speech recognition with
deep recurrent neural networks. 2013 IEEE International Conference
on Acoustics, Speech and Signal Processing, 6645–6649.
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines.
arXiv preprint arXiv:1410.5401.
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber,
J. (2016). LSTM: A search space odyssey. IEEE Transactions on Neural
Networks and Learning Systems, 28(10), 2222–2232.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural
Computation, 9(8), 1735–1780.
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent
neural networks for sequence learning. arXiv preprint arXiv:1506.00019.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of train-
ing recurrent neural networks. In International Conference on Machine
Learning (pp. 1310–1318).
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural net-
works. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks
for language modeling. In Thirteenth Annual Conference of the
International Speech Communication Association.