Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
112 views

OceanofPDF.com Data Analytics for Finance Using Python - Nitin Jaglal Untwal

Uploaded by

ramchander010477
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views

OceanofPDF.com Data Analytics for Finance Using Python - Nitin Jaglal Untwal

Uploaded by

ramchander010477
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

Data Analytics for

Finance Using Python


Unlock the power of data analytics in finance with this comprehensive
guide. Data Analytics for Finance Using Python is your key to unlocking
the secrets of the financial markets.
In this book, you’ll discover how to harness the latest data analytics
techniques, including machine learning and inferential statistics,
to make informed investment decisions and drive business success.
With a focus on practical application, this book takes you on a journey
from the basics of data preprocessing and visualization to advanced
modeling techniques for stock price prediction.
Through real-world case studies and examples, you’ll learn how to:
• Uncover hidden patterns and trends in financial data
• Build predictive models that drive investment decisions
• Optimize portfolio performance using data-driven insights
• Stay ahead of the competition with cutting-edge data analytics
techniques
Whether you’re a finance professional seeking to enhance your data
analytics skills or a researcher looking to advance the field of finance
through data-driven insights, this book is an essential resource. Dive
into the world of data analytics in finance and discover the power to
make informed decisions, drive business success, and stay ahead of the
curve.
This book will be helpful for students, researchers, and users of
machine learning and financial tools in the disciplines of commerce,
management, and economics.
Advances in Digital Technologies for Smart Applications
Series Editor: Saad Motahhir

The Advances in Digital Technologies for Smart Applications series publishes


leading-edge research on innovative digital technologies and their application
in smart systems. Key topics include AI, IoT, blockchain, and their integration
into various sectors, including finance, healthcare, and public governance.

Data Analytics for Finance Using Python


Nitin Jaglal Untwal, Utku Kose

Big Data and Blockchain Technology for Secure IoT Applications


Shitharth Selvarajan, Gouse Baig Mohammad, Sadda Bharath Reddy,
Praveen Kumar Balachandran

Technology-Based Teaching and Learning in Pakistani English Language


Classrooms
Muhammad Mooneeb Ali

Medical Knowledge Paradigms for Enabling the Digital Health Ecosystem


Usha Desai, Vivek P Chavda, Ankit Vijayvargiya, Ravichander Janapati

Soft Computing in Renewable Energy Technologies


Najib El Ouanjli, Mahmoud A. Mossa, Mariya Ouaissa, Sanjeevikumar
Padmanaban, Said Mahfoud

Leveraging the Potential of Artificial Intelligence in the Real World: Smart Cities
and Healthcare
Tien Anh Tran, Edeh Michael Onyema, Arij Naser Abougreen

eGovernment Whole-of-Government Approach for Good Governance: The Back-


Office Integrated Management IT Systems
Said Azelmad

Advances in Digital Marketing in the Era of Artificial Intelligence: Case Studies


and Data Analysis for Business Problem Solving
Moez Ltifi

For more information about this series, please visit: https://www.routledge.com/Advances-


in-Digital-Technologies-for-Smart-Applications/book-series/ADT
Data Analytics for
Finance Using Python

Nitin Jaglal Untwal and Utku Kose

Boca Raton London New York

CRC Press is an imprint of the


Taylor & Francis Group, an informa business
First edition published 2025
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2025 Nitin Jaglal Untwal and Utku Kose
Reasonable efforts have been made to publish reliable data and information, but the
author and publisher cannot assume responsibility for the validity of all materials or the
consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if
permission to publish in this form has not been obtained. If any copyright material has not
been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted,
reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other
means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.
copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978–750–8400. For works that are not available on CCC please
contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks
and are used only for identification and explanation without intent to infringe.
ISBN: 978-1-032-61821-0 (hbk)
ISBN: 978-1-032-61823-4 (pbk)
ISBN: 978-1-032-61824-1 (ebk)
DOI: 10.1201/9781032618241
Typeset in Caslon
by Apex CoVantage, LLC
Contents

P r e fa c e  xiii
Authors xv

C h a p t e r 1 S t o c k I n v e s t m e n t s P o r t f o l i o M a n a g e m e n t
b y A pp ly i n g K- M e a n s C l u s t e r i n g  1
1.1 Introduction 1
1.1.1 Introduction to Cluster Analysis 2
1.1.2 Literature Review 3
1.2 Research Methodology 3
1.2.1 Data Source 3
1.2.2 Study Time Frame 4
1.2.3 Tool for Analysis 4
1.2.4 Model Applied 4
1.2.5 Limitations of the Study 4
1.2.6 Future Scope 4
1.3 Feature Extraction and Engineering 4
1.4 Data Extraction 5
1.5 Standardizing and Scaling 6
1.6 Identification of Clusters by the Elbow Method 6
1.7 Cluster Formation 7
1.8 Results and Analysis 8
1.8.1 Cluster One 9
1.8.2 Cluster Two 10
1.8.3 Clusters Three and Four 11
1.8.4 Cluster Five 11
1.8.5 Cluster Six 11
1.9 Conclusion 12

v
vi C o n t en t s

Chapter 2 Predicting Stock Price Using the

ARIMA M o d e l  15
2.1 Introduction 15
2.2 ARIMA Model 15
2.2.1 Literature Review 16
2.3 Research Methodology 17
2.3.1 Data Source 17
2.3.2 Period of Study 17
2.3.3 Software Used for Data Analysis 17
2.3.4 Model Applied 17
2.3.5 Limitations of the Study 17
2.3.6 Future Scope of the Study 17
2.3.7 Methodology 17
2.4 Finding Different Lags Autocorrelation 18
2.5 Creating the Different ARIMA Models 21
2.5.1 Comparing the AIC Values of Models 23
2.6 Selecting the Best Model Using Cross-Validation 24
2.7 Conclusion 24

C h a p t e r 3 S t o c k I n v e s t m e n t S t r at e gy U s i n g a

Logistic Regression Model 27


3.1 Introduction to the Logistic Regression Model 27
3.1.1 Introduction to a Multinomial Logistic
Regression Model 27
3.1.2 Literature Review 28
3.1.3 Applied Research Methodology 29
3.2 Fetching the Data into a Python Environment and
Defining the Dependent and Independent Variables 30
3.3 Data Description and Creating Trial and Testing
Data Sets 31
3.4 Results Analysis for the Logistic Regression Model 31
3.4.1 The Stats Models Analysis in Python 32
3.5 Model Evaluation Using Confusion Matrix and
Accuracy Statistics 32
3.5.1 Calculating False Negative, False Positive,
True Negative, and True Positive 32
3.6 Accuracy Statistics 33
3.6.1 Recall 33
3.6.2 Precision 34
3.7 Conclusion 34

Chapter 4 Predicting Stock Buying and Selling


D e c i s i o n s by A pp ly i n g t h e G au s s i a n N a i v e
B ay e s M o d e l U s i n g P y t h o n P r o g r a m m i n g  36
4.1 Introduction 36
4.1.1 Literature Review 37
4.2 Research Methodology 37
4.2.1 Data Collection 37
C o n t en t s vii

4.2.2 Sample Size 37


4.2.3 Software Used for Data Analysis 38
4.2.4 Model Applied 38
4.2.5 Limitations of the Study 38
4.2.6 Future Scope of the Study 38
4.3 Methodology 38
4.4 Feature Engineering and Data Processing 38
4.5 Training and Testing 39
4.6 Predicting Naive Bayes Model with Confusion
Matrix40
4.6.1 Creating Confusion Matrix 40
4.6.2 Calculating False Negative, False Positive,
True Negative, and True Positive 40
4.6.3 Result Analysis 41
4.7 Conclusion 41

C h a p t e r 5 Th e R a n d o m F o r e s t Te c h n i q u e I s a
To o l f o r S t o c k Tr a d i n g D e c i s i o n s  43
5.1 Introduction 43
5.2 Random Forest Literature Review 43
5.3 Research Methodology 44
5.3.1 Data Source 44
5.3.2 Period of Study 44
5.3.3 Sample Size 44
5.3.4 Software Used for Data Analysis 44
5.3.5 Model Applied 44
5.3.6 Limitations of the Study 44
5.3.7 Future Scope of the Study 44
5.3.8 Methodology 45
5.4 Defining the Dependent and Independent
Variables for the Random Forest Model 45
5.5 Training and Testing with Accuracy Statistics 46
5.6 Buying and Selling Strategy Return 46
5.7 Conclusion 48

C h a p t e r 6 A pp ly i n g D e c i s i o n Tr e e C l a s s i f i e r
f o r B u y i n g a n d S e l l i n g S t r at e gy w i t h

S p e c i a l R e f e r e n c e t o MRF S t o c k  51
6.1 Introduction 51
6.2 Decision Tree 51
6.3 Research Methodology 52
6.3.1 Data Source 52
6.3.2 Period of Study 52
6.3.3 Software Used for Data Analysis 52
6.3.4 Model Applied 53
6.3.5 Limitations of the Study 53
6.3.6 Methodology 53
6.4 Creating a Data Frame 53
viii C o n t en t s

6.5 Feature Construction and Defining the Dependent


and Independent Variables 54
6.6 Training and Testing of Data for Accuracy Statistics 55
6.7 Buying and Selling Strategy Return 56
6.8 Decision Tree Analysis 56
6.9 Conclusion 58

C h a p t e r 7 D e s c r ip t i v e S tat i s t i c s for Stock


Risk As ses s ment 61
7.1 Introduction 61
7.1.1 Related Work 61
7.2 Research Methodology 61
7.2.1 Data Source 61
7.2.2 Period of Study 62
7.2.3 Software Used for Data Analysis 62
7.2.4 Model Applied 62
7.2.5 Limitations of the Study 62
7.2.6 Future Scope of the Study 62
7.3 Performing Descriptive Statistics in Python
for Mean 63
7.4 Performing Descriptive Statistics in Python
for Median 63
7.5 Performing Descriptive Statistics in Python
for Mode 64
7.6 Performing Descriptive Statistics in Python
for Range 64
7.7 Performing Descriptive Statistics in Python for
Variance65
7.8 Performing Descriptive Statistics in Python for
Standard Deviation 65
7.9 Performing Descriptive Statistics in Python for
Quantile66
7.10 Performing Descriptive Statistics in Python for
Weakness66
7.11 Performing Descriptive Statistics in Python
for Kurtosis 67
7.12 Conclusion 67

C h a p t e r 8 S t o c k I n v e s t m e n t S t r at e gy U s i n g a

Regression Model 69
8.1 Introduction to a Multiple Regression Model 69
8.2 Applied Research Methodology 70
8.2.1 Data Source 70
8.2.2 Sample Size 70
8.2.3 Software Used for Data Analysis 70
8.2.4 Model Applied 70
8.3 Fetching the Data into a Python Environment and
Defining the Dependent and Independent Variables 71
C o n t en t s ix

8.4 Correlation Matrix 72


8.5 Result Analysis for the Multiple Regression Model 73
8.5.1 R-Square 73
8.6 Conclusion 74

C h a p t e r 9 C o m pa r i n g S t o c k R i s k U s i n g F-Te s t  76
9.1 Introduction 76
9.1.1 Review of Literature 76
9.2 Research Methodology 77
9.2.1 Data Source 77
9.2.2 Period of Study 77
9.2.3 Software Used for Data Analysis 77
9.2.4 Model Applied 77
9.2.5 Limitations of the Study 77
9.2.6 Future Scope of the Study 77

C h a p t e r 10 S t o c k R i s k A n a ly s i s U s i n g t-Te s t  80
10.1 Introduction 80
10.2 Research Methodology 80
10.2.1 Data Source 80
10.2.2 Period of Study 80
10.2.3 Software Used for Data Analysis 81
10.2.4 Model Applied 81
10.2.5 Limitations of the Study 81
10.2.6 Future Scope of the Study 81
10.3 Conclusion 83

C h a p t e r 11 S t o c k I n v e s t m e n t S t r at e gy U s i n g
a Z-S c o r e  84
11.1 Introduction to Z-Score 84
11.2 Applied Research Methodology 85
11.2.1 Data Source 85
11.2.2 Sample Size 85
11.2.3 Software Used for Data Analysis 85
11.2.4 Model Applied 85
11.3 Fetching the Data into a Python Environment
and Defining the Dependent and Independent
Variables85
11.4 Calculating the Z-Score for the Stock 86
11.5 Results Z-Score Analysis 88
11.6 Conclusion 88

C h a p t e r 12 A pp ly i n g a S u pp o r t V e c t o r M a c h i n e
Model Using P y thon Programming 90
12.1 Introduction 90
12.1.1 Review of Literature 91
12.2 Research Methodology 92
12.2.1 Data Collection 92
12.2.2 Sample Size 92
x C o n t en t s

12.2.3 Software Used for Data Analysis 92


12.2.4 Model Applied 92
12.2.5 Limitations of the Study 93
12.2.6 Future Scope of the Study 93
12.3 Methodology 93
12.4 Feature Engineering and Data Processing 93
12.5 Training and Testing 94
12.6 Predicting a Support Vector Machine Model
with a Confusion Matrix 95
12.6.1 Creating a Confusion Matrix 95
12.7 Calculating False Negative, False Positive,
True Negative, and True Positive 95
12.7.1 Result Analysis 96
12.8 Conclusion 97

C h a p t e r 13 D ata V i s ua l iz at i o n f o r S t o c k R i s k
C o m pa r i s o n a n d A n a ly s i s  99
13.1 Introduction to Data Visualization 99
13.1.1 Review of Past Studies 99
13.1.2 Applied Research Methodology 100
13.2 Fetching the Data into a Python Environment
and Defining the Dependent and Independent
Variables100
13.2.1 Data Visualization Using Scatter Plot 101
13.3 Data Visualization Using Bar Chat 102
13.4 Data Visualization Using Line Chart 104
13.5 Data Visualization Using Bokeh 104

C h a p t e r 14 A pp ly i n g N at u r a l L a n g ua g e P r o c e s s i n g
107
f o r S t o c k I n v e s t o r s S e n t i m e n t A n a ly s i s 
14.1 Introduction 107
14.2 Research Methodology 108
14.2.1 Data Source 108
14.2.2 Period of Study 108
14.2.3 Software Used for Data Analysis 108
14.2.4 Model Applied 108
14.2.5 Limitations of the Study 108
14.2.6 Future Scope of the Study 108
14.3 Fetching the Data into a Python Environment 108
14.4 Sentiments Count for Understanding Investors’
Perceptions109
14.5 Performing Data Cleaning in Python 110
14.6 Performing Vectorization in Python 111
14.7 Vector Transformation to Create Trial and
Training Data Sets 111
14.8 Result Analysis Model Testing AUC 113
14.9 Conclusion 113
C o n t en t s xi

C h a p t e r 15 S t o c k P r e d i c t i o n A pp ly i n g LSTM 115
15.1 Introduction 115
15.1.1 Review of Literature 116
15.2 Research Methodology 117
15.2.1 Data Source 117
15.2.2 Period of Study 117
15.2.3 Software Used for Data Analysis 117
15.2.4 Model Applied 117
15.2.5 Limitations of the Study 117
15.2.6 Future Scope of the Study 117
15.3 Fetching the Data into a Python Environment 117
15.4 Performing Data Cleaning in Python 118
15.5 Vector Transformation to Create Trial and Training
Data Sets 119
15.6 Result Analysis for the LSTM Model 120
15.7 Conclusion 121
Preface

In today’s fast-paced and rapidly changing financial landscape, data


analytics has become an essential tool for making informed investment
decisions and driving business success. With the increasing availability
of financial data, professionals in the finance industry are turning to
data analytics to gain a competitive edge and stay ahead of the curve.
This book is designed to provide finance professionals, researchers,
and students with a comprehensive guide to the application of data ana-
lytics in finance. The book covers a wide range of topics, from the basics
of data preprocessing and visualization to advanced machine learning
models and inferential statistical techniques for stock price prediction.
This book provides a strong basic foundation for machine learning
and its application in finance. It emphasizes various machine learn-
ing algorithms and their application to the finance discipline. The
advanced machine learning concepts are discussed lucidly over their
practical application and theoretical understanding. The topics cov-
ered in the book act as a step-by-step guide for the application of
machine learning tools in finance. The advanced machine learning
topics which are covered are as follows:
Chapter 1: Stock investments portfolio management by applying
K-means clustering
Chapter 2: Predicting stock price using the ARIMA model

x iii
xiv P refac e

Chapter 3: Stock investment strategy using a logistic regression


model
Chapter 4: Predicting the stock buying and selling decisions
by applying the gaussian naive bayes model using python
programming
Chapter 5: The random forest technique as a tool for stock trad-
ing decisions
Chapter 6: Stock management and decision tree technique for
proper investment
Chapter 7: Descriptive statistics for stock risk analysis and its
management
Chapter 8: Stock prediction using a multiple regression model
Chapter 9: F-test for stock risk assessment
Chapter 10: T-test for stock risk assessment
Chapter 11: Z-test for stock risk assessment
Chapter 12: Support vector machine learning model for stock
prediction
Chapter 13: Stock risk analysis by visualization
Chapter 14: NLP for sentiment analysis for stock
Chapter 15: LSTM for stock price prediction
This book highlights the use and application of machine learn-
ing tools and techniques in the finance area to further improve the
performance of researchers and analysts. The themes will be helpful
for students, researchers, and many other users of these technologies.
This book is a catalyst for bringing together machine learning and the
financial discipline to develop a deep understanding of the subject.
In addition, we offer our collective expertise and knowledge in this
book, providing readers a unique and insightful perspective on finance
and computer science and engineering.
Authors

Nitin Jaglal Untwal, PhD, is a distinguished scholar and educator in the


field of finance, with a remarkable academic background and research
expertise. Holding a doctorate in finance and master’s degrees in related
fields like commerce, management, and econometrics, he has established
himself as a prominent authority in financial data analytics, technology
management, and econometrics modeling. With over 11 years of experi-
ence in teaching and research, Dr. Untwal has published numerous papers
in esteemed databases like Scopus and Web of Science, solidifying his
reputation as a leading researcher in his field. Recognized as a postgradu-
ate faculty member by the S.P. University of Pune since 2008, he has also
achieved success in prestigious eligibility tests, including UGC-SET in
Management and State Eligibility Test Commerce. Additionally, he has
completed a Faculty Development Program from the Indian Institute of
Management, Kozhikode (IIM-K). Dr. Untwal’s wealth of knowledge
and experience make him an invaluable contributor to this book.

Utku Kose, PhD, a distinguished scholar in computer science and


engineering, joins Dr. Untwal in this literary endeavor. With over 200
publications to his name, Dr. Kose has demonstrated his expertise
in artificial intelligence, machine ethics, biomedical applications, and
more. His impressive academic background and extensive research
experience make him a significant contributor to this book.
xv
1
S to ck I n v estments
P ortfolio M anag ement
by A pplyin g K-M e ans
C lusterin g

1.1 Introduction

National Stock Exchange Indices is the owner of the Nifty 50 and earlier
it was known as Index Services and Product Limited. Nifty 50 covers
12 sectors of the Indian economy. The Nifty 50 is a portfolio of compa-
nies from the financial industry, information technology (IT), oil and gas,
consumer goods, and automobiles. The composition includes the financial
sector with 36.81 percent share, information technology (IT) companies
with 14.70 percent share, 12.17 percent share for oil and gas, 9.02 percent
share for consumer goods, and 5.84 percent share for automobiles. These
companies are considered to be the top performers. Clustering is a tech-
nique that classifies the data sets into different groups based on their simi-
larities. The technique of clustering is based on pattern recognition. The
Nifty 50 is a group of top-performing companies listed on the National
Stock Exchange. The researcher applied K-mean clustering to Nifty 50
stocks to create clusters considering different parameters related to stock
valuation. The study is conducted by considering seven parameters: last
traded price, price-to-earnings ratio (P/E), debt-to-equity ratio, earning
per share (EPS), dividend per share (DPS), return on equity (ROE), and
face value. The clustering of high-performing companies is very useful for
getting insight into high-value stocks for investors.
The last traded price is the price of a share which is stated at the
end of the day. It is the price that occurred as the last traded price.
It differs from the closing price. Price-to-earnings ratio is the ratio
of the current market price of the share to earnings per share, also
called price multiple ratio. It is a handy tool for comparing the price
and performance of different stocks. It majors the proportion of a
DOI: 10.1201/9781032618241-11
2 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

company’s stock price to earnings per share. A high P/E ratio indi-
cates that a company’s stock is overvalued or that the investors expect
high growth rates.
Debt-to-equity ratio is the ratio of total debt to total shareholders’
equity. It is the ratio of borrowed capital to the owned capital. Higher
debt-to-equity ratio means that a company is using borrowed capital
for financing. The debt-to-equity ratio of 1 to 1.5 is considered to be
the standard ratio. The depth-to-equity ratio may vary from industry
to industry.
Earnings per share are calculated by dividing the earnings before
interest and tax by the number of shareholders. The company’s finan-
cial position is reflected in its EPS. If the EPS is high, it means that
shareholders’ value is increased, which is considered to be the main
objective of financial management.
Dividend per share in which dividend is the reward to sharehold-
ers for investing and taking risks. It is the dividend issued divided by
the number of shareholders. The dividend per share is based on the
amount of dividend issued from the overall earnings of the company.
The retained earnings are kept aside keeping the future growth and
expansion plan of the company, further, it plays an important role in
deciding the Dividend policy.
Return on equity is the percentage of net income to the value of
shareholders’ equity. Higher the percentage, more efficient is the gen-
eration of profit.

1.1.1 Introduction to Cluster Analysis

Machine learning: The unsupervised learning models are trained for


data sets, which are unlabeled and are allowed to act without any
supervision. The grouping of data sets into different clusters helps to
generate different meaningful patterns to analyze unlabeled data sets.
The clustering machine learning model is an unsupervised learning
model that helps in determining the patterns of unlabeled data sets.
To find out similar patterns or dissimilar patterns in a data set cluster-
ing technique is applied. Every data set has some common features
based on which we can draw similar patterns, categories and group
the data into different clusters. The K-means technique is one of the
very popular clustering techniques for analysis. This is the reason
S t o c k In v e s t m en t s P o r t f o li o M a n ag em en t 3

for its highest usage and application. The K-means clustering is very
simple to understand and apply. It is one of the best partitioning tech-
niques for data analysis. The cluster technique is based on the concept
of centroid which makes clustering formation unique.
The K-means clustering is the technique of grouping and classifying
the data sets into different categories based on the nearest distance from
the mean. The clustering produces the exact number of clusters of the
greatest possible difference, which is known as priori. It also controls
the total cluster and the total cluster variance.
It is represented by the equation

k n 2

j = E E xi - c j
( j)
j=1 i=1

Here, J is the objective function, k stands for cluster numbers, n


denotes the number of cases, x is the number of cases i, and cj is the
number of the centroid.

1.1.2 Literature Review

The cluster technique has been applied to study the financial market.
Bonanno et al. (2003) worked on network structures of equities and found
out the relationship between them further by applying a complex system.
Coelho et al. (1996) analyzed large noisy data sets by applying cluster
analysis. Jain (2010) and Nanda (2010) applied cluster analysis for port-
folio management (Coronnello et al., 2005). Madhavan (2000) applied
clustering to study the market microstructure. Onnela et al. (2003) and
Bonanno et al. (2001) explored the correlation between different mar-
kets (Huang et al., 2011). Mantegna (1999) studied portfolio manage-
ment strategies for financial forecasting and analysis. Song et al. (2011)
applied random matrix theory to develop insights into the movement of
the financial market (Kantar & Deviren, 2014; Kenett et al., 2011).

1.2 Research Methodology


1.2.1 Data Source

Nifty 50 database
4 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

1.2.2 Study Time Frame

The data selected for analysis is the ratio of different companies under
Nifty 50.

1.2.3 Tool for Analysis

Python Programming

1.2.4 Model Applied

For this study, we applied K-means clustering.

1.2.5 Limitations of the Study

The study is restricted to cluster analysis for only Nifty 50 companies.

1.2.6 Future Scope

A similar kind of cluster analysis can be done for the different sectors
of the Indian economy at the macro level.

Research Is Carried Out in Five Steps

1.3 Feature Extraction and Engineering


1.4 Data Extraction
1.5 Standardizing and Scaling
1.6 Identification of Clusters
1.7 Cluster Formation

1.3 Feature Extraction and Engineering

Feature engineering is an important element of the machine learning


pipeline. Python directly cannot read a particular file as it does not
suit the Python environment; hence, we need to fetch the data into the
Python environment. Feature engineering is a process that creates data
that can be utilized by the machine learning algorithm for analysis. The
raw data cannot be used as they contain so many errors. The raw data
need to be transformed depending upon the domain knowledge and
also according to the need for machine learning modeling. We can-
not create a good machine learning model unless we have done feature
engineering. Feature engineering is the process of cleaning, and defin-
ing the data into different parameters to make it algorithm-friendly.
S t o c k In v e s t m en t s P o r t f o li o M a n ag em en t 5

1.4 Data Extraction

The process of fetching data from an external source into the Python
environment and further making it readable by the Jupiter environ-
ment to carry out machine learning analysis is known as data extrac-
tion (Refer Figure 1.1). For this study, we need to fetch the Excel file
which contains the financial information about the Nifty 50 compa-
nies. We use different libraries for K-means clustering in Python such
as Pandas, matplotlib, and sklearn.

Figure 1.1 Data fetching in the Python environment.

Data is cleaned by removing the unwanted column from the data


frame by applying the Python code below (Refer Figure 1.2):

Figure 1.2 Data frame for Nifty 50 in the Python environment.


6 data a n a ly tI C s f o r fIn a n C e UsIn g P y t h o n

1.5 Standardizing and Scaling

Before performing a machine learning algorithm on a particular


data set, we need to standardize and scale the data as a huge amount
of variation is present in the data set. The huge amount of varia-
tion needs to be transformed by scaling to remove the difference
in the magnitude of the data set (Refer Figure 1.3). Furthermore,
this will remove the difficulty of variation as the K-means clustering
algorithm is distance based (Pozzi et al., 2012). Scaling is applied
since we need to standardize the data by bringing down the standard
deviation and mean of features to one and zero (Lillo & Mantegna,
2003; Peralta & Zareei, 2016).
Python code is applied as follows:

Figure 1.3 Descriptive statistics for scaled data.

1.6 Identification of Clusters by the Elbow Method

The elbow technique is applied in Python Programming to get a


defined number of clusters. The elbow technique is a diagrammatic
representation of cluster formation from a given data set. It works
S t o c k In v e s t m en t s P o r t f o li o M a n ag em en t 7

on the concept of centroid (Song et al., 2011). It explains the varia-


tion in different data sets and finds out the exact number of clus-
ters. The clusters explains variation which also exempts overfitting.
It also removes the various constraints in the cluster. Identification
and its formation. It also optimizes the number of clusters by apply-
ing inertia which acts as a tool for well-defined clusters by applying
clustering technique. The elbow technique optimizes the number of
clusters with lower inertia and less number of clusters (Peralta &
Zareei, 2016).
The calculation of within-cluster inertia is
Inertia (k) = ∑_{i \in C_k} (y_{ik}−μ_k) ^2
where μ_k is the mean of cluster k and C_k corresponds to the set of
indices of genes attributed to cluster k.

1.7 Cluster Formation

When we apply the Python code shown below (Refer Figure 1.4), we
get the results for clusters one to six according to their characteristics
and features.

Figure 1.4 Cluster formation Elbow technique.


8 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 1.5 Cluster formation results with classification for scaled data.

1.8 Results and Analysis

We categorized the data selected for analysis into six clusters in Python
(Refer Figure 1.5 and Figure 1.6):

Figure 1.6 Cluster analysis.


S t o c k In v e s t m en t s P o r t f o li o M a n ag em en t 9

1.8.1 Cluster One

After applying clustering, the result shows that there are six clusters.
Cluster one includes companies like Bajaj Auto, Britannia, Divis Lab-
oratories, Dr. Reddy’s lab, Hero Motors, lit Maitri, Maruti Suzuki,
Tata Steel, TCS, and UltraTech Cement (Refer Table 1.1). The aver-
age LTE for cluster one is 5653, the maximum LTE is 10,209, and
the minimum LTE is 139. Cluster one includes companies from the
automobile sector, cement sector, and pharma, and only one company
from food processing sector, which is Britannia.
The price-to-earnings ratio of Maruti Suzuki is the highest at 60
percent. The lowest price-to-earnings ratio is registered with Tata Steel
with a value of 4.84. The debt-to-equity ratio of almost all companies is
less than 0 only Britannia had a registered debt-to-equity ratio of 0.91.
The highest earning per share is registered with Tata Steel with a value
of 270. The minimum Earning per share is registered with the value of
66.56 for Britannia. The dividend per share of 140 is registered with
Bajaj Auto which is the highest dividend after Hero Motors. Clus-
ter one the automobile companies like Bajaj Auto, Hero Motors, and
Maruti Suzuki had given a good dividend. Pharma companies in clus-
ter one had a good LTP but their dividend payout ratio was consider-
ably low in comparison to Pharma. In cluster one, Britannia dominated
all financial parameters and showed a sound financial position.

Table 1.1 Cluster One Classifications for Nifty 50 Companies


NAME LTP P/E DEBT TO EPS DPS ROE FACE CLUSTER
EQUITY (RS.) (RS.) % VALUE
Bajaj Auto 6,675.00 21.04 0 173.6 140 18.81 10 0
Britannia 5,287.05 48.17 0.91 66.56 56.5 66.72 1 0
Divis Labs 4,014.65 39.63 0 111.07 30 25.21 2 0
Dr Reddys Labs 5,944.40 43.9 0.12 97.85 30 8.85 5 0
Hero Motocorp 4,115.15 18.53 0 123.78 95 15.66 2 0
LTIMindtree 6,153.05 47.66 0 129.14 55 26.9 1 0
Maruti Suzuki 10,209.50 60.65 0.01 124.68 60 6.96 5 0
Tata Steel 139.9 4.84 0.29 270.33 51 26.31 10 0
TCS 3,790.40 35.84 0 104.34 43 49.48 1 0
UltraTechCement 10,205.00 26.95 0.2 245 38 14.34 10 0
Avg 5,653.41 34.721 0.153 144.635 59.85 25.924 4.7
Max 10,209.50 60.65 0.91 270.33 140 66.72 10
Min 139.90 4.84 0 66.56 30 6.96 1
10 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

1.8.2 Cluster Two

In cluster 2 the average LTE is 1822 with the maximum of 5749


for Apollo hospital. In cluster two the service sector bank which
includes Axis Bank, HDFC Bank, SBI, ICICI Bank had been the
key players in cluster 2 represented by the banking sector (Refer Table
1.2). The Apollo Hospital has a good LTP ratio, Debt to the equity
ratio and Earnings per share is leading Cluster Two with all financial
parameters. Cluster two also includes information technology com-
panies like Wipro, Tech Mahindra, Infosys which are below Apollo
Hospital which is a service sector company.

Table 1.2 Cluster Two Classifications for Nifty 50 Companies


NAME LTP P/E DEBT TO EPS DPS ROE F.V
EQUITY (RS.) (RS.) %
Adani Enterprise 2,930 307.6 0.81 6.55 1 13.75 1
Apollo Hospital 5,749 97.65 0.33 46.25 11.75 10.88 5
Asian Paints 3,391 94.25 0 32.68 19.15 23.48 1
Axis Bank 1,091 17.92 0 42.48 1 11.32 2
Bajaj Finserv 1,685 863.27 0 11.2 3 4.7 5
Cipla 1,282 27.8 0 36.62 0 13.13 2
Eicher Motors 3,897 42.35 0 58.02 21 14.69 1
Grasim 2,109 35.81 0.08 46.47 10 6.27 2
HCL Tech 1,473 29.02 0.01 40.1 42 25.53 2
HDFC Bank 1,696 22.01 0 66.8 15.5 15.39 1
Hindalco 618 16.38 0.35 34.76 4 10.11 1
HUL 2,615 54.59 0 37.53 34 18.08 1
ICICI Bank 985 21.7 0 33.66 5 13.68 2
Infosys 1,533 37.77 0 50.49 31 30.63 5
ETC 470 20.51 0 12.22 11.5 24.52 1
JSW Steel 870 10.54 0.79 69.48 17.35 26.3 1
Kotak Mahindra 1,874 49.84 0 35.17 0.9 11.01 5
Larsen 3,447 31.51 0.3 56.09 22 11.74 2
M&M 1,656 19.54 0.17 41.28 11.55 12.66 5
ONGC 208 5.12 0.03 32.04 10.5 16.99 5
SBI 642.8 13.91 0 35.49 7.1 11.3 1
TATA Cons. Prod 1,098 80.89 0 9.61 6.05 7.53 1
Tech Mahindra 1,279 29.7 0 50.48 45 19 5
Titan Company 3,707 103.26 0.02 24.56 7.5 23.25 1
UPL 595 50.01 0.2 15.39 10 14.33 2
Wipro 470 26.66 0.14 22.2 6 22.32 2
S t o c k In v e s t m en t s P o r t f o li o M a n ag em en t 11

Table 1.2 (Continued)


NAME LTP P/E DEBT TO EPS DPS ROE F.V
EQUITY (RS.) (RS.) %
Avg 1,822 81.13 0.12 36.44 13.60 15.86 2.38
Max 5,749 863 0.81 69.48 45 30.63 5
Min 208.07 5.12 0 6.55 0 4.7 1

1.8.3 Clusters Three and Four

Clusters three and four include companies like Sun Pharma and Nes-
tle. Further, the EPS for Sun Pharma is negative and EPS for Nestle
is 222 with a dividend per share of 200 (Refer Table 1.3).

Table 1.3 Clusters Three and Four Classifications for Nifty 50 Companies
NAME LTP P/E DEBT TO EPS DPS ROE FACE CLUSTER
EQUITY (RS.) (RS.) % VALUE
Sun Pharma 1,298.80 −2,286.88 0.2 −0.4 10 −0.4 1 3
Nestle 27,240.00 0 0.02 222.46 200 102.89 10 4

1.8.4 Cluster Five

In cluster Five the highest LTP is registered with Bajaj Finance with
7407. The P/E ratio for Bajaj Finance is 78 and the very high debt-
to-equity ratio of 2.78 (Refer Table 1.4). The EPS for Bajaj Finance
is highest in cluster Five with 65. The maximum return on equity
registered with 22.4 by Power Grid Corporation which is highest in
cluster Five

1.8.5 Cluster Six

In Cluster six which includes Three Service sector banks and insur-
ance companies (Refer Table 1.5). The highest LTP is registered with
Reliance with a value of 20601 and the lowest last credit price is reg-
istered for the public sector organization which is Coal India. The
highest price-to-earnings ratio is registered with HDFC Life Insur-
ance. The highest debt-to-equity ratio is registered with BPCL. For
the highest earning per share is registered with the Indus bank. The
12 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Table 1.4 Cluster Five Classifications for Nifty 50 Companies


NAME LTP P/E DEBT TO EPS DPS ROE FACE VALUE CLUSTER
EQUITY (RS.) (RS.) %
Adani 1,071.25 549.08 1.68 1.41 5 1.11 2 5
Ports
Bajaj 7,407.50 78.21 2.78 65.85 10 11 2 5
Finance
Bharti 1,024.95 −115.61 1.31 −6.53 3 −4.59 5 5
Airtel
NTPC 306.65 8.12 1.33 16.62 7 12.58 10 5
Power 239 8.85 1.77 24.51 14.75 22.44 10 5
Grid
Corp
Tata 786.3 −119.49 1.16 −3.63 0 −6.97 2 5
Motors
Avg 1,805.94 68.1933333 1.671666667 16.37167 6.625 5.928333 5.166666667
Max 7,407.50 549.08 2.78 65.85 14.75 22.44 10
Min 239.00 −119.49 1.16 −6.53 0 −6.97 2

Table 1.5 Cluster Six Classifications for Nifty 50 Companies


NAME LTP P/E DEBT TO EPS DPS ROE FACE CLUSTER
EQUITY (RS.) (RS.) % VALUE
BPCL 456.45 8.7 0.49 41.31 16 17.69 10 6
Coal India 393.8 10.07 0 18.18 17 68.47 10 6
HDFC Life 645.4 103.45 0.05 6.73 2.02 12.15 10 6
IndusInd 1,585.80 15.7 0 59.57 8.5 9.66 10 6
Bank
Reliance 2,601.15 44.48 0.41 59.24 8 8.28 10 6
SBI Life 1,434.30 0 0 14.56 2.5 11.09 10 6
Insurance
Avg 1186.15 30.4 0.158333333 33.265 9.003333 21.22333 10
Max 2601.15 103.45 0.49 59.57 17 68.47 10
Min 393.8 0 0 6.73 2.02 8.28 10

highest dividend per share is registered with Coal India the highest
return on equity is also given by Coal India with 68.47.

1.9 Conclusion

The overall cluster formation is classified into six clusters based on


different parameters like Last Traded Price, Price earnings ratio
S t o c k In v e s t m en t s P o r t f o li o M a n ag em en t 13

(P/E), Debt to Equity, Earning per share (EPS), Dividend per share
(DPS), return on equity (ROE), and Face Value. The clustering of
high-performing companies is very useful for getting insight into
high-value stocks for investors.

References
Bonanno, G., Caldarelli, G., Lillo, F., Miccichè, S., Vandewalle, N., &
Mantegna, R. N. (2003). Networks of equities in financial markets. The
European Physical Journal B-Condensed Matter and Complex Systems,
38(2), 363–371.
Bonanno, G., Lillo, F., & Mantegna, R. N. (2001). High-frequency cross-
correlation in a set of stocks. Quantitative Finance, 1(1), 96–104.
Coelho, R., Gilmore, C. G., Lucey, B. M., Richmond, P., & Hutzler, S. (2007).
The evolution of interdependence in world equity markets—Evidence
from minimum spanning trees. Physica A: Statistical Mechanics and its
Applications, 376, 455–466.
Coronnello, C., Tumminello, M., Lillo, F., Miccichè, S., & Mantegna, R. N.
(2005). Sector identification in a set of stock return time series: A com-
parative study. Quantitative Finance, 5(4), 373–387.
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algo-
rithm for discovering clusters in large spatial databases with noise. In
Kdd (Vol. 96, No. 34, pp. 226–231).
Huang, Z., Cai, Y., & Xu, X. (2011). A data mining framework for invest-
ment opportunities identification. KDD-96: The Second International
Conference on Knowledge Discovery and Data Mining. Expert Systems
with Applications, 38(8), 9224–9233.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern
Recognition Letters, 31(8), 651–666.
Kantar, E., & Deviren, B. (2014). Hierarchical structure of stock markets.
Physica A: Statistical Mechanics and Its Applications, 404, 117–128.
Kenett, D. Y., Shapira, Y., & Ben-Jacob, E. (2011). RMT assessments of the
market latent information embedded in the stocks’ raw data. Journal of
Probability and Statistics. DOI:10.1155/2009/249370 (2009)
Lillo, F., & Mantegna, R. N. (2003). Power-law relaxation in a complex sys-
tem: Omori law after a financial market crash. Physical Review E, 68(1),
016119.
Madhavan, A. (2000). Market microstructure: A survey. Journal of Financial
Markets, 3(3), 205–258.
Mantegna, R. N. (1999). Hierarchical structure in financial markets. The
European Physical Journal B, 11(1), 193–197.
Nanda, S., Mahanty, B., & Tiwari, M. K. (2010). Clustering Indian stock mar-
ket data for portfolio management. Expert Systems with Applications,
37(12), 8793–8798.
14 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Onnela, J. P., Chakraborti, A., Kaski, K., Kertész, J., & Kanto, A. (2003).
Dynamics of market correlations: Taxonomy and portfolio analysis.
Physical Review E, 68(5), 056110.
Peralta, G., & Zareei, A. (2016). A network approach to portfolio selection.
Journal of Empirical Finance, 38, 157–180.
Pozzi, F., Di Matteo, T., & Aste, T. (2012). Exponential smoothing weighted
correlations. The European Physical Journal B, 85(6), 175.
Song, D. M., Tumminello, M., Zhou, W. X., & Mantegna, R. N. (2011).
Evolution of worldwide stock markets, correlation structure, and corre-
lation-based graphs. Physical Review E, 84(2), 026108.
2
P red i ctin g S to ck P rice
U sin g the ARIMA M od el

2.1 Introduction

The stock is exposed to different types of risk and uncertainties which


have an impact on the price of the stock. It is difficult to predict the
stock price. The stock price is influenced by various factors related to
demand and supply. For predicting the price of a stock, we require
dependent variables like stock market index, similar or identical com-
pany, sales, profits, earnings per share, etc. When the dependent vari-
able does not have any impact and it is impossible to predict the stock
price in such cases, the stock price is predicted by considering the past
stock value on different time horizons like days, week, months, quar-
terly, or yearly by applying the autoregressive moving average method
called ARIMA.
Stock prediction using time series analysis is an emerging area in
predictive analytics. It has attracted many researchers because of the
utility and accuracy of the model. The main objective of the ARIMA
model is to study past observations based on which future models are
generated to forecast a given variable. The success of the ARIMA
model depends on appropriate model identification and evaluation.
It is essential to understand that the ARIMA model is applied
under what circumstances.
• No dependent variable is available.
• Good sufficient historical data is available.
• Autocorrelation.

2.2 ARIMA Model

The ARIMA model has a wide area of application for estimating and
predicting the future value of a variable in applied econometrics areas like
management, finance, banking, health analytics, and weather forecasting
DOI: 10.1201/9781032618241-215
16 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

which are crucial for selecting the optimized model that can predict the
precise value of a given variable depending on historical and past data.
The ARIMA model is considered to be the most reliable model in such
a situation; Box and Jenkins is the researchers and scientists who devel-
oped the ARIMA model in 1970. The model was used in forecasting
and showed tremendous potential to generate a short-term forecast.
The ARIMA model forecasts time lags which are equally spaced in
time horizon with univariate time series. In ARIMA, AR stands for
autoregressive, which emphasizes on the relationship between the past
values and the future values, I stands for integrated, and MV stands
for moving average.
It is represented by the equation
yt = Ф 0 + Ф1 it−1 + Ф2 it−2 +. . .+ Фpat-p
  + єt−θ1 є t−1−θ2 є t−2−. . .−θq є t−q (2.1)
where actual data values are denoted as yt, coefficients are denoted as
Фi and θj, Єi denotes the random errors, and integers p and q repre-
sent the degrees of autoregressive and moving averages (Ayodele et al.,
2014). The ARIMA model is a mixture of two equations: Autoregres-
sive is the equation based on past lags and the moving average is based
on error.

2.2.1 Literature Review

Book explains. Various predictive models, including ARIMA for its


practical application and use Burbidge et al. (2001). The time series
analysis and its application for forecasting is developed and explained
lucidly Burges (1998). The main focus is on forecasting of finan-
cial data and risk analysis associated with it Cervantes et al. (2023).
Applied hybrid autoregressive integrated moving average method with
a neural network model further explains the combination of both the
models and has improved the predictive analysts forecasting model
Deo (2015). Developed hybrid model includes long-term short-term
memory, autoregressive integrated moving average, and the Bayes
optimization model for predicting the financial data in the form of
forecasting stock price Dhillon and Verma (2020).
Furthermore, various research studies focused on hybrid models
of LSTM and machine learning with time series analysis (Ding &
P red ic tin g S t o c k P ric e Usin g t he ARIMA M o d el 17

Dubchak, 2001; Drucker et al., 1999; Garcia-Lamont et al., 2023;


Hinton et al., 2012; Huang et al., 2005; Joachims, 1998; Kim, 2014;
Maita et al., 2015; Mountrakis et al., 2011; Nguyen et al., 2020; Pal &
Mather, 2003; Schölkopf et al., 2001; Tay & Cao, 2001; Toledo-Pérez
et al., 2019; Turk & Pentland, 1991).

2.3 Research Methodology


2.3.1 Data Source

Yahoo Finance financial database is used to create the ARIMA model.

2.3.2 Period of Study

The study period was from 8 January 2023 to 5 January 2024. The
interval for the selected data is the daily closing stock price of MRF
for analysis.

2.3.3 Software Used for Data Analysis

Python Programming, Anaconda

2.3.4 Model Applied

For this study, we applied the ARIMA model.

2.3.5 Limitations of the Study

The study is restricted to the Stock price Index of MRF only.

2.3.6 Future Scope of the Study

In the future, the study can be done on the macro level by applying it
to a different stock at the same time.

2.3.7 Methodology

The autoregressive integrated moving average (ARIMA) model pre-


dicts the stock price by using past values. The ARIMA model is applied
18 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

to understand the relationship between the past values of stock for pre-
dicting its future predicted value. The ARIMA model is widely applied
in the field of stock price prediction. The ARIMA model is imple-
mented first by understanding the relationship between the past values
of stock and its future value. Autocorrelation plays an important role
in model development. The check for autocorrelation defines further
steps of model evaluation and parameter estimation to select the best
ARIMA model for stock prediction using Python. Research is carried
out in three steps. First, we need to check the autocorrelation. Then, we
need to evaluate different ARIMA models and compare the AIC of
other models. The best model is selected with the lowest AIC and mean
square error given by train and test data analysis results. The data set is
divided into two parts: 70 percent train data and 30 percent test data.

Research is carried out in three steps:

2.4 Finding Different Lags Autocorrelation


2.5 Creating the Different ARIMA Models
2.6 Selecting the Best Model Using Cross-Validation

2.4 Finding Different Lags Autocorrelation

The autocorrelation at different lags (lag1, lag2, lag3, lag4, lag5, etc.)
are considered by using matplot, and autocorrelation is detected by
studying the projections in autocorrelation charts. Autocorrelation is
the relationship between successive values of the same variable. Here,
we will cross-check the autocorrelation in our time series data using
Python Programming and comparing the autocorrelation at different
lags (Figures 2.1–2.5 show the autocorrelation plot for lag = 1 to lag =
5 for the MRF stock).
The plot in Figure 2.1 shows a very high degree of autocorrelation
for lag = 1; hence, we further checked the autocorrelation for lag = 2.
The plot in Figure 2.2 does not show the substantial degree of auto-
correlation for lag = 2; hence, we further checked the autocorrelation
for lag = 3.
The plot in Figure 2.3 does not show a considerable degree of auto-
correlation for lag = 3; hence, we further checked the autocorrelation
for lag = 4.
P red ic tin g S t o c k P ric e Usin g t he ARIMA M o d el 19

Figure 2.1 An autocorrelation plot with lag = 1 for the MRF stock.

Figure 2.2 An autocorrelation plot with lag = 2 for the MRF stock.
20 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 2.3 An autocorrelation plot with lag = 3 for the MRF stock.

Figure 2.4 An autocorrelation plot with lag = 4 for the MRF stock.

The plot in Figure 2.4 shows a lesser degree of autocorrelation for


lag = 4; hence, we further checked the autocorrelation for lag = 5.
The plot in Figure 2.5 does not show a substantial degree of auto-
correlation for lag = 5. The above analysis from lag1 to lag5 shows that
P red ic tin g S t o c k P ric e Usin g t he ARIMA M o d el 21

Figure 2.5 An autocorrelation plot with lag = 5 for the MRF Stock.

lag1 has the highest autocorrelation (Figures 2.1–2.5 show the auto-
correlation plot for lag = 1 to lag = 5 for the MRF stock).
Akaike information criterion (AIC) for ARIMA model evaluation
is considered to be the best method and hence we tried to compare the
AIC of different ARIMA models.

2.5 Creating the Different ARIMA Models

We first find out the Akaike Information Criterion (AIC) for model
evaluation. The ARIMA models are compared to check the AIC to
select the best model. It determines the order of an ARIMA model.
AIC is given by the following equation:

AIC= − 2 Log (L) + 2 (p+q+k+1). . . . . . . (2.2)

where L is the likelihood of the data; k=1 if c ≠ 0; and k = 0 if c = 0. We


tested four ARIMA models, and Figures 2.6–2.10 show the details
of different ARIMA models as Python Programming output. The
ARIMA (1,1,1) model has an AIC value of 7459.
22 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Inference—The ARIMA (1,1,1) model has an AIC value of 7459


(Refer Figure 2.6).

Figure 2.6 Results for the ARIMA (1,1,1) model.

Inference—The ARIMA (1,0,2) model has an AIC value of 7485


(Refer Figure 2.7).

Figure 2.7 Results for the ARIMA (1,0,2) model.


P red ic tin g S t o c k P ric e Usin g t he ARIMA M o d el 23

Inference—The ARIMA (0,0,3) model has an AIC value of 8058


(Refer Figure 2.8).

Figure 2.8 Results for the ARIMA (0,0,3) model.

2.5.1 Comparing the AIC Values of Models

The Akaike information criterion (AIC) scores of different ARIMA


models are compared (refer to Figures 2.6–2.8). The model with the
lowest AIC, BIC, and likelihood scores is considered the best fit
after comparison of the AIC, BIC, and likelihood scores of different
ARIMA models (Refer Table 2.1). The ARIMA (1,1,1) model regis-
tered the lowest AIC, BIC, and likelihood scores with no significant
p-values.

Table 2.1 Akaike Information Criterion (AIC) and BIC Values for Different ARIMA Models
S. NO ARIMA MODEL AIC BIC
1 (1,1,1) 7459 7476
2 (1,0,2) 7485 7505
3 (0,0,3) 8058 8078

Inference—The table above has the lowest AIC and hence it is the
best ARIMA model (Refer Table 2.1).
24 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 2.9 Results for the ARIMA (1,1,1) model.

2.6 Selecting the Best Model Using Cross-Validation

After comparison of the AIC, BIC (Refer Table 2.1), and p-values
of the ARIMA (1,1,1) model, ARIMA (1,0,2) model, and ARIMA
(0,0,3) model, it was found that the AIC and BIC values of the
ARIMA (1,1,1) model were the lowest and that the p-values were
also significant; hence, we cross-validated the models to select the best
ARIMA model (Refer Figure 2.9).

2.7 Conclusion

After cross-validation of different ARIMA models, the ARIMA


(1,1,1) model was considered the best fit for predicting the stock value
of MRF with a mean square error of 3504375. We selected the best
ARIMA model depending on the Akaike information criterion and
testing mean square error. After comparison of the AIC, BIC, and
likelihood scores of different ARIMA models, the ARIMA model
with the lowest scores of AIC and testing mean square error (cross-
validation) (Refer Figure 2.10) and p-value less than 0.05 was selected
as the best model.
P red ic tin g S t o c k P ric e Usin g t he ARIMA M o d el 25

Figure 2.10 Results for the ARIMA (1,1,1) model with cross-validation.

References
Ayodele, A. et al., (2014). “Comparison of ARIMA and Artificial Neural
Networks Models for Stock Price Prediction”, Journal of Applied
Mathematics, 2014(1), 1–12.
Burbidge, R., Trotter, M., Buxton, B., & Holden, S. (2001). Drug design
by machine learning: Support vector machines for pharmaceutical data
analysis. Computers & Chemistry, 26(1), 5–14. https://doi.org/10.1016/
S0097-8485(01)00094-8
Burges, C. J. (1998). A tutorial on support vector machines for pattern recog-
nition. Data Mining and Knowledge Discovery, 2(2), 121–167. https://
doi.org/10.1023/A:1009715923555
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & López, A.
(2023). A comprehensive survey on support vector machine classification:
Applications, challenges and trends. Journal of Building Engineering.
https://doi.org/10.1016/j.jobe.2023.104911
Deo, R. C. (2015). Machine learning in medicine. Circulation, 132(20), 1920–
1930. https://doi.org/10.1161/CIRCULATIONAHA.115.001593
Dhillon, A., & Verma, G. K. (2020). Convolutional neural network: A review of
models, methodologies and applications to object detection. Progress in
Artificial Intelligence, 9(2), 85–112. https://doi.org/10.1007/s13748-019-
00203-0
Ding, C., & Dubchak, I. (2001). Multi-class protein fold recognition using
support vector machines and neural networks. Bioinformatics, 17(4),
349–358. https://doi.org/10.1093/bioinformatics/17.4.349
26 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for
spam categorization. IEEE Transactions on Neural Networks, 10(5),
1048–1054. https://doi.org/10.1109/72.788645
Garcia-Lamont, F., Cervantes, J., Rodríguez-Mazahua, L., & López, A.
(2023). Support vector machine in structural reliability analysis: A review.
Structural Safety. https://doi.org/10.1016/j.strusafe.2023.102211
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. r., Jaitly, N., . . . &
Sainath, T. N. (2012). Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups. IEEE
Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/
MSP.2012.2205597
Huang, W., Nakamori, Y., & Wang, S. Y. (2005). Forecasting stock market move-
ment direction with support vector machine. Computers & Operations
Research, 32(10), 2513–2522. https://doi.org/10.1016/j.cor.2004.03.016
Joachims, T. (1998). Text categorization with support vector machines: Learning
with many relevant features. European Conference on Machine Learning,
137–142. https://doi.org/10.1007/BFb0026683
Kim, Y. (2014). Convolutional neural networks for sentence classification.
EMNLP 2014. https://doi.org/10.3115/v1/D14-1181
Maita, A. R. C., Martins, L. C., López Paz, C. R., Peres, S. M., & Fantinato,
M. (2015). Process mining through artificial neural networks and sup-
port vector machines: A systematic literature review. Business Process
Management Journal, 21(6), 1391–1415. https://doi.org/10.1108/BPMJ-
02-2015-0017
Mountrakis, G., Im, J., & Ogole, C. (2011). Support vector machines in remote
sensing: A review. ISPRS Journal of Photogrammetry and Remote
Sensing, 66(3), 247–259. https://doi.org/10.1016/j.isprsjprs.2010.11.001
Nguyen, H. Q., Nguyen, N. D., & Nahavandi, S. (2020). A review on deep
reinforcement learning for robotic manipulation. Computers & Electrical
Engineering, 88, 106838. https://doi.org/10.1016/j.compeleceng.2020.
106838
Pal, M., & Mather, P. M. (2003). An assessment of the effectiveness of decision
tree methods for land cover classification. Remote Sensing of Environment,
86(4), 554–565. https://doi.org/10.1016/S0034-4257(03)00132-9
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R.
C. (2001). Estimating the support of a high-dimensional distribu-
tion. Neural Computation, 13(7), 14431471. https://doi.org/10.1162/
089976601750264965
Tay, F. E., & Cao, L. (2001). Application of support vector machines in
financial time series forecasting. Omega, 29(4), 309–317. https://doi.
org/10.1016/S0305-0483(01)00026-3
Toledo-Pérez, D. C., Rodríguez-Reséndiz, J., Gómez-Loenzo, R. A., &
Jauregui-Correa, J. C. (2019). Support vector machine-based EMG sig-
nal classification techniques: A review. Applied Sciences, 9(20), 4402.
https://doi.org/10.3390/app9204402
Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive
Neuroscience, 3(1), 71–86. https://doi.org/10.1162/jocn.1991.3.1.71
3
S to ck I n v estment
S tr ategy U sin g a L o g istic
R eg ression M od el

3.1 Introduction to the Logistic Regression Model

Stock trading is the art of investing. Timely buying and selling


(trading) is considered to be the key for successful investment as
the decision involves a huge amount of risk. The basic criterion and
decisive factor for trading a stock is to predict whether the stock
price trend is moving upward or downward, which mostly influences
the decision of buying a particular stock or selling it. The buying or
selling of stock is based on the golden assumption which is consid-
ered to be the rule. Buy the stock when its price is predicted to rise
and sell it when its price is predicted to fall. The buying and selling
of stock involve a tremendous amount of analysis and research to
make the right decision. To overcome the risk as mentioned earlier,
the researcher made an attempt to create a reliable logistic regres-
sion model for the buying and selling of stock, and a study titled
Stock Investment Strategy Using a Logistic Regression Model was
conducted here. The logistic regression model is applied when the
dependent variable (y) outcome is binary in nature or multinomial
in nature. The independent variable (x) may be continuous or binary
or multinomial in nature but the dependent variable (y) is always
binary or multinomial in nature. The logistic regression model is a
supervised learning tool (Huang et al., 2023).

3.1.1 Introduction to a Logistic Regression Model

The supervised learning classification algorithm, the logistic regres-


sion model, is used to predict the target variable, which is binary in
nature. The outcome (dependent variable) is binary or dichotomous,

DOI: 10.1201/9781032618241-3 2 7
28 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 3.1 Explanation of different probabilities in the logistic regression model.

(Source: https://www.statstest.com/multinomial-logistic-regression/)

with two classes. Furthermore, we can say that it is in binary form as


the outcome is binary in nature like success/failure, female/male, and
no/yes (Patel et al., 2023; Roberts & Evans, 2023). The logistic regres-
sion model predicts P(Y=1) as a function of X. The logistic regression
model is extensively applied in various classification problems as it is
the easiest and simplest algorithm.
The logistic regression model for predicting the severity of stock
buying and selling is represented as
f (k,i ) = b0,k + b1,k x1,i + b2,k x2,i + ...+ bM,k xM,i
Logistic regression applies a linear predictor function f(k,i) to predict
the probability that observation i has outcome k.
Figure 3.1 shows a logistic regression model with an example for a
better understanding of a dependent variable with classes A, B, and C.

3.1.2 Literature Review

Smith et al. (2023) applied a logistic regression model for pre-


dicting patient outcome in health analytics and achieved good
S t o c k In v e s t m en t S t r at egy 29

accuracy (Brown et al., 2023). They also worked on reducing over-


fitting in a logistic regression model Davis and Green (2023).
Anderson and Thompson (2023) developed an enhanced logistic
regression model. Chen and Zhao (2023) provided deeper insight
into probabilistic analysis. Clark and Lewis (2023) worked on the
ethical aspects of a logistic regression model and artificial intelli-
gence (Kumar & Singh, 2023; Lee & Kim, 2023; Lee et al., 2023).
Garcia and Martinez (2023) worked on combining a logistic regres-
sion model with other ML models for improving accuracy. Harris
and Brown (2023) applied a logistic regression model for environ-
mental analytics and developed model for environmental analysis.
Huang et al. (2023) worked on feature engineering for improv-
ing logistic regression model. Johnson and Wang (2023) worked
on reducing overfitting in a logistic regression model. Martinez
and Perez (2023) applied a logistic regression model for forecasting
in social science. Nguyen et al. (2023) applied the logistic regres-
sion model in clinical research. Taylor's and Wilson (2023) worked
for optimization and effectiveness in efficiency. White and Black
(2023) worked on the stabilization of a logistic regression model
(Garcia et al., 2023).

3.1.3 Applied Research Methodology

3.1.3.1 Data Source


Data is taken from Yahoo Finance which is a reliable source.

3.1.3.2 Sample Size


The daily price of the MRF stock is considered for the study from 2
January 2023 to 5 January 2024 (daily stock price).

3.1.3.3 Software Used for Data Analysis


Python Programming libraries used for analysis are statsmodels.api,
Pandas, NumPy, and SciPy.

3.1.3.4 Model Applied


The logistic regression model algorithm is applied for the analysis and
creation of a model.
30 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

3.2 Fetching the Data into a Python Environment and


Defining the Dependent and Independent Variables

Raw data filtering is a procedure applied in feature engineering. Fea-


ture engineering is a process of converting raw data into features that
can be utilized in an ML model. The data was fetched into the Python
Anaconda environment using the Jupyter Notebook, as the format of
the data file was not readable in Python. The data frame is created
by fetching the comma-separated values (CSV) file, making it read-
able in Python and utilizing it for further processing in the form of a
data frame. The data frame is created as per the model requirement.
The data frame needs to be structured as per the requirement of the
model. The first step in creating a data frame is to structure the data
so that the program can read and work on the data. Once the data
frame is created, it is ready to be used by the algorithm. The syntax
used for creating a data frame in Python Programming is presented
in Figure 3.2.

Figure 3.2 Creating a data frame.


S t o c k In v e s t m en t S t r at egy 31

3.3 Data Description and Creating Trial and Testing Data Sets

The study has a dependent variable as the adjusted closing price which
is a comparison of the past day’s (yesterday’s) adjusted closing price and
(today’s) adjusted closing price (Refer Table 3.1). The thumb rule is if
today’s adjusted closing price is higher than yesterday’s adjusted clos-
ing price then purchase (buy) the stock and if today’s adjusted closing
price is lower than yesterday’s adjusted closing price sell the stock. The
buy is denoted by a Variable 1 and the sell is denoted by a 0 variable
as the dependent variable. The dependent variables are continuous in
nature and are represented as Open, Close, High, and Low.

Table 3.1 Presenting the Classes of the Variables Used in the Logistic Regression Model
VARIABLE CLASSES
Adjust—Close Sell = 0
(Dependent) Buy = 1
‘Open’ Continuous
‘Close’ Continuous
‘High’ Continuous
‘Low’ Continuous

3.4 Results Analysis for the Logistic Regression Model

Figure 3.3 Results for a logistic regression model.


32 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

3.4.1 The Stats Models Analysis in Python

The above statistical analysis shows that the four independent variables
that are continuous in nature are Open, Close, High, and Low (Refer
Figure 3.3). Out of four independent variables, the variable High is
highly insignificant with a p-value of 0.418. The p-value of the inde-
pendent variable is more than 0.05, which is considered to be highly
insignificant. The variable Low is highly significant with a p-value of
0.013. The variable Open and the variable Close are also significant.

3.5 Model Evaluation Using Confusion Matrix and Accuracy Statistics

A confusion matrix measures model performance. It evaluates the


actual values and predicted values (Refer Table 3.2). It is of the order
of N X N, where N denotes the class of dependent/target variable. For
binary classes, it is a 2 X 2 confusion matrix. For multi-classes, it is a
3 X 3 confusion matrix.

3.5.1 Calculating False Negative, False Positive, True Negative,


and True Positive

The confusion matrix for our data set is as below:

Figure 3.4 The Python code for the confusion matrix:


S t o c k In v e s t m en t S t r at egy 33

• True negatives in the upper-left position


• False negative in the lower-left position
• False positive in the upper-right position
• True positives in the lower-right position
True Positive: The true positive is represented by cell one
True Positive = 127
False Negative: The sum of values apart from the true positive value
False Negative = 15
False Positive: The total value does not include the true positive value.
False Positive = 14
True Negative: The sum of values of all columns and rows excluding the values of that class that we
are calculating the values for.
True Negative = 94

Table 3.2 The Confusion Matrix


1 0
1 127 (TP) 94 (TN)
0 14 (FP) 15 (FN)

3.6 Accuracy Statistics

AccuracyNow, to obtain accuracy from the confusion matrix, we apply


the following formulae (Refer Figure 3.5):
True Positive + True Negative
Model Accuracy =
True Positive + True Negative + False Positive + False Negative
127 + 94
=
127 + 94 + 14 + 15

Model Accuracy = 88 Percent

3.6.1 Recall

Recall is the ratio of true positive with correctly classified positive


examples divided by the total number of positive examples. High
recall indicates that the class is correctly recognized (a small number
of FN).
True Positive
Recall =
True Positive + False Negative
1227
Recall =
127 + 15
Recall = 89 Percent
34 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

3.6.2 Precision

Precision is the measure of how often it is correct when positive results


are predicted.

True Positive
Precision =
True Positive + False Positive
127
Precission =
127 +14
Precision = 90 Percent

Figure 3.5 The Python code for accuracy statistics.

3.7 Conclusion

The four independent variables that are continuous in nature are


Open, Close, High, and Low. Out of four independent variables,
the variable High is highly insignificant with a p-value of 0.418. The
p-value of the independent variable is more than 0.05, which is con-
sidered to be highly insignificant. The variable Low is highly signifi-
cant with a value of 0.013. The variable Open and the variable Close
are also significant. The overall accuracy of the model is 80 percent
and the precision is 86 percent for selling (0) and 80 percent for buy-
ing (1), which will act as an investment strategy for the buying and
selling of stock.
S t o c k In v e s t m en t S t r at egy 35

References
Anderson, J., & Thompson, R. (2023). Future directions in logistic regression
research. Journal of Advanced Computational Methods, 45(1), 78–92.
Brown, A., et al. (2023). Genome-wide association studies using logistic
regression models. Genomic Data Science, 37(4), 512–527.
Chen, L., & Zhao, H. (2023). Bayesian logistic regression for probabilistic
inferences. Statistics in Medicine, 40(3), 233–245.
Clark, P., & Lewis, D. (2023). Ethical considerations in logistic regression
applications. Journal of Fair AI, 12(2), 98–110.
Davis, M., & Green, J. (2023). Enhancing model interpretation with SHAP
and LIME. Data Science Insights, 29(5), 402–418.
Garcia, F., et al. (2023). Adaptive logistic regression models. Machine Learning
Review, 50(3), 289–305.
Garcia, M., & Martinez, L. (2023). Hybrid models combining logistic regres-
sion and machine learning. AI and Data Science Journal, 44(7), 678–691.
Harris, N., & Brown, S. (2023). Cross-disciplinary applications of logistic
regression. Environmental Modelling & Software, 21(6), 311–326.
Huang, Z., et al. (2023). Improved feature selection for logistic regression.
Computational Statistics, 28(9), 411–429.
Johnson, R., & Wang, Y. (2023). Regularization techniques in logistic regres-
sion. Journal of Statistical Computation, 39(2), 145–160.
Kumar, S., & Singh, R. (2023). Marketing analytics using logistic regression.
Business Analytics Quarterly, 35(3), 256–271.
Lee, H., & Kim, S. (2023). Multinomial logistic regression in consumer pref-
erence modeling. Marketing Science, 38(8), 491–507.
Lee, Y., et al. (2023). Sparse logistic regression models for high-dimensional
data. Journal of Data Science, 25(4), 334–349.
Martinez, P., & Perez, J. (2023). Logistic regression in social sciences.
Sociological Methods & Research, 42(5), 190–205.
Nguyen, T., et al. (2023). Combining survival analysis and logistic regression.
Clinical Trials Journal, 17(1), 23–37.
Patel, M., et al. (2023). Handling imbalanced datasets in logistic regression.
Journal of Machine Learning Research, 55(7), 811–829.
Roberts, K., & Evans, M. (2023). Financial applications of logistic regression.
Journal of Financial Analytics, 47(3), 215–230.
Smith, J., et al. (2023). Predicting hospital readmissions using logistic regres-
sion. Health Informatics Journal, 31(2), 143–157.
Taylor, G., & Wilson, E. (2023). Enhancing computational efficiency in logis-
tic regression. Computational Optimization and Applications, 36(6),
520–536.
White, R., & Black, D. (2023). Robustness to outliers in logistic regression.
Journal of Applied Statistics, 48(11), 1023–1038.
4
Predicting Stock Buying
and Selling Decisions by
Applying the Gaussian
Naive Bayes Model Using
Python Programming

4.1 Introduction

The stock market is exposed to different kinds of risk. The risk cannot
be accurately predicted as the stock market is based on the principle of
random walk which is depicted in the review of literature. Models like
the efficient market hypothesis emphasize the random walk principle
on which the stock market usually acts. The different machine learning
algorithms like the logistic regression model, support vector machine
model, and decision tree model are applied for predicting the stock price
and they have given a good precision and accurate predictive models
that are almost near to the expected value. Different inferential statis-
tics like the t-test, F-test, and Z-test are also applied for predicting the
stock price for measurement and assessment of risk and uncertainties.
After analyzing different studies, we concluded that a predictive GNB
model needs to be applied for predicting the buying and selling deci-
sions for stock and thus the study titled predicting the stock buying and
selling decisions by applying the Gaussian Naive Bayes model using
Python Programming was conducted here. It works on Bayes’ theorem
of probability to predict the categorical output. It is fast compared to
other machine learning models. The algorithm works on some prior
model data sets. The model assumes that all independent variables are
independent in nature which is not true in real-world scenarios (Lee
et al., 2015). The model is extremely used in predictive analytics since

36 DOI: 10.1201/9781032618241-4


P red ic tin g S t o c k Bu y in g a n d Sel lin g 37

it is very simple to apply and has very high efficiency in addition to


good performance. It is a powerful tool and is usually applied to large
data sets. It outperforms the logistic regression model, support vector
machines, and other classification models. The algorithm is based on a
model given by Thomas Bayes. As it is one of the best models, we apply
it to predict the buying and selling of stock.

4.1.1 Literature Review

Hastie et al. (2009) considered Gaussian Naive Bayes (GNB) model to


be the most effective machine learning model for stock market predic-
tion (Huang et al., 2012; Huber, 1964). Although for the GNB model
the data should be normally distributed, this assumption is always
neglected and not being found in stock market data (Mandelbrot, 1963;
Aggarwal et al., 2015). The GNB model has wide areas of application
for stock market prediction (Chen et al., 2011). Kim et al. (2013) and
Lee et al. (2015) achieved high precise accuracy after applying the GNB
model to financial data of stock market. Bishop (2006) applied the GNB
model and compared it with other classification models like support
vector machine and random forest technique. Wang et al. (2017) found
the GNB model to be more reliable than other classification models. The
GNB model was also used for feature engineering and feature selection
in raw data processing by Guyon et al. (2002) and Li et al. (2018). The
GNB model was also applied for detection of anomalies by Aggrawal
et al. (2015) and Chandola et al. (2009). Anderson (1962) advocated
the application of the GNB model with its pros and cons. Jensen (1969)
focused on dynamics of data related to the stock market.

4.2 Research Methodology


4.2.1 Data Collection

Secondary data was collected from Yahoo Finance.

4.2.2 Sample Size

Daily stock price of the MRF stock is considered for the study from
2/1/2023 to 5/1/2024.
38 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

4.2.3 Software Used for Data Analysis

Python Programming

4.2.4 Model Applied

For this study, we applied the Naive Bayes machine learning algorithm.

4.2.5 Limitations of the Study

The study is limited to only predicting the stock price of MRF.

4.2.6 Future Scope of the Study

In the future, the study can be extended to compare Naive Bayes


models applied to different sectors of industry at the macro level.

4.3 Methodology

For creating a predictive model we selected and applied the Naive


Bayes machine learning algorithm.

Research is carried out in five steps:

4.4 Feature Engineering and Data Processing


4.5 Training and Testing
4.6 Predicting Naive Bayes Model with Confusion Matrix
4.7 Comparing the Kernel Performance
4.8 Results and Analysis

4.4 Feature Engineering and Data Processing

The process of converting raw data into features that can be easily
utilized to create a model as per the requirement of the algorithm is
called feature engineering (Refer Figure 4.1). The creation of a data
frame is the first step in creating a model. The data frame is created to
maintain the notion of the model which has different variables. Fea-
ture engineering is the process of preparing of data frame according to
the need of algorithm hence, it is needs to be converted into nominal
scale or ordinal scale etc. in order to prepare data that can be read
and utilized by algorithm. It will make raw data ready for program
P red ic tin g S t o c k Bu y in g a n d Sel lin g 39

Figure 4.1 Creating a data frame.

to utilize in best possible manner. The syntax used for creating a data
frame in Python Programming is presented in Figure 4.1.

4.5 Training and Testing

To conduct the study, secondary data was collected from Yahoo


Finance. The dependent variable for predicting (Y) is the binary class
(Figure 4.2) and four independent variables ‘High’, ‘Low’, ‘Open’, and
‘Close’ are continuous in nature.

VARIABLE CLASSES
Buy/Sell (Dependent) Tomorrow’s Price > Today’s Price
Buy = 1
Tomorrow’s Price < Today’s Price
Sell = 0
Open (Independent) Continuous
Close (Independent) Continuous
High (Independent) Continuous
Low (Independent) Continuous
40 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 4.2 Defining the dependent and independent variables.

For trial and testing, the data is divided into two categories: 80 per-
cent of the data is converted and used for trial and 20 percent of the data
is used for testing. With trial and testing the test results are validated by
creating a confusion matrix.

Figure 4.3 The Python code for trial and testing.

4.6 Predicting Naive Bayes Model with Confusion Matrix


4.6.1 Creating Confusion Matrix

A confusion matrix measures model performance (Refer Table 4.1). It


evaluates the actual values and predicted values. It is of the order of
N X N, where N is the class of dependent/target variable. For binary
class, it is a 2 X 2 confusion matrix.

4.6.2 Calculating False Negative, False Positive,


True Negative, and True Positive

The confusion matrix for our data set is as below:

Table 4.1 The Confusion Matrix


0 1
0 22 (TP) 4 (FN)
1 0 (FP) 24 (TN)
P red ic tin g S t o c k Bu y in g a n d Sel lin g 41

4.6.3 Result Analysis

4.6.3.1 Accuracy Statistics


It measures the overall accuracy of the model by analyzing the output
predicted about incorrect predictions.
To obtain the accuracy of the model we apply the following
formula:

True Positive + True Negative


Accuracy =
True Positive + True Neegative + False Positive + False Negative
22 + 4
Accuracy = = 0.93
22 + 4 + 0 + 24

The accuracy for the overall model is 0.93

4.6.3.2 Recall
It is the ratio of true positive predictions divided by the total num-
ber of true positive predictions and false-negative predictions. Higher
Recall implies more correct prediction (a small number of FN).

True Positive
Recall =
True Positive + False Negative
222
Recall = = 0.85
22 + 4

Recall for the overall model is 0.85

4.6.3.3 Precision
Precision measures how correctly we have predicted the true positive
prediction. It is the qualitative analysis of correctly predicted values

True Positive 22
Precision = = = 1.00
True Positive + False Positive 22 + 0

The precision for the overall model is 1.00

4.7 Conclusion

The Naive Bayes model predicted the MRF stock with a precision of
100 percent. The overall model accuracy is 93 percent.
42 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

References
Aggarwal, C. C., & others. (2015). Anomaly detection in stock market data
using Gaussian Naive Bayes. Journal of Intelligent Information Systems,
46(2), 241–263.
Anderson, T. W. (1962). An Introduction to Multivariate Statistical Analysis.
Wiley.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Chandola, V., & others. (2009). Anomaly detection in stock market data using
One-Class SVM. Journal of Intelligent Information Systems, 33(2), 147–163.
Chen, X., & others. (2011). Stock price prediction using Gaussian Naive Bayes.
Journal of Computational Information Systems, 7(10), 3565–3572.
Guyon, I., & others. (2002). Gene selection for cancer classification using sup-
port vector machines. Machine Learning, 46(1–3), 389–422.
Hastie, T., & others. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer.
Huang, W., & others. (2012). Stock price prediction using Gaussian Naive Bayes
and SVM. Journal of Computational Information Systems, 8(10), 4321–4328.
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of
Mathematical Statistics, 35(1), 73–101.
Jensen, M. C. (1969). Risk, the pricing of capital assets, and the evaluation of
investment portfolios. Journal of Business, 42(2), 167–247.
Kim, J., & others. (2013). Stock price prediction using Gaussian Naive Bayes
and feature selection. Journal of Intelligent Information Systems, 41(2),
241–263.
Lee, S., & others. (2015). Stock return prediction using Gaussian Naive Bayes
and technical indicators. Journal of Financial Markets, 23, 1–15.
Li, X., & others. (2018). Feature selection for stock price prediction using
Gaussian Naive Bayes. Journal of Intelligent Information Systems, 51(2),
241–263.
Mandelbrot, B. (1963). The variation of certain speculative prices. Journal of
Business, 36(4), 392.
5
The R and om F orest
Technique I s a To ol for
S to ck Tr ad in g D ecisi ons

5.1 Introduction

The stock market is exposed to lots of uncertainties. It is difficult to pre-


dict the stock price since the value of the stock is influenced by so many
factors (Adebiyi et al., 2010). The machine learning models like the
logistic regression model, Naive Bayes model, and decision tree model
are some of the machine learning tools through which we can predict
the stock price (Louppe 2014). The random forest technique is applied
since the preliminary results given by different models cannot give pre-
cise and effective results and hence we depend on the random forest
model. The random forest model is considered to be important since it
adds randomness with effective analysis which can remove the bias in
the model and hence we conducted the study titled The Random Forest
Technique Is a Tool for Stock Trading Decisions (Chen & Guestrin,
2016; Cutler & Cutler, 2009; Cutler et al., 2007; Strobl et al., 2007).

5.2 Random Forest Literature Review

Random forest technique has very wide application and the model
improvement for accuracy is carried out in various fields by various
researchers like Breiman (2001), Liaw and Wiener (2002a, b), Ish-
waran and Kogalur (2007), and Geurts et al. (2006). Strobl et al.
(2008) improved interpretation of random forest technique. Wright
(2017) applied random forest techniques in C++ and R programming
languages. Deng and Runger (2012) applied random forest technique
for feature engineering to select proper features for a model. Lopes
and Rossi (2015) applied a random forest model for analysis of global
sensitivity. Prasad et al. (2006) and Cutler and Cutler (2009) applied

DOI: 10.1201/9781032618241-5 4 3
44 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

and studied the application of random forest techniques in ecologi-


cal analysis. Chen and Guestrin (2016) and Zhou (2012) developed
an ML algorithm Extreme gradient boosting, also known as XG
Boost (Cutler & Cutler, 2009; Ishwaran 2008; Ishwaran 2014; Lall
1996).

5.3 Research Methodology


5.3.1 Data Source

Data taken for the study is from Yahoo Finance.

5.3.2 Period of Study

The study period commenced on 2/1/2023 and ended on 5/1/2024.


The interval for selected data is the daily price of the MRF stock.

5.3.3 Sample Size

Sample size includes 250 samples as the daily closing price of MRF
stock. The data is partitioned as follows: 75 percent of data (183 sam-
ples) is used for training and the remaining 25 percent of data (62
samples) is used for testing purposes.

5.3.4 Software Used for Data Analysis

Python Programming

5.3.5 Model Applied

For this study, we applied the random forest model.

5.3.6 Limitations of the Study

The study is restricted to the buying and selling decision of MRF only.

5.3.7 Future Scope of the Study

In the future, the study can be conducted on the macro level by apply-
ing it to a group of companies.
R a n d o m F o re s t T ec hni q ue f o r S t o c k T r a d in g 45

5.3.8 Methodology

We selected and applied the random forest model to create a predictive


model. The study is carried out in three steps—Defining the depen-
dent and independent variables, training and testing with accuracy
statistics, and buying and selling strategy return.

Research is carried out in three steps:

5.4 Defining the Dependent and Independent Variables for


the Random Forest Model
5.5 Training and Testing with Accuracy Statistics
5.6 Buying and Selling Strategy Return

5.4 Defining the Dependent and Independent Variables for the


Random Forest Model

The dependent variable Buy/Sell(Y) is binary 1 for Buy and −1 for Sell
(Refer Table 5.1). The four independent variables are Open-Close,
High-Low, Std-5, and Ret-5.

Table 5.1 The Classes of Variables Used in the Random Forest Model
VARIABLE CLASSES
Buy/Sell (Dependent) Tomorrow’s Price > Today’s Price
Buy = 1
Tomorrow’s Price < Today’s Price
Sell = −1
Open-Close = O pen – Close
Open
(Continuous)
High-Low = H igh – Low
  Low
(Continuous)
Std-5 Standard deviation of 5 days
(Continuous)
Ret-5 The mean of 5 days
(Continuous)

We are creating the code in Python Programming for the depen-


dent variable Buy/Sell (Y) (Refer Figure 5.1), binary 1 for Buy, and
−1 for Sell, and the four independent variables are Open-Close, High-
Low, Std-5, and Ret-5.
46 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 5.1 Feature construction for a random forest model.

5.5 Training and Testing with Accuracy Statistics

Here we need to split the data into training and testing data sets to eval-
uate data mining models. When we separate the data into training data
set and testing data set, most of the data is used for training and a small
amount of data is used for testing (Refer Figure 5.2). We randomly sam-
ple the data to ensure that the training and testing data sets are similar
for analysis. By using similar data for training and testing, we can mini-
mize data errors and achieve a better understanding of the model. The
data is partitioned as follows: 75 percent of data is used for training and
the remaining 25 percent of data is used for testing purposes.

Inference—Results show an accuracy of 56 percent for the random


forest model. A precision of 68 percent is recorded for buying MRF
stock, and 45 percent is registered for selling.

5.6 Buying and Selling Strategy Return

The plot (Figure 5.3) shows the distribution of percentage MRF stock
return. The strategy helps extract the required information and understand
R a n d o m F o re s t T ec hni q ue f o r S t o c k T r a d in g 47

Figure 5.2 Training and testing with accuracy statistics.

Figure 5.3 Strategy for MRF stock return in percentage.


48 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

the density of MRF stock return in percentage (Refer Figure 5.3). The
maximum density is seen in the stock return percentage from −1 percent to
a 1 percent increase. The spread shows the range of −3 percent to 4 percent.

Inference—The plot shows the predicted movement of MRF stock


return in percentage as indicated by the random forest algorithm.
The trend analysis (refer to Figure 5.4) for the MRF buying and
selling strategy shows a downward trend since the beginning of the
study period.

Figure 5.4 Strategy for return in percentage.

5.7 Conclusion

The study has an overall model accuracy of 56 percent, and the preci-
sion for buying is 68 percent and for selling it is 45 percent. The data
set is split into two parts: train and test. 75 percent of data is used for
training and 25 percent of data is used for testing purposes. The maxi-
mum density is seen in the stock return percentage from −1 percent
to 1 percent increase. The overall movement for buying and selling
strategy ranges from a −3 percent decline to a 4 percent rise.
R a n d o m F o re s t T ec hni q ue f o r S t o c k T r a d in g 49

References
Adebiyi, A. A., Marwala, T., & Sowunmi, T. O. (2010). Bankruptcy pre-
diction using artificial neural networks and multivariate statistical
techniques: A review. African Journal of Business Management, 4(6),
942–947.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system.
In Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (pp. 785–794). Springer.
Cutler, D. R., & Cutler, A. (2009). Random Forest: Breiman and Cutler’s
random forests for classification and regression. R package version
4.6–10.
Cutler, D. R., Edwards Jr, T. C., Beard, K. H., Cutler, A., Hess, K. T.,
Gibson, J., & Lawler, J. J. (2007). Random forests for classification in
ecology. Ecology, 88(11), 2783–2792.
Deng, H., & Runger, G. (2012). Feature selection via regularized trees.
IEEE Transactions on Knowledge and Data Engineering, 24(6),
1057–1069.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees.
Machine Learning, 63(1), 3–42.
Ishwaran, H., & Kogalur, U. B. (2007). Random forests for survival, regres-
sion and classification (RF-SRC). R News, 7(2), 25–31.
Ishwaran, H., & Malley, J. D. (2008). An iterative random forest algorithm
for variable selection in high-dimensional data. Bioinformatics, 26(4),
1182–1187.
Ishwaran, H., & Malley, J. D. (2014). Forest floor: Visualizes random forests
with feature contributions. R package version 0.9.4.
Lall, U., Sharma, A., & Tarhule, A. (1996). Streamflow forecasting in the
Sahel using climate indices. Journal of Applied Meteorology, 35(10),
274–287.
Liaw, A., & Wiener, M. (2002a). Breiman and Cutler’s random forests for
classification and regression. R News, 2(3), 22–24.
Liaw, A., & Wiener, M. (2002b). Classification and regression by randomFor-
est. R News, 2(3), 18–22.
Lopes, F. M., & Rossi, A. L. (2015). Using random forests for global sensitiv-
ity analysis of the CLM4. 5-FATES land surface model. Geoscientific
Model Development, 8(4), 1059–1075.
Louppe, G. (2014). Understanding random forests: From theory to practice.
PhD Thesis, University of Liège.
Prasad, A. M., Iverson, L. R., & Liaw, A. (2006). Newer classification and
regression tree techniques: Bagging and random forests for ecological
prediction. Ecosystems, 9(2), 181–199.
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008).
Conditional variable importance for random forests. BMC Bioinformatics,
9(1), 307.
50 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in ran-
dom forest variable importance measures: Illustrations, sources and a
solution. BMC Bioinformatics, 8(1), 1–15.
Wright, M. N., & Ziegler, A. (2017). Ranger: A fast implementation of ran-
dom forests for high dimensional data in C++ and R. Journal of Statistical
Software, 77(1), 1–17.
Zhou, Z. H. (2012). Ensemble Methods: Foundations and Algorithms.
Taylor & Francis Group.
6
A pplyin g D ecisi on Tree
C l assifier for B uyin g
and S ellin g S tr ategy
with S pecial R eferen ce
to MRF S to ck

6.1 Introduction

Today the stock market is dynamic in nature. The stock market is


uncertain with lots of shocks of ups and downs like a rollercoaster.
The stock market has been an attractive investment opportunity for
stock traders. To exploit the opportunity and to get a good amount
of return, the traders need to be very speculative. This speculation
cannot be made by humans as it is a very complex phenomenon
hence we need to depend upon machine learning tools like logistic
regression models, support vector models, regression models, etc.
The success of stock investment depends on the buying and selling
of stock. The buying and selling of stock decides your returns/prof-
its/losses. The buying and selling of stocks needs accurate predictive
analytics; hence, the study titled Applying Decision Tree Classifier
for Buying and Selling Strategy with Special Reference to MRF
Stock was carried out.

6.2 Decision Tree

Decision tree is a diagrammatic representation of all decisions with


their possible outcomes (Li & Cheng, 2023). It is an important
tool for strategic management as far as investment is considered.
It also gives all possible outcomes. It can act as a regression model
by classification of different outcome. It is a tool for the analysis
of decisions and all possible outcomes (García & Martínez, 2023;

DOI: 10.1201/9781032618241-651
52 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Huang & Zhao, 2023; Kim & Park, 2023). The root node is the
starting node of a decision tree and is also known as the mother
node. The leaf node is the end node of a decision tree with zero
Gini value.
The decision tree can be an effective tool for stock price predic-
tive analytics (Du et al., 2023; Olorunnimbe & Viktor, 2023). The
decision tree is highly accurate with stock price prediction since it
can predict the volatility and risk of the stock market (Zhou et al.,
2023). The efficiency of a decision tree model is enhanced by making a
hybrid model with a relative machine learning model such as LSTM
(Feng & Zhang, 2023; Liu et al., (2023). The decision tree is used for
portfolio management and volatility assessment of the stock market
(Chen & Lin, 2023; Kumar & Das, 2023; Wang & Zhang, 2023).
The use of the decision tree trading algorithm in market sentiment
analysis has shown the importance of decision tree in the finance field
(Lee & Kim, 2023; Rodriguez & Lopez, 2023; Patel et al., 2023;
Wang 2023). The combination of decision tree with other machine
learning techniques and artificial intelligence has a huge impact on
financial data analysis decisions (Singh & Gupta, 2023; Patel & Roy,
2023; Yang & Liu, 2023).

6.3 Research Methodology


6.3.1 Data Source

Data taken for the study is from Yahoo Finance.

6.3.2 Period of Study

The study period commenced on 2/1/2023 and ended on 5/1/2024.


The interval for selected data is the daily price of the MRF stock for
analysis. The total sample size is 250 days.

6.3.3 Software Used for Data Analysis

Python Programming
MRF S t o c k D ecisi o n T ree C l a s sifier 53

6.3.4 Model Applied

For this study, we applied the decision tree model.

6.3.5 Limitations of the Study

The study is restricted to the analysis of MRF stock prices.

6.3.6 Methodology

We selected and applied the decision tree model to create a predictive


model. The study is carried out in five steps—Creating a data frame,
feature construction and defining the dependent and independent
variables, creating a confusion matrix, buying and selling strategy
return, and decision tree analysis.

Research is carried out in five steps:

6.4 Creating a Data Frame


6.5 Feature Construction and Defining the Dependent and
Independent Variables
6.6 Training and Testing of Data for Accuracy Statistics
6.7 Buying and Selling Strategy Return
6.8 Decision Tree Analysis

6.4 Creating a Data Frame

Before we start the analysis, it is very important to convert the data which
can be assessed in a Python environment (Refer Figure 6.1). A data frame
is the representation of structured data which will be used for analysis. The
raw data is cleaned by removing unwanted data in the data frame so that
the data will be ready to use for further analysis. The process of preparing
data to make it ready for analysis as per the requirement of the algorithm
is called feature engineering. It also helps for an easy understanding of
various variables used in machine learning models since it is structured
and easy for the algorithm to utilize the data for analysis. It is the first step
in the process of building a machine learning model. The syntax used for
creating a data frame in Python Programming is presented in Figure 6.1.
54 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 6.1 Data fetching from CSV to Python as a data frame.

6.5 Feature Construction and Defining the Dependent


and Independent Variables

The dependent variable Buy/Sell(Y) is binary 1 for Buy and -1 for Sell
(Refer Table 6.1). The four independent variables are Open-Close,
High-Low, Std-5, and Ret-5.

Table 6.1 Presenting the Variables Used in the Decision Tree Model
VARIABLE CLASSES
Buy/Sell (Dependent) Tomorrow’s Price > Today’s Price
Buy = 1
Tomorrow’s Price < Today’s Price
Sell = −1
Open-Close = O pen – Close
Open
(Continuous)
High-Low = H igh – Low
  Low
(Continuous)
Std-5 Standard deviation of 5 days
(Continuous)
Ret-5 The mean of 5 days
(Continuous)
MRF S t o c k D ecisi o n T ree C l a s sifier 55

They are creating the code in Python Programming for the


dependent Variable Buy/Sell (Y), binary 1 for Buy, and −1 for Sell,
and Four Independent variables are Open-Close, High-Low, Std-5,
Ret-5 (Refer Figure 6.2).

Figure 6.2 Feature construction for a decision tree model.

6.6 Training and Testing of Data for Accuracy Statistics

For accuracy statistics, we need to convert data into two parts: train-
ing data set and testing data set (Refer Figure 6.3). The major part
of the data is used for training purposes since we build the decision
tree model on the training data set. The model evaluation is done
based on a testing data set. We select random samples of data to
ensure that the training and testing data sets are similar for analysis
and hence biasedness can be minimized. By using similar data for
training and testing, we can minimize data errors and achieve a bet-
ter understanding of the model.
Results show an accuracy of 42.85 percent for the decision tree
model. A precision of 48 percent is recorded for buying and 40 percent
is registered for selling MRF Stocks.
56 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 6.3 Training and testing with accuracy statistics.

6.7 Buying and Selling Strategy Return

The plot (Figure 6.4) shows the percentage returns distribution by


means of a frequency density chart. The percentage helps extract the
required information to understand the density of return in percent-
age. The maximum density is seen in the stock return percentage
from −1 percent to 1.80 percent increase.

Inference—The plot shows the predicted movement of return in


percentage as indicated by the decision tree algorithm. The maxi-
mum density is seen in the return percentage from −1 percent to
1.80 percent.

6.8 Decision Tree Analysis

The root node is the mother node and is also the starting node of a
decision tree. It has no backward step since it is the topmost node. The
largest information gain is by std_5 with a Gini value of 0.495 and a
MRF S t o c k D ecisi o n T ree C l a s sifier 57

Figure 6.4 Strategy for stock return in percentage.

Figure 6.5 Strategy for return in percentage.


58 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

sample size of 49 with class 1 (Buy). The root node splits into Open-
Close and High-Low with Gini values of 0.0.472 and 0.245 Decision
nodes are the nodes that are next to the root node which generate mul-
tiple decision modes and leaf nodes (end nodes) with maximum purity.
Leaf nodes are the end nodes with maximum purity. The leaf nodes
have zero Gini values. The leaf nodes classify the data with the high-
est purity. The outcome is predicted with color nodes. The highest leaf
class predicted was Class −1 (Sell) and it was predicted with nine final
leaf nodes and Class 1 (Buy) with seven leaf nodes (Refer Figure 6.6).

Figure 6.6 Decision tree in Python.

6.9 Conclusion

Results: The highest leaf class predicted was Class −1 (Sell), which was
predicted with nine final leaf nodes and Class 1 (Buy) was predicted
with seven leaf nodes. The predicted movement of return in percent-
age as indicated by the decision tree algorithm and the corresponding
maximum density were seen in the return percentage from −1 percent
to 1.80 percent.
MRF S t o c k D ecisi o n T ree C l a s sifier 59

References
Chen, X., & Lin, M. (2023). Decision trees in high-frequency trad-
ing. International Journal of Financial Studies, 11(3), 94. https://doi.
org/10.3390/ijfs11030094
Du, S., Li, X., & Yang, D. (2023). Research on prediction of decision tree
algorithm on different types of stocks. In Proceedings of the 2nd
International Seminar on Artificial Intelligence, Networking and
Information Technology—Volume 1: ANIT (pp. 178–181). SciTePress.
https://doi.org/10.5220/0012277000003807
Feng, S., & Zhang, T. (2023). Improving stock market predictions using
LSTM and decision tree models. AIP Conference Proceedings, 3072,
020023. https://pubs.aip.org/aip/acp/article/3072/1/020023/3277787
García, J., & Martínez, A. (2023). Financial market forecasting using decision
trees and machine learning. International Journal of Financial Studies,
11(3), 94. https://doi.org/10.3390/ijfs11030094
Huang, Z., & Zhao, Y. (2023). Predicting stock market trends using decision
tree algorithms. International Journal of Financial Studies, 11(3), 94.
https://doi.org/10.3390/ijfs11030094
Kim, J., & Park, H. (2023). Application of decision tree algorithms for mar-
ket trend analysis. International Journal of Financial Studies, 11(3), 94.
https://doi.org/10.3390/ijfs11030094
Kumar, V., & Das, S. (2023). Decision tree-based risk assessment in stock
investments. International Journal of Financial Studies, 11(3), 94.
https://doi.org/10.3390/ijfs11030094
Lee, H., & Kim, S. (2023). Decision trees and their role in automated trading
systems. International Journal of Financial Studies, 11(3), 94. https://
doi.org/10.3390/ijfs11030094
Li, J., & Cheng, S. (2023). A hybrid approach combining decision trees and
neural networks for stock prediction. International Journal of Financial
Studies, 11(3), 94. https://doi.org/10.3390/ijfs11030094
Liu, Q., et al. (2023). Enhancing stock market predictions with ensemble
learning. International Journal of Financial Studies, 11(3), 94. https://
doi.org/10.3390/ijfs11030094
Olorunnimbe, R., & Viktor, H. (2023). Stock market prediction with time
series data and news. International Journal of Financial Studies, 11(3),
94. https://doi.org/10.3390/ijfs11030094
Patel, A., & Roy, B. (2023). Decision trees in predictive analytics for stock
markets. International Journal of Financial Studies, 11(3), 94. https://
doi.org/10.3390/ijfs11030094
Patel, J., Shah, S., & Thakkar, P. (2023). A review on decision tree algorithms
in financial forecasting. International Journal of Financial Studies, 11(3),
94. https://doi.org/10.3390/ijfs11030094
Rodriguez, P., & Lopez, F. (2023). Using decision trees to analyze market
sentiments and stock prices. International Journal of Financial Studies,
11(3), 94. https://doi.org/10.3390/ijfs11030094
60 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Shi, Y., & Chen, L. (2023). Decision trees in financial markets: Construction
and applications. International Journal of Financial Studies, 11(3), 94.
https://doi.org/10.3390/ijfs11030094
Singh, R., & Gupta, M. (2023). Decision trees for predictive modeling in
finance. International Journal of Financial Studies, 11(3), 94. https://doi.
org/10.3390/ijfs11030094
Wang, T., & Zhang, L. (2023). Enhancing portfolio management with deci-
sion trees. International Journal of Financial Studies, 11(3), 94. https://
doi.org/10.3390/ijfs11030094
Wang, Y., & Sun, L. (2023). Comparative study of decision tree models in
stock price prediction. International Journal of Financial Studies, 11(3),
94. https://doi.org/10.3390/ijfs11030094
Yang, M., & Liu, H. (2023). Stock market prediction using decision trees
and support vector machines. International Journal of Financial Studies,
11(3), 94. https://doi.org/10.3390/ijfs11030094
Zhou, X., et al. (2023). Machine learning techniques for stock price prediction
and graphic processing. International Journal of Financial Studies, 11(3),
94. https://doi.org/10.3390/ijfs11030094
7
D escrip ti v e S tatisti c s for
S to ck R isk A ssessment

7.1 Introduction

Descriptive statistics works on organizing and presenting data in a


meaningful and lucid manner to describe its features and characteris-
tics. It provides an accurate summary of data. The summary includes
central tendencies like mean, median, and mode; dispersion which
includes range, variance, and standard deviation; and shape and dis-
tribution in the form of kurtosis and skewness.

7.1.1 Related Work

Data wrangling was done with Pandas and NumPy (McKinney, 2017).
VanderPlas (2016) worked on different descriptive analysis tools.
Das (2018) emphasizes on practical application of descriptive statis-
tics. Shaikh and Prakash (2020) put emphasis on descriptive statis-
tics and its practical application. Saxena and Gupta (2019) worked
on COVID-19 data, performed descriptive statistics for academic
research work done by MCKinney et al. (2010) and further focused
on computational analysis. Das Gupta and Gosh (2019) carried out
empirical analysis. Pedregosa et al. (2011) applied scikit-learn and per-
formed descriptive analysis. Grolemund (2017), Gauda (2016), Getlin
(2015), Géron (2019), VanderPlas (2016), and Wickham and Grol-
emund (2017) contributed hands on to building a machine learning-
related model (DePoy & Gitlin, 2015; Géron, 2019; Saxena 2019).

7.2 Research Methodology


7.2.1 Data Source

Yahoo Finance financial database was used to perform Descriptive


statistics (Refer Figure 7.2).
DOI: 10.1201/9781032618241-761
62 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

7.2.2 Period of Study

The study period commenced on 2/1/2023 and ended on 5/1/2023.


The interval for selected data the daily opening stock price of MRF for
analysis. The study used a sample size of 250 days.

7.2.3 Software Used for Data Analysis

Python Programming, Anaconda (Refer Figure 7.1)

7.2.4 Model Applied

For this study, we applied the t-test.

7.2.5 Limitations of the Study

The study is restricted to t-tests only.

7.2.6 Future Scope of the Study

In the future, the study can be done on different stocks at the same time.

Figure 7.1 Python libraries for fetching the data sets into a Python environment and performing
descriptive statistics.
D e s c rip ti v e S tatis tic s 63

Figure 7.2 Results of checking for null values.

7.3 Performing Descriptive Statistics in Python for Mean

Mean is the total or sum of all values of stock returns divided by the
days (Refer Figure 7.3). The average return by the stock is useful to
understand the risk by applying standard deviation and variance.

Figure 7.3 Performing descriptive statistics in Python for mean.

7.4 Performing Descriptive Statistics in Python for Median

The median represents the middle value of a variable (Refer Figure


7.4). The median value is more than the average, which means the
stock value in the middle of the data set is higher.
64 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 7.4 Performing descriptive statistics in Python for median.

7.5 Performing Descriptive Statistics in Python for Mode

It is the most apparent score in data sets (Refer Figure 7.5). The mode
is 108500, which is higher than the average of 100703 and median of
100773, indicating that the stock gives a higher return than the aver-
age return.

Figure 7.5 Performing descriptive statistics in Python for mode.

7.6 Performing Descriptive Statistics in Python for Range

The range is the difference between the lowest and highest values,
here the min is 81900 and the max is 131600 (Refer Figure 7.6).
The min is below the most appeared score in the data sets (mode) of
108500 and it is also lower than the average of 100703 and median
of 100773, indicating that the stock gives higher returns than the
min return most of the time. The highest is 131600 which is above
the mean, median, and mode, indicating a good return. The range
has a difference of 49700 and the difference between the average
return and max return is 30 897, which is lower than the range.
Hence, we can interpret that the returns are above 100773 or close
to that value in most of the cases.
D e s c rip ti v e S tatis tic s 65

Figure 7.6 Performing descriptive statistics in Python for range.

7.7 Performing Descriptive Statistics in Python for Variance

Variance is calculated as the average of the squared differences from


the mean (Refer Figure 7.7).

Figure 7.7 Performing descriptive statistics in Python for variance.

7.8 Performing Descriptive Statistics in Python for Standard Deviation

Standard deviation is the square root of variance (Refer Figure 7.8). It


measures the deviation from the mean value. It is a measure of con-
sistency. Smaller the standard deviation, more consistent is the data in
nature. The standard deviation of 11169 is way below the difference of
range and variance. Hence, we can interpret that the data or stock is
more consistent in nature.

Figure 7.8 Performing descriptive statistics in Python for standard deviation.


66 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

7.9 Performing Descriptive Statistics in Python for Quantile

Quantile divides data into quarters: first quantile is the 25th percen-
tile, second quantile is the 50th percentile, and third quantile is the
75th percentile (Refer Figure 7.9). Q1 of 89967 says that 25 percent
of data is less than or equal to 89967, which is below the average of
100703. The second quantile has a value of 100773, which is above the
average of 100703; and the third quantile has a value of 108879, which
is above the average and near to the mode value of 108500, indicating
less deviation from the average and mode and hence less risk.

Figure 7.9 Performing descriptive statistics in Python for different quantiles with IQR.

7.10 Performing Descriptive Statistics in Python for Skewness

It is the measurement of symmetry or asymmetry in a normal dis-


tribution (Refer Figure 7.10). Asymmetry means the curve appears
to be skewed towards the right (positive skewness) or left (negative
skewness). It ranges from −1 (negative skewness) to 1 (positive skew-
ness). If it is zero, then the data is normally distributed and there is no
skewness. If the skewness is 0.24 (positive skewness), the strategy of
D e s c rip ti v e S tatis tic s 67

Figure 7.10 Performing descriptive statistics in Python for skewness.

investors is to put the stock on hold, as the stock will give small losses
in short term but a good return in long term (Müller & Guido, 2016).

7.11 Performing Descriptive Statistics in Python for Kurtosis

Kurtosis is a measure that determines whether a data distribution has


heavy or light tails in comparison to a normal distribution (Refer Fig-
ure 7.11). The kurtosis of −0.50 means that the distribution has tails
that are thinner than normal distribution. Generally, these do not
produce extreme values, which is good for investors who do not want
to take risk.

Figure 7.11 Performing descriptive statistics in Python for kurtosis.

7.12 Conclusion

Descriptive statistics is a good tool for getting analysis of stock for


developing a good Investment strategy.

References
Das, A. (2018). Descriptive Statistics with Python. Packt Publishing Ltd.
Das Gupta, A., & Ghosh, S. (2019). An empirical study on descriptive
data analysis using Python. International Journal of Engineering and
Advanced Technology, 8(6), 988–992.
DePoy, E., & Gitlin, L. N. (2015). Introduction to Research: Understanding
and Applying Multiple Strategies. Elsevier Health Sciences.
Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras,
and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent
Systems. O’Reilly Media, Inc.
68 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

McKinney, W. (2017). Python for Data Analysis: Data Wrangling with


Pandas, NumPy, and IPython. O’Reilly Media.
McKinney, W., & others. (2010). Data Structures for Statistical Computing
in Python. Proceedings of the 9th Python in Science Conference, 445,
51–56.
Müller, A. C., & Guido, S. (2016). Introduction to Machine Learning with
Python: A Guide for Data Scientists. O’Reilly Media, Inc.
Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. Journal
of Machine Learning Research, 12, 2825–2830.
Saxena, M., & Gupta, A. (2019). Descriptive statistical analysis of COVID-
19 data using python. SSRN Electronic Journal.
Shaikh, F. B., & Prakash, S. R. (2020). A comprehensive review of descriptive
data analysis techniques using Python libraries. Journal of Open Source
Software, 5(50), 2284.
VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for
Working with Data. O’Reilly Media.
Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data. O’Reilly Media.
8
S to ck I n v estment
S tr ategy U sin g a
R eg ression M od el

8.1 Introduction to a Multiple Regression Model

The multiple regression model is the statistical tool for predicting


the dependent variable based on multiple independent variables. In
this method, we identify the independent variable which has a major
impact on the dependent variable. In short, we build a model that
studies the impact of the independent variable on the dependent vari-
able based on the relationship that exists. In the regression model, we
take all information about the independent variable and use it to build
a powerful predictive analytics multiple regression model. The mul-
tiple regression model is represented by the equation below:
Y = a + b1X1 + b2 X2
where
Y is the dependent variable.
a is the Y-intercept.
b1 is the change in the value of Y for each 1 percent change in X1.
b2 is the change in the value of Y for each 1 percent change in X2.
X1 and X2 are the independent variables.

The multiple regression model is developed and implemented with its


evaluation in five steps:

1. Introduction to a multiple regression model


2. Fetching the data into a Python environment and defining the
dependent and independent variables
3. Correlation matrix for selecting variables for the regression model
4. Result analysis for the multiple regression model

DOI: 10.1201/9781032618241-8 6 9
70 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

5. Conclusion For stock market prediction, the multiple regression


model is considered to be the most reliable model (Anderson,
1962). The multiple regression model considers the linear relation-
ship between the dependent variable and independent variables
(Huber, 1964). By applying the multiple regression model, we can
understand the relationship between the factors which influence
the stock price (Fama, 1965; Theil, 1950). The portfolio returns are
predicted by applying multiple regression analysis and by identify-
ing the factors which influence it (Markowitz, 1952; Sharpe, 1964;
Mandelbrot 1963). Before applying the multiple regression model
(Huber, 1964), we need to test normality as the stock market data
is not normally distributed (Campbell and Lo, 1997). The mul-
tiple regression model did not consider the assumption of multi-
collinearity (Brown & Forsythe, 1974). Multicollinearity can lead
to unstable regression coefficient outputs (Hampel, 1974). The
regression model was further developed by Rousseeuw (1984),
McCullagh and Nelder (1989), and Breusch and Pagan (1979).

8.2 Applied Research Methodology


8.2.1 Data Source

Data is taken from Yahoo Finance, which is a reliable source.

8.2.2 Sample Size

The daily price of the MRF stock is considered for the study from 2
January 2023 to 27 February 2024 (daily stock price).

8.2.3 Software Used for Data Analysis

Python Programming libraries used for analysis are statsmodels.api,


Pandas, NumPy, and SciPy.

8.2.4 Model Applied

Multiple regression model algorithm was applied for analyzing and


creating the model.
S t o c k In v e s t m en t S t r at egy 71

8.3 Fetching the Data into a Python Environment and


Defining the Dependent and Independent Variables

Raw data filtering is a procedure applied in feature engineering


(Refer Figure 8.1). Feature engineering is a process of converting
raw data into features that can be utilized in an ML model. The
data was fetched into a Python Anaconda environment using the
Jupyter Notebook, as the format of the data file was not readable
in Python. A data frame is created by fetching the comma-sep-
arated values (CSV) file making it readable in Python and uti-
lizing it for further processing in the form of a data frame. The
data frame needs to be structured in as per the requirement of the
model. The first step in creating a data frame is to structure the
data so that the program can read and work on the data. Once the
data frame is created, it is ready to be used by the algorithm. The
syntax used for creating a data frame in Python Programming is
presented in Figure 8.1.

Figure 8.1 Creating a data frame.


72 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

8.4 Correlation Matrix

To create a regression model, first we need to select a variable based on


collinearity. The selection of variables for a regression model depends
upon the relationship between the different variables. To understand
the underlying relationship between different variables, we need to
create a correlation matrix (Refer Figure 8.2). A correlation matrix
is an important tool for understanding the underlying relationships
between the variables. The correlation matrix shows the correlation
between different variables in the range from −1 to 1. −1 indicates a
high degree of negative correlation. The negative correlation is defined
as the relationship between two variables moving in opposite direc-
tions in which one increases as the other decreases, and vice versa. The
positive correlation of 1 is detected when both variables are moving
in the same direction. Both may increase or decrease together in the

Figure 8.2 Correlation matrix for selecting variables for the regression model.
S t o c k In v e s t m en t S t r at egy 73

same direction. If the correlation value is zero, then there exists no


correlation between the two variables.
We created a correlation matrix for selecting different variables to
create a regression model.
The study has a dependent variable, the closing price. The thumb
rule is if the degree of correlation is high, the variable can be
included in the regression model. The dependent variables are con-
tinuous in nature and are represented as Open, High, and Low.
In the analysis of the correlation matrix, the variables show a high
degree of positive correlation of more than 0.98, which indicates
that variables are fit for inclusion in the regression model (Theil,
1950; White, 1980).

8.5 Result Analysis for the Multiple Regression Model

Analysis using p-value —The model accuracy depends upon the


p-values of the output results (Refer Figure 8.3). If the p-value is
less than 0.05, the variable is considered to be significant; and if the
p-value is greater than 0.05, the variable is considered not significant
in the regression model. In the present regression model, the depen-
dent variables Open, High, and Low are highly significant.

8.5.1 R-Square

Analysis using R-square is a tool for measuring variance proportion


for the dependent variable which is explained by the independent
variable in a predictive multiple regression model (Refer Figure 8.3).
If the R-square is 0.50, it means that 50 percent of the dependent
variable is explained by the independent variable. The result analysis
shows an R-square value of 0.99, which means that 99 percent of the
dependent variable is explained by independent variables, implying a
high accuracy of the regression model.
Analysis using Durbin Watson measures the autocorrelation for
the regression models (Refer Figure 8.3). The Durbin Watson value of
2.0 indicates no autocorrelation and less than zero indicates positive
autocorrelation. The regression results show the Durbin Watson value
of 1.80, which indicates positive autocorrelation.
74 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 8.3 Result analysis for a multiple regression model.

8.6 Conclusion

The three independent variables which are continuous in nature are


Open, High, and Low, which are highly significant with a p-value of
less than 0.05 with the dependent variable Close. The regression results
show the Durbin Watson value of 1.80, which indicates positive auto
correlation. The result analysis shows an R-square value of 0.99, which
means that 99 percent of the dependent variable is explained by inde-
pendent variables, implying a high accuracy of the regression model.

References
Anderson, T. W. (1962). An Introduction to Multivariate Statistical Analysis.
Wiley.
Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and
random coefficient variation. Econometrica, 47(5), 1287–1294.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for equality of variances.
Journal of the American Statistical Association, 69(346), 364–367.
Campbell, J. Y., & Lo, A. W. (1997). The Econometrics of Financial Markets.
Princeton University Press.
S t o c k In v e s t m en t S t r at egy 75

Fama, E. F. (1965). The behavior of stock prices. Journal of Business, 38(1),


34–105.
Hampel, F. R. (1974). The influence curve and its role in robust estimation.
Journal of the American Statistical Association, 69(346), 383–393.
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of
Mathematical Statistics, 35(1), 73–101.
Mandelbrot, B. (1963). The variation of certain speculative prices. Journal of
Business, 36(4), 392–417.
Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7(1), 77–91.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models. Chapman
and Hall.
Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the
American Statistical Association, 79(388), 871–880.
Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium
under conditions of risk. Journal of Finance, 19(3), 425–442.
Theil, H. (1950). A rank-invariant method of linear and polynomial regres-
sion analysis. Proceedings of the Koninklijke Nederlandse Akademie
van Wetenschappen, 53, 386–392.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estima-
tor and a direct test for heteroskedasticity. Econometrica, 48(4).
9
C omparin g S to ck
R isk U sin g F-Test

9.1 Introduction

A F-test is a method of inferential statistics that determines the sta-


tistical difference between the variances of two variables. The F-test
is applied when we want to compare and check whether the variances
of two samples are equal or not. We apply the F-test on data that is
normally distributed and samples that are independent variables. The
F-test is based on F-distribution. The rejection zone is decided by ana-
lyzing and comparing critical values. Here, the F-test is performed to
compare the risk of two stocks.
The degree of freedom represents the number of observations that
are considered for the calculation of the chi-square variables used
for calculating the ratio. As the degree of freedom increases, the
F-distribution becomes more symmetrical and approaches the bell-
shaped normal distribution.

9.1.1 Review of Literature

F-test is applied for comparing the variances of two variables (Mont-


gomery et al., 2017; Allen 2012). The SciPy library is a tool for per-
forming F-test (Vertanen, 2020; Montgomery et al., 2017; Pedregosa
et al., 2011; Seabold & Perktold, 2010). F-test is applied for financial
analysis of stock markets (Zou et al., 2020; Virtanen et al., 2020). F-test
is used for feature engineering on high dimensional data, and F-test is
integrated with machine learning models and the workflow for predic-
tive analytics is prepared (Pedregosa et al., 2011). Johnson and Wichern
(2007) conducted a conceptual research on its application and con-
straints. The statsmodels library provides the workflow documentation
for applying F-test (Seabold & Perktold, 2010; Gelman et al., 2013).
F-test should be applied after checking the normality of the data.
76 DOI: 10.1201/9781032618241-9
C o m pa rin g S t o c k Risk Usin g F-T e s t 77

9.2 Research Methodology


9.2.1 Data Source

The Yahoo Finance financial database is used to perform the F-test.

9.2.2 Period of Study

The study period was from 1 January 2023 to 12 January 2023 (Refer
Figure 9.2). The interval for selected data is the daily closing stock price
of two companies for analysis. The study used a sample size of 12 days.

9.2.3 Software Used for Data Analysis

Python Programming, Anaconda (Refer Figure 9.1)

9.2.4 Model Applied

For this study, we applied the F-test.

9.2.5 Limitations of the Study

The study is restricted to t-tests only.

9.2.6 Future Scope of the Study

In the future, the study can be conducted for different stocks at the
same time.

Hypothesis

Null Hypothesis—The variance of group 1 stock is equal to the


variance of group 2 stock (same risk).
Alternative Hypothesis—The variance of group 1 stock is not
equal to the variance of group 2 stock.

Figure 9.1 Python libraries for performing an F-test.


78 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 9.2 Fetching the data sets into a Python environment for performing an F-test.

Figure 9.3 Performing an F-test in Python using scipy.stats library.

Conclusion the p-value of the test is 0.13, which is more than the
alpha value of 0.05 (Refer Figure 9.3). Hence, we cannot reject the
null hypothesis of the test. Based on the above analysis, we con-
clude that the variance of return of both the stocks is not different,
thus rejecting the alternative hypothesis.
C o m pa rin g S t o c k Risk Usin g F-T e s t 79

References
Allen, F., & Powell, M. (2012). Market Liquidity: A Primer. Oxford
University Press.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin,
D. B. (2013). Bayesian Data Analysis (Vol. 2). CRC Press.
Johnson, R. A., & Wichern, D. W. (2007). Applied Multivariate Statistical
Analysis. Prentice Hall.
Montgomery, D. C., Jennings, C. L., & Kulahci, M. (2017). Introduction to
Time Series Analysis and Forecasting. John Wiley & Sons.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., . . . & Vanderplas, J. (2011). Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statisti-
cal modeling with Python. Proceedings of the 9th Python in Science
Conference (pp. 92–96).
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T.,
Cournapeau, D., . . . & van der Walt, S. J. (2020). SciPy 1.0: Fundamental
algorithms for scientific computing in Python. Nature Methods, 17(3),
261–272.
Zou, H., Hastie, T., & Tibshirani, R. (2020). Sparse principal component
analysis. Journal of Computational and Graphical Statistics, 27(2),
316–324.
10
S to ck R isk A nalysis
U sin g t-Test

10.1 Introduction

A t-test is a method of inferential statistics that determines the statisti-


cal difference between the means of two variables (Lumley et al., 2002).
The t-test that was developed by Gosset (1908) is an important tool in
statistics for comparing the means of two variables which are significantly
different. The t-test is applied in healthcare, social sciences, etc. (Field,
2018; Howell, 2013; Tabachnick 2019). The t-test is applied in different
forms such as one-tailed and two-tailed paired test for different experi-
ments in research (Cohen, 1988; Hinkle et al., 2003; Mendenhall & Sin-
cich, 2016). The validity of t-test depends upon the assumption that the
data is normally distributed and is independent (Urdan, 2016). When the
above assumptions are violated, we applied the U-test (McDonald, 2014).
Despite lots of limitations, we still apply t-test because of its simplicity
(Sheskin, 2003; Tabachnick & Fidell, 2019). The present study on t-tests
will further modify its application and improve its accuracy (Neter et al.,
1996; Rosenthal & Rosnow, 2008). It is considered to be the basic test
for hypothesis testing (Trochim et al., 2016; Zar, 2010; Mendenhall &
Sincich, 2016).

10.2 Research Methodology


10.2.1 Data Source

The Yahoo Finance financial database is used to perform a t-test.

10.2.2 Period of Study

The study period was from 1 January 2023 to 12 January 2023 (Refer
Figure 10.2). The interval for selected data is the daily closing stock price
of two companies for analysis. The study used a sample size of 12 days.
8 0 DOI: 10.1201/9781032618241-10
S t o c k Risk A n a lysis Usin g t-T e s t 81

10.2.3 Software Used for Data Analysis

Python Programming, Anaconda (Refer Figure 10.1)

10.2.4 Model Applied

For this study, we applied the t-test.

10.2.5 Limitations of the Study

The study is restricted to t-tests only.

10.2.6 Future Scope of the Study

In the future, the study can be conducted for different stocks at the
same time.

Hypothesis

Null Hypothesis—The mean of group 1 stock is equal to the


mean of group 2 stock (same risk).
Alternative Hypothesis—The mean of group 1 stock is not
equal to the mean of group 2 stock.
The p-value of the test is 0.23 (Refer Figure 10.3), which is more than
the alpha value of 0.05. Hence, we cannot reject the null hypothesis of
the test. From the above analysis, we conclude that the mean return
of both stocks is different, thus rejecting the alternative hypothesis
(Refer Figure 10.4).

Figure 10.1 Python libraries for performing a t-test.


82 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 10.2 Fetching the data sets into a Python environment for performing a t-test.

Figure 10.3 Calculating the variance.

Figure 10.4 Performing a t-test in Python using scipy.stats library.


S t o c k Risk A n a lysis Usin g t-T e s t 83

10.3 Conclusion

The t-test is performed on a small sample size and the average return
of two stock variables are calculated by applying Python libraries like
scipy.stats. The study period was from 1 January 2023 to 12 January
2023. The interval for selected data is the daily closing stock price of
two companies for analysis. The p-value of the t-test is 0.23, which is
more than the alpha value of 0.05. Hence, we cannot reject the null
hypothesis of the test. From the above analysis, we conclude that the
mean return of both stocks is different, which ultimately results in the
rejection of the alternative hypothesis.

References
Box Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences
(2nd ed.). Erlbaum.
Field, A. (2018). Discovering Statistics using IBM SPSS Statistics (5th ed.).
Sage.
Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25.
Hinkle, D. E., Wiersma, W., & Jurs, S. G. (2003). Applied Statistics for the
Behavioral Sciences (5th ed.). Houghton Mifflin.
Howell, D. C. (2013). Statistical Methods for Psychology (8th ed.). Cengage
Learning.
Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the
normality assumption in large public health data sets. Annual Review of
Public Health, 23, 151–169.
McDonald, J. H. (2014). Handbook of Biological Statistics (3rd ed.). Sparky
House Publishing.
Mendenhall, W., & Sincich, T. (2016). Statistics for Engineering and the
Sciences (6th ed.). Pearson.
Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996).
Applied Linear Statistical Models (4th ed.). McGraw-Hill.
Rosenthal, R., & Rosnow, R. L. (2008). Essentials of Behavioral Research:
Methods and Data Analysis (3rd ed.). McGraw-Hill.
Sheskin, D. J. (2003). Handbook of Parametric and Nonparametric Statistical
Procedures (3rd ed.). CRC Press.
Tabachnick, B. G., & Fidell, L. S. (2019). Using Multivariate Statistics (7th
ed.). Pearson.
Trochim, W. M. K., Donnelly, J. P., & Arora, K. (2016). Research Methods:
The Essential Knowledge Base (2nd ed.). Cengage Learning.
Urdan, T. C. (2016). Statistics in Plain English (4th ed.). Routledge.
Zar, J. H. (2010). Biostatistical Analysis (5th ed.). Pearson Prentice Hall.
11
S to ck I n v estment
S tr ategy U sin g a Z-S core

11.1 Introduction to Z-Score

Z-score represents or gives us an interpretation of the number of stan-


dard deviations the value of a variable is from the average or mean. It
is measured in terms of standard deviation from its average or mean.
Z-score value of 2.0 indicates that it is two time the standard devia-
tion away from the mean. A positive Z-score indicates that the value
is above the mean and a negative Z-score indicates that the value is
below the mean. If the Z-score is zero, then it is equal to the mean.
The Z-score is represented by the below formula:

Value (X ) - Mean (µ)


Z score =
Standard Deviation (σ)

where
X is the variable.
µ is the mean, which is given by the value of the variable divided by
the number of items.
ϭ is the standard deviation of variable X.

The Z-score model is developed and implemented with


its evaluation in five steps:

1. Introduction to the Z-score


2. Fetching the data into a Python environment
3. Calculating the Z-score for the stock
4. Result analysis by evaluating the Z-score
5. Conclusion
Aanderson (1962) before applying the Z-test we need to test the nor-
mality of the data. The normality of data is considered to be a void
8 4 DOI: 10.1201/9781032618241-11
S t o c k In v e s t m en t S t r at egy Usin g a Z- Sc o re 85

assumption as far as stock market data is considered (Mandelbrot, 1963;


Fama, 1965; Campbell et al., 1997; Fama, 1965; Hampel, 1974; Huber,
1964; Jensen, 1969; Greene 2003). The significance of return is tested
by applying Z-testfor stock market analysis. The Z-test is applied to
optimize the return on portfolio (Markowitz, 1952; Rousseeuw, 1984;
Theil, 1950; White, 1980). Because the Z-test does not accurately cap-
ture the volatility of the market, non-parametric tests can be used as
an alternative method. The assumption that the Z-test considers equal
variance (Brown & Forsythe, 1974) cannot be held in the case of stock
market data. Hence, it is important to check homoscedasticity before
the application of Z-test (Aanderson, 1962; Brown & Forsythe, 1974).

11.2 Applied Research Methodology


11.2.1 Data Source

Data is taken from Yahoo Finance, which is a reliable source.

11.2.2 Sample Size

The daily price of the MRF stock is considered for the study from 2 Jan-
uary 2023 to 27 February 2024 (daily stock price) (Refer Figure 11.1).

11.2.3 Software Used for Data Analysis

Python Programming libraries used for analysis are Pandas, NumPy,


and SciPy.

11.2.4 Model Applied

The Z-score model is applied for analysis.

11.3 Fetching the Data into a Python Environment and


Defining the Dependent and Independent Variables

Raw data filtering is a procedure applied in feature engineering (Refer


Figure 11.1). Feature engineering is a process of converting raw data
into features that can be utilized in an ML model. The data was fetched
in the Python Anaconda environment using the Jupyter Notebook, as
86 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

the format of the data file was not readable in Python. A data frame is
created by fetching the comma-separated values (CSV) file, making it
readable in Python and utilizing it for further processing in the form
of a data frame. The data frame is created as per the model require-
ment. The data frame needs to be structured in as per the requirement
of the model. The first step in creating a data frame is to structure the
data so that the program can read and work on the data. Once the data
frame is created, it is ready to be used by the algorithm. The syntax
used for creating a data frame in Python Programming is presented
in Figure 11.1.

Figure 11.1 Creating a data frame.

11.4 Calculating the Z-Score for the Stock

The Z-score analysis for the stock is done by calculating the Z-score
for opening price, closing price day high, day low, and stock volume
(Refer Figure 11.2). Calculating the Z-score and determining risk is
S t o c k In v e s t m en t S t r at egy Usin g a Z- Sc o re 87

done by comparing the values of stock with parameters like opening


price, closing price day high, day low, and stock volume. The value of
a stock is positive and above zero means that it is above the mean and
if it is below zero it is considered to be negative or below the mean.
The Z-score of zero indicates the value is equal to zero. If the value is
above it is a good investment opportunity and if the Z-score is below
zero it means the value is going down.

Figure 11.2 Calculating the Z-score for the stock.


88 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

11.5 Results Z-Score Analysis

The mean of the opening price is 7.86 of the Z-score (Refer Figure
11.3). So we can interpret this as it is the best time to invest since the
opening price has given a good return and the Z-score is above seven
times of mean return. The Z-score value of the variables High, Low,
and Close has given a lower return since the standard deviation of one.
The Z-score of minimum ranges from −1.63 to −1.61 which indicates
low risk and high return at the opening price as the Z-score mean is
high, The maximum Z-score ranges from 2.08 for the opening price
and 2.02 for all other variables which show high return and high risk
in the opening price.

Figure 11.3 Z-scores with descriptive statistics for risk analysis.

11.6 Conclusion

The four continuous variables are Open, Close, High, and Low which
have poor Z-scores. Only the open stock price is exceptional and has
a higher return and high risk since the mean of Z-score is 7.86 and
the standard deviation is low which indicates that the best investment
opportunity is the Opening price.

References
Aanderson, T. W. (1962). An Introduction to Multivariate Statistical Analysis.
Wiley.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for equality of variances.
Journal of the American Statistical Association, 69(346), 364–367.
S t o c k In v e s t m en t S t r at egy Usin g a Z- Sc o re 89

Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The Econometrics of


Financial Markets. Princeton University Press.
Fama, E. F. (1965). The behavior of stock prices. Journal of Business, 38(1),
34–105.
Greene, W. H. (2003). Econometric Analysis. Prentice Hall.
Hampel, F. R. (1974). The influence curve and its role in robust estimation.
Journal of the American Statistical Association, 69(346), 383–393.
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of
Mathematical Statistics, 35(1), 73–101.
Jensen, M. C. (1969). Risk, the pricing of capital assets, and the evaluation of
investment portfolios. Journal of Business, 42(2), 167–247.
Mandelbrot, B. (1963). The variation of certain speculative prices. Journal of
Business, 36(4), 392–417.
Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7(1), 77–91.
Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the
American Statistical Association, 79(388), 871–880.
Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium
under conditions of risk. Journal of Finance, 19(3), 425–442.
Theil, H. (1950). A rank-invariant method of linear and polynomial regres-
sion analysis. Proceedings of the Koninklijke Nederlandse Akademie
van Wetenschappen, 53, 386–392.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator
and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.
12
A pplyin g a S upp ort
Vector M achine
M o d el U sin g P y thon
P ro g r ammin g

12.1 Introduction

Supervised machine learning models include the support vector machine


(SVM) as one of the reliable models in predictive analytics. The model
was developed in 1960, and further its application and accuracy were
effectively increased by the 1990s. The model had tremendous accu-
racy and was known for achieving good results with precise accuracy.
It stands unique compared to other machine learning models as it has
the minimum classification errors. As we closely analyze the machine
learning models, we will learn about the accuracy of the SVM com-
pared to other machine learning models (Refer Figure 12.2). The algo-
rithm differentiates the data points with precise accuracy. The SVM
algorithm can classify with tremendous accuracy and create a model
that can give good results compared to other machine learning models.
The support vector machine algorithm is the most used and applied
machine learning classification algorithm. The SVM algorithm is used
since it outperforms other machine learning classification models like
the logistic regression model and Naive Bayes model. The SVM algo-
rithm gives more optimum solutions than any other machine learning
classification model. The SVM algorithm is also known for the accuracy
it provides (Refer Figure 12.2). The main utility of the SVM algorithm
is to find out the hyperplane (N-dimensional) that creates a difference
in data points (refer to Figure 12.1). To bifurcate the data as per classes,
many hyperplanes can be drawn. The optimum hyperplane, which has
the maximum margin, is considered to be the best (refer to Figure 12.1).
The decision boundaries are hyperplanes that help classify the data points.

9 0 DOI: 10.1201/9781032618241-12
Supp o r t V ec t o r M ac hine M o d el 91

Figure 12.1 Different bifurcation of the data as per classes.

(Source: https://www.javatpoint.com)

Figure 12.2 Different bifurcation of the data with maximum margin hyperplane.

(Source: https://www.javatpoint.com)

12.1.1 Review of Literature

In the area of machine learning, support vector machine is considered


to be a powerful algorithm for classification and regression analysis
(Burbidge et al., 2001). The support vector machine has a wide range
of applications (Dhillon & Verma, 2020). Hence, a literature review is
conducted in order to grab various related studies and their role in the
development of support vector machines. The SVM is being (Varma)
applied for image processing in order to improve the clarity and qual-
ity of images further by applying high-quality image processing, the
image recognition system’s is created.
92 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Deo (2015) applied the support vector machine model in health ana-
lytics in order to identify complex medical patterns in predictive health
analytics (Cervantes et al., 2023). Jardine et al. (2006) applied a support
vector machine learning model in automation for manufacturing and
predicted machine failure by analyzing the past data, which improves the
maintenance cost. Garcia-Lamont et al. (2023) applied support vector
machine learning model for structure safety in the area of infrastructure
safety Hinton et al., 2012; Kim 2014; Nguyen et al., 2020; Schölkopf
et al., 2001. Toledo-Pérez et al. (2019) applied the SVM algorithm and
improved the signal classification accuracy. Huang et al. (2005) applied
SVM models in stock market prediction in order to help investors in
making investment decisions. Joachims (1998) applied SVM models for
natural language processing NLP and improved the accuracy of a docu-
ments classification system. Ding and Dubchak (2001) applied SVM
models for understanding and classification of the protein structure.
Mountrakis et al. (2011) applied SVM models in the remote sensing
field and achieved a high data classification accuracy (Burges, 1998).

12.2 Research Methodology


12.2.1 Data Collection

Secondary data was collected from Yahoo Finance.

12.2.2 Sample Size

Daily stock price of the MRF stock is considered for the study from
2/1/2023 to 5/1/2024.

12.2.3 Software Used for Data Analysis

Python Programming

12.2.4 Model Applied

For this study, we applied the support vector machine algorithm.


Supp o r t V ec t o r M ac hine M o d el 93

12.2.5 Limitations of the Study

The study is limited to only predicting the stock price of MRF.

12.2.6 Future Scope of the Study

In the future, the study can be extended to compare SVM models


applied to different sectors of industry at the macro level.

12.3 Methodology

For creating a predictive model, we selected and applied the support


vector machine algorithm.

Research is carried out in five steps:

12.4 Feature Engineering and Data Processing


12.5 Training and Testing
12.6 Predicting a Support Vector Machine Model with a
Confusion Matrix
12.7 Calculating False Negative, False Positive, True Negative,
and True Positive
12.8 Results and analysis

12.4 Feature Engineering and Data Processing

The process of converting raw data into features that can be easily
utilized to create a model as per the requirement of the algorithm is
called feature engineering (Refer Figure 12.3). The creation of a data
frame is the first step in creating a model. The data frame is prepared
to maintain the notion of the model which has different variables.
The feature engineering is the process of preparing the data according
to the need and required to be converted into nominal scale, ordi-
nal scale, etc. Feature engineering prepares data that can be read and
utilized by algorithms. Further, it converts raw data which will be
ready for the program to utilize it best possible manner. The syntax
used for creating a data frame in Python Programming is presented
in Figure 12.3.
94 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 12.3 Creating a data frame.

12.5 Training and Testing

To conduct the study, secondary data was collected from Yahoo


Finance (Refer Figure 12.4). The dependent variable for predicting
(Y) is the binary class (Figure 12.4) and the four independent vari-
ables ‘High’, ‘Low’, ‘Open’, and ‘Close’ are continuous in nature.

VARIABLE CLASSES
Buy/Sell (Dependent) Tomorrow’s Price > Today’s Price
Buy = 1
Tomorrow’s Price < Today’s Price
Sell = 0
Open (Independent) Continuous
Close (Independent) Continuous
High (Independent) Continuous
Low (Independent) Continuous
Supp o r t V ec t o r M ac hine M o d el 95

Figure 12.4 Defining the dependent and independent variables.

For trial and testing, the data is divided into two categories (Refer
Figure 12.5). 80 percent of data is converted and used for trial and 20
percent of data is used for testing to evaluate. With trial and testing,
the test results are validated by creating a confusion matrix.

Figure 12.5 The Python code for trial and testing.

12.6 Predicting a Support Vector Machine Model


with a Confusion Matrix
12.6.1 Creating a Confusion Matrix

A confusion matrix measures model performance (Refer Table 12.1).


It evaluates the actual values and predicted values. It is of the order
of N X N, where N is the class of dependent/target variable. For binary
classes, it is 2 X 2 confusion matrix.

12.7 Calculating False Negative, False Positive, True Negative,


and True Positive

The confusion matrix for our data set is as follows:


96 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Table 12.1 The Confusion Matrix


0 1
0 22 (TP) 4 (FN)
1 0 (FP) 24 (TN)

Figure 12.6 The Python code for confusion matrix and classification report.

12.7.1 Result Analysis

12.7.1.1 Accuracy Statistics


It measures the overall accuracy of the model by analyzing the output
predicted about incorrect predictions (Refer Figure 12.6).
To obtain the accuracy of the model, we apply the following
formula:
Accuracy = True Positive + True Negative/True Positive + True
Negative + False Positive + False Negative
22 + 4
Accuracy = = 0.93
22 + 4 + 0 + 24
The accuracy for the overaall model is 0.93
Supp o r t V ec t o r M ac hine M o d el 97

12.7.1.2 Recall
It is the ratio of true positive predictions divided by the total number
of true positive predictions and false negative predictions. Higher
recall implies more correct predictions (a small number of false
negatives).

True Positive
Recall =
True Positive + False Negative
222
Recall = = 0.85
22 + 4
Recall for the overall model is 0.85

12.7.1.3 Precision
Precision measures how correctly we have predicted the true positives.
It is the qualitative analysis of correctly predicted values.

True Positive 22
Precision = = = 1.00
True Positive + False Positive 22 + 0
The precision for the overall model is 1.00

12.8 Conclusion

The support vector machine model predicted the MRF stock with a
precision of 100 percent. The overall model accuracy is 93 percent.

References
Burbidge, R., Trotter, M., Buxton, B., & Holden, S. (2001). Drug design
by machine learning: Support vector machines for pharmaceutical data
analysis. Computers & Chemistry, 26(1), 5–14. https://doi.org/10.1016/
S0097-8485(01)00094-8
Burges, C. J. (1998). A tutorial on support vector machines for pattern recog-
nition. Data Mining and Knowledge Discovery, 2(2), 121–167. https://
doi.org/10.1023/A:1009715923555
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & López, A.
(2023). A comprehensive survey on support vector machine classification:
Applications, challenges and trends. Journal of Building Engineering.
https://doi.org/10.1016/j.jobe.2023.104911
Deo, R. C. (2015). Machine learning in medicine. Circulation, 132(20), 1920–
1930. https://doi.org/10.1161/CIRCULATIONAHA.115.001593
98 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Dhillon, A., & Verma, G. K. (2020). Convolutional neural network: A review


of models, methodologies and applications to object detection. Progress
in Artificial Intelligence, 9(2), 85–112. https://doi.org/10.1007/s13748-
019-00203-0
Ding, C., & Dubchak, I. (2001). Multi-class protein fold recognition using
support vector machines and neural networks. Bioinformatics, 17(4),
349–358. https://doi.org/10.1093/bioinformatics/17.4.349
Garcia-Lamont, F., Cervantes, J., Rodríguez-Mazahua, L., & López, A.
(2023). Support vector machine in structural reliability analysis: A review.
Structural Safety. https://doi.org/10.1016/j.strusafe.2023.102211
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. r., Jaitly, N., . . . &
Sainath, T. N. (2012). Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups. IEEE
Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/MSP.
2012.2205597
Huang, W., Nakamori, Y., & Wang, S. Y. (2005). Forecasting stock mar-
ket movement direction with support vector machine. Computers &
Operations Research, 32(10), 2513–2522. https://doi.org/10.1016/j.cor.
2004.03.016
Jardine, Andrew & Lin, Daming & Banjevic, Dragan. (2006). A review on
machinery diagnostics and prognostics implementing condition-based
maintenance. Mechanical Systems and Signal Processing. 20. 1483–1510.
10.1016/j.ymssp.2005.09.012.
Joachims, T. (1998). Text categorization with support vector machines:
Learning with many relevant features. European Conference on Machine
Learning (pp. 137–142). https://doi.org/10.1007/BFb0026683
Kim, Y. (2014). Convolutional neural networks for sentence classification.
EMNLP 2014. https://doi.org/10.3115/v1/D14-1181
Mountrakis, G., Im, J., & Ogole, C. (2011). Support vector machines in
remote sensing: A review. ISPRS Journal of Photogrammetry and
Remote Sensing, 66(3), 247–259. https://doi.org/10.1016/j.isprsjprs.2010.
11.001
Nguyen, H. Q., Nguyen, N. D., & Nahavandi, S. (2020). A review on deep
reinforcement learning.
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C.
(2001). Estimating the support of a high-dimensional distribution. Neural
Computation, 13(7), 1443–1471. https://doi.org/10.1162/089976601750
264965
Toledo-Pérez, D. C., Rodríguez-Reséndiz, J., Gómez-Loenzo, R. A., &
Jauregui-Correa, J. C. (2019). Support vector machine-based EMG sig-
nal classification techniques: A review. Applied Sciences, 9(20), 4402.
https://doi.org/10.3390/app9204402
13
D ata Visualiz ation for
S to ck R isk C omparison
and A nalysis

13.1 Introduction to Data Visualization

Today due to the development of information technology and e-commerce


in the world, data is being generated on an hourly basis. Financial data
like the buying of stock and the movement of the market can be uti-
lized by data visualization. The financial data contains certain trends
and patterns which are very difficult to understand as the raw data is
in raw format. Hence to overcome this problem, an attempt has been
made by means of this study titled Data Visualization for Stock Risk
Comparison and Analysis.
It is easy to analyze, observe, understand, and interpret data by
visualization in Python Programming. Large amount of financial
data can be analyzed and investment strategies can be made based on
data visualization. The Python libraries which are mostly used are as
follows:
1. Matplotlib
2. Seaborn
3. Bokeh
4. Plotly
Here we apply different Python libraries to create a scatter plot, line
chart, bar chart, histogram, and bokeh.

13.1.1 Review of Past Studies

Visualizing data in Python is very easy and simple, which is the


only reason why it has gained significant attraction and popularity
(Smith, 2018; Hunter, 2007; Wang & Liu, 2020; Waskom et al.,
DOI: 10.1201/9781032618241-13 9 9
10 0 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

2020; Wickham & Grolemund, 2017). Python libraries, like matplot-


lib, have been used due to their tremendous applications (Waskom
et al., 2020). The existence of plotly and seaborn made its application
in various areas. Jones et al. (2019) used Python for data analysis
in the emerging areas of biology. Wang and Liu (2020) had uti-
lized data visualization for financial data analytics and stock mar-
ket analysis. McKinney (2017) applied a hybrid model in Python
using Pandas (VanderPlas, 2016; Virtanen et al., 2020). Grolemund
(2017) compared the utilization of data visualization by comparing
R programming and Python. Python’s extensive documentation and
its user-interactive, user-friendly, and flexible nature have made data
visualization with Python an important tool for researchers and aca-
demicians (Hunter, 2007; Jones et al., 2019; McKinney, 2017; Smith,
2018).

13.1.2 Applied Research Methodology

13.1.2.1 Data Source


Data is taken from Yahoo Finance, which is a reliable source.

13.1.2.2 Sample Size


The daily price of the MRF stock is considered for the study from 2
January 2023 to 5 January 2024 (daily stock price).

13.1.2.3 Software Used for Data Analysis


Python Programming libraries used for analysis are statsmodels.api,
Pandas, NumPy, and SciPy.

13.2 Fetching the Data into a Python Environment and Defining the


Dependent and Independent Variables
Raw data filtering is a procedure applied in feature engineering
(Refer Figure 13.1). Feature engineering is a process of convert-
ing raw data into features that can be utilized in an ML model.
The data was fetched in the Python Anaconda environment using
the Jupyter Notebook as the format of the data file was not read-
able in Python. A data frame is created by fetching the comma-
separated values (CSV) file, making it readable in Python and
Data Visua liz ati o n 101

utilizing it for further processing in the form of a data frame.


The data frame is created as per the model requirement. The
data frame needs to be structured in C as per the requirement of
the model. The first step in creating a data frame is to structure the
data so that the program can read and work on the data. Once the
data frame is created, it is ready to be used by the algorithm. The
syntax used for creating a data frame in Python Programming is
presented in Figure 13.2.

Figure 13.1 Creating a data frame.

13.2.1 Data Visualization Using Scatter Plot

It is applied to find out the relationship between two variables (Refer


Figure 13.2). It is used to find out correlation and autocorrelation lag
in regression and time series panel data analysis. We have taken the
opening price and closing price of the MRF stock to find out the
relationship between the two variables. The scatter plot shows a close
10 2 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 13.2 Creating a scatter plot for understanding the degree of association between the
closing price and the opening price.

degree of association between them, and hence we can conclude that


the opening price and closing price of the MRF stock remain the
same.
It is applied to find out the relationship between two variables. It
is used to find out correlation and autocorrelation lag in regression
and time series panel data analysis. We have taken the opening price
of MRF stock and the closing price of stock to find out the relation-
ship between the two variables the scatter plot shows a close degree of
association hence we can conclude that the opening price and closing
price of MRF stock remain the same.

13.3 Data Visualization Using Bar Chat

A bar chart represents data with rectangular bars on the axis with
lengths and heights that are proportional to the variable’s value (Refer
Figure 13.4a). It is created using the bar method. We have applied a
histogram to analyze the daily movement of the opening price of the
stock (Refer Figure 13.3).
Data Visua liz ati o n 10 3

Figure 13.3 Creating Scatter Plot for understanding degree of association between closing price
and opening price.

Figure 13.4a Creating a histogram for understanding the movement of the opening price.
10 4 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

13.4 Data Visualization Using Line Chart

It is applied to find out the relationship between two variables (Refer


Figure 13.4b). It is used to find out correlation and autocorrelation
lag in regression and time series panel data analysis. We have taken
the opening price and closing price of the MRF stock to find out the
relationship between the two variables. The scatter plot shows a close
degree of association between them and hence we can conclude that
the opening price and closing price of the MRF stock remain the same.

Figure 13.4b Creating a line plot for understanding the degree of association between the
closing price and the opening price.

13.5 Data Visualization Using Bokeh

It generates interactive charts by generating HTML java script which


uses web browsers (Refer Figure 13.5). It has a very high level of
interactivity. It is applied to find out the relationship between two
variables. It is used to find out correlation and autocorrelation lag
in regression and time series panel data analysis. We have taken the
opening price and closing price of the MRF stock to find out the
Data Visua liz ati o n 10 5

Figure 13.5 Creating scatter graph for understanding the degree of association between the
closing price and the opening price.

relationship between the two variables. The scatter plot shows a close
degree of association between them and hence we can conclude that
the opening price and closing price of the MRF stock remain the
same.

References
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in
Science & Engineering, 9(3), 90–95.
Jones, E., Oliphant, T., & Peterson, P. (2019). SciPy: Open source scientific
tools for Python.
McKinney, W. (2017). Python for Data Analysis: Data Wrangling with
Pandas, NumPy, and IPython. O’Reilly Media, Inc.
Smith, J. (2018). Python Data Visualization Cookbook. Packt Publishing Ltd.
VanderPlas, J. T. (2016). Python Data Science Handbook: Essential Tools for
Working with Data. O’Reilly Media, Inc.
10 6 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T.,
Cournapeau, D., . . . & van der Walt, S. J. (2020). SciPy 1.0: Fundamental
algorithms for scientific computing in Python. Nature Methods, 17(3),
261–272.
Wang, J., & Liu, S. (2020). Python for Finance Cookbook. Packt Publishing
Ltd.
Waskom, M., Botvinnik, O., O'Kane, D., Hobson, P., Ostblom, J., Lukauskas,
S., . . . & Halchenko, Y. (2020). Mwaskom/Seaborn: v0.11.1 (December
2020). Zenodo.
Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data. O’Reilly Media, Inc.
14
A pplyin g N atur al
L an guag e P ro cessin g
for S to ck I n v estors
S entiment A nalysis

14.1 Introduction

The natural language processing (NLP) technique is used to under-


stand the sentiments related to buying a particular stock. NLP is a
technique through which we can understand sentiments by analyzing
the text or comments posted on social media like Twitter, Facebook,
etc. NLP classifies the text into positive and negative sentiments,
which can measure the sentiments related to stock investors. NLP is a
tool that can classify text and further utilize it to understand the senti-
ments. Hence to understand the importance of NLP, we conducted the
study titled Applying Natural Language Processing for Stock Investors
Sentiment Analysis (Al-Rfou & Perozzi, 2019; Bengfort et al., 2018;
Bird et al., 2009).
Bird et al. (2009) applied NLP in a simple and lucid manner and
his work is a big and rich resource for researchers. Jurafsky and Martin
(2019) developed a conceptual literature on NLP. Manning and Schü-
tze (1999) developed a statistical NLP method, which is the landmark
work in the area of NLP. Raschka and Mirjalili (2019) give details on
implementation of NLP with machine learning models. Vaswani et al.
(2017) and Perkins (2016) applied the NLTK library for text mining.
Loper and Bird (2002) and Chollet (2018) applied a deep planning
model with AI. AI-Rfou & Perozzi (2019) analyzed the multilin-
gual problems in NLP. Vaswani et al. (2017) developed sequence to
sequence learning (Chollet, 2018).

DOI: 10.1201/9781032618241-1410 7
10 8 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

14.2 Research Methodology


14.2.1 Data Source

Data was collected from the WhatsApp chat investors group.

14.2.2 Period of Study

The study period was from 1 January 2023 to 1 February 2023.

14.2.3 Software Used for Data Analysis

Python Programming, Anaconda

14.2.4 Model Applied

For this study, we applied the natural language processing technique.

14.2.5 Limitations of the Study

The study is restricted to natural language processing.

14.2.6 Future Scope of the Study

In the future, the study can be conducted on different stocks at the


same time.

14.3 Fetching the Data into a Python Environment

Raw data filtering is a procedure applied in feature engineering (Refer


Figure 14.1). Feature engineering is a process of converting raw data
into features that can be utilized in an ML model. The data was fetched
in the Python Anaconda environment using the Jupyter Notebook as
the format of the data file was not readable in Python. A data frame is
created by fetching the comma-separated values (CSV) file, making it
readable in Python and utilizing it for further processing in the form
of a data frame. The data frame is created as per the model require-
ment. The data frame needs to be structured in as per the requirement of
App ly in g N at ur a l L a n guag e P r o c e s sin g 10 9

Figure 14.1 Python libraries for performing NLP and fetching data sets into a Python environment.

the model. The first step in creating a data frame is to structure the data
so that the program can read and work on the data. Once the data frame
is created, it is ready to be used by the algorithm. The syntax used for cre-
ating a data frame in Python Programming is presented in Figure 14.1.

14.4 Sentiments Count for Understanding Investors’ Perceptions

The sentiments are nothing but the perception of investors regard-


ing the investment (Refer Figure 14.2). The investor’s sentiments
may be positive or negative depending upon their choices. We need
to consider and understand investors’ perceptions as they play an
important role in deciding the future sales and marketing plan for
selling a financial product. To understand the gravity of investors’
sentiments, we need to count the number of positive and negative
sentiments and analyze it (Refer Figure 14.2).
110 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 14.2 Count of sentiments.

14.5 Performing Data Cleaning in Python

The first step in natural language processing is the removal of


unnecessary words (Refer Figure 14.3). The process of cleaning
the text data is known as normalization. Normalization is the first
step in natural language processing. Natural language processing
cleans the text data by punctuation removal, stop word removal,
stemming, and lemmatization. The process of normalization of
data in the form of text starts with case normalization. Under case
normalization, the uppercase word is converted into lowercase to
standardize the overall text available for creating an NLP model.
After case normalization, we step into punctuation removal. At
this stage, we remove special characters and punctuation marks
from the text data, making it easy for further analysis. After the
punctuation removal, we apply the stop word removal technique to
remove similar words. Then, we apply stemming, a process through
which the fixes and prefixes of particular words are removed to
normalize the given text data.
App ly in g N at ur a l L a n guag e P r o c e s sin g 111

Figure 14.3 Performing data cleaning in Python.

14.6 Performing Vectorization in Python

In natural language processing, we convert the textual data into numer-


ical values that can be easily understood by the machine learning algo-
rithm (Refer Figure 14.4). The neural language process allows computers
to understand the process of human language by understanding mean-
ing and context related to that particular sentiment, through which
we can get a useful inside in creating and understanding NLP. NLP
combines computational linguistics with statistical machine learning
algorithms through which the model is created for analysis purposes.
Vectorization is a classic approach to converting linguistic text into real
numbers in the form of vectors which are applied for supporting and
creating a machine learning model. The vectorization is the first step in
the extraction of features. The basic idea behind vectorization is to get
distinct features out of the linguistic text available for analysis. The last
step is to create a training and test model for analysis.

14.7 Vector Transformation to Create Trial and Training Data Sets

After cleaning the data for sentiment analysis, we need to create trial
and training data sets (Refer Figure 14.5). To test the accuracy of the
112 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 14.4 Performing Vectorization in Python.

Figure 14.5 Vector transformation to create trial and training data sets.

model applied, we need to test the results by comparing it with original


data (Refer Figure 14.4). For comparing the data sets, we need to divide
the data set into test and train. A label encoder is used to transform the
data of sentiments into numerical values for creating a multinomial
Naive Bayes model. The process of transforming the sentiments from
text to numerical values is known as vector transformation.
App ly in g N at ur a l L a n guag e P r o c e s sin g 113

14.8 Result Analysis Model Testing AUC

The AUC stands for the area under the ROC curve (Refer Figure 14.6).
The AUC provides the accuracy of all possible classifications with its
thresholds. The AUC is the ratio between the true positive rate and the
false positive rate. An AUC value of 1 is considered to be the best with 100
percent model accuracy. The value of AUC ranges from 0 to 1. In the pres-
ent study, we have an AUC value of 0.20, which is considered to be poor.

Figure 14.6 Result analysis model testing AUC.

14.9 Conclusion

The implementation of natural language processing requires a huge


amount of critical analysis. After the implementation of natural lan-
guage processing, we implemented the Naive Bayes model, which has
a low accuracy of 20 percent.

References
Al-Rfou, R., & Perozzi, B. (2019). Polyglot: Distributed Word Representations
for Multilingual NLP. Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics.
114 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied Text Analysis with
Python: Enabling Language-Aware Data Products with Machine
Learning. O’Reilly Media.
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python. O’Reilly Media.
Chollet, F. (2018). Deep Learning with Python. Manning Publications.
Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing
(3rd ed.). Pearson.
Loper, E., & Bird, S. (2002). NLTK: The Natural Language Toolkit. CoRR.
cs.CL/0205028. 10.3115/1118108.1118117.
Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural
Language Processing. The MIT Press.
Perkins, J. (2016). Python Text Processing with NLTK 2.0 Cookbook. Packt
Publishing.
Raschka, S., & Mirjalili, V. (2019). Python Machine Learning: Machine
Learning and Deep Learning with Python, scikit-learn, and TensorFlow
(3rd ed.). Packt Publishing.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention
is all you need. In Proceedings of the 31st International Conference on
Neural Information Processing Systems (NIPS’17). Curran Associates
Inc., Red Hook, NY, USA, 6000–6010.
15
S to ck P red icti on
A pplyin g LSTM

15.1 Introduction

Stock prediction and modeling involve a huge amount of analysis.


The analysis includes different models applied for analysis. It includes
the logistic regression model, regression analysis, and support vec-
tor machines, but all these models involve analysis of data that is not
panel data. For panel data analysis, we use models like autoregressive
moving average (ARIMA) and LSTM. In recent times, the LSTM
model, which is popularly known as long short-term memory, has had
a huge impact on the prediction of stocks as applied to time series
data. LSTM is a deep learning model that is based on the principle of
neural networking. The LSTM model consists of three layers: input
layer, hidden layer, and output layer. The capability of memorizing the
sequence of data makes LSTM a specialized type of recurrent neu-
ral network, and hence the study titled Stock Prediction by Applying
LSTM was conducted in three steps: fetching the data, cleaning the
data number, data transformation and normalization, model analysis.
The LSTM model is known for predicting and developing a pre-
dictive model that involves long short-term memory and the removal
of irrelevant information. The irrelevant information is deleted from
the model in the form of a forget gate. For example, we explain this
with the sentence Dr Nitin Stays at Aurangabad (Refer Table 15.1).
Dr Nitin has an area of specialization in Financial Analytics. He is a
good teacher in the area of ______. From the analysis of the above
sentence, we need to fill up the blank space with relevant information.
Here the sentence includes information about Dr Nitin and the area of
specialization in which he works can be used to fill up the blank space.
To fill up the blank space, we need to analyze the relevant information
and after analyzing the information from the sentence, the relevant and

DOI: 10.1201/9781032618241-15115
116 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Table 15.1 The Architecture and Process of the Long Short-Term Model
LSTM PROCESS SENTENCE
Long-term memory Dr Nitin stays at Aurangabad. Dr Nitin has an area of specialization
in Financial Analytics. He is a good teacher in the area of______
Short-term memory Dr. Nitin has an area of specialization in Financial Analytics. He is a
good teacher in the area of______
Input information Dr Nitin stays at Aurangabad. Dr. Nitin has an area of specialization
in Financial Analytics. He is a good teacher in the area of______
Irrelevant information or Dr. Nitin stays at Aurangabad
Forget information
Relevant information Area of specialization in Financial Analytics.
Output He is a good teacher in the area of Financial Analytics.

irrelevant information produced by analyzing it is required to be used


for filling up the blank space. Irrelevant information is to be removed
from the analysis. The irrelevant information is about the residence of
Dr Nitin which needs to be removed (Refer Table 15.1). The LSTM
model that we apply here includes long-term memory in the form of
the overall sentence. After removing the irrelevant information, the
remaining information is the short-term memory. Here after further
analysis, we need to forget about the information regarding the resi-
dence of Dr Nitin and we need to analyze the information about the
area of specialization in which Dr Nitin works. After applying the
past knowledge and long short-term memory we delete the irrelevant
information and the blank space is filled up with the correct answer.
The correct answer for filling up the blank space is Financial Analyt-
ics. Table 15.1 explains the long short-term model.

15.1.1 Review of Literature

The LSTM model was developed by Hochreiter and Schmidhuber


(1997). The model is based on recurrent neural network to overcome
vanishing gradient problems (Gers et al., 1999; Graves et al., 2013,
2014). These models are based on sequential data which helps in
developing speech recognition (Graves et al., 2013). LSTM models
have network units where information moves in a regulated manner
(Greff et al., 2016; Sundermeyer et al., 2012). Due to gradient prob-
lems, it is difficult to train LSTM models (Pascanu et al., 2013). The
S t o c k P red ic ti o n App ly in g LSTM 117

bidirectional LSTM architecture was proposed by Schuster and Pali-


wal (1997) and neural turing machines were developed by Graves
et al. (2014; Chen et al., 2016; Lipton 2015).

15.2 Research Methodology


15.2.1 Data Source

Data was collected from Yahoo Finance.

15.2.2 Period of Study

The study period was from 1 January 2023 to 1 February 2023.

15.2.3 Software Used for Data Analysis

Python Programming, Anaconda

15.2.4 Model Applied

For this study, we applied the long short-term memory (LSTM)


model.

15.2.5 Limitations of the Study

The study is restricted to the long short-term memory (LSTM).

15.2.6 Future Scope of the Study

In the future, the study can be conducted on different stocks at the


same time.

15.3 Fetching the Data into a Python Environment

Raw data filtering is a procedure applied in feature engineering (Refer


Figure 15.1). Feature engineering is a process of converting raw data
into features that can be utilized in an ML model. The data was
fetched in the Python Anaconda environment using the Jupyter Note-
book as the format of the data file was not readable in Python. A data
118 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

Figure 15.1 Python libraries for performing LSTM and fetching data sets into a Python environment.

frame is created by fetching the comma-separated values (CSV) file,


making it readable in Python and utilizing it for further processing in
the form of a data frame. The data frame is created as per the model
requirement. The data frame needs to be structured as per the require-
ment of the model. The first step in creating a data frame is to struc-
ture the data so that the program can read and work on the data. Once
the data frame is created, it is ready to be used by the algorithm. The
syntax used for creating a data frame in Python Programming is pre-
sented in Figure 15.1.

15.4 Performing Data Cleaning in Python

The data frame is prepared to maintain the notion of the model which
has different variables (Refer Figure 15.2). The Feature engineering
is the process of preparing converting the data into nominal scale or
ordinal scale etc. it prepare data that can be read and utilized by algo-
rithm. data which will be ready for the program to utilize. The syntax
used for creating a data frame in Python Programming is presented
in Figure 15.2.
S t o c k P red ic ti o n App ly in g LSTM 119

Figure 15.2 Performing data cleaning in Python.

15.5 Vector Transformation to Create Trial and Training Data Sets

After cleaning the data for the LSTM ML model, we need to create
trial and testing data sets (Refer Figure 15.3). To test the accuracy of the
model applied, we need to test the results by comparing it with original
data. For comparing the data sets, we need to divide the data set into
test and train. For model evaluation, we divide the data into 80 percent
of trial data and 20 percent of test data by vector transformation.

Figure 15.3 Vector transformation to create trial and training data sets.
12 0 DATA A N A LY TI C S F O R FIN A N C E USIN G P Y T H O N

15.6 Result Analysis for the LSTM Model

The results of the LSTM model show the different layers in the out-
put and the generated parameters for the different layers (Refer Fig-
ure 15.4). We applied LSTM layers of 50 units. The representation of
parameters in LSTM units are functions involved in calculations. The
parameters are generated using the below formula:
Number of parameters = 4 * (N + M + 1) * m
where
N is the number of dimensions in the input variable.
M is the units in the LSTM layer.
1 is the bias parameter.
Substituting the values in the above equation, we get
Number of LSTM parameters = 4 * (50 + 50 + 1) * 50
Number of LSTM parameters = 20,200
We applied the LSTM model and generated different layers for pre-
dicting the opening price and the closing price. The LSTM model
is generated with 50 units or neurons. These units from the LSTM

Figure 15.4 Result analysis for the LSTM model.


S t o c k P red ic ti o n App ly in g LSTM 121

model will be used as input to the next LSTM layer. The next dropout
layer is the regulator of the model, which keeps the irrelevant infor-
mation or the so-called biases away from the LSTM model. The other
LSTM-1 layer with 50 neurons or units is followed by the final dense
LSTM layer with 2 neurons or units.

15.7 Conclusion

The long short-term memory model is created for predicting the open-
ing price and the closing price. The total parameters generated for the
LSTM layer is 20,200. After applying the past knowledge and the
LSTM, we deleted the irrelevant information and created an LSTM
model for predicting the stock price.

References
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P.
(2016). Infogan: Interpretable representation learning by information
maximizing generative adversarial nets. In Advances in Neural Information
Processing Systems (pp. 2172–2180). https://api.semanticscholar.org/
CorpusID:5002792
Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to forget: Continual
prediction with LSTM. Neural Computation, 12(10), 2451–2471.
Graves, A., Mohamed, A., & Hinton, G.E. (2013). Speech recognition with
deep recurrent neural networks. 2013 IEEE International Conference
on Acoustics, Speech and Signal Processing, 6645–6649.
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines.
arXiv preprint arXiv:1410.5401.
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber,
J. (2016). LSTM: A search space odyssey. IEEE Transactions on Neural
Networks and Learning Systems, 28(10), 2222–2232.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural
Computation, 9(8), 1735–1780.
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent
neural networks for sequence learning. arXiv preprint arXiv:1506.00019.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of train-
ing recurrent neural networks. In International Conference on Machine
Learning (pp. 1310–1318).
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural net-
works. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks
for language modeling. In Thirteenth Annual Conference of the
International Speech Communication Association.

You might also like