Kumar Sunil - Python For Accounting and Finance. An Integrative Approach To Using Python For Research
Kumar Sunil - Python For Accounting and Finance. An Integrative Approach To Using Python For Research
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Palgrave Macmillan imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Completing this book was made possible through the collective support
and wisdom of many outstanding individuals. I would like to express my
heartfelt gratitude to the University of Massachusetts Boston, its Accounting
and Finance Department, and its PhD program, all of which provided an
enriching environment that greatly contributed to my research and writing.
Special thanks are due to Professor Atreya Chakraborty, whose guidance
and insights have been invaluable. I am also profoundly grateful to the faculty
of the Accounting and Finance department at UMass Boston especially,
Arindam Bandopadhyaya, Robert Kim, Sangwan Kim, Surit Tinaikar, and
Lucia Silva Gao for their unwavering support and encouragement throughout
this journey.
I also owe a deep sense of appreciation to Dr. Robert Taggart, Professor of
Finance (Retired) at Boston College, whose teachings and mentorship have
left an indelible mark on my professional life and academic pursuits. I also
extend a special thanks to my friend Ankush Mohekar, who was always there
when I needed him throughout this journey.
To all of you who supported me directly or indirectly during the writing
of this book, thank you. Your collective wisdom has not only enlightened my
path but has also enriched the pages of this work.
vii
Prologue
ix
x Prologue
xi
xii Contents
Character Classes 39
Escape Sequences 45
Groups in Regex 47
Substitution or Replacement Metacharacters 49
Assertions 51
Regular Expressions Cheat Sheet 56
4 Important Python Libraries 59
Library 59
Data Access Libraries 61
BeautifulSoup 61
Requests 63
Scrapy 65
Data Manipulation Libraries 67
Pandas 67
NumPy 70
Dask 72
Data Visualization Libraries 74
Matplotlib 74
Statistical Analysis Libraries 76
SciPy 76
StatModels 78
PyMC3 80
Machine Learning Libraries 82
Scikit-Learn 82
TensorFlow 84
PyTorch 86
Keras 87
The disciplines of accounting and finance are inherently driven by data and
heavily rely on analytical methodologies. As technological progress continues,
the availability of data for analysis expands accordingly. To maintain compet-
itiveness and ensure informed decision-making, researchers and practitioners
in these fields must possess the ability to efficiently analyze large datasets.
Python, a versatile programming language, has garnered significant attention
in the realm of accounting and finance research. Its user-friendly syntax and
robust libraries make it an optimal tool for performing tasks such as data
analysis, machine learning, and data visualization. Acquiring proficiency in
Python can yield substantial advantages in professional endeavors for doctoral
candidates, researchers, and accounting professionals alike.
This chapter aims to introduce Python and elucidate its merits specifically
within the context of accounting research. It commences with an overview of
the language, encompassing its historical evolution, distinctive features, and
wide-ranging applications. The discussion subsequently transitions to a step-
by-step guide for installing and configuring a Python environment on the
reader’s personal computer, enabling a prompt initiation of coding activities.
By the conclusion of this chapter, readers will have acquired a foundational
understanding of Python and be adequately prepared to explore its extensive
capabilities within the domains of accounting and finance research.
In this case, we have created a variable named x and assigned it the value
of 1. Once a variable has been created, it can be used in expressions and
calculations.
Python also provides several operators that can be used to perform
calculations and manipulate data. These include:
Here’s an example of how you can use variables and operators in Python:
x = 10
y = 5
z = x + y
print(z)
2 Introduction to Python Language 13
In this example, we have created two variables named x and y and assigned
them the values of 10 and 5, respectively. We have then created a third vari-
able named z and assigned it the value of x + y, which is 15. Finally, we have
printed the value of z to the console using the print() function.
If, else, and elif statements are conditional statements that are used to execute
certain code based on specific conditions. These statements allow the program
to make decisions based on whether a certain condition is true or false.
• If Statements
Here’s an example:
x = 10
if x > 5:
print("x is greater than 5")
In this example, the condition is x > 5. Since x is equal to 10, which is greater
than 5, the code inside the if statement will be executed, and the output will
be x is greater than 5.
• Else Statements
The else statement is used to execute code if the condition in the if statement
is false. The syntax of an else statement is as follows:
14 S. Kumar
if condition:
# code to execute if condition is true
else:
# code to execute if condition is false
Here’s an example:
x = 2
if x > 5:
print("x is greater than 5")
else:
print("x is less than or equal to 5")
• Elif Statements
The elif statement is used to test multiple conditions and execute different
code based on which condition is true. The syntax of an elif statement is as
follows:
if condition1:
# code to execute if condition1 is true
elif condition2:
# code to execute if condition2 is true
else:
# code to execute if all conditions are false
Here’s an example:
x = 2
if x > 5:
print("x is greater than 5")
elif x == 5:
print("x is equal to 5")
else:
print("x is less than 5")
In this example, the first condition (x > 5) is false, so the program moves
on to the next condition (x == 5). Since x is not equal to 5, the program
executes the code inside the else statement, and the output is x is less than
5.
• For Loops
A for loop is used to iterate over a sequence of elements. This sequence can
be a list, tuple, string, or any other iterable object. The basic syntax for a for
loop is:
for variable in sequence:
# code to execute
The variable is assigned to each element in the sequence, and the code inside
the loop is executed for each element.
For example, let’s say we want to print each number in a list:
numbers = [1, 2, 3, 4, 5]
Output:
1
2
3
4
5
Output:
16 S. Kumar
H
e
l
l
o
,
W
o
r
l
d
!
We can also use the range function to create a sequence of numbers to iterate
over. The range function takes three arguments: start, stop, and step. The start
argument is the first number in the sequence (inclusive), the stop argument
is the last number in the sequence (exclusive), and the step argument is the
amount by which to increment each number.
for i in range(1, 6):
print(i)
Output:
1
2
3
4
5
• While Loops
i = 1
while i <= 5:
print(i)
i += 1
Output:
1
2
3
4
5
We can also use the break and continue statements in for and while loops, just
like in if statements. The break statement is used to exit the loop completely,
while the continue statement is used to skip the current iteration and move
on to the next one.
# Example of using break and continue in a for loop
for i in range(1, 6):
if i == 3:
break
elif i == 2:
continue
print(i)
Output:
1
while i <= 5:
if i == 3:
break
elif i == 2:
i += 1
continue
print(i)
i += 1
Output:
1
4
5
Python also provides two other important control flow statements: break and
continue. These statements allow you to alter the flow of a loop based on
certain conditions.
• Break Statement
The break statement is used to terminate a loop prematurely. When the inter-
preter encounters a break statement within a loop, it immediately exits the
loop and resumes execution at the next statement after the loop. This is useful
when you need to exit a loop early based on some condition.
Here’s an example of using the break statement in a for loop:
for i in range(1, 11):
if i == 5:
break
print(i)
In this example, the loop will iterate over the numbers 1 through 10.
However, when the loop variable i is equal to 5, the break statement is
executed, and the loop is terminated prematurely. As a result, only the
numbers 1 through 4 are printed.
• Continue Statement
The continue statement is used to skip the current iteration of a loop and
move on to the next iteration. When the interpreter encounters a continue
statement within a loop, it immediately skips to the next iteration of the
loop without executing any further statements in the current iteration. This
is useful when you need to skip over certain iterations of a loop based on
some condition.
Here’s an example of using the continue statement in a while loop:
i = 0
while i < 10:
i += 1
if i % 2 == 0:
continue
print(i)
In this example, the loop will iterate over the numbers 1 through 10.
However, when the loop variable i is even, the continue statement is
executed, and the current iteration of the loop is skipped. As a result, only
the odd numbers are printed.
2 Introduction to Python Language 19
• Pass Statement
In Python, the pass statement is used as a placeholder for code that is not
yet implemented. It is an empty statement that does nothing and is used to
avoid syntax errors when you have an empty block of code.
The pass statement is particularly useful when writing code that requires a
function or loop structure to be defined, but you haven’t yet written the code
that goes inside. Rather than leaving the block empty, which will generate an
error, you can use the pass statement to fill the block with something that
does nothing.
Here is an example of the pass statement in a loop:
for i in range(10):
if i % 2 == 0:
pass
else:
print(i)
In this code, the pass statement is used to fill the block of code that runs
when i is an even number. Since we don’t want to do anything in this case,
we simply use pass to fill the block.
Without the pass statement, the code would generate a syntax error
because the if statement requires a block of code to be executed when the
condition is true.
The pass statement can also be used in functions, classes, and other code
blocks where a block of code is required but you don’t want to execute any
code.
For example, in a function that you are still designing, you can use the
pass statement to fill the function body until you’re ready to write the actual
code:
def my_function():
pass
This function takes two arguments, a and b, and returns their sum using
the return keyword. The return statement is used to return a value from the
function to the caller.
Functions in Python can also have default arguments, which are used
when the caller does not provide a value for a particular argument. Here’s
an example of a function with default arguments:
def greet(name, greeting="Hello"):
print(greeting + ", " + name + "!")
This function takes two arguments, name and greeting, with greeting
having a default value of "Hello". If the caller does not provide a value for
greeting, the function will use the default value.
A module is a file that contains Python code, such as functions, variables,
and classes. Modules allow code to be organized into separate files, making
it easier to manage and reuse code across different projects. In Python,
modules are imported using the import keyword, followed by the name of
the module.
Here’s an example of how to import the math module in Python:
import math
print(math.pi)
This code imports the math module and prints the value of pi, which is
defined in the math module.
2 Introduction to Python Language 21
greet("John")
This code imports the greet function from the my_module module and calls
it with the argument "John", which will print the greeting "Hello, John!".
b. Tuples: Tuples are similar to lists, but they are immutable, which means
that they cannot be modified after they are created. Tuples are created
using parentheses, and items in the tuple are separated by commas. Here’s
an example:
22 S. Kumar
These data structures provide a way to store and manipulate data in Python.
They can be used in various ways, such as sorting, filtering, and searching
data. By understanding the different data structures available in Python, you
can choose the best one for your specific needs and optimize your code for
performance.
In this example, the input() function prompts the user to enter their name,
and the response is stored in the name variable. The program then uses the
print() function to output a personalized greeting.
Non-standard input methods, such as reading from a file, are also common
in Python.
a. Opening a File
To open a file in Python, we use the open() function. The basic syntax for
opening a file in Python is:
file_object = open("filename", "mode")
• filename: The name of the file (including the path, if necessary) that we
want to open.
• mode: The mode in which we want to open the file. The mode can
be either read mode (’r’), write mode (’w’), append mode (’a’), or a
combination of these modes.
• ’r’: This mode is used to read data from an existing file. It is the default
mode when we open a file. If the file does not exist, we will get an error.
• ’w’: This mode is used to write data to a file. If the file exists, it will be
overwritten. If the file does not exist, a new file will be created.
• ’a’: This mode is used to append data to an existing file. If the file does not
exist, a new file will be created.
• ’x’: This mode is used to create a new file. If the file already exists, we will
get an error.
24 S. Kumar
b. Closing a File
To read from a file in Python, we use the read() method of the file object.
Here is an example:
file_object = open("filename", "r")
data = file_object.read()
print(data)
file_object.close()
This will read the entire contents of the file into the data variable.
d. Writing to a File
To write to a file in Python, we use the write() method of the file object.
Here is an example:
file_object = open("filename", "w")
file_object.write("This is some text that we want to write to
the file.")
file_object.close()
This will write the specified text to the file. If the file already exists, it will be
overwritten.
e. Appending to a File
To append to a file in Python, we use the write() method of the file object
with the mode ’a’. Here is an example:
2 Introduction to Python Language 25
In this example, the open() function is used to open the example.txt file in
append mode. The write() method is then used to append the string "This
is some additional text." to the end of the file. Finally, the close() method is
used to close the file.
Python provides the csv module for working with CSV files. You can use
the csv.reader() method to read data from a CSV file and the csv.writer()
method to write data to a CSV file. For example:
import csv
The os Module
The os module in Python provides a way to interact with the file system.
This includes creating, deleting, and renaming files and directories, as well as
navigating the file system and changing file permissions.
To use the os module, it must be imported at the beginning of the Python
script using the following command:
import os
Once the os module has been imported, there are several functions that can
be used for file operations. One of these is the is function, which can be used
to check whether a file or directory exists. The syntax for using the is function
is as follows:
os.path.exists(path)
26 S. Kumar
The path argument is a string representing the path to the file or directory
that you want to check. The os.path module provides a way to construct
file paths that is platform-independent, so you can use the same code on
Windows, Mac, and Linux.
Here is an example of how to use the is function to check whether a file
exists:
import os
if os.path.exists("example.txt"):
print("The file exists.")
else:
print("The file does not exist.")
If the file "example.txt" exists in the current directory, the output will be "The
file exists." Otherwise, the output will be "The file does not exist."
Another useful function in the os module is os.listdir(), which returns a
list of all the files and directories in a given directory. The syntax for using
os.listdir() is as follows:
os.listdir(path)
The path argument is a string representing the path to the directory you want
to list. Here is an example of how to use os.listdir() to list all the files in the
current directory:
import os
files = os.listdir(".")
for file in files:
print(file)
This code will print the names of all the files in the current directory.
In addition to these functions, the os module provides many other func-
tions for file operations, such as creating directories, renaming files, and
changing file permissions. With the os module, you can perform all the file
operations you need to manage files and directories in your Python scripts.
In Python, you can define a class using the class keyword followed by the
name of the class. Here’s an example of a simple class definition:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def say_hello(self):
print(f"Hello, my name is {self.name} and I'm {self.age}
years old.")
This class defines a Person object with two attributes (name and age) and
one method (say_hello). The __init__ method is a special method that gets
called when an object of the class is created.
28 S. Kumar
To create an object of this class, you can use the following code:
person = Person("Alice", 30)
This creates a Person object with the name "Alice" and age 30. You can access
the object’s attributes and methods using dot notation:
print(person.name) # Output: Alice
person.say_hello() # Output: Hello, my name is Alice and I'm 30
years old.
def speak(self):
raise NotImplementedError("Subclass must implement
abstract method")
class Dog(Animal):
def speak(self):
return "Woof"
In this example, we define the Animal class with an __init__ method that
initializes the name attribute, and a speak method that raises a NotImple-
mentedError because it is an abstract method that must be implemented by
any subclass.
We then define the Dog class as a subclass of Animal and override the
speak method to return "Woof". Since the Dog class inherits from the
Animal class, it also inherits the __init__ method and the name attribute.
Polymorphism is another important feature of object-oriented program-
ming that allows you to use objects of different classes interchangeably,
as long as they have the same interface. In other words, if two classes
have the same methods and attributes, you can use objects of either class
interchangeably.
2 Introduction to Python Language 29
For example, let’s say we have two classes called Circle and Rectangle, both
of which have a area method that calculates the area of the shape:
class Circle:
def __init__(self, radius):
self.radius = radius
def area(self):
return 3.14 * self.radius ** 2
class Rectangle:
def __init__(self, width, height):
self.width = width
self.height = height
def area(self):
return self.width * self.height
In this example, we define two classes, Circle and Rectangle, both of which
have an __init__ method that initializes the attributes of the shape, and an
area method that calculates the area of the shape.
Because both classes have the same interface (i.e., the same methods and
attributes), we can use objects of either class interchangeably in a function
that expects a shape object:
def print_area(shape):
print(f"The area of the shape is {shape.area()}")
c = Circle(5)
r = Rectangle(4, 6)
print_area(c)
print_area(r)
Regular expressions (regex) are a powerful tool used for pattern matching and
text manipulation. A regular expression is a sequence of characters that define
a search pattern. It is widely used in programming languages such as Python
to search and manipulate strings. In other words, regex allows us to specify a
pattern of text that we want to match against, and then search for that pattern
within a larger body of text.
Regex finds its application in many fields, including computer science,
linguistics, and natural language processing. In the context of Python, regular
expressions are used extensively for text processing and data cleaning. For
example, if you have a large amount of text data that needs to be analyzed
or processed, regular expressions can be used to search for specific patterns
or structures in the text, such as finding all occurrences of a particular word
or phrase, or identifying patterns of text that indicate the presence of certain
types of data.
Understanding regular expressions is an important skill for any data scien-
tist or analyst. It is a valuable tool that can be used to clean and preprocess text
data, extract relevant information from large datasets, and perform advanced
natural language processing tasks. In the context of academic accounting and
finance research, regular expressions can be a powerful tool for text prepro-
cessing and data analysis. With the increasing availability of unstructured
data, such as financial reports, news articles, and social media posts, regular
expressions can help extract relevant information and identify patterns in the
data. This can lead to a deeper understanding of market trends, sentiment
analysis, and risk management. Moreover, regular expressions can be used in
conjunction with other data analysis techniques, such as machine learning
re Functions
In Python, the re module provides support for regular expressions. This
module allows you to use regex functions and methods to search, find, and
manipulate text data. The re module is a standard library in Python, so you
don’t need to install any external libraries to use regular expressions.
The re module provides several functions to work with regular expres-
sions. These functions can be used to perform a wide range of tasks such
as searching for a pattern in a string, replacing a pattern with another string,
and splitting a string based on a pattern. Some of the commonly used re
functions include:
• re.search(): Searches for a pattern in a string and returns the first match.
• re.findall(): Finds all the occurrences of a pattern in a string and returns
them as a list.
• re.sub(): Replaces all the occurrences of a pattern in a string with another
string.
• re.split(): Splits a string into a list of substrings based on a pattern.
• These functions can be used to perform various text processing tasks, such
as data cleaning, text classification, and sentiment analysis, among others.
These functions will be used in Python as follows:
• re.search(pattern, string, flags=0): searches for the first occurrence of the
pattern in the string and returns a match object.
• re.match(pattern, string, flags=0): searches for the pattern at the begin-
ning of the string and returns a match object.
• re.findall(pattern, string, flags=0): returns all non-overlapping occur-
rences of the pattern in the string as a list of strings.
• re.finditer(pattern, string, flags=0): returns an iterator of all non-
overlapping occurrences of the pattern in the string as match objects.
3 Regular Expressions for Python 33
Literals
Literals are the simplest form of regular expressions. They are characters that
match themselves, with no special meaning. For example, the regular expres-
sion "hello" would match the string "hello" and only "hello". Literals can be
any alphanumeric character, as well as special characters like punctuation and
whitespace.
To use a literal in a regular expression, simply include the character or
sequence of characters that you want to match. For example, to match the
string "cat", you would use the regular expression "cat". This will match only
the exact sequence of characters "cat".
It is important to note that literals are case-sensitive by default. This means
that the regular expression "cat" will not match the string "Cat". To match
both upper and lower case versions of a letter, you can use a character class,
which we will cover in a later subsection. An example of using re using only
literals is as follows:
34 S. Kumar
This code searches for the literal string "fox" in the larger string "The quick
brown fox jumps over the lazy dog." using the re.findall() function. The
output of the code will be [’fox’], indicating that the search found one match
for the literal "fox" in the string.
Metacharacters
This code searches for the pattern fox within the string ’The quick brown
fox jumps over the lazy dog’ using the re.search() function. If the pattern is
found, it prints Pattern found. If the pattern is not found, it prints Pattern
not found.
Other examples of using a combination of literals and quantifiers are as
follows:
This code searches for the pattern fox within the string ’The quick brown
fox jumps over the lazy dog’ using the re.search() function. If the pattern is
found, it prints Pattern found. If the pattern is not found, it prints Pattern
not found.
This code searches for the string "brown" within the given text "The quick
brown fox jumps over the lazy dog" using the regular expression module in
Python. It uses the re.search() function to look for a match and if a match is
found, it prints "Match found!". If no match is found, it prints "No match
found."
This code uses the re module in Python to search for a match of any
vowel character in a given text string. The regular expression pattern [aeiou]
matches any single character that is either a, e, i, o, or u. The search() func-
tion of the re module is used to search the text string for a match of this
pattern. If a match is found, the code prints "Match found!", otherwise it
prints "No match found."
This code searches for a non-whitespace character in the given text string. The
regular expression pattern \S matches any character that is not a whitespace
character. The re.search() function searches for the first occurrence of the
pattern in the text string. If a match is found, the code prints "Match found!"
to the console, otherwise it prints "No match found."
This code uses regular expressions in Python to search for a digit in a string.
The string "The year is 2022" is assigned to the variable text, and the regular
expression pattern \d is assigned to the variable pattern. The \d pattern
matches any digit. The re.search() function searches for a match between
the pattern and the text, and returns a match object if it finds a match. If
3 Regular Expressions for Python 37
a match is found, the code prints "Match found!". Otherwise, it prints "No
match found."
The code uses the re module to search for whitespace characters in the
given string "text". It defines a pattern using the escape character ” followed
by the lowercase letter ’s’, which represents whitespace. It then uses the search
function to look for a match with the pattern in the text string. If a match is
found, it prints "Match found!" otherwise it prints "No match found."
Quantifiers
Quantifiers in regex are used to specify how many times a character or group
of characters can occur in a pattern. They help in making the regex more
flexible and powerful. Some commonly used quantifiers include:
This Python code uses the re module to search for a pattern of one or more ’a’
characters in the string text. The pattern is specified using the regular expres-
sion syntax and the + quantifier, which matches one or more occurrences of
the preceding character or group. If a match is found, the search() function
returns a match object that contains information about the match, including
the matching substring. The code then prints the matched substring.
This Python code uses the re module to search for a pattern in a string. The
pattern is defined as colo?ur, which uses the ? quantifier to indicate that the
u character is optional. This means that the pattern will match both "color"
and "colour". The search() method is used to search for the pattern in the
text string. If a match is found, the group() method is used to return the
matched text. In this case, the output would be "colour". If no match is
found, the output would be "No match found."
In this Python code, the regular expression pattern \b\w{4}\b is used to find a
four-letter word that occurs at the boundary of a word in the input text string.
\b matches a word boundary, and \w matches a word character (letters, digits,
or underscore). The {4} quantifier specifies that the preceding \w should
match exactly four times. When run, the program searches for a match to
this pattern in the input string "The quick brown fox jumps over the lazy
dog", and prints "Match found: jumps", since "jumps" is a four-letter word
at a word boundary.
Character Classes
In this Python code, the regular expression pattern \b\w{4}\b is used to find a
four-letter word that occurs at the boundary of a word in the input text string.
\b matches a word boundary, and \w matches a word character (letters, digits,
or underscore). The {4} quantifier specifies that the preceding \w should
match exactly four times. When run, the program searches for a match to
this pattern in the input string "The quick brown fox jumps over the lazy
dog", and prints "Match found: jumps", since "jumps" is a four-letter word
at a word boundary.
Character classes in regex are a powerful tool for specifying the exact set of
characters you want to match in a search pattern. They can be used to match
any type of character, such as digits, letters, whitespace, or punctuation.
For example, the character class [0-9] matches any single digit from 0 to
9, while the character class [a-z] matches any single lowercase letter from a to
z. You can also use negation to specify a set of characters that should not be
matched, such as [^aeiou] to match any consonant.
Character classes can also be used to match specific types of characters,
such as digits, letters, or whitespace. For example, \d matches any single
digit, \w matches any alphanumeric character, and \s matches any whitespace
character.
Here is a list of some common character classes in regex:
40 S. Kumar
There are also some other character classes that can be used, depending on
the regex flavor being used, such as:
This code is searching for an email address using regular expression in Python.
The regex pattern used here is r’\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-
Z|a-z]{2,}\b’. It starts with the \b anchor which matches at a word boundary.
Then it matches one or more occurrences of characters that can be letters
(both uppercase and lowercase), digits, period, underscore, percent, plus, or
hyphen. After that, it matches the @ symbol followed by one or more occur-
rences of letters (both uppercase and lowercase), digits, period, or hyphen.
Finally, it matches a period followed by two or more letters (uppercase or
lowercase) to match the top-level domain. The if-else statement checks if a
match is found or not and prints the appropriate message.
The code uses the regular expression module re to find all words in a given
string. The regular expression pattern used is r’\b\w+\b’, which matches
one or more word characters (\w+) that are surrounded by word bound-
aries (\b). Word characters include alphanumeric characters and underscores.
The findall() function of the re module is used to find all non-overlapping
42 S. Kumar
matches of the pattern in the input string. In this case, the input string is
’This is a sample string.’, and the output will be a list of words found in the
string, which will be [’This’, ’is’, ’a’, ’sample’, ’string’].
• Matching a URL:
This Python code uses the re module to search for a URL in a given string.
The regular expression used to find a URL is stored in the variable pattern,
which matches a string that starts with either "http://" or "https://" and is
followed by one or more characters that can be a combination of letters,
digits, and special characters like dots and hyphens. The re.search() function
is used to search for a match for the pattern in the given URL. If a match
is found, the code prints the message "URL found" along with the matched
URL, otherwise, it prints "URL not found".
• Matching a Date:
3 Regular Expressions for Python 43
This code uses the re module to search for a date string in YYYY-MM-DD
format in a given input string. The regex pattern used is r’\d{4}-\d{2}-\d{2}’,
which matches four digits for the year, two digits for the month, and two
digits for the day, separated by hyphens. The input string date is searched
for a match using the re.search() method, which returns a match object if a
match is found. The if condition checks if a match was found, and if so, it
prints a message with the matched date using the match.group() method. If
no match is found, it prints a message indicating the date was not found.
This code uses the Python re module to search for a date in a specific format
(YYYY-MM-DD) within a string. The regular expression pattern used is
\d{4}-\d{2}-\d{2}, which matches any sequence of four digits, followed by a
hyphen, followed by two digits, another hyphen, and finally two more digits.
The search function is used to search for a match of this pattern within the
date string. If a match is found, the code prints a message saying that the
date was found and prints the matched string. Otherwise, it prints a message
saying that the date was not found.
• Matching an IP Address:
44 S. Kumar
This Python code demonstrates how to use regular expressions to search for
a social security number in a given string. The regular expression pattern
used is \b\d{3}-\d{2}-\d{4}\b, which matches a sequence of three digits, a
hyphen, two digits, another hyphen, and four digits. In this code, the social
security number to be searched is provided in the ssn variable, and the
re.search() function is used to find a match for the regular expression pattern.
3 Regular Expressions for Python 45
If a match is found, the message "Social security number found" along with
the matched string is printed to the console, otherwise the message "Social
security number not found" is printed.
The code uses the re module to find and validate a credit card number. The
regex pattern used is r’\b(?:\d{4}[-]){3}\d{4}\b’, which matches a credit card
number with a format of four-digit groups separated by hyphens or spaces.
The \b at the beginning and end of the pattern indicates word boundaries.
The (?:…) syntax is used for a non-capturing group, and the \d{4}[-] matches
four digits followed by a hyphen or a space. The {3} quantifier indicates that
this group should be repeated three times, and the final \d{4} matches the
last four digits of the credit card number. If the credit card number is found,
the program prints a message saying so, otherwise it prints a message saying
it was not found.
Escape Sequences
Escape sequences are special characters that are used to match specific charac-
ters or classes of characters in regular expressions. They are usually represented
by a backslash () followed by a letter or sequence of letters.
Some common escape sequences in regular expressions include:
Escape sequences can also be used to match specific characters that have
special meaning in regular expressions, such as the period (.) or the back-
slash () itself. To match these characters literally, you need to use an escape
sequence.
For example, if you want to match a period character in a regular expres-
sion, you need to use the escape sequence . because the period has a special
meaning in regular expressions (matches any character). Similarly, if you want
to match a backslash character, you need to use the escape sequence \. Here
are some examples of using escape sequences in Python:
This code uses the regex pattern r"\t" to match a tab character (\t) in the
text string. It then uses the re.findall() function to find all matches of the
pattern in the text, and returns the number of matches found. In this case,
the output will be "Number of matches: 2", as there are two tab characters
in the text.
This code uses the regex pattern r"\n" to match a newline character (\n) in
the text string. It then uses the re.findall() function to find all matches of the
3 Regular Expressions for Python 47
pattern in the text, and returns the number of matches found. In this case,
the output will be "Number of matches: 1", as there is one newline character
in the text.
This code uses the regex pattern r"\x00" to match a null character (\0) in
the text string. It then uses the re.findall() function to find all matches of
the pattern in the text, and returns the number of matches found. In this case,
the output will be "Number of matches: 1", as there is one null character in
the text.
Groups in Regex
$n: Inserts the nth captured group from the pattern. For example, $1 will
insert the first captured group, $2 will insert the second captured group,
and so on.
$’: Inserts the part of the string after the matched substring.
‘$“: Inserts the part of the string before the matched substring.
$+: Inserts the last captured group.
\g: This is used to specify the group number to be replaced. For example,
\g<1> represents the first group in the pattern.
\g<name>: This is used to specify the group name to be replaced. For
example, \g<name> represents the group named ’name’ in the pattern.
In this example, we’re using the sub function to reverse the order of two
words ("brown" and "fox") in the given text. The regular expression used here
matches two words separated by whitespace and captures them in separate
groups. The replacement string \2 \1 swaps the order of the captured groups.
In this example, we’re using the sub function to replace a date in the format
"YYYY-MM-DD" with a date in the format "MM/DD/YYYY". The regular
3 Regular Expressions for Python 51
expression used here matches the date in the first format and captures the
year, month, and day in separate groups. The replacement string \2/\3/\1
rearranges the captured groups to match the desired format.
Assertions
In this example, the pattern \d+ matches one or more digits, and (?=
dollars) is a positive lookahead assertion that matches only if the pattern
is followed by the string " dollars". Therefore, the code should output
Match found: 100.
ii. Negative Lookahead: This assertion matches a pattern only if it is not
followed by another pattern. It is represented by the syntax (?!pattern).
For example, the regex pattern foo(?!bar) would match the substring
foo only if it is not followed by bar. An example of negative lookahead
assertion would be:
52 S. Kumar
In this example, the pattern \d+ matches one or more digits, and (?!
dollars) is a negative lookahead assertion that matches only if the pattern
is not followed by the string " dollars". Therefore, the code should output
Match found: 100.
iii. Positive Lookbehind: This assertion matches a pattern only if it is
preceded by another pattern. It is represented by the syntax (?<=pattern).
For example, the regex pattern (?<=foo)bar would match the substring bar
only if it is preceded by foo. An example of positive lookbehind assertion
would be:
and returns a match object. The if statement checks if a match was found
and prints the result accordingly.
A negative word boundary is used very rarely, but an example here
would be relevant.
In this example, the regex pattern \Bcat\b uses the negative word
boundary \B to match the word ’cat’ only if it’s not preceded or followed
by a letter or digit. The re.findall() function is used to find all occur-
rences of the pattern in the given text. When this code is executed, it will
output the following result:
Since there is only one occurrence of the word ’cat’ in the text that
matches the pattern, the output is a list with only one element.
vi. Comment in regex is denoted by the (?#comment) syntax. It allows
adding comments to the regex pattern without affecting the pattern
matching. The text inside the (?#) will be ignored by the regex engine.
In Python, the triple-quoted strings (”’ or """) are often used to create
multiline string literals. While these can be used to include comments in
code, they are not considered as an official commenting syntax in regular
expressions. However, they can be used to add comments in regular
expressions if the regex is compiled with the re.VERBOSE flag. In this
case, the comments are ignored by the regex engine. An example would
be:
3 Regular Expressions for Python 55
REGEX DESCRIPTION
Literals and Metacharacters
^ The start of a string
$ The end of a string
Wildcard which matches any character, except newline (\n)
| Matches a specific character or group of characters on
either side (e.g., a|b corresponds to a or b)
\ Used to escape a special character
a The character "a"
ab The string "ab"
Quantifiers
* Used to match 0 or more of the previous (e.g., xy*z could
correspond to "xz", "xyz", "xyyz", etc.)
? Matches 0 or 1 of the previous
+ Matches 1 or more of the previous
{5} Matches exactly 5
{5, 10} Matches everything between 5–10
Character Classes
\s Matches a whitespace character
\S Matches a non-whitespace character
\w Matches a word character
\W Matches a non-word character
\d Matches one digit
\D Matches one non-digit
[\b] A backspace character
(continued)
3 Regular Expressions for Python 57
(continued)
REGEX DESCRIPTION
\c A control character
Escape Sequences
\n Matches a newline
\t Matches a tab
\r Matches a carriage return
\ZZZ Matches octal character ZZZ
\xZZ Matches hex character ZZ
\0 A null character
\v A vertical tab
Groups
(xyz) Grouping of characters
(?:xyz) Non-capturing group of characters
[xyz] Matches a range of characters (e.g., x or y or z)
[^xyz] Matches a character other than x or y or z
[a-q] Matches a character from within a specified range
[0-7] Matches a digit from within a specified range
Substitution or Replacement Metacharacters
$‘ Insert before matched string
$’ Insert after matched string
$+ Insert last matched
$& Insert entire match
$n Insert nth captured group
Assertions
(?=xyz) Positive lookahead
(?!xyz) Negative lookahead
?!=or ?<! Negative lookbehind
\b Word Boundary (usually a position between \w and \W)
?# Comment
4
Important Python Libraries
Library
In programming, a library is a pre-written collection of utility methods,
classes, and modules that can be used to perform specific tasks without having
to create functionalities from scratch. The APIs of libraries have a narrow
scope, such as Strings, Input/Output, and Sockets, and require fewer depen-
dencies, making them highly reusable. Code reusability is the primary reason
for using libraries, as they allow developers to avoid reinventing the wheel and
instead focus on solving the actual problem at hand. Libraries are like soft-
ware installed on a computer, and their utilities can be utilized in multiple
applications, enabling developers to achieve more in less time and with fewer
dependencies.
Installing a library in Python is a simple process that can be done using
the pip command, which is a package installer for Python. Here are the steps
to install a library in Python:
Replace ’library_name’ with the name of the library you want to install.
3. Press Enter to execute the command. This will download and install the
library along with its dependencies.
Once the installation is complete, you can start using the library in your
Python code. To import the library, simply add the following line at the
beginning of your Python script:
Replace ’library_name’ with the name of the library you have installed.
Note that some libraries may require additional dependencies or installa-
tion steps. You can find more information about a specific library’s installation
process in its documentation. If you encounter any issues installing a library
using pip, you can try running the command prompt as an administrator as a
troubleshooting step. If the issue persists, you can seek help from the library’s
documentation or online forums to resolve the issue.
For the purpose of this discussion, I divide Python libraries that are useful
for academic researchers in accounting and finance into five functional cate-
gories. It is pertinent to mention that this categorization is only for the
purpose of discussion. More often a library does more than one function
and, therefore, may fall into more than one category.
4 Important Python Libraries 61
BeautifulSoup
To begin using BeautifulSoup, import the necessary libraries and create a new
BeautifulSoup object:
62 S. Kumar
• Searching for tags: The find() and find_all() methods allow you to search
for tags based on their name, attributes, or content:
Once you’ve located the desired tags, you can extract their text content using
the .text or .get_text() methods:
Requests
but it can be easily installed using pip, the Python package installer. To install
Requests, simply run the following command:
Once installed, Requests enables you to send various types of HTTP requests,
such as GET, POST, PUT, DELETE, and others. The most common request,
the GET request, can be performed using the get() function:
The Requests library makes it easy to send data along with your requests,
specify custom headers, and control timeouts. For example, sending a JSON
payload with a POST request:
4 Important Python Libraries 65
Requests raises exceptions for certain types of network errors, such as time-
outs, DNS resolution failures, or SSL certificate issues. To handle these
exceptions gracefully, you can use a try-except block:
The Requests library has several advantages that make it a popular choice
among Python developers. Its simplicity and ease of use are arguably its
most significant benefits. The intuitive interface abstracts the complexities
of handling HTTP requests, allowing developers to quickly send and receive
data from web services. This user-friendly design makes it easy for begin-
ners to get started with web development in Python. Another advantage is its
reliability, as the library is well-maintained and regularly updated, ensuring it
remains compatible with current web standards. Moreover, Requests is highly
extensible, with a wide range of built-in features and support for custom
middleware, making it suitable for a variety of web development tasks.
Despite its many benefits, there are a few disadvantages to using the
Requests library. One potential drawback is that Requests only supports
synchronous operations, meaning that it blocks the execution of the program
while waiting for a response from the server. This can lead to performance
issues, particularly in applications that require high levels of concurrency or
deal with numerous simultaneous requests. In such scenarios, using an asyn-
chronous library like aiohttp or httpx might be a better choice. Another
disadvantage is that Requests, being an external library, adds an additional
dependency to a project. While this is generally not a significant concern,
it can be problematic in certain contexts where minimizing dependencies is
crucial, such as in a resource-constrained environment or when deploying
a serverless application. Also, the Requests library is limited to the HTTP/
1.1 protocol, meaning it does not support newer protocols like HTTP/2 or
HTTP/3, which offer improved performance and features.
Scrapy
2008 and has since become one of the most popular web scraping frame-
works among data scientists and developers alike. Scrapy offers a robust and
scalable solution to web scraping, allowing users to navigate and collect data
from even the most complex websites.
Before diving into Scrapy, it is essential to install the library and its depen-
dencies. The recommended way to do this is through the Python Package
Installer (pip):
Once installed, you can start building your first Scrapy project. To create a
new Scrapy project, run the following command:
This command will generate a project structure with essential files and
directories for your web scraping project.
Scrapy’s Components:
code reusability, making it easier to maintain and share code across multiple
projects or within a team. Another advantage of Scrapy is its extensibility.
Users can extend the library’s functionality by creating custom middlewares,
pipelines, and extensions to meet the specific requirements of various web
scraping scenarios. Furthermore, Scrapy comes with built-in CSS and XPath
selectors, which simplifies the process of targeting and extracting specific data
elements from web pages with precision.
Despite its many advantages, Scrapy also has some drawbacks. One poten-
tial issue is its relatively steep learning curve for those new to web scraping
or Python. Users may need to invest time in understanding Scrapy’s compo-
nents, such as spiders, items, middlewares, and pipelines, before being able
to efficiently build their projects. Another disadvantage is that Scrapy may
be considered overkill for small-scale or straightforward web scraping tasks.
In such cases, simpler libraries like BeautifulSoup or Requests might be
more suitable and easier to use. Moreover, Scrapy’s reliance on the Twisted
networking library might also pose a challenge for users unfamiliar with
asynchronous programming, potentially increasing the complexity of certain
projects. Moreover, while Scrapy is excellent for structured data extraction,
it may not be the best option for dealing with JavaScript-heavy websites or
those that require rendering. In these cases, using a headless browser such as
Selenium or integrating Scrapy with Splash can be necessary, adding an extra
layer of complexity to the web scraping process.
Pandas
an indispensable tool for data scientists, analysts, and programmers who work
with data in Python. Pandas provides an easy-to-use interface for handling
data in various formats, such as CSV, Excel, JSON, and SQL databases. With
its powerful data structures and functions, the library greatly simplifies data
processing and analysis tasks.
Key features of Pandas are:
To start using Pandas, import the library and create a DataFrame or Series:
4 Important Python Libraries 69
Pandas provide for easy handling of missing values. The missing data can be
filled using interpolation. The relevant code would be:
• linear: Fill missing values using linear interpolation, which connects two
points with a straight line. This is the default method and is suitable for
data with a linear trend.
• time: Fill missing values based on the time index, taking into account the
time distance between points. This method is appropriate for time series
data with irregular intervals.
• index, values: Use the index or values, respectively, to perform interpola-
tion.
• nearest: Fill missing values with the nearest available data point.
• polynomial: Fill missing values using a polynomial function of a specified
order.
• spline: Fill missing values using a spline interpolation of a specified order.
• pad, ffill: Fill missing values using the previous value in the time series
(forward fill).
• bfill, backfill: Fill missing values using the next value in the time series
(backward fill).
• from_derivatives: Interpolate based on the first derivative of the data
points.
• piecewise_polynomial: Fill missing values using a piecewise polynomial
of a specified order.
• cubicspline: Fill missing values using a cubic spline interpolation.
• akima: Fill missing values using the Akima interpolator.
• pchip: Fill missing values using the PCHIP (Piecewise Cubic Hermite
Interpolating Polynomial) method.
• quadratic: Fill missing values using a quadratic function.
70 S. Kumar
Pandas offers several advantages that make it a popular choice for data manip-
ulation and analysis in Python. One of its key strengths is the provision of
powerful data structures like Series and DataFrame, which simplify handling
complex data types and operations. These data structures come with a wide
range of built-in functions for data cleaning, filtering, sorting, and aggre-
gation, resulting in more efficient and readable code. Another advantage of
Pandas is its ability to seamlessly handle missing data. It represents missing
values as NaN and offers multiple functions to detect, drop, or fill them,
making data cleaning and preprocessing easier. Furthermore, Pandas boasts
strong support for time series data, which is crucial for various applications
in finance, economics, and other fields. The library’s time-based indexing,
resampling, and frequency conversion features make it an ideal choice for
time series analysis.
On the other hand, there are some disadvantages to using Pandas. One
primary concern is its memory consumption, as Pandas DataFrames can
consume a significant amount of memory, especially when working with
large datasets. This can lead to performance issues and may require users to
resort to other libraries or techniques for handling big data, such as Dask
or PySpark. Another drawback is that, while Pandas is highly versatile, it
might not be the best choice for certain specialized tasks. For instance, when
working with multidimensional arrays or tensors, users may find NumPy
or TensorFlow more suitable. Additionally, for highly performance-sensitive
applications, Pandas may not be the optimal choice due to its inherent over-
head compared to lower-level libraries like NumPy. In such cases, users might
need to resort to Cython or Numba for further performance optimization.
NumPy
After installing NumPy, import it into your Python script or notebook with:
NumPy offers several key advantages that make it an indispensable tool for
numerical computing in Python. First, its high-performance N-dimensional
array object, ndarray, enables efficient array manipulation and storage of large
72 S. Kumar
volumes of data. This is particularly beneficial for users working with big
datasets or complex mathematical operations. Second, NumPy’s comprehen-
sive set of mathematical functions and broadcasting capabilities allow for fast,
vectorized computations that significantly reduce the need for explicit loops.
This results in cleaner, more readable code and improved performance. Third,
NumPy provides excellent interoperability with other scientific libraries, such
as SciPy, Matplotlib, and Pandas. This seamless integration fosters a cohesive
and powerful ecosystem for scientific computing and data analysis in Python.
Despite its numerous advantages, NumPy has certain limitations. One key
drawback is that it requires the elements within an array to be of the same
data type, restricting the flexibility of the data structure. This constraint may
lead to increased memory usage when working with mixed data types, as users
may need to create separate arrays for each type. Additionally, while NumPy
offers a vast range of mathematical functions, it does not provide higher-
level data manipulation or visualization tools, often necessitating the use of
other libraries, such as Pandas or Matplotlib. Finally, NumPy’s performance
benefits primarily apply to numerical operations, making it less suitable
for tasks involving non-numerical data or those that require extensive text
processing. For these tasks, users may need to consider alternative libraries or
data structures that better align with their specific requirements.
Dask
Dask is a powerful library that offers several key features that make it a
valuable tool for academic researchers in various fields. Dask’s scalability is
a major advantage, as it can scale computations from a single core on a
laptop to thousands of cores on a distributed cluster. This flexibility allows
users to scale their computations horizontally or vertically, depending on their
specific needs. Additionally, Dask provides a familiar interface for working
with large datasets, which closely mirrors popular libraries like NumPy and
pandas. This makes it easy for users to adopt Dask without the need to learn
a new API. Dask is also highly customizable, enabling users to define their
own task graphs, schedulers, and parallel algorithms. Finally, Dask integrates
seamlessly with other popular Python libraries, such as NumPy, pandas, and
scikit-learn, making it easy to use alongside existing workflows. These key
features make Dask a valuable tool for academic researchers in accounting
and finance, enabling them to efficiently handle large and complex datasets
and perform complex computations.
Dask provides three main schedulers for executing tasks concurrently:
Local Threads, Local Processes, and Distributed. The Local Threads scheduler
uses Python’s built-in threading module and is lightweight, making it suit-
able for smaller-scale parallelism. The Local Processes scheduler, on the other
hand, uses Python’s multiprocessing module and takes advantage of multiple
cores, providing better isolation between tasks but with potentially higher
overhead. The Distributed scheduler uses the dask.distributed module and
is designed to execute tasks concurrently on a cluster of machines. It offers
advanced features such as real-time monitoring, data locality, and resilience
to node failures, making it a powerful tool for large-scale parallelism.
Dask is a versatile Python library that is ideal for working with large
and complex datasets. Its ability to parallelize tasks and efficiently handle
intermediate results makes it a powerful tool for various data processing
tasks. Researchers in accounting and finance can use Dask for data prepro-
cessing and cleaning, machine learning and model training, image and signal
processing, as well as running large-scale simulations and models. With its
range of functionalities, Dask can help researchers save time and effort when
working with large datasets and enable them to draw meaningful insights and
conclusions from their research.
74 S. Kumar
Matplotlib
To create a simple line plot, use the plot() function and specify the x and y
data points:
This will display a simple line plot of the specified data points. You can easily
customize your plots by modifying various parameters:
This example demonstrates how to change the line color, style, and marker,
as well as add axis labels and a title.
Despite its numerous advantages, Matplotlib has some drawbacks. The
library’s syntax can be complex and unintuitive, especially for beginners,
leading to a steep learning curve. Moreover, the need for extensive customiza-
tion to achieve desired visualizations can result in verbose code, which
might hinder readability and maintainability. While Matplotlib is capable of
producing 3D visualizations, its primary focus is on 2D graphics, limiting its
applicability in certain scenarios. Furthermore, in comparison with modern
web-based visualization libraries, such as Plotly and Bokeh, Matplotlib’s inter-
activity features may be lacking, potentially making it less suitable for creating
interactive dashboards and web applications.
76 S. Kumar
SciPy
To get started with SciPy, you first need to install the library using the
following command:
Once installed, you can import the library in your Python script using:
StatModels
Once installed, you can import the library in your Python script as follows:
The StatModels library offers a variety of tools for time series analysis, such
as autoregression (AR), moving average (MA), and seasonal decomposition
of time series (STL). These tools can help forecast future values, detect
seasonality, and identify trends in your data.
For instance, to perform seasonal decomposition on a time series dataset,
you can use the following code:
PyMC3
features make PyMC3 a powerful tool for Bayesian statistical modeling and
probabilistic machine learning in accounting and finance research.
To start using PyMC3, install the library using pip or conda:
or
PyMC3 offers several advantages for users working with Bayesian modeling
and probabilistic programming. The library’s intuitive context-based syntax
simplifies the process of specifying complex models, making it accessible
to beginners and experts alike. With its extensive collection of probability
distributions and automatic differentiation capabilities, PyMC3 streamlines
model implementation by reducing the need for manual gradient calcula-
tions. Moreover, the library’s support for advanced sampling methods, such
as NUTS, and variational inference techniques, such as ADVI, allows for
efficient and versatile handling of complex posterior distributions.
One of the primary drawbacks of PyMC3 is its reliance on Theano, a
discontinued library, for its backend in earlier versions. While this issue has
been addressed in later versions by integrating TensorFlow as the backend,
82 S. Kumar
it may still present challenges for users working with legacy code. Further-
more, as PyMC3 is a specialized library for Bayesian modeling, it may not
be the best choice for users seeking to perform frequentist statistical analysis
or those requiring a broader range of machine learning tools. In such cases,
other libraries like SciPy or scikit-learn might be more appropriate.
Scikit-Learn
This code snippet highlights the essential steps in a typical machine learning
workflow with scikit-learn. It starts by loading the iris dataset and splitting
it into training and testing sets. The data is then preprocessed using feature
scaling to standardize the input features, improving the model’s performance.
The k-nearest neighbors algorithm is chosen as the classifier and trained on
the scaled training data. Finally, the trained model is used to make predictions
84 S. Kumar
on the test set, and its accuracy is evaluated, providing an estimate of the
classifier’s performance on unseen data.
Scikit-learn is a widely used machine learning library in Python that
offers a broad range of machine learning algorithms, preprocessing tools,
and model evaluation techniques. Its simple and intuitive interface makes it
easy for beginners to use machine learning algorithms without requiring in-
depth knowledge of the underlying mathematics. Scikit-learn has an extensive
documentation library that provides detailed explanations of the algorithms,
examples of their usage, and references to the underlying research papers.
Moreover, scikit-learn has a vibrant community that is always ready to help
with any issues or questions users may have.
However, scikit-learn does have some limitations. For example, it is not
designed for deep learning, which requires more advanced algorithms and
computational power than traditional machine learning. While scikit-learn
does support some deep learning models, such as neural networks, users
may find other deep learning frameworks, such as TensorFlow or PyTorch,
more suitable for their needs. Additionally, scikit-learn is not designed for
big data analysis and may not be suitable for datasets with millions of rows or
high-dimensional data. In such cases, users may need to consider distributed
computing frameworks, such as Apache Spark or Dask, that are specifically
designed for big data processing.
TensorFlow
After installation, TensorFlow can be imported into the Python script as:
PyTorch
Keras
However, there are also several drawbacks associated with using WRDS
data. One significant limitation is the cost of access. Subscribing to WRDS
can be financially burdensome for smaller academic institutions or indepen-
dent researchers due to its high subscription fees. Moreover, some researchers
may encounter difficulties with the platform’s interface and user experience,
finding it less intuitive compared to other data platforms, which can impede
their work.
Another potential disadvantage of utilizing WRDS is the restricted avail-
ability of certain datasets. Although WRDS offers access to an extensive
range of data, certain datasets may be inaccessible due to restrictions imposed
by data providers or regulatory agencies. Consequently, this limitation may
restrict the scope of research projects and necessitate researchers to comple-
ment their analysis with data obtained from alternative sources.
Despite these disadvantages, the comprehensive nature and data quality
offered by WRDS establish it as an invaluable resource for academic research.
Armed with the appropriate tools and expertise, researchers can utilize
WRDS data to extract valuable insights and make significant contributions to
their respective fields. Furthermore, it is worth noting that WRDS is acces-
sible to a majority of universities. In order to employ Python for WRDS,
researchers need to install the WRDS library through the pip install method.
The code to download data from compustat annual in the csv format would
be:
5 Accessing Data from WRDS 93
This code facilitates the retrieval of data from the Compustat North America
Annual database on WRDS, subsequently saving it in CSV format. To estab-
lish a connection with the WRDS database, users are prompted to input their
WRDS username and password.
In this code, specific variables, namely gvkey, datadate, fyr, sale, and at,
are designated for download. It is possible to specify any variables available
in the Compustat Annual dataset. These variables are incorporated into the
SQL query through the vars list. Additionally, the condition variable permits
the inclusion of supplementary query conditions, such as data format, source,
and currency.
The raw_sql method is employed to extract data from the database. The
query is formulated based on the predetermined variables and conditions,
and the resulting data is stored in the data variable. Subsequently, the data is
saved in CSV format within the specified directory using the to_csv method.
By setting the index parameter to False, the index is excluded from being
included in the saved file. To conclude the process, the connection with the
WRDS database is terminated using the close method. The generated CSV
file can be imported into various software applications such as SAS, Stata,
Excel, or any other relevant software for further analysis.
In situations where data for a specific time period is required, the code
can be enhanced by incorporating start and end dates as illustrated in the
following example:
94 S. Kumar
In the event that the selection of a specific period is required based on the
fiscal year (fyear), a minor modification in the code implementation would
be necessary.
5 Accessing Data from WRDS 95
The code does not require inclusion of variable on which selection of data is
to be done (selection of data done on the basis of fyear, which hasn’t been
included), but that variable must be present in wrds library from which data
is being downloaded (comp.funda in this case). Similarly share price data can
be downloaded from crsp daily stock file using the following code:
96 S. Kumar
The presented code facilitates the retrieval of data from the CRSP Daily
Stock File available on the Wharton Research Data Services (WRDS) plat-
form. Prior to utilizing the script, a WRDS account is required, and users
are prompted to input their username and password for authentication. The
script commences by importing essential libraries, including "wrds" for estab-
lishing a connection with WRDS, "pandas" for data manipulation, and
"datetime" for handling date and time information. Following the establish-
ment of a connection using the user-provided credentials, the script specifies
the variables to be downloaded and sets a condition for the query, limiting
the data to a specific date range.
Subsequently, the script employs the "raw_sql" method from the "wrds"
library to download the data, subsequently storing it in a CSV file within
the designated directory. Finally, the connection with WRDS is terminated
through the utilization of the "close"method.
In the event that data from IBES is required, an alternative code would be:
5 Accessing Data from WRDS 97
This Python code uses the WRDS (Wharton Research Data Services) module
and Pandas library to download financial data from the Institutional Brokers
Estimate System (IBES) database. The first step is to import the necessary
libraries—wrds and pandas. Then, a connection is established to WRDS
using the user’s credentials. Next, the desired start and end dates are defined
for the data. The SQL query to retrieve the data from the IBES database
is defined using the start and end dates as conditions. In this example, the
query is retrieving actual earnings per share (EPS) data for a specified date
range, but the query can be customized as needed. After the query is defined,
it is executed and the data is retrieved into a pandas dataframe. Finally, the
dataframe is written to a CSV file.
To download data from Thomson Reuters Mutual funds the following
code would be useful:
98 S. Kumar
This code downloads data from the Thomson Reuters Mutual Funds database
using the WRDS (Wharton Research Data Services) platform. First, the
required libraries are imported: the wrds library for establishing the connec-
tion to the WRDS server, and the pandas library for data manipulation. The
code then prompts the user to enter their WRDS username and password,
and establishes a connection with the WRDS server using these creden-
tials. The code then specifies the variables to download from the ’mfnav’
table of the ’finfund’ library, which include the ’crid’, ’cusip’, ’sedol’, ’fdate’,
’nav’, ’pnav’, and ’tret’. Additionally, the code specifies a condition for the
query, which limits the data to the date range between January 1, 2010, and
December 31, 2020. Using the ’get_table’ function, the code downloads the
data from the Thomson Reuters Mutual Funds database, limiting the results
to the first 100 observations. Finally, the data is saved in a CSV file in the
specified directory, and the connection to the WRDS server is closed using
the ’close’ function.
This code can be modified by changing the specified variables and condi-
tions to download different data from the Thomson Reuters Mutual Funds
database.
This chapter elucidates the process of accessing structured data through
the utilization of Python and the WRDS platform. The fundamental aspects
encompassed in this chapter involve the acquisition of data from diverse
databases offered by WRDS, namely CRSP, Compustat, IBES, and Thomson
5 Accessing Data from WRDS 99
# Loop through each ticker symbol and download its 10-K report
using the downloader object
for t in for_ticker:
downloader.get("10-K", t)
print(str(t) + ' is downloaded')
This Python code downloads the 10-K filings for a list of companies from
the SEC’s EDGAR database. It uses the sec_edgar_downloader module to
download the filings and creates a Downloader object with a directory (dir1)
to store the downloaded files. An empty DataFrame is created using pandas.
The code then reads a list of ticker symbols and their associated date ranges
from an Excel file named Tickers.xlsx in directory dir2 using the pandas
module. The Ticker column of the excel file Tickers.xlsx must have a column
Ticker that has all the ticker symbols of the companies for which the files are
to be downloaded from sec edgar. The Ticker column from the DataFrame
is extracted as a pandas series. The for loop then iterates over each ticker
symbol in the series and downloads its 10-K report using the downloader
object. For each company, the get() method of the Downloader object is
called with the "10-K" parameter and the ticker symbol. The print() function
104 S. Kumar
is called within the loop to indicate that the filing for that company has been
downloaded.
After all the tickers have been processed, the loop ends, and the code prints
a message to indicate that the download is complete. However, it is important
to note that this code does not download filings within a specific date range,
unlike the previous code we discussed.
sec_edgar_downloader library can be installed using:
pip install sec_edgar_downloader
Useful Modifications
Limiting the Period
The previous code downloads all the files from the beginning to the end.
Most of the times in accounting and finance research we need data related
to specific period or specific financial years. Unfortunately, SEC does not
provide the sorting of files by financial year, but you can download the files
that were filed by the company between two dates, which can be used indi-
rectly to download the files for a financial year, though it’s not always true.
To do this you need to install a library "datetime", which can be installed by
the same pip install method. Also, in the Ticker.xlsx file you need to create
two additional columns, i.e., Before and After that have the relevant dates for
the period, in date format. The code would be:
6 Accessing Data from SEC EDGAR 105
import sec_edgar_downloader
import sec_edgar_downloader
import pandas as pd
from datetime import datetime
# iterate over the ticker list and download the 10-K reports
with the specified date range
for t, a, b in zip(for_ticker, after_date, before_date):
downloader.get("10-K", t, after=a, before=b)
print(f"{t} is downloaded")
This Python code downloads 10-K filings for a list of companies from
the SEC’s EDGAR database. It uses the sec_edgar_downloader module to
download the filings, the pandas module for data manipulation, and the date-
time module to convert dates. The code reads a list of ticker symbols and their
associated date ranges from an Excel file named Tickers.xlsx, extracts the
ticker symbols, after dates, and before dates from the file, and converts them
to lists. The for loop then iterates over the ticker symbols, after dates, and
before dates simultaneously using the zip() function. For each ticker symbol,
the get() method of the Downloader object is called with the relevant param-
eters to download the filings for that company within the specified date range.
After all the companies have been processed, the loop ends and the code
prints a message to indicate that the download is complete.
The files in previous code have numerous html tags and are not usable in this
format. We need to clean those tags to make the files readable. The code to
clean these files would be:
106 S. Kumar
import pandas as pd
import os
from bs4 import BeautifulSoup
This code is written to clean all the HTML tags from text files with .txt exten-
sion in a specified directory and all its subdirectories. The os.walk() function
is used to traverse through all the directories and subdirectories within the
specified root directory. Then, a for loop is used to iterate over all the files in
each directory. The if statement checks if the file has a .txt extension.
If the file is a text file, it is opened using the open() function, and its
content is read using the read() method. The BeautifulSoup() function from
the bs4 library is then used to parse the content of the file and remove all
the HTML tags, and the cleaned content is saved to the variable cleaned_
content. Finally, the cleaned content is written back to the original file using
the write() method, and a message is printed to indicate that the file has been
cleaned. Once all the files have been cleaned, a message is printed to indicate
that the cleaning process is complete.
A single-stage code that downloads and cleans all the files would be more
efficient. The code is:
6 Accessing Data from SEC EDGAR 107
import sec_edgar_downloader
import pandas as pd
# Iterate over the ticker list and download the 10-K reports
with the specified date range
for t, a, b in zip(for_ticker, after_date, before_date):
# Download the 10-K reports with the specified date range
downloader.get("10-K", t, after=a, before=b)
# Print a message for each ticker that has been downloaded
print(f"{t} is downloaded")
techniques and tools, researchers can overcome these challenges and gain
fresh insights into the field of accounting. The techniques presented in this
chapter are designed to assist researchers, both novice and experienced, in
obtaining and processing data more efficiently, enabling them to conduct
comprehensive and insightful analyses. Through the application of these tech-
niques, researchers can harness the vast amount of information available on
the internet and obtain novel perspectives on the accounting profession.
The subsequent sections will provide a step-by-step guide on website
scraping, HTML tag cleansing, table handling, and management of login
credentials. Additionally, best practices for data processing and analysis will
be discussed. By adhering to these instructions and best practices, researchers
can overcome common challenges and effectively utilize data from websites
that necessitate scraping and cleaning procedures.
import requests
from bs4 import BeautifulSoup
import os
This Python code that scrapes data from a website, processes it and saves it
to text files. The script loops over a range of IDs from 174001 to 300000,
constructs the URL for each ID, sends a GET request to the website and
retrieves the HTML content. It then uses the BeautifulSoup module of bs4
library to extract the text from the HTML content and remove HTML tags.
The script creates a directory where the text files will be saved and constructs
the filename for each ID using the directory path and the ID number. It then
writes the text data to a file and prints a message to confirm that the file was
saved.
The purpose of this script is to collect data on the Stanford Securities
Class Action Database, which contains information on securities class action
lawsuits. The data is stored in separate text files, each corresponding to a
specific ID number. The script can be modified to adjust the range of ID
numbers to scrape, the directory where the text files are saved, and the
filename format for the text files.
Here we have taken 300,000 as a large number bigger than expectation in
place of actual last page number. Most websites will give an inaccurate page
number response, in response to a page number that does not exist. This
code can automatically handle that. Some websites including the Stanford
Securities Class Action Database generate a page that will say that this page
112 S. Kumar
does not exist. The code will download that page as a separate text file, which
later need to be deleted based on size or any other criteria. In both cases the
code in a large number bigger than expected page number scenario will have
to loop over greater numbers and therefore, will take more time. Therefore,
it is always good to look at the actual last page number, wherever possible.
import pandas as pd
import yfinance as yf
import pytz
import os
# Set the start and end times in Eastern Standard Time (EST)
start_time = pd.Timestamp('2023-03-13 10:00:00',
tz='US/Eastern')
end_time = pd.Timestamp('2023-03-13 10:15:00', tz='US/Eastern')
This code downloads financial data for a list of tickers from Yahoo Finance
and saves it to a CSV file. The tickers are read from an Excel file that has
a column Ticker for all the tickers, in dir1 directory, and the start and end
times are specified in Eastern Standard Time (EST). The code uses the pandas
library to read the tickers from the Excel file, and the yfinance library to
download the financial data. The pytz library is used to specify the timezone
for the start and end times, and the os library is used to create directories and
file paths.
The code loops over each ticker in the Excel file, and downloads the data
using the yf.download() function. The data is then saved to a CSV file using
the pandas DataFrame.to_csv() function. The CSV file is saved in a directory
specified by the user, with a filename that includes the ticker symbol and the
start and end times. After each file is saved, the code prints a message to
confirm that the data was saved. For different times and timezones, start_
time, end_time, and tz can be modified.
114 S. Kumar
One of the key challenges with working with financial data is that it is
often timestamped in a particular timezone. This can create issues when
working across multiple timezones, as the time of day will be different for
different parts of the world. In the case of the yfinance library, it is important
to note that the data is timestamped in the timezone of the exchange where
the stock is traded. This means that when downloading data for a particular
stock, it is important to set the start and end times in the timezone of that
exchange. Furthermore, it is important to consider whether the exchange is
open or closed at the time you are requesting data, as this will impact the
availability and accuracy of the data.
Data on Cryptocurrency
Cryptocurrency research is a rapidly emerging field in accounting and
finance. However, one major challenge researchers face is the lack of reliable
data. While many exchanges trade cryptocurrencies 24 hours a day, obtaining
and interpreting daily open and close price data can be difficult. In the world
of crypto, real-time data is often more relevant due to the high levels of
volatility.
Fortunately, researchers can leverage tools like Python and the Pycoingecko
library to access real-time cryptocurrency data. With this library, researchers
can obtain data on prices, market capitalizations, trading volumes, and other
metrics for a variety of cryptocurrencies. In this section, we provide you
Python code to obtain crypto data using the Pycoingecko library. The code
allows you to retrieve real-time data between two specified dates, making it
easier to analyze how prices, volumes, and other metrics have changed over
time. By combining real-time data with other analytical tools, researchers can
gain valuable insights into the behavior of cryptocurrency markets.
7 Accessing Data from Other Sources 115
import csv
import datetime
from pycoingecko import CoinGeckoAPI
import os
import pandas as pd
# Set the start and end dates for the historical data
start_date = datetime.datetime(2023, 1, 1)
end_date = datetime.datetime(2023, 1, 15)
# Read the crypto_id column from the Excel file and store as a
list
df = pd.read_excel("<dir1>/crypto.xlsx")
crypto_ids = df["crypto_id"].tolist()
This code retrieves historical price, market cap, and volume data for cryp-
tocurrencies and saves it as CSV files. It uses the CoinGecko API to retrieve
the data and the pandas library to read the crypto_id column from an Excel
file in <dir1> directory. First, the start and end dates for the historical data
116 S. Kumar
are set using datetime. These dates are then converted to Unix timestamps
using the int() function. Next, the crypto_id column is read from the Excel
file using the pandas library and stored as a list. This list is used in a for loop
to retrieve the historical data for each cryptocurrency. For each cryptocur-
rency, the CoinGeckoAPI is used to retrieve the historical data in USD for
the specified date range. The data is then saved as a CSV file in a specified
folder <dir2> using the csv library.
The CSV file contains four columns: Date, Price, Market Cap, and Total
Volume. The data for the Date and Price columns are retrieved directly
from the historical price data returned by the CoinGeckoAPI. The data
for the Market Cap and Total Volume columns is retrieved using a nested
list comprehension to search for the corresponding data in the market_
caps and total_volumes fields returned by the API. Finally, a message is
printed to the console indicating that the data has been saved for the current
cryptocurrency.
url =
'https://www1.ncdc.noaa.gov/pub/data/cdo/samples/PRECIP_HLY_samp
le_csv.csv'
filename = 'D:/NOAA/PRECIP_HLY_sample_csv.csv'
urllib.request.urlretrieve(url, filename)
This code downloads daily summary data from the National Oceanic and
Atmospheric Administration (NOAA) using Python. First, it downloads a
sample CSV file from NOAA using the urllib library. It then defines the FTP
118 S. Kumar
server details, data directory, file names, and date range for the data. The
pandas library is used to loop through each date in the specified range.
For each date, the code creates the file name for the current date, creates
the directories if they don’t exist, and downloads the corresponding file from
the NOAA FTP server using the ftplib library. The file is saved with the
appropriate name and in the appropriate directory. The FTP connection is
closed at the end of the loop. The code downloads the desired data in "D:/
NOAA" folder.
This code demonstrates how to automate the process of downloading large
amounts of daily summary data from NOAA for research purposes. It is
important to note that the code can be modified to download other types
of data from NOAA by changing the file names and directory structure.
Twitter Data
Twitter data has gained increasing attention among accounting and finance
academic researchers due to its real-time nature and the vast amount of
information that can be extracted from it. It provides a unique opportu-
nity to study the opinions and sentiments of individuals, organizations, and
the broader market in real time. Researchers have used Twitter data to study
various accounting and finance-related topics, such as financial market reac-
tions to corporate announcements, stock price prediction, sentiment analysis
of earnings calls, and disclosure behavior of firms. Twitter data can also
provide valuable insights into consumer behavior and trends, as well as the
impact of news events on the stock market. The use of Twitter data in
accounting and finance research is expected to continue to grow as new tools
and techniques are developed to extract meaningful insights from the vast
amount of data available. Twitter data can be downloaded based on various
parameters such as:
• Locations (geotags)
• Language of the tweets
• Dates or time period
• Number of tweets to be downloaded
• Type of tweets (original tweets, retweets, replies)
• Filter by media types (photos, videos, etc.)
To download data from twitter you need to obtain API credential from
twitter, which is free for academic research. Here’s a sample code that down-
loads all the text tweets containing a company name during a particular time
frame:
120 S. Kumar
The above code demonstrates how to use the Twitter API and Python’s
Tweepy library to download tweets from Twitter based on specific search
query parameters. The code first authenticates the API using the provided
API credentials. It then reads a list of company names from an Excel file,
named Companies that has a column named Company_name, that contains
names of the companies, and defines the search query parameters based on
those names, as well as a start and end date for the search. The code creates an
empty list to store the retrieved tweets and loops through the pages of search
results, using the Tweepy Cursor object. For each page, it loops through the
individual tweets and appends them to the list. The code finally creates a
pandas dataframe from the retrieved tweet information, including the tweet
ID, text, creation date, user screen name, and user follower count. It then
writes this data to a CSV file for further analysis. This code provides a simple
example of how to use Python and Tweepy to access Twitter data for research
purposes.
The inclusion of replies and retweets in the downloaded data can be
achieved by adjusting the search query parameters. By incorporating the
7 Accessing Data from Other Sources 121
Keep in mind that including replies and retweets may increase the amount
of data downloaded and may require additional processing steps to filter and
clean the data.
can analyze the search patterns related to major news events such as mergers
and acquisitions, product recalls, and economic policy changes. This can
provide valuable insights into how consumers and investors react to these
events and can inform business strategies and public policy decisions. Thus,
Google Trends data is a powerful tool for academic accounting and finance
research, offering unique insights into consumer behavior, market trends, and
the performance of financial markets.
Google trends data can be downloaded on many parameters such as:
import pandas as pd
from pytrends.request import TrendReq
This Python code downloads Google Trends daily data for a list of keywords
in an Excel file. The code uses the pandas and pytrends libraries to download
and process the data.
124 S. Kumar
First, the keywords are read from an Excel file, and the output directory
is specified. A TrendReq object is created to establish a connection with the
Google Trends API, and the time range for the data is defined. The code then
loops through each keyword in the list, builds the keyword query list, and
downloads the interest over time data using the interest_over_time method
of the pytrends object. The keyword column is added to the data and the data
is appended to a trends_data dataframe. Finally, the index of the trends_
data dataframe is reset and the data is saved to a CSV file in the specified
output directory. If the output directory does not exist, it is created using
os.makedirs(). The output file contains the daily search interest for each
keyword in the time range specified, which can be used for further analysis
in accounting and finance academic research.
In this chapter, we have learned how to access structured and unstructured
data from various sources using Python. We have covered different types of
data sources such as financial databases, social media platforms, weather and
climate data, and Google Trends data. We have also seen how to download
data from these sources using different Python libraries and tools such as
WRDS, Tweepy, NOAA, and Pytrends.
Accessing data is a crucial step in any academic research, and Python
provides a powerful set of tools to help researchers access data from different
sources. With the knowledge gained in this chapter, academic researchers
can now access and analyze data from different sources to gain insights and
conduct research in various fields such as finance, accounting, and sustain-
ability. The next chapter will build on this knowledge by showing how to
clean, preprocess, and analyze data using Python.
8
Text Extraction and Cleaning
In the field of academic accounting and finance research, the analysis of vast
amounts of unstructured textual data is growing in significance. However,
working with such data can be challenging due to the presence of extra-
neous information, including HTML tags and formatting elements, which
can hinder text comprehension and analysis.
This chapter will primarily focus on techniques for extracting and cleaning
text, which are crucial for retrieving relevant information from extensive
volumes of unstructured textual data. These techniques empower researchers
to identify and extract specific sections of text, such as financial statements
or earnings announcements, while eliminating irrelevant or redundant infor-
mation. For instance, in the case of researching Managerial Discussion and
Analysis, researchers may choose to exclude all text in 10-K statements except
item 7.
Text extraction and cleaning techniques are essential for ensuring that
researchers work with pertinent data and can reveal valuable insights that
may not be discernible through traditional data analysis methods. These
techniques enable researchers to identify patterns in financial statements and
analyze market sentiment based on news articles and other textual data.
The following sections will delve into comprehensive explanations of text
extraction and cleaning techniques and explore their implementation using
Python. By the conclusion of this chapter, readers will have a comprehensive
understanding of how to extract and cleanse text data for utilization in various
natural language processing (NLP) applications in academic accounting and
finance research.
import re
if match:
print('Valid email address found:', match.group())
else:
print('Invalid email address')
import os
import re
# Loop through each file in the directory and extract the text
between the markers
for filename in os.listdir(directory):
if filename.endswith(".txt"):
with open(os.path.join(directory, filename), "r") as f:
# Read the file contents
file_contents = f.read()
We define the directory path where our text files are located as dir_path.
m1 and m2 are the regex markers that we will use to extract the relevant
text. These markers can be any string or regular expression pattern that can
uniquely identify the start and end of the text we want to extract.
for root, dirs, files in os.walk(dir_path):
for file in files:
if file.endswith('.txt'):
with open(os.path.join(root, file), 'r') as f:
text = f.read()
Here, we are using the os.walk function to traverse the directory tree rooted
at dir_path and iterate over each file in the directory tree. We check if the
file has a .txt extension and if so, we open it in read mode using the with
statement. Next part of the code is:
pattern = re.compile(f'{m1}(.*?){m2}',
re.DOTALL)
match = re.search(pattern, text)
if match:
extracted_text = match.group(1)
with open(os.path.join(root, 'extracted_' +
file), 'w') as outfile:
outfile.write(extracted_text)
this code should work on all text files. If there is problem, you should check
your regex markers.
import os
from bs4 import BeautifulSoup
This code sets the directory path where the source text files are located.
for filename in os.listdir(dir_path):
if filename.endswith('.txt'):
file_path = os.path.join(dir_path, filename)
This code loops through all files in the directory path and selects only the
ones that end with .txt. It then sets the file path for each selected file.
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
This code reads the contents of the selected file into a variable called text.
soup = BeautifulSoup(text, 'html.parser')
cleaned_text = soup.get_text()
This code uses BeautifulSoup to parse the HTML tags in text and extract
only the text content. The cleaned text is stored in a variable called cleaned_
text.
with open(file_path, 'w', encoding='utf-8') as f:
f.write(cleaned_text)
8 Text Extraction and Cleaning 131
This code writes the cleaned text back to the original file, replacing the
previous contents.
To achieve optimal file refinement, it may be necessary to iterate the process
multiple times. This algorithm demonstrates high efficiency when applied
exclusively to text files comprising HTML code. Nonetheless, there are
instances where the files incorporate both HTML and XML tags, necessi-
tating the utilization of the lxml library.
import os
from lxml import etree
This code is similar to the previous one, with the main difference being
the use of the lxml library for parsing XML/HTML documents. The
etree.HTMLParser() method is used to create an HTML parser that can
handle malformed HTML, while the etree.fromstring() method is used to
parse the input text and create an element tree. The etree.tostring() method
is then used to extract the text content from the element tree, and the
resulting string is written back to the file.
Text extraction and cleaning techniques is a crucial step in academic
accounting and finance research, enabling researchers to work with large
volumes of unstructured textual data and gain valuable insights that may not
be apparent through traditional data analysis methods. This process makes
the data readable to the user, reduces the amount of data to be handled, and
enables researchers to focus their analysis on the most meaningful parts of
the data. This first step of text extraction and cleaning is necessary for any
further analysis on data. By accurately extracting relevant information from
unstructured text, researchers can gain focus on what is important to their
research and using more advance techniques such as sentiment analysis, topic
132 S. Kumar
Lowercasing
Lowercasing is a fundamental textual operation employed in text normaliza-
tion, whereby all alphabetical characters within a given text are converted
to their lowercase equivalents. The purpose of this procedure is to facili-
tate the uniform treatment of various word forms as identical entities during
subsequent processing stages. Failure to apply lowercasing can lead to the
differentiation of words such as "Income" and "income" as separate enti-
ties, thereby introducing potential errors or inaccuracies in the context of
text analysis endeavors, including tasks such as text classification or sentiment
analysis. By applying lowercasing to the text, these aforementioned variations
are harmonized into a singular representation, thus enhancing the accuracy
of these aforementioned analytical tasks.
Tokenizing
Tokenization is the fundamental procedure of decomposing a given text into
smaller entities known as tokens, typically encompassing words or sentences.
This process holds significant relevance in the realm of text manipulation,
enabling more streamlined analysis of textual data. Tokenization is applicable
9 Text Normalization 135
Stemming
Stemming is a fundamental procedure within the domain of natural language
processing (NLP) whereby inflected words are transformed into their respec-
tive base or root forms, resulting in what is termed a "stem". The primary
purpose of stemming is to establish a standardization of words and to
diminish the volume of words requiring processing in NLP applications. To
illustrate, the stem of terms such as "running", "runner", and "runs" is "run".
The utilization of stem forms permits the treatment of these words as iden-
tical entities within NLP applications, thereby enhancing both the accuracy
and efficiency of analyses. Several distinct algorithms have been developed for
the process of stemming, including the Porter stemmer, Snowball stemmer,
136 S. Kumar
and Lancaster stemmer. Each algorithm adheres to distinct sets of rules and
produces dissimilar outcomes.
Llematization
Lemmatization is a technique employed in natural language processing (NLP)
to condense words to their base or dictionary form, referred to as the lemma.
Its objective is to cluster inflected word forms together for the purpose of
analyzing them as a unified entity. Lemmatization is regarded as a more
advanced approach than stemming, as it takes into account the surrounding
context of a word in order to ascertain the appropriate lemma. To illustrate,
the lemma for the words "am", "are", and "is" is "be". By applying lemma-
tization to a given text, the quantity of unique words within the corpus can
be diminished, thereby enhancing the accuracy of various text analysis tasks,
including text classification, sentiment analysis, and topic modeling. While
Stemming involves the reduction of words to their base form through the
removal of suffixes or affixes, lemmatization determines the canonical form
of words by considering various factors such as morphology, context, and
part of speech. Thus, lemmatization is more informed by linguistic princi-
ples, leading to the production of valid words while preserving grammatical
accuracy.
In natural language processing, a corpus is a large and structured set of
text documents. It is used to train and evaluate various models and algo-
rithms in language processing tasks, such as text classification, information
retrieval, sentiment analysis, and machine translation. A corpus is formed
after the collection of raw text data, and after it has undergone preprocessing
steps such as cleaning, normalization, and tokenization. These preprocessing
steps are necessary to ensure that the corpus is clean and consistent, so that
it can be used effectively for various natural language processing tasks such
as text classification, sentiment analysis, and information retrieval. Once the
text data has been cleaned and normalized, it is typically organized into a
structured format, such as a document-term matrix or a set of preprocessed
documents, which can then be used as input to various NLP algorithms.
Here’s an example of a code that does all of the above in one go:
9 Text Normalization 137
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from string import punctuation
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from string import punctuation
138 S. Kumar
This part of code imports the following modules: os for file and directory
operations, nltk for natural language processing, stopwords and Porter-
Stemmer from the nltk.corpus and nltk.stem modules, respectively, for
removing stop words and stemming words, WordNetLemmatizer from the
nltk.stem module for lemmatization, and punctuation from the string
module for removing special characters and punctuation.
dir_path = r’D:/Data/Extracted'
output_path = r’D:/Data/Preprocessed'
This part of code then sets the paths to the input directory (dir_path)
containing the raw text files and the output directory (output_path) where
the preprocessed files will be stored.
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
The part of the code we define a set of English stop words using the stop-
words module of NLTK and initialize the Porter stemmer and WordNet
lemmatizer from the nltk.stem module of NLTK.
Here, we are using the os.walk function to iterate over all the files in the
directory (dir_path). The os.path.join function is used to join the directory
path and file name to create the full file path. We only process files with
.txt extension. Inside the loop, we open each file in read mode and read its
contents into the text variable.
9 Text Normalization 139
5. We then preprocess the text and write the preprocessed text to a new file.
Remove special characters and punctuation:
In this part of the code, special characters and punctuation are removed
from the text using a list comprehension and the string.punctuation string.
Then, all text is converted to lowercase using the lower() function, which
ensures that case does not affect the analysis. The text is then tokenized into
words using the word_tokenize() function from the nltk module. The next
step involves removing stop words from the list of words using a list compre-
hension and the stop_words set. Finally, the words are stemmed using the
Porter stemmer, which reduces the inflected forms of the words to their base
or root form. Before this code can be done nltk library has to be installed
using pip install method and punkt, stopwords, and wordnet packages need
to be installed using the following code:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
There are cases when emoticons are relevant for the analysis of text
data. Emoticons are often used in social media and online communication
to convey sentiments and emotions, and removing them could potentially
remove important information from the text. In such cases, it might be neces-
sary to modify the preprocessing steps to retain emoticons or other relevant
symbols. However, it depends on the specific use case and the goals of the
analysis.
140 S. Kumar
To retain all the emoticons in the text, we can modify the code to remove
only the punctuation marks that are not emoticons. We can create a list
of punctuation marks that are not emoticons and remove only those. For
example, we can include the following punctuation marks in the list: ".", ",",
":", ";", "?", "!", "-", "_", "(", ")", "[", "]", "{", "}", "’", ’"’. Here’s how the
modified code would look like:
9 Text Normalization 141
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from string import punctuation
This way, the $ and () around numbers will not be removed during text
normalization.
Now, in the next step, we remove all the ‘(’ and ‘)’ and add a negative sign
before all the numbers surrounded by bracket signs:
import os
import re
Renaming Files
While creating a corpus, it is often necessary to rename the text files to better
reflect the content they contain. Renaming can help in maintaining consis-
tency and avoiding confusion while processing and analyzing the text data.
If the file names are descriptive and meaningful, it can be easier to identify
the contents of each file without having to open and read it. It can help in
organizing the files in a logical manner. For example, if the files are related
to a specific project or topic, renaming them with a common prefix or suffix
can help in grouping them together and identifying them as part of the same
collection. Renaming the files can also help in avoiding potential errors or
conflicts while processing the data. If the original file names contain special
characters or are too long, it may cause issues while accessing or manipulating
them programmatically. Renaming them with shorter and simpler names can
help in avoiding such errors.
To reflect the content of the file, in many cases, it may be necessary to
extract information from the file itself and use that information to name
the file. For example, if the file contains a news article, we may want to
extract the headline and use it as the filename. This can be achieved through
text processing techniques like regular expressions. By having descriptive file-
names, it becomes easier to locate and work with specific documents within
the corpus.
Here’s a code that renames all the text files in the directory and subdi-
rectories based on information in the file. In this example we have created
four markers (marker1 to marker4) using regex. We extract the text between
marker1 and marker2 and then between marker3 and marker4 and append
the extracted texts to create filename.
10 Corpus 149
import os
import re
dir_path = 'D:/Data/Preprocessed/'
In this code, we set the directory path where the text files are stored in the
dir_path variable. Then, we use the os.listdir() function to iterate over all
files in the directory and check if the file ends with .txt extension. If yes, we
proceed with reading the contents of the file using the with open() statement.
Next, we define two regex patterns marker1 and marker2 to extract the
text between them from the file. We use the re.compile() function to create
a compiled regex pattern with re.DOTALL flag to match any character,
including newline. We then use the re.search() function to search for the
pattern in the text and extract the group between the markers using the
group() method. If the pattern is not found, we set the text1 variable to
an empty string. We repeat the same process to extract the text between
marker3 and marker4 and store it in the text2 variable. We then combine
the extracted text using a space and create a new string new_string.
150 S. Kumar
Sorting Files
When creating a corpus, it is often necessary to include certain files while
excluding others. One way to accomplish this is by sorting the desired files
into a separate directory, leaving the undesired ones in their original location.
By separating the desired files in this way, it becomes easier to include only
the relevant files in the corpus for further analysis using Python. This can be
especially useful when dealing with large volumes of text data, as it helps to
streamline the data processing and analysis tasks.
For instance, when working with financial reports downloaded from SEC
EDGAR, you can sort 10K and 10Q files into separate folders using Python.
This can be achieved by searching for the keywords "10-K" or "10k" (for 10K
files), or "10-Q" or "10q" (for 10Q files) in the filenames of the downloaded
files, and copying the matching files to their respective destination folders.
Here’s a code that separates the 10-K and 10-Q files downloaded from SEC
EDGAR.
10 Corpus 151
import os
import shutil
os.makedirs(os.path.dirname(dst_path),
exist_ok=True)
shutil.move(src_path, dst_path)
print(f'Moved {src_path} to {dst_path}')
src_dir = 'D:/Data'
This code uses the os and shutil modules in Python to traverse the source
directory, search for files containing specific keywords, and move those files
to their corresponding destination directories. The move_files function is
responsible for the main logic of the script. It accepts three parameters: the
source directory, the destination directory, and a list of keywords. For each
file in the source directory, the function checks if the filename contains any
of the specified keywords, and if so, moves the file to the corresponding subdi-
rectory in the destination directory. The function also creates any necessary
subdirectories in the destination directory if they do not already exist.
152 S. Kumar
Creating Corpus
Once we have sorted and renamed the text files, the next step is to create
a corpus from these files. A corpus is a fundamental resource for Natural
Language Processing (NLP) tasks, such as text classification, sentiment anal-
ysis, and information extraction. Each document in the corpus is typically
identified by a unique identifier, such as a filename. Here’s a code that creates
a corpus with the renamed files.
import os
import shutil
os.makedirs(os.path.dirname(dst_path),
exist_ok=True)
shutil.move(src_path, dst_path)
print(f'Moved {src_path} to {dst_path}')
src_dir = 'D:/Data'
The code first defines the directory path containing the preprocessed text files
and sets the regular expression pattern to match all files with a .txt extension.
It then creates a corpus object using the PlaintextCorpusReader() func-
tion from the nltk library. The PlaintextCorpusReader() function takes two
arguments: the directory path and the regular expression pattern to match
the file names. The regular expression pattern in this code is r’(?!/.)[/w-]+/
.txt’, which matches all file names with a .txt extension but excludes files that
start with a period (i.e., hidden files on Unix systems). The PlaintextCorpus-
Reader() function returns a corpus object, which can be used to access the
10 Corpus 153
contents of the text files. For example, to access the raw text of a specific file
in the corpus, you can use the raw() method of the corpus object and specify
the file identifier (i.e., the file name with its extension) as the argument.
This corpus is stored in the memory of your Python environment. When
you create a corpus using the PlaintextCorpusReader function, it reads the
plaintext files in the specified directory and constructs the corpus in memory.
You can then access the corpus and perform various operations on it within
your Python code. However, the corpus is not saved as a separate file on your
hard drive, unless you explicitly save it using methods like pickle.dump().
The following code would be save the corpus in D:/Data folder:
import pickle
from nltk.corpus import PlaintextCorpusReader
corpus_root = r'D:/Data/Preprocessed'
file_pattern = r'(?!/.)[/w-]+/.txt'
You can visualize this corpus and a table of two columns. First column
contains all the file names and the second column contains all the text in
the file. You can access the specific text in a file (example_file.txt) using the
following code:
154 S. Kumar
import pickle
from nltk.corpus.reader.api import CorpusReader
corpus_path = r'D:/Data/my_corpus.bin'
Matplotlib
Matplotlib is a widely used Python library for creating static, two-
dimensional visualizations. It provides a wide range of functions for creating
a variety of chart types, including scatter plots, line charts, bar charts,
histograms, and more. It can be used to create publication-quality plots for
use in academic research, as well as for creating visualizations for presentations
or reports.
Matplotlib is highly customizable, offering control over every aspect of a
visualization, from the font size to the line style. This makes it a powerful tool
for creating bespoke visualizations that suit the specific needs of a particular
research project. It also integrates well with other Python libraries, making it
easy to create visualizations that include data from multiple sources.
One of the key strengths of Matplotlib is its versatility. It can be used to
create simple visualizations quickly and easily, but it also offers a wide range
of advanced features for creating more complex visualizations. For example, it
can be used to create subplots, add annotations and labels, and create custom
color maps.
Matplotlib is also highly compatible with other Python libraries. It works
seamlessly with NumPy, Pandas, and SciPy, allowing researchers to quickly
create visualizations from their data. In addition, Matplotlib integrates with
Jupyter Notebook, a popular tool for data analysis and visualization, making
it easy to create interactive visualizations that can be shared with others.
A simple code using matplotlib that can make five different kinds of plots
is as follows:
11 Data Visualization: Numerical Data 159
# Create a histogram
x = np.random.normal(size=1000)
plt.hist(x, bins=30)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()
This code creates five different types of plots: a simple line plot, a scatter plot,
a bar chart, a histogram, and a pie chart. To create each plot, the code sets up
the data, specifies the plot type, adds axis labels and a title, and then displays
the plot using the show() function. You can use this code as a starting point
for creating your own visualizations in Matplotlib.
160 S. Kumar
Apart from these Matplotlib can create more plots. The following codes use
random data to different plots:
Heatmap
data = np.random.rand(5, 5)
plt.imshow(data, cmap='hot', interpolation='nearest')
plt.colorbar()
plt.show()
3D Plot
Matplotlib can also create 3D plots, which can be useful for visualizing data
with multiple dimensions.
11 Data Visualization: Numerical Data 161
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = np.random.rand(50)
y = np.random.rand(50)
z = np.random.rand(50)
ax.scatter(x, y, z)
plt.show()
Box Plot
data = np.random.normal(size=100)
plt.boxplot(data)
plt.show()
162 S. Kumar
You can use your own data to make any plot using Matplotlib. Here’s a code
that uses data stored in D:/Data and makes a scatterplot:
import matplotlib.pyplot as plt
import pandas as pd
Seaborn
Seaborn is a Python data visualization library built on top of Matplotlib.
It provides a high-level interface for creating a wide range of statistical
graphics, including heatmaps, scatterplots, line charts, and more. Seaborn
is particularly well-suited for visualizing complex, multidimensional data,
and it provides many advanced features for customizing visualizations and
performing statistical analysis.
One of the key strengths of Seaborn is its support for working with Pandas
dataframes. Seaborn’s functions are designed to work seamlessly with Pandas
dataframes, allowing researchers to easily visualize and explore their data.
Seaborn also provides many built-in themes and color palettes, which can
help to make visualizations more visually appealing and easier to interpret.
In addition to its built-in themes and color palettes, Seaborn offers a wide
range of advanced features for customizing visualizations. For example, it
provides functions for creating subplots, adding annotations and labels, and
adjusting plot aesthetics. Seaborn also offers support for advanced statistical
techniques such as regression analysis, data smoothing, and bootstrapping.
Seaborn is particularly well-suited for visualizing complex, multidimen-
sional data. It provides many advanced visualization functions for working
with categorical data, time series data, and statistical models. For example,
Seaborn provides functions for creating boxplots, violinplots, and swarm-
plots, which can be used to visualize the distribution of data across multiple
164 S. Kumar
categories. Seaborn also provides functions for creating time series plots,
which can be used to visualize changes in data over time.
Here’s a code the makes some plots using Seaborn.
import seaborn as sns
import numpy as np
import pandas as pd
# Create a histogram
x = np.random.normal(size=1000)
sns.histplot(x)
plt.show()
The results look very similar to matplotlib. Here are some more plots with
the outputs:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
All these plots use random data. You can customize this code use your own
data, similar to what we did with Matplotlib.
Following are the advanced functions of Seaborn that are not available in
Matplotlib:
• Categorical Plot Types: Seaborn provides many plot types for visual-
izing categorical data, such as factorplot(), swarmplot(), countplot(),
and pointplot(). These plot types make it easy to visualize relation-
ships between categorical variables and to compare distributions across
categories.
• Color Palettes: Seaborn provides many built-in color palettes for
customizing the colors of plots, including categorical palettes, sequential
palettes, and diverging palettes. These palettes are designed to be visually
appealing and to improve the readability of plots.
• Faceting: Seaborn provides advanced support for faceting, which allows
you to create multiple plots that display different subsets of data. For
example, you can use the FacetGrid() function to create a grid of plots
that display different subsets of the data, based on one or more categorical
variables.
• Regression Analysis: Seaborn provides many functions for performing
regression analysis and visualizing the results. For example, you can use
the lmplot() function to create a scatter plot with a regression line, or
the residplot() function to create a plot of the residuals from a regression
analysis.
• Plot Styling: Seaborn provides many advanced options for customizing the
appearance of plots, including the ability to adjust the size, color, and style
of plot elements such as markers and lines. Seaborn also provides many
built-in themes that can be used to improve the overall appearance of plots
and to make them easier to interpret.
Plotly
Plotly is an open-source data visualization library for creating interactive plots
and charts. It provides a high-level interface for creating a wide range of
chart types, including scatter plots, line charts, bar charts, and more. Plotly’s
interactive features allow users to explore and analyze their data in real time,
including zooming and panning, hover-over tooltips, and clickable legends.
One of the key strengths of Plotly is its ability to create highly interactive
and dynamic visualizations. Plotly allows users to create plots that can be
manipulated in real time, allowing them to explore their data from different
angles and perspectives. Plotly also provides advanced features for creating
dashboards and other interactive applications, making it a popular choice for
creating data-driven web applications.
168 S. Kumar
# Create a histogram
x = np.random.normal(size=1000)
fig = go.Figure(data=go.Histogram(x=x))
fig.show()
Note that this figure can be rotated to view the plot from different angles,
something that could not be done with Matplotlib or Seaborn. Plotly offers
several advantages over Matplotlib and Seaborn, including:
Bokeh
Bokeh is a Python data visualization library that focuses on creating interac-
tive visualizations for the web. It provides a high-level interface for creating
a wide range of charts, including scatter plots, line charts, bar charts, and
more. Bokeh is particularly well-suited for creating complex, interactive
visualizations with large datasets.
One of the key features of Bokeh is its ability to create interactive visu-
alizations that can be embedded in web applications. Bokeh provides many
tools for interactivity, such as zooming and panning, hover tooltips, and selec-
tion tools. These features make it easy to explore and analyze large datasets in
a web-based environment. Bokeh also provides many advanced features for
customization and styling of visualizations. For example, Bokeh provides a
wide range of color palettes, advanced axis labeling and formatting options,
and support for customizing the appearance of markers and lines. Another
key feature of Bokeh is its support for streaming data. Bokeh provides tools
for creating real-time visualizations that can update dynamically as new data
is received. This makes Bokeh particularly well-suited for use in applica-
tions that require real-time monitoring or analysis of streaming data, such
as financial trading or sensor data analysis.
11 Data Visualization: Numerical Data 173
An example of a code that shows Crypto prices in a real time would be:
import requests
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource
from bokeh.io import push_notebook
from datetime import datetime
import time
output_notebook()
import time
output_notebook()
Similarly, example of a code that shows the stock prices of five stocks in
real time is:
import yfinance as yf
import pandas as pd
from bokeh.plotting import figure, show, output_notebook,
ColumnDataSource
from bokeh.models import HoverTool
from bokeh.palettes import Spectral10
import datetime as dt
import time
from IPython import display
output_notebook()
stock_data = {}
for stock in stocks:
stock_data[stock] = get_stock_data(stock, start_date,
end_date)
colors = Spectral10
p.legend.location = "top_left"
p.legend.click_policy = "hide"
178 S. Kumar
Apart from the one discussed above, there are other libraries that are useful
for data visualization. Some of them are:
Each library has its own strengths and weaknesses, so the choice of which
library to use depends on the specific needs of the project. Data visualiza-
tion is an essential part of the data analysis process, and researchers should
take the time to explore the various visualization libraries available in Python
to find the tools that best suit their needs. By effectively visualizing their
data, researchers can gain deeper insights into their research questions and
communicate their findings more effectively to others.
12
Data Visualization: Text Data
Text data poses unique challenges compared to numerical data, due to its
inherent complexity. Consequently, analyzing and comprehending text data
can be quite daunting. However, by employing appropriate visualization
techniques, we can reveal hidden insights and patterns that may otherwise
elude us. In this discussion, we will explore various visualization methods
frequently employed for analyzing text data. These methods include word
clouds, bar charts, heatmaps, scatterplots, and network graphs. We will eluci-
date how each technique can be utilized to analyze distinct aspects of text
data, such as word frequency, sentiment, and document similarity. Addition-
ally, we will demonstrate how Python libraries introduced in the preceding
chapter, namely Matplotlib, Seaborn, Plotly, and Bokeh, can be employed to
generate these visualizations for text data. While these libraries are commonly
employed for numerical data, they possess the capability to handle text
data when equipped with appropriate techniques and tools. Leveraging these
libraries, we can create visually captivating and informative visualizations that
facilitate deeper insights into text data.
Wordcloud
Word clouds are a popular method for visualizing the most frequently occur-
ring words in a corpus. They can provide a quick overview of the most
prominent words in a text, making it easier to identify themes and patterns.
Creating a word cloud in Python is a relatively simple process. The first
step is to extract the text from your corpus, and then use a word tokenizer
to break the text into individual words. You can then count the frequency of
each word using Python’s built-in Counter class, or use specialized libraries
such as NLTK. Once you have a dictionary of word frequencies, you can use
a word cloud library like wordcloud. Word clouds typically display the words
in different-sized bubbles, with the larger bubbles representing more frequent
words. You can customize the appearance of the word cloud by changing the
font, color scheme, and layout.
In addition to providing a quick overview of the most frequent words,
word clouds can also be useful for identifying outliers and anomalies in a
corpus. For example, if you notice a word in the word cloud that is unex-
pected or unusual, you can investigate further to see why it is appearing so
frequently.
One limitation of word clouds is that they do not provide any information
about the context in which the words are used. For example, two words may
have the same frequency in a corpus, but one may be used more frequently in
positive contexts while the other is used more frequently in negative contexts.
To get a better understanding of the context in which words are used, you
may want to use other visualization techniques such as bar charts, heatmaps,
or scatterplots.
Here is a code that gives a word cloud of a random text of the first para of
this chapter:
12 Data Visualization: Text Data 181
import random
import wordcloud
import matplotlib.pyplot as plt
In this code, we first define a random text corpus as a string variable text.
We then create a WordCloud object using the wordcloud library. Next,
we generate the word cloud using the generate method of the WordCloud
object, passing in the text corpus as an argument. Finally, we display the word
cloud using matplotlib. We use imshow to display the image, set interpo-
lation to ’bilinear’ to smooth the edges of the bubbles, and turn off the axis
using axis(’off ’). Finally, we call show to display the word cloud in a new
window.
Note that you can customize the appearance of the word cloud by passing
in various arguments to the WordCloud constructor. For example, you
can set the background color, font, and size of the word cloud using the
background_color, font_path, and max_font_size arguments, respectively.
Here’s a code that picks normalized text files from D:/Data/Preprocessed
folder:
12 Data Visualization: Text Data 183
import os
import glob
import string
from collections import Counter
import wordcloud
from PIL import Image
import numpy as np
In this code, we first define the path to the directory containing the text files
using a wildcard * to match all files with a .txt extension. We then use the
glob module to create a list of all the file names in the directory. Next, we
define a function called get_words that reads the text from a file, converts
it to lowercase, removes punctuation, and splits it into a list of words. We
then use a loop to call this function for each file in the directory, and create
184 S. Kumar
a list of all the words in the directory. Using the Counter class from the
collections module, we create a word frequency dictionary that counts the
number of occurrences of each word in the list. We then create a Word-
Cloud object using the wordcloud module, and generate the word cloud
from the word frequency dictionary using the generate_from_frequencies
method. Next, we create a mask from an image file using the Image module
from the Python standard library. This module takes an image stored in local
directory as a sample to create this mask for the new image. We then create
a colored word cloud image using the ImageColorGenerator class from the
wordcloud module, passing in the mask as an argument. Finally, we save the
image in desired directory. The final output of the code will look like:
Matplotlib
Matplotlib is useful not only for numerical data, but also for the text data.
Matplotlib can be used to make a word cloud. Here’s a code that gener-
ates random text from random library and then makes wordcloud using
matplotlib:
12 Data Visualization: Text Data 185
import random
import string
import matplotlib.pyplot as plt
from collections import Counter
import numpy as np
plt.figure(figsize=(6, 3))
colors = plt.get_cmap('viridis')(np.linspace(0, 1,
len(word_count)))
plt.axis('off')
plt.show()
if __name__ == "__main__":
random_text = generate_random_text(100)
print("Random text:\n", random_text)
create_word_cloud(random_text)
This Python code generates a simple word cloud from random text using the
matplotlib, random, and numpy libraries. The generate_random_text()
function creates a random string of words by iterating over a specified number
of words, generating random word lengths between 3 and 10 characters, and
constructing the words from random lowercase letters. The create_word_
cloud() function splits the generated text into a list of words and uses the
Counter class from the collections library to count the frequency of each
word. It then creates a scatter plot using matplotlib, placing each word
at a random position in the plot with a size determined by its frequency
multiplied by a scaling factor. Colors are generated using a colormap from
matplotlib (in this case, ’viridis’) and the numpy library, and each word
186 S. Kumar
is assigned a color. Note that this simple implementation does not handle
collision detection or optimal positioning, so some words may overlap.
If normalized text files is stored in D:/Data/Preprocessed, then the code to
make a word cloud from those files using Matplotlib would be:
import os
import random
import string
import matplotlib.pyplot as plt
from collections import Counter
import numpy as np
plt.figure(figsize=(12, 12))
colors = plt.get_cmap('viridis')(np.linspace(0, 1,
len(word_count)))
plt.axis('off')
plt.show()
if __name__ == "__main__":
file_path = 'D:/Data/Preprocessed/2.txt'
text = read_text_from_file(file_path)
create_word_cloud(text)
The read_text_from_file function reads the text data from a given file path
and returns the contents as a string. The create_word_cloud function takes
the input text and generates a word cloud visualization using the Counter
class from the collections module to count the occurrences of each word.
The word cloud is created using the matplotlib.pyplot library. It initializes
a figure of size 6×3 inches and assigns colors to the words based on a color
map. The word cloud is generated by iterating over each word and its corre-
sponding count. For each word, a random position is generated, and the word
is placed at that position using the scatter and text functions from pyplot.
The font size of each word is determined by its count. The resulting word
12 Data Visualization: Text Data 187
cloud is displayed using the show function. In the main part of the script,
a file path is specified as ’D:/Data/Preprocessed/2.txt’, and the read_text_
from_file function is called to retrieve the text from that file. The create_
word_cloud function is then invoked with the obtained text to generate the
word cloud visualization, which will look like:
It is pertinent to mention that once you get the word frequency using the
word frequency dictionary, you can use any numerical method of data visu-
alization to visualize the distribution of word frequencies in the corpus. Here
are some examples of numerical methods of data visualization that you can
use:
• Bar Charts: You can use a bar chart to display the frequency of the most
common words in the corpus. You can sort the dictionary by frequency
and select the top N words to display in the chart.
• Line Charts: You can use a line chart to show the trend of word frequencies
over time. For example, if you are analyzing a set of news articles, you can
plot the frequency of certain keywords over time to see how they change.
• Heat Maps: You can use a heat map to display the co-occurrence of words
in the corpus. You can create a matrix of word co-occurrences and use a
color map to show the frequency of each co-occurrence.
• Scatter Plots: You can use a scatter plot to show the relationship between
two variables, such as the frequency of two different words in the corpus.
• Network Graphs: You can use a network graph to show the relationships
between different words in the corpus. You can create nodes for each word
and edges between them based on their co-occurrence.
In order to use these methods, you need to first calculate the word frequency
dictionary as you have already done. Once you have the dictionary, you can
use any numerical data visualization tool of your choice, such as matplotlib,
seaborn, or plotly, to create the desired visualization.
188 S. Kumar
Here’s a code that generates 1000 random words from random dictionary
and creates a bar chart of top ten words using matplotlib:
import random
import nltk
import collections
import matplotlib.pyplot as plt
Similarly, you can import your text data to get any plot of the top words in
the text data.
Network Map
Network graphs, also known as graphs or network diagrams, are visual repre-
sentations of relationships between entities. In the context of text data, these
entities can be words, phrases, or even entire documents. Network graphs
are particularly useful for exploring and understanding the structure and
connections within complex datasets.
In a network graph, entities are represented as nodes (or vertices), and
relationships between these entities are represented as edges (or links). The
relationships can be either directed or undirected, depending on the nature
of the dataset. For instance, in a citation network, edges might be directed
to represent the direction of citations, while in a co-occurrence network of
words, edges might be undirected as the relationship is mutual.
12 Data Visualization: Text Data 189
To create and analyze network graphs, several libraries are available in Python,
such as NetworkX, igraph, and Gephi (a standalone application). These
libraries provide tools for creating, manipulating, and visualizing network
graphs, as well as performing various graph analysis algorithms, such as
calculating centrality measures, detecting communities, and more.
This code generates the network map from D:/Data/Preprocessed/2.txt:
190 S. Kumar
import nltk
import networkx as nx
import matplotlib.pyplot as plt
This code utilizes the Natural Language Toolkit (NLTK) and NetworkX
libraries in Python to construct a network graph, depicting the co-occurrence
of words within a text document. After obtaining the text from a specified
file, the script employs NLTK to tokenize the text into individual words. The
NetworkX library then generates a graph, with nodes representing words and
edges signifying their co-occurrence. If a pair of words appears together more
than once, the weight of the corresponding edge is incremented. The final
part of the script is dedicated to visualizing this graph. Using matplotlib,
it displays the graph with nodes and edges, where the nodes are labeled
with their corresponding words, and the edges are labeled with their weights,
showing the frequency of co-occurrence. This graph provides a spatial repre-
sentation of the relationships between words, aiding in the analysis of textual
data. The output will look like:
192 S. Kumar
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
# Read text from file
file_path = 'D:/Data/Preprocessed/2.txt'
try:
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
except FileNotFoundError:
print(f"Error: File not found at {file_path}")
exit()
except IOError:
print(f"Error: Unable to read file at {file_path}")
exit()
if text is None:
print(f"Error: File is empty or could not be read
correctly")
exit()
# Tokenize the text
tokens = word_tokenize(text.lower())
# Train Word2Vec model
model = Word2Vec([tokens], min_count=1, vector_size=50)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
word_vectors_2d = tsne.fit_transform(word_vectors)
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.manifold import TSNE
from gensim.models import Word2Vec
# Apply t-SNE
tsne = TSNE(n_components=3, random_state=42)
word_vectors_3d = tsne.fit_transform(word_vectors)
df = pd.DataFrame(data)
fig.update_layout(scene=dict(xaxis_title='Dimension 1',
yaxis_title='Dimension 2',
zaxis_title='Dimension 3'),
title='3D t-SNE Plot of Word Embeddings')
fig.show()
The output will look like the figure shown. The output can be rotated and
zoomed into.
196 S. Kumar
This code demonstrates how to use UMAP to visualize text data in two
dimensions. It loads the 20 newsgroups dataset using scikit-learn, vector-
izes the text data using CountVectorizer, fits a UMAP model to the resulting
matrix, and visualizes the UMAP embedding using a scatter plot.
The fetch_20newsgroups function loads the 20 newsgroups dataset,
which contains approximately 20,000 newsgroup posts across 20 different
topics. The CountVectorizer class is used to vectorize the text data, which
counts the number of times each word appears in each document. The
resulting matrix is then passed to the UMAP class from the umap library to
fit a UMAP model. The resulting low-dimensional embedding is then visual-
ized using a scatter plot with color-coded labels corresponding to the different
topics.
Note that the max_df, min_df, and stop_words parameters of
CountVectorizer control the preprocessing of the text data, and the n_
neighbors, n_components, and metric parameters of UMAP control the
198 S. Kumar
mean_btc = np.mean(df['BTC'])
median_btc = np.median(df['BTC'])
mode_btc = stats.mode(df['BTC'])
print("Mean of BTC:", mean_btc)
print("Median of BTC:", median_btc)
print("Mode of BTC:", mode_btc[0][0])
A code that calculates some basic descriptive stats by Year variable in the
dataset and saves those as additional columns in the dataset is as follows:
202 S. Kumar
import pandas as pd
import numpy as np
import scipy.stats as stats
This code groups the data by the "Year" column using the groupby method,
and then computes the mean, median, mode, standard deviation, and vari-
ance for the "BTC" column using the agg method. It then renames the
columns and merges the computed statistics with the original data using the
merge method. Finally, it saves the modified data to replace the existing CSV
file. The lambda function inside agg is used to extract the mode value from
the result of stats.mode, which returns a tuple containing both the mode
value and its count.
To export the descriptive stats in a word file as a publication-quality table,
the following code would be useful:
13 Descriptive Statistics 203
import pandas as pd
import numpy as np
import scipy.stats as stats
from docx import Document
from docx.shared import Pt
This code computes basic descriptive statistics for the ‘BTC’ variable in a
dataset, grouped by year. It then creates a new Word document, adds a table
to the document, populates the table with the computed statistics, and saves
the document to a file. The pandas library is used for data manipulation,
scipy is used for computing the mode of the ‘BTC’ variable, and the docx
204 S. Kumar
library is used for creating and saving the Word document. The table style is
set to Times New Roman font with a font size of 12pt, and missing data is
represented with an em dash (—).
Outlier Detection
Outlier detection is the process of identifying observations or data points
that deviate significantly from the rest of the dataset. In academic research in
accounting and finance, outlier detection is important because outliers can
have a significant impact on the results of statistical analysis. Outliers can
skew the results, leading to incorrect conclusions and hypotheses.
Python is well-suited for outlier detection because of its powerful data
manipulation and visualization libraries, such as Pandas, NumPy, and
Matplotlib. These libraries allow for efficient data exploration, identifica-
tion of outliers, and visualization of the results. Additionally, Python has
a wide range of statistical and machine learning packages, such as Scikit-
learn and Statsmodels, that can be used to identify outliers and handle them
appropriately.
To implement outlier detection in Python, first step is to import the
necessary libraries and load the data into a pandas DataFrame:
import pandas as pd
import numpy as np
import scipy.stats as stats
data = pd.read_csv('D:/Data/my_data.csv')
1. Z-Score Method
The Z-score method is based on the principle that data points with a z-score
greater than a certain threshold are considered outliers. To calculate the z-
score, we first need to standardize our data. Here is a sample code that detects
observations below 1 percentile and above 99 percentile of the observations
in data:
13 Descriptive Statistics 205
btc_mean = data['BTC'].mean()
btc_std = data['BTC'].std()
p1 = data['BTC'].quantile(0.01)
p99 = data['BTC'].quantile(0.99)
2. IQR Method
The modified Z-score method uses the median and the Median Absolute
Deviation (MAD) instead of the mean and standard deviation. It is more
robust to outliers and suitable for data with non-normal distributions. Here’s
the code for modified Z-Score method:
btc_median = data['BTC'].median()
MAD = data['BTC'].mad()
The modified Z-score method uses the median and the Median Absolute
Deviation (MAD) instead of the mean and standard deviation. It is more
robust to outliers and suitable for data with non-normal distributions. Here’s
the code for modified Z-Score method:
206 S. Kumar
4. Tukey’s Fences
Each of these methods has its advantages and drawbacks, and the choice of
method depends on the nature of the data and the research question being
addressed. It is essential to understand the underlying assumptions and limi-
tations of each method before applying them to your data. In some cases,
combining multiple methods can provide a more comprehensive picture of
potential outliers in the data.
# Create a scatterplot
plt.scatter(data['BTC'], data['ETH'])
plt.xlabel('BTC')
plt.ylabel('ETH')
plt.title('Scatterplot of BTC and ETH')
plt.show()
If the data points in the scatterplot form a clear pattern (e.g., upward or
downward sloping), this suggests a strong linear relationship between the
variables. Conversely, if the data points appear scattered with no discernible
pattern, this may indicate a weak or nonexistent relationship.
1. Autocorrelations
# Calculate autocorrelations
btc_autocorr = data['BTC'].autocorr(lag=1)
2. Moving Averages
3. Exponential Smoothing
If the data points in the scatterplot form a clear pattern (e.g., upward or
downward sloping), this suggests a strong linear relationship between the
variables. Conversely, if the data points appear scattered with no discernible
pattern, this may indicate a weak or nonexistent relationship.
13 Descriptive Statistics 209
data = pd.read_csv("D:/Data/my_data.csv")
sample_size = 100
simple_random_sample = data.sample(n=sample_size,
random_state=42)
2. Systematic Sampling
systematic_sample = data.iloc[::interval]
3. Stratified Sampling
unique_strata = data[strata_column].unique()
stratified_sample = pd.DataFrame()
4. Cluster Sampling
unique_clusters = data[cluster_column].unique()
selected_clusters = np.random.choice(unique_clusters,
size=clusters_sample_size, replace=False)
cluster_sample =
data[data[cluster_column].isin(selected_clusters)]
1. Latent Dirichlet Allocation (LDA): This is the most widely used and
popular type of topic modeling. LDA models each document as a mixture
of topics, where each topic is a probability distribution over words. LDA
assumes that each word in a document is generated by one of the topics,
and that the topics themselves are generated from a Dirichlet distribution.
2. Non-negative Matrix Factorization (NMF): NMF is a matrix decompo-
sition technique that can be used for topic modeling. It decomposes a
matrix of word counts into two matrices: one that represents the topics
and one that represents the documents. The topics are represented as
non-negative linear combinations of the words in the vocabulary.
3. Probabilistic Latent Semantic Analysis (PLSA): PLSA is a variant of LDA
that uses a different approach to estimate the topic distributions. Instead
of assuming a Dirichlet prior on the topic distributions, PLSA directly
models the joint distribution of the words and the topics.
4. Correlated Topic Model (CTM): CTM is an extension of LDA that allows
for correlations between topics. It assumes that the topics are generated
from a multivariate normal distribution, and that each document is a
mixture of these topics.
5. Hierarchical Dirichlet Process (HDP): HDP is a Bayesian non-parametric
model that allows for an unbounded number of topics. It models the topic
proportions of each document as a distribution over an infinite number
of latent topics, which are shared across all documents.
This output has ten topics. The numbers before the words represent the
weight or importance of that word in a particular topic. Specifically, they
represent the probability of a word appearing in that topic. For example, in
the first topic, the word “walmart” has a weight of 0.016, which means that
it has a 1.6% chance of appearing in that topic. The sum of the weights for
all words in a topic adds up to 1.0.
The code uses the gensim library to perform topic modeling using Latent
Dirichlet Allocation (LDA). First, the directory path for preprocessed text
files is defined using the os library. The dir_path variable is set to the path
where the preprocessed text files are located. Next, a list of preprocessed text
files is created using a list comprehension. The os.listdir function is used
to get a list of files in the dir_path directory, and the list comprehension
filters the list to include only those files that end with the.txt extension. The
resulting list of file names is stored in the text_files variable.
After that, a list of preprocessed documents is created by iterating through
each file in the text_files list. The open function is used to open each file in
read mode, and the resulting text is split into a list of words using the split
method. Each list of words is then appended to the documents list.
The corpora.Dictionary function is used to create a dictionary from
preprocessed documents. The dictionary variable is set to the resulting
14 Topic Modeling 217
This code uses makes 10 topics. Optimum number of topics can be deter-
mined in two ways:
.
.
Number of topics: 18 | Perplexity: -5.20
Number of topics: 19 | Perplexity: -5.22
Number of topics: 20 | Perplexity: -5.25
Number of topics: 21 | Perplexity: -5.28
Number of topics: 22 | Perplexity: -5.30
Number of topics: 23 | Perplexity: -5.33
Number of topics: 24 | Perplexity: -5.35
Number of topics: 25 | Perplexity: -5.37
Optimal number of topics: 25
The code below does automatically select the optimum number of topics and
creates those topics:
220 S. Kumar
The code below calculates the coherence for a list of number of topics.
222 S. Kumar
by
for num_topics in range(2, 8):
14 Topic Modeling 223
After this you can run the LDA code based on this number. However, after
the approximation, you can also make the optimum number of codes directly
using the following code:
from gensim.models import CoherenceModel
from gensim.models import LdaModel # Import the LdaModel class
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
# Use the vectorizer to fit and transform the text data into a
matrix of TF-IDF features
tfidf = vectorizer.fit_transform(file_paths)
This first uses the os module to define the directory containing the prepro-
cessed text files (dir_path). It then creates a list of file paths for all files in
the directory (file_paths). Then it creates a TfidfVectorizer object called
vectorizer to convert the text data into a matrix of TF-IDF features. The
input parameter is set to ‘filename’ to indicate that the input data is a set
of file paths rather than raw text data, and the stop_words parameter is set
to ‘english’ to remove common English stop words (if not removed already).
This step is important because it helps to reduce the variability in the text data
and make it easier for the NMF model to identify and extract meaningful
topics.
The vectorizer is then used to fit and transform the text data into a matrix
of TF-IDF features (tfidf ). This matrix represents the text data in a format
that is suitable for input to the NMF algorithm. The code defines the number
of topics to extract from the text data (n_topics), which in this case is set to
226 S. Kumar
50. This number can be adjusted depending on the size of the text data and
the desired level of granularity in the extracted topics.
NMF is then performed on the TF-IDF matrix (tfidf ) to extract the topics.
The NMF object is created with n_components = n_topics to specify the
number of topics, init = ‘nndsvd’ to initialize the factorization using a non-
negative double singular value decomposition, max_iter = 200 to set the
maximum number of iterations, and random_state = 0 to ensure repro-
ducibility of the results. The fit_transform method is used to fit the NMF
model to the TF-IDF matrix and transform the data into the factorized
matrices W and H.
Finally, the code prints the top 10 words for each topic by sorting the
entries in the H matrix (topic.argsort()[:-11:-1]) and using the feature_
names list to print the corresponding words. This step helps to identify
the most important words associated with each topic and provides insights
into the themes and patterns in the text data. The results are printed to the
console.
In using NMF, you need to define the number of topics, which can be
between 1 and 1000. Unfortunately, NMF does not have a built-in method
to automatically select the optimal number of topics for a given corpus of
text data. However, there are several methods that you can use to estimate
the optimal number of topics based on the characteristics of the text data.
One common approach is to use a metric called “perplexity” to evaluate the
quality of the topic model at different numbers of topics. Perplexity measures
how well a model predicts new text data based on the topics it has learned.
It is a statistical concept and is beyond the scope of this book. Generally,
lower perplexity scores indicate better performance. You can train multiple
NMF models with different numbers of topics and choose the model that
produces the lowest perplexity score on a held-out set of text data. A code that
calculates selects the best NMF topics based on a range of topics provided by
the user would be:
14 Topic Modeling 227
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.model_selection import train_test_split
# Use the vectorizer to fit and transform the text data into a
matrix of TF-IDF features for the training set
tfidf_train = vectorizer.fit_transform(train_paths)
In this version of the code, we first split the set of preprocessed text files into
a training set and a validation set using the train_test_split function from
sklearn.model_selection. We then fit and transform the training set using
the TfidfVectorizer object.
Next, we define a range of numbers of topics to extract and iterate over
this range, performing NMF on the training set for each value of n_topics.
We then calculate the log-likelihood of the documents in the validation set
and compute the perplexity score for each value of n_topics.
After computing the perplexity score for each value of n_topics, we find
the index of the minimum perplexity score and extract the topics for the
corresponding N.
Another approach is to use visual inspection of the topic model results
to estimate the optimal number of topics. You can generate topic-word and
document-topic matrices for each number of topics and examine the resulting
topics to see if they are coherent and meaningful. If the topics are too general
or too specific, you can try adjusting the number of topics until you find a
satisfactory set of topics.
Keep in mind that the optimal number of topics may depend on the
specific characteristics of your text data and the goals of your analysis, so
it is important to evaluate multiple models and select the one that best meets
your needs.
14 Topic Modeling 229
# Get the file paths for all text files in the directory
file_paths = [os.path.join(directory, f) for f in
os.listdir(directory) if f.endswith('.txt')]
Topic 0:
3185, neighborhood, behalf, market, fiscal, histor, respect,
includ, reason, consolid
Topic 1:
walmart, sale, net, fiscal, billion, brand, histor, segment,
profit, 2019
Topic 2:
walmart, sale, 2019, consolid, billion, segment, net, incom,
brand, repr
In this code, first the necessary modules are imported, including os,
numpy, pandas, CountVectorizer, and LatentDirichletAllocation from
the sklearn.feature_extraction.text and sklearn.decomposition modules,
respectively. Next, the directory containing the preprocessed text files is set
to D:/Data/Preprocessed, and the file paths for all text files in the directory
are obtained using the os.listdir() function and a list comprehension.
The text data is read in from the files using a for loop, which reads each
file and appends the text to a list called documents. A CountVectorizer()
object is then created with the specified parameters max_df = 0.95, min_df
= 0.05, and stop_words = ‘english’. This vectorizer is used to convert the
preprocessed text data into a matrix of word counts.
Next, a LatentDirichletAllocation() object is created with n_topics =
10, max_iter = 50, learning_method = ‘online’, and random_state = 42.
This LDA model is then fit to the count matrix using the fit() method.
Finally, the top 10 words for each topic are printed out using a for loop that
iterates over the topics and uses the argsort() method to obtain the indices
of the top 10 words in each topic, which are then looked up in the feature_
names list obtained from the CountVectorizer() object. max_df and min_
df are parameters of the CountVectorizer() function in scikit-learn, which is
used to convert a corpus of text documents into a matrix of token counts.
max_df specifies the maximum document frequency of a term in the
corpus, as a fraction between 0 and 1. Terms that appear in more than max_
df fraction of the documents are excluded from the vocabulary.
14 Topic Modeling 231
import os
from gensim import corpora, models
from gensim.models import CoherenceModel
and evaluate our CTM model. Next, we define the path to the directory
containing our preprocessed text files. We create a list of the file names in the
directory using the os.listdir() function. We then load the preprocessed text
files into a list of lists of words. For each file, we read its contents and split
them into words using the split() function. We then append the resulting list
of words to our documents list.
We create a corpora.Dictionary object from our documents list, which
maps each unique word to a unique integer ID. We also create a corpus
object from our documents list using the doc2bow() function of our dictio-
nary object. This converts each document in our documents list to a
bag-of-words representation, where each word is represented by its integer
ID and its frequency in the document.
We then set the number of topics for our CTM model to 10, and train
the model on our corpus using the LdaModel() function from the models
module of gensim. We pass our corpus, dictionary, and the number of
topics to the function, along with various other hyperparameters that control
the behavior of the algorithm. After training the CTM model, we print the
topics and their corresponding words using the show_topics() method of our
ctm_model object. Finally, we compute the coherence score for our CTM
model using the CoherenceModel() function from gensim. We pass our
ctm_model, documents, dictionary, and the type of coherence measure we
want to use (‘c_v’ in this case) to the function. The coherence score measures
the degree of semantic similarity between the words in the topics, and serves
as a rough measure of the quality of the model. The output will look like:
234 S. Kumar
In this code the number of topics is set to 10. Fortunately, there is a way
to know the optimum number of topics. You need to calculate the coher-
ence score from a range of number of topics. Coherence score is a statistical
concept and beyond the scope of this book. The following code will automat-
ically calculate coherence score for a range of number of topics (in this case 5
to 10) and tell the optimum number of topics. You can use that information
to decide the number of topics needed.
14 Topic Modeling 235
import os
from gensim import corpora, models
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt
model=ctm_model,
texts=documents,
dictionary=dictionary,
coherence='c_v'
)
coherence_score = coherence_model.get_coherence()
coherence_scores.append(coherence_score)
The code sets the path to a directory of preprocessed text files and
loads the files into memory as a list of lists of words. Then it creates a
corpora.Dictionary object and a corpus object from the preprocessed docu-
ments. The code then defines a range of possible topic numbers that we want
to evaluate, sets some CTM model parameters, and loops over the range of
topic numbers. For each topic number, the code trains a CTM model using
the LdaModel() function from gensim, computes the coherence score for
the model using the CoherenceModel() function, and appends the coher-
ence score to a list. After computing the coherence scores for all of the topic
numbers, the code plots the coherence scores as a function of the number
of topics using matplotlib.pyplot. This allows us to visually identify the
number of topics that maximizes the coherence score.
Finally, the code selects the optimal number of topics by finding the topic
number that corresponds to the maximum coherence score and prints it to
the console.
Here’s a code that does both the things in one step and gives the topics
based on optimum number of topics based on coherence score.
14 Topic Modeling 237
model=ctm_model,
texts=documents,
dictionary=dictionary,
coherence='c_v'
)
coherence_score = coherence_model.get_coherence()
coherence_scores.append(coherence_score)
# Train the CTM model on the corpus using the optimal number of
topics
ctm_model = models.LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=optimal_topic_num,
chunksize=2000,
decay=0.5,
passes=10,
eval_every=None,
iterations=50,
gamma_threshold=0.001,
minimum_probability=0.01,
random_state=None,
alpha='asymmetric',
eta=None,
dtype=None
)
import os
from gensim import corpora, models
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt
This code assumes that each text file contains a single document. The code
imports necessary packages for creating and training an HDP model for topic
modeling on a set of preprocessed text files. It then defines the directory where
the preprocessed text files are located using the dir_path variable. A function
called read_documents() is defined to read in the text files from the direc-
tory and tokenize them using the word_tokenize() function from the NLTK
package. The documents variable is created by reading in all the prepro-
cessed text files and tokenizing them using the read_documents() function.
The dictionary variable is created by generating a dictionary mapping words
to unique integer IDs using the Dictionary() function from the Gensim
package. The corpus variable is created by converting the preprocessed text
240 S. Kumar
In recent years, there has been a significant increase in the amount of text
data available for accounting and finance research. This has led to a growing
interest in using text analysis techniques to extract insights and patterns from
large volumes of unstructured data. Word embeddings are a type of vector
representation of words in a corpus that capture the semantic and syntactic
relationships between them. This representation is created by training a
model on a large corpus of text, and the resulting vectors capture information
about the meanings and relationships between words.
Word embeddings are a powerful tool for NLP tasks, such as sentiment
analysis, text classification, and language generation, because they enable
computers to reason about the meaning of words in a way that is similar
to how humans do. By representing words as vectors, we can perform math-
ematical operations on them (such as addition and subtraction) to capture
relationships between words, such as synonymy, antonymy, and analogy.
In this chapter, you will learn how to create and use word embeddings in
Python. We will explore popular models for generating word embeddings,
such as Word2Vec and GloVe, and learn how to visualize and analyze them.
Additionally, you will learn how to use pre-trained embeddings and how to
train your own embeddings on your own corpus of text. By the end of this
chapter, you will have a solid understanding of how to represent words as
vectors, and how to use these representations to perform various NLP tasks.
Once we have these word embeddings, we can use them in various NLP tasks,
such as sentiment analysis, text classification, and topic modeling. By using
word embeddings, we can better capture the meaning of words in context
and improve the accuracy of our NLP models.
It is recommended to normalize text before using it for word embeddings.
Normalization helps to remove noise and inconsistencies in the text, which
can improve the quality of the embeddings. The reason why normalization
is important for word embeddings is that the algorithms used for creating
embeddings rely on statistical patterns in the text. If the text is noisy or incon-
sistent, the algorithms may not be able to accurately capture the relationships
between words, which can lead to poor quality embeddings.
A code to perform the word embedding in a set of text files containing
normalized text, stored in D:/Data/Preprocessed folder, is as follows:
15 Word Embeddings 245
import os
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
This code is written in Python and uses the gensim library to create a
Word2Vec model on a set of preprocessed text files. The first two lines import
the necessary libraries, including os for file handling, and gensim for creating
word embeddings. The third line defines the path to the preprocessed text
files, which is set to ‘D:/Data/Preprocessed’.
The code then defines a function called read_text_files that reads each file
in the folder and yields the preprocessed text as a list of words. This function
takes the path to the text files as an input and uses os.listdir to list all files in
the directory, and then opens each file, reads its contents, and applies simple_
preprocess to convert the text to a list of words. The yield statement generates
a generator object that can be used to iterate over the preprocessed text.
The next line uses the read_text_files function to read the preprocessed
text files and stores them in a list called "sentences". The list contains all the
preprocessed text from the files. The following line trains a Word2Vec model
on the preprocessed text data stored in the "sentences" list. The Word2Vec
function takes several parameters, including vector_size, which sets the
dimensionality of the output vectors, window, which sets the maximum
distance between the current and predicted word within a sentence, min_
count, which sets the minimum number of times a word must appear in the
corpus to be included in the model, and workers, which sets the number of
worker threads to train the model.
Finally, the last line saves the trained Word2Vec model to a file named
‘word2vec.model’. The trained model can then be loaded and used to
246 S. Kumar
explore the relationships between words in the corpus, which can help in
understanding the semantic and syntactic patterns in the text data.
The output can be viewed using the following code:
from gensim.models import Word2Vec
This output is the vector representation of the word "stock" and the top
5 most similar words to "stock" in the Word2Vec model, along with their
cosine similarity scores. The vector representation of "stock" is a list of 100
float values, which corresponds to the dimensions of the vector space used to
train the Word2Vec model. Each value represents the weight or importance
of the corresponding dimension in capturing the meaning of the word. For
example, the value 0.02866101 may correspond to the "financial" dimension,
while the value −0.00567884 may correspond to the "company" dimension.
The list of similar words, along with their similarity scores, indicates the
words that are closest in meaning or context to "stock" in the vector space.
15 Word Embeddings 247
The cosine similarity score is a measure of the angle between the vectors of
the words, where a score of 1 indicates that the vectors are identical, and
a score of 0 indicates that they are completely dissimilar. In this case, the
top 5 similar words to "stock" are "sale", "fiscal", "factor", "statement", and
"walmart", with cosine similarity scores ranging from 0.175 to 0.081.
The model can be saved in CSV format using the following code:
import pandas as pd
from gensim.models import Word2Vec
Apart from Word2Vec, following techniques can be used for word embed-
dings:
• FastText
particularly for larger datasets. FastText can be a more powerful word embed-
ding technique than Word2Vec in scenarios where sub-word information is
important for capturing word meanings, where one word may have many
meanings, and where rare or Out-of-Vocabulary (OOV) words are prevalent.
ELMo is a newer word embedding technique that uses deep neural networks
to generate contextualized word embeddings, which can capture the meaning
of words in context. This is in contrast to traditional word embeddings,
which treat each word as a static entity independent of its context. ELMo
embeddings have been shown to improve the accuracy of many natural
language processing tasks, such as question answering and sentiment anal-
ysis. However, one disadvantage of ELMo is that it can be computationally
expensive and may require a large amount of training data to achieve optimal
performance.
In general, Word2Vec is a good starting point for most academic research
applications, as it is a widely used and well-established technique that is rela-
tively easy to implement and tune. Word2Vec can be particularly effective
in identifying patterns in the language used in financial documents, such as
annual reports, financial statements, and news articles. However, the choice of
word embedding technique should be guided by the specific needs and char-
acteristics of the research project, and researchers should carefully consider the
trade-offs between computational efficiency, accuracy, and the specific needs
of their research project when selecting a word embedding technique.
Word embeddings are a powerful tool that enables researchers to better
understand the meaning and relationships between words in a corpus of
text. By representing words as vectors in a dense vector space, word embed-
dings allow researchers to perform mathematical operations on words, such as
addition and subtraction, which can help in capturing relationships between
words, such as synonymy, antonymy, and analogy. This can be particularly
useful for academic researchers in accounting, finance, and other business
fields, who are dealing with large volumes of unstructured text data. By
using word embeddings, researchers can extract valuable insights and patterns
from this data that may not be easily discernible through traditional analysis
methods. Therefore, it is important for academic researchers to understand
the concept of word embeddings and how to apply them in their research
projects.
16
Text Classification
patterns and trends that may not be immediately apparent to the human
eye.
In most cases, normalized text is necessary for effective text categorization.
We have already covered normalization in previous chapters. By normalizing
the text data, you can ensure that the same words are represented consistently
across all documents, regardless of their formatting or capitalization. This
can help to improve the accuracy of text categorization models, as they can
more easily identify and compare the key features of each document. Word
embeddings are not always necessary for text categorization, but they can be
very useful in certain cases. For example, if the text data contains synonyms
or words with multiple meanings, word embeddings can help the model to
differentiate between them and accurately categorize the text. Word embed-
dings can also help to capture the context in which words are used, which can
be useful for categorizing text based on specific topics or themes. However,
it’s important to note that creating and using word embeddings requires
significant computational resources and may not be necessary for all text cate-
gorization tasks. In some cases, simpler feature engineering techniques, such
as bag-of-words or tf-idf (term frequency-inverse document frequency), may
be sufficient for accurately categorizing text. So, while word embeddings can
be a powerful tool for improving text categorization accuracy, their use should
be evaluated on a case-by-case basis depending on the specific requirements
and characteristics of the text data.
There are several text classification techniques that can be used for auto-
matically categorizing text documents based on their content. Here are some
of the most common techniques:
Naive Bayes
Naive Bayes is a simple but effective text classification algorithm that is based
on Bayes’ theorem. It works by calculating the probability of each class given
the words in the document, and then selecting the class with the highest
probability.
Here’s a code for Naïve Bayes text classification:
16 Text Classification 251
252 S. Kumar
An accuracy of 1.0 means that the model correctly classified all of the test
documents. The confusion matrix shows the number of true positives, true
negatives, false positives, and false negatives for the test set. Since you only
have one category, the confusion matrix is a 1 × 1 matrix with a single
value of 1, which indicates that the model correctly classified all of the test
documents as belonging to the single category.
While a perfect accuracy may seem desirable, it’s important to note that
the model was trained and tested on the same dataset, which means that
it may not generalize well to new, unseen data. In practice, it’s common
to use cross-validation or holdout testing to evaluate the performance of a
text classification model on new data. Moreover, it’s important to consider
other performance metrics, such as precision, recall, and F1-score, which can
provide a more comprehensive evaluation of the model’s performance.
Example of a code that takes training data from D:/Data/Machine/
Training and making predictions on test data in D:/Data/Machine/Test is as
follows:
16 Text Classification 253
The script uses the sklearn library for building the classifier and nltk
library for handling stopwords. It loads training and test data from speci-
fied directories and evaluates the classifier’s performance using accuracy and
a classification report. The code starts by importing necessary libraries and
254 S. Kumar
1. Accuracy: The overall accuracy of the model is 0.85, which means the
model correctly predicted the class labels for 85% of the test instances.
16 Text Classification 255
This is a good accuracy score, indicating that the model has performed
well on the test dataset.
2. Precision, Recall, and F1-score: These are class-wise performance
metrics:
• Class 1: Precision is 0.90, recall is 0.82, and F1-score is 0.86. There
were 100 instances of this class in the test dataset.
• Class 2: Precision is 0.87, recall is 0.89, and F1-score is 0.88. There
were 120 instances of this class in the test dataset.
• Class 3: Precision is 0.80, recall is 0.84, and F1-score is 0.82. There
were 90 instances of this class in the test dataset.
• Class 4: Precision is 0.83, recall is 0.80, and F1-score is 0.81. There
were 110 instances of this class in the test dataset.
• Class 5: Precision is 0.85, recall is 0.88, and F1-score is 0.86. There
were 105 instances of this class in the test dataset.
• Class 6: Precision is 0.88, recall is 0.85, and F1-score is 0.86. There
were 95 instances of this class in the test dataset.
The precision, recall, and F1-score values for all classes are relatively high,
indicating that the model performed well on each class.
3. Macro Avg and Weighted Avg: These are the average values of precision,
recall, and F1-score across all classes:
• Macro Avg: The unweighted mean of the class-wise metrics is 0.85 for
precision, recall, and F1-score.
• Weighted Avg: The average of class-wise metrics, weighted by the
number of instances in each class (support), is also 0.85 for precision,
recall, and F1-score.
This output represents a well-performing text classification model with good
accuracy and class-wise performance. The model has been successful in
predicting class labels across all classes, as reflected by the high precision,
recall, and F1-score values.
Support Vector Machines (SVMs): SVMs are a type of machine learning
algorithm that can be used for text classification. They work by finding the
best boundary (hyperplane) between different classes of data, which can then
be used to classify new data points.
An example of a code that provides SVM text classification is as follows:
256 S. Kumar
This Python code performs text classification using a Support Vector Machine
(SVM) classifier. It starts by importing necessary libraries and defining a func-
tion named load_data_from_directory that loads preprocessed text files and
their labels from a given directory. The function reads the content of each file
and extracts the file name without the extension as the label.
16 Text Classification 257
After defining the function, the data_path variable is set to the directory
containing the preprocessed text files. The code loads the text data and corre-
sponding labels using the load_data_from_directory function. The loaded
data is then split into training and test sets using the train_test_split function
from the sklearn.model_selection module. The function takes the loaded
texts and labels as input and randomly splits them into training and test
subsets, with 20% of the data used for testing and the remaining 80% used
for training.
Next, a pipeline named text_clf is created for text preprocessing and clas-
sification. The pipeline consists of three steps: vectorization, term frequency-
inverse document frequency (TF-IDF) transformation, and classification
using the SVM classifier. The CountVectorizer is responsible for converting
text data into a bag-of-words representation, while the TfidfTransformer
calculates the TF-IDF scores for the words. The SVC classifier with a linear
kernel is then used to classify the documents based on the processed text
features.
Once the pipeline is defined, the classifier is trained on the training data
using the fit method. The trained classifier is then used to predict the
class labels for the test data using the predict method. Finally, the accu-
racy of the classifier is calculated using the accuracy_score function from
sklearn.metrics, and the classification report is generated using the classifica-
tion_report function. The accuracy score and classification report are printed
to provide an overview of the classifier’s performance on the test dataset.
The output of data will look like:
In this code there is a single group of text files that are being randomly divided
into training and test data in line 28 of the code. If we want to manually
provide the training and test data, then the relevant code would be:
258 S. Kumar
This code loads text and labels from specified directories, transforms the text
into numerical features using TF-IDF, trains an SVM classifier on the training
data, makes predictions on the test data, and evaluates the classifier’s perfor-
mance by calculating its accuracy and printing a classification report. The
output will be similar to earlier examples.
16 Text Classification 259
Decision Trees
Decision trees are another type of machine learning algorithm that can be
used for text classification. They work by recursively splitting the data into
smaller subsets based on the most informative features (words) until a deci-
sion can be made about the class of the document. A code that performs text
classification on normalized text files by randomly classifying text files into
training and test files, using decision trees is as follows:
Similarly, the code for manually assigning training and test data is as follows:
260 S. Kumar
16 Text Classification 261
Random Forests
Random forests are an ensemble learning technique that combines multiple
decision trees to improve the accuracy of the classification. Each decision tree
is trained on a random subset of the data, and the final classification is based
on the majority vote of all the trees. Here’s an example of a code to perform
text classification using random forests. The code randomly divides the text
files into training and test data.
262 S. Kumar
This code first loads preprocessed text data from the specified directory and
splits it into training and test sets using the train_test_split function. Then,
it converts the text data into a numerical format using the TF-IDF (term
frequency-inverse document frequency) vectorizer. The code proceeds to train
a random forest classifier on the training set and evaluates its performance
by predicting labels for the test set. Finally, it prints the accuracy score and a
classification report that includes precision, recall, and F1-score for each class.
The code to manually allocate training data and test data is as follows:
16 Text Classification 263
Deep Learning
Deep learning techniques, such as Convolutional Neural Networks (CNNs)
and Recurrent Neural Networks (RNNs), have shown promising results in
text classification. These algorithms are able to learn more complex repre-
sentations of the text data and can be trained on very large datasets. Deep
learning has emerged as a powerful tool in the field of text classification,
with the potential to improve accuracy and efficiency in tasks such as senti-
ment analysis, topic classification, and document categorization. In academic
accounting and finance research, deep learning techniques have been used
to classify financial news articles based on sentiment, to predict stock prices
based on text data, and to identify fraudulent financial statements. While
264 S. Kumar
deep learning methods require large amounts of labeled data and computa-
tional resources, they have shown promising results in improving the accuracy
of text classification tasks. In a separate full chapter, we will delve deeper into
the concepts and applications of deep learning in text classification.
These are just a few examples of the many text classification techniques that
are available. The choice of which technique to use depends on the specific
characteristics of the text data, such as the size of the dataset, the number of
classes, and the complexity of the classification task.
In this chapter, we explored various techniques for text classification,
ranging from traditional machine learning models such as Naive Bayes,
Support Vector Machines, Decision Trees, and random Forests. Each of
these models has its own strengths and weaknesses, and the choice of model
depends on the specific requirements of the task. Word Embedding and text
classification are an essential task in natural language processing and have
a wide range of applications in various industries. With the advancement
of machine learning and deep learning techniques, we can expect further
improvements in the accuracy of text classification models, which will lead
to more effective and efficient solutions for real-world problems.
17
Sentiment Analysis
Each technique has its own strengths and weaknesses, and the choice of tech-
nique depends on the specific requirements and constraints of the project. A
combination of different techniques may also be used for better accuracy and
performance. A detailed discussion of these methos is as follows:
Rule-Based Methods
Rule-based methods of sentiment analysis involve using a set of predefined
rules to classify text data into positive, negative, or neutral sentiment. These
methods are relatively simple and easy to implement, but they may not be as
17 Sentiment Analysis 267
• Keyword Matching
This method involves creating a list of positive and negative keywords and
matching them against the text data. For example, the presence of words like
"good", "excellent", "happy", and "satisfied" might indicate positive senti-
ment, while words like "bad", "terrible", "disappointed", and "frustrated"
might indicate negative sentiment.
Suppose you are conducting a sentiment analysis of news articles related
to a particular company’s quarterly earnings report. You want to identify the
sentiment of the articles as either positive, negative, or neutral. You can use a
keyword matching approach to classify the sentiment of the articles based on
the presence of positive or negative keywords. For example, you might create
a list of positive keywords such as "record earnings", "strong growth", and
"positive outlook", and negative keywords such as "disappointing results",
"weak performance", and "negative guidance". You could then search for the
presence of these keywords in the text of the news articles and use this infor-
mation to classify the sentiment of each article. If an article contains more
positive keywords than negative keywords, you might classify it as having a
positive sentiment. Conversely, if an article contains more negative keywords
than positive keywords, you might classify it as having a negative sentiment.
If the article contains an equal number of positive and negative keywords,
you might classify it as having a neutral sentiment.
While this approach is relatively simple and easy to implement, it may
not be as accurate as more advanced techniques such as machine learning or
deep learning models. Additionally, the choice of keywords may be subjective
and may not capture the full range of sentiment expressed in the text. There-
fore, it is important to carefully evaluate the accuracy and effectiveness of the
keyword matching approach before using it in a sentiment analysis project.
Here’s a code that performs a sentiment analysis on all the text files located
in D:/Data/Preprocessed directory and its subdirectories.
268 S. Kumar
a dictionary saved as a CSV file into the code and use that for sentiment
analysis as below:
270 S. Kumar
The sentiment dictionary is read from a CSV file located in the D:/Data
directory, and a function is defined to classify the sentiment of a text based
17 Sentiment Analysis 271
Then, a TfidfVectorizer object is created to convert the text data into a matrix
of TF-IDF features, which is then used to fit and transform the text data into
a matrix of TF-IDF features. Next, the number of topics to extract is defined
and NMF is performed on the TF-IDF matrix to extract the topics. The top
words for each topic are then printed.
The sentiment dictionary is loaded from the CSV file Sentiment_
Dict.csv. A function classify_sentiment is defined to classify the sentiment
of a text based on the sentiment dictionary. This function takes in a text
string and the sentiment dictionary as input, and returns a sentiment label
(Positive, Negative, or Neutral) based on the sentiment score of the text. For
each topic, the top words are extracted and concatenated into a text string.
The classify_sentiment function is then used to determine the sentiment of
the topic. The topic text and sentiment are printed. The output is saved to a
CSV file. For each file in the directory, the topic scores and document senti-
ment are computed and stored in a dataframe. This dataframe is then saved
to a CSV file called Output.CSV in the D:/Data directory.
• Linguistic Rules:
This method involves using linguistic rules to identify sentiment in text data.
For example, negation words like "not" or "never" can change the sentiment
of a sentence, so a rule might be created to reverse the sentiment if a nega-
tion word is present. So, here’s an example of a linguistic rule for sentiment
analysis that could be used in academic research, particularly in the fields of
accounting and finance:
Linguistic Rule: If a sentence contains a negation word, such as "not" or
"never", the sentiment should be reversed.
Example: "The company’s financial performance was not good".
Using the linguistic rule, we would reverse the sentiment of the sentence,
so it would be classified as negative sentiment rather than neutral or positive
sentiment.
This rule is especially relevant in accounting and finance research, where
sentiment analysis is commonly used to analyze financial news articles or
17 Sentiment Analysis 275
earnings call transcripts. Negation words are often used to modify the senti-
ment of a sentence, and failing to account for them can lead to inaccurate
sentiment analysis results. By using a linguistic rule to account for negation
words, researchers can more accurately identify the sentiment expressed in
financial text data.
Linguistic rules are generally combined with dictionaries. An example of a
code of linguistic rule and dictionary usage is as follows:
276 S. Kumar
the score for the word is reversed. The overall sentiment score for the text is
then calculated as the average of the sentiment scores for all words in the text.
Finally, the sentiment is classified based on the overall sentiment score.
Note that the sentiment dictionary used in this code should have words
in lowercase as the text is converted to lowercase while processing. Also, the
sentiment dictionary should be formatted as a CSV file with two columns:
the first column should contain the words, and the second column should
contain their corresponding sentiment scores. The sentiment scores can be
positive, negative, or neutral, with values between –1 and 1.
This code can be combined with the earlier code with topic modeling in
the following way:
278 S. Kumar
Lexicon-Based Methods
Lexicon-based methods for sentiment analysis involve using a pre-defined
dictionary or lexicon of words and their corresponding sentiment scores to
determine the overall sentiment of a text. For example, a word like "love"
might have a high positive sentiment score, while a word like "hate" might
have a high negative sentiment score. The sentiment score of a text is calcu-
lated by summing the sentiment scores of all the words in the text. The
sentiment scores of the words can be based on a number of criteria such
as the word’s connotation, frequency, or context.
One popular lexicon used for sentiment analysis is the AFINN lexicon,
which contains a list of words and their corresponding sentiment scores
ranging from –5 (most negative) to 5 (most positive). The sentiment scores
in the AFINN lexicon are based on the valence of the word, or the degree
to which the word expresses a positive or negative emotion. Other sentiment
17 Sentiment Analysis 279
In this code, we first import the necessary packages, including the AFINN
lexicon and the "os" package for directory operations. We then create an
instance of the AFINN lexicon using the Afinn() function. Next, we define
the directory path where the preprocessed text files are located using the dir_
path variable. We then loop through all the files in the directory using the
os.listdir() function and read the contents of each file using the open()
function. We then calculate the sentiment score of each file using the
afinn.score() function and print the filename and sentiment score using the
print() function. Note that the sentiment score returned by the AFINN
lexicon ranges from –5 (most negative) to 5 (most positive), with 0 indicating
neutral sentiment. You can modify the code to save the sentiment scores to
a file or perform additional analysis based on the sentiment scores. Aggre-
gate sentiment in document is derived by summing all the sentiments in the
document. The output will look like:
280 S. Kumar
The output shows the value of negative, positive, and neutral sentiments
separately. compound is the overall sentiment in the text.
282 S. Kumar
These datasets are just a few examples of the many publicly available financial
training datasets that can be used for machine learning tasks. When selecting
a dataset for your project, it’s important to consider factors such as the size of
the dataset, the quality and accuracy of the labeling, and the relevance of the
data to your specific task.
Here are some of the most commonly used machine learning algorithms
for sentiment analysis:
• Naive Bayes:
The code loads the FinancialPhraseBank dataset and trains the Naive Bayes
algorithm on the dataset using a bag-of-words matrix representation of the
text data. The code then defines the path to a folder containing preprocessed
text files for testing. After this, the code loops through each file in the folder,
converts the preprocessed text data to a bag-of-words matrix representation
using the same vectorizer as before, predicts the sentiment label of the text
data using the trained Naive Bayes model, and stores the predicted sentiment
label in a list. Finally, the code prints the predicted sentiment label of each
file to the console.
The model, however, needs to be fine-tuned before final implementation.
The process for fine-tuning a Naive Bayes model for sentiment analysis using
scikit-learn’s GridSearchCV and Pipeline involves several steps. The labeled
dataset should be loaded into memory, and split into training and testing
sets. A pipeline object should be defined that consists of data preprocessing
steps and a Naive Bayes algorithm. A dictionary of hyperparameters should be
defined to be tuned during the grid search. A GridSearchCV object should
be defined that performs a grid search over the hyperparameters and uses
cross-validation to evaluate the performance of the pipeline on the training
set. The GridSearchCV object should be fit to the training data using the fit()
method to find the best hyperparameters. The performance of the trained
model should be evaluated on the testing set using metrics such as accuracy,
precision, recall, and F1-score. The trained model can be used to predict the
sentiment of new, unseen text data by applying the pipeline to the raw text
data. This process can be implemented as follows:
The code loads a training dictionary (in this case, the Financial PhraseBank),
vectorizes the training data using the TF-IDF algorithm, and then initializes
an SVM classifier with a linear kernel. It then loops through all the text files in
17 Sentiment Analysis 289
the specified folder and its subfolders, reads in each text file, vectorizes it using
the same TF-IDF vectorizer as the training data, and predicts the sentiment
using the SVM classifier. Finally, it prints out the predicted sentiment for
each text file. Note that this is just a basic example code and you may need
to modify it to suit your specific requirements.
To fine-tune your SVM model, you can adjust the hyperparameters of
the SVM classifier or try different kernels to see which one works best for
your dataset. Here are some of the hyperparameters you can tune in the
svm.SVC() function:
• C: The penalty parameter of the error term. A higher value of C will result
in a more complex decision boundary, which may lead to overfitting. You
can try different values of C to find the optimal value that balances between
overfitting and underfitting.
• gamma: The kernel coefficient for "rbf", "poly", and "sigmoid" kernels. A
higher value of gamma will result in a more complex decision boundary,
which may lead to overfitting. You can try different values of gamma to
find the optimal value that balances between overfitting and underfitting.
• kernel: The type of kernel to use. You can try different kernels such as
linear, polynomial, or radial basis function (RBF) to see which one works
best for your dataset.
To fine-tune your model, you can use techniques such as grid search or
randomized search to search over a range of hyperparameters and find the
optimal combination that maximizes the model’s performance. Grid search
involves exhaustively searching over a pre-defined range of hyperparame-
ters, while randomized search randomly samples hyperparameters from a
pre-defined distribution.
Here’s an example code that shows how to use grid search to fine-tune an
SVM model:
290 S. Kumar
This code defines a pipeline that consists of a TF-IDF vectorizer and an SVM
classifier, and uses grid search to search over a range of hyperparameters for
both the vectorizer and the classifier. The GridSearchCV() function performs
17 Sentiment Analysis 291
This code initializes an SVM classifier with the best hyperparameters that
were obtained from fine-tuning, and then fits the classifier to the entire
training data. It then loops through all the text files in the specified folder
and its subfolders, reads in each text file, vectorizes it using the same TF-IDF
vectorizer as the training data, and predicts the sentiment using the fine-tuned
SVM classifier. Finally, it prints out the predicted sentiment for each text file.
Instead of printing, you can modify this code to save the sentiments as csv
file.
More often you need to do sentiment analysis as topic level, rather than
document level. Here’s a code that creates model for first creating topics from
17 Sentiment Analysis 293
the normalized text data using SVM and then performing sentiment analysis
of all the topics using the process above. Aggregate sentiment for the whole
document will be the sum of sentiments for all the topics.
294 S. Kumar
• Random Forest:
This code uses the pandas library to load the Financial PhraseBank training
dictionary from the specified file path, and the sklearn library to vectorize
the text data using the CountVectorizer class and train the Random Forest
model using the RandomForestClassifier class. It then uses the os library
to iterate through all the preprocessed text files in the specified folder and
subfolders, reads each file using the open function, vectorizes the text data
using the same CountVectorizer object used for training, and predicts the
sentiment using the trained Random Forest model. The predicted sentiment
is then printed to the console.
You can again fine-tune this mode using GridSearchCV() and apply the
best model to our preprocessed text data using the code below:
296 S. Kumar
17 Sentiment Analysis 297
In this code, we perform grid search to find the best hyperparameters for the
Random Forest model using the GridSearchCV function, and save the best
model found using joblib.dump to the specified path model_path. We then
load the best model using joblib.load and use it to predict the sentiment
of each preprocessed file in the specified folder and subfolders, printing the
predicted sentiment to the console. You can modify this code to save the
sentiments to csv file also.
Similarly, a topic modeling approach to random forest sentiment analysis
would be:
298 S. Kumar
In this code, we first load the training dictionary and split it into training
and testing sets. We create a TfidfVectorizer object to convert the text data
to numerical features and fit it on the training data. We then create a Latent-
DirichletAllocation model to extract topics from the text data and fit it on
the training.
• Logistic Regression:
Here, we first load the Financial PhraseBank dataset and split it into training
and testing sets. We then create a TfidfVectorizer object to convert the text
data to numerical features and fit it on the training data. We create a Logis-
ticRegression model and fit it on the training data. We evaluate the model
on the testing data and print the accuracy. Finally, we classify the text files
in the specified folder and subfolders using the trained model and print the
predicted sentiment for each file.
These are just a few examples of the many machine learning algo-
rithms that can be used for sentiment analysis. Each algorithm has its own
strengths and weaknesses, and the choice of algorithm depends on the specific
300 S. Kumar
Linear Regressions
Linear regression is a fundamental method in statistical analysis, which is
widely used to model the relationships between a dependent variable and
one or more independent variables. However, linear regression has several
assumptions and limitations that need to be considered before using this
technique. The first assumption is linearity, which means that the relation-
ship between the dependent and independent variables should be linear. The
second assumption is independence, which assumes that the observations are
independent of each other. The third assumption is homoscedasticity, which
means that the variance of the error term is constant across all levels of the
independent variable. The fourth assumption is normality, which assumes
that the error term follows a normal distribution. These assumptions are
important to ensure the accuracy and validity of the results obtained from
simple linear regression. Failure to meet these assumptions can lead to inac-
curate or biased results. Therefore, it is important to carefully consider these
assumptions before using linear regression.
306 S. Kumar
data = pd.read_csv('D:/Data/my_data.csv')
X = data['ETH']
Y = data['BTC']
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
print(model.summary())
This Python code uses the pandas library to read in a CSV file containing data
on two variables, "ETH" and "BTC". It then uses the statsmodels.api library
to perform a simple linear regression analysis on the data, where "ETH" is the
independent variable and "BTC" is the dependent variable. The X variable is
created by selecting the "ETH" column from the dataset and the Y variable
is created by selecting the "BTC" column from the dataset. The X variable is
then augmented with a constant term using the add_constant function from
the statsmodels.api library. The OLS function from the statsmodels.api
library is then used to fit the linear regression model with Y as the depen-
dent variable and X as the independent variable, and the resulting model is
stored in the "model" variable. Finally, the code prints out a summary of
the regression analysis, which includes information about the model’s coeffi-
cients, statistical significance, goodness-of-fit measures, and other important
diagnostic statistics.
The result will look like:
18 Basic Regression 307
Notes:
[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified.
[2] The condition number is large, 1.16e+03. This might
indicate that there are
strong multicollinearity or other numerical problems.
The results also show the tests which normally are not available automat-
ically from other statistical softwares while you carry out regression. Here,
the Omnibus test and Jarque–Bera test indicate that the residuals are not
normally distributed, and the Durbin-Watson test suggests that there may
be some autocorrelation in the residuals. The condition number of 1160
suggests that there might be strong multicollinearity or other numerical prob-
lems. To interpret the results, we can also print the coefficients and intercept
separately:
print('Intercept:', model.params[0])
print('Coefficients:', model.params[1])
To see the scatterplot with regression line the following code will be useful:
308 S. Kumar
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('D:/Data/my_data.csv')
X = data['ETH']
Y = data['BTC']
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='ETH', y='BTC', data=data, label='Data
Points')
# Regression line
reg_line_x = X['ETH']
reg_line_y = model.params[0] + model.params[1] * reg_line_x
plt.plot(reg_line_x, reg_line_y, color='red',
label='Regression Line')
plt.xlabel('ETH')
plt.ylabel('BTC')
plt.legend()
plt.show()
This code imports the necessary libraries, loads the dataset, defines the depen-
dent and independent variables, and fits an OLS regression model. It then
creates a scatterplot using seaborn.scatterplot and adds the regression line
to the plot using the model’s parameters to create the plot. The output will
look like:
18 Basic Regression 309
Multiple Regression
Notes:
[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified.
[2] The condition number is large, 6.66e+03. This might
indicate that there are
strong multicollinearity or other numerical problems.
310 S. Kumar
data = pd.read_csv('D:/Data/my_data.csv')
X = data[['ETH', 'BNB', 'ADA']]
Y = data['BTC']
model = LinearRegression().fit(X, Y)
print('Intercept:', model.intercept_)
print('Coefficients:', model.coef_)
You can export the results in a publication quality table in a word file with
the code below:
import pandas as pd
import statsmodels.api as sm
import os
from docx import Document
from docx.shared import Inches
data too well, but performs poorly on new data. Regularization methods can
help address this issue by shrinking the coefficients of the predictors and
reducing their variance.
lasso_model = Lasso(alpha=0.1).fit(X, Y)
print('LASSO Coefficients:', lasso_model.coef_)
ridge_model = Ridge(alpha=0.1).fit(X, Y)
print('Ridge Coefficients:', ridge_model.coef_)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
data = pd.read_csv('D:/Data/my_data.csv')
X = data[['ETH', 'BNB', 'ADA']]
Y = data['BTC']
model = LinearRegression().fit(X, Y)
print('Intercept:', model.intercept_)
print('Coefficients:', model.coef_)
# 3D scatterplot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Regression plane
xx, yy = np.meshgrid(data['ETH'].unique(),
data['BNB'].unique())
zz = model.intercept_ + model.coef_[0] * xx + model.coef_[1] *
yy
ax.set_xlabel('ETH')
ax.set_ylabel('BNB')
ax.set_zlabel('ADA')
plt.show()
data = pd.read_csv('D:/Data/my_data.csv')
X = data[['ETH', 'BNB', 'ADA']]
Y = data['BTC']
model = LinearRegression().fit(X, Y)
print('Intercept:', model.intercept_)
print('Coefficients:', model.coef_)
# 3D scatterplot
fig = px.scatter_3d(data, x='ETH', y='BNB', z='ADA')
# Regression plane
xx, yy = np.meshgrid(data['ETH'].unique(),
data['BNB'].unique())
zz = model.intercept_ + model.coef_[0] * xx + model.coef_[1] *
yy
fig.update_layout(scene=dict(xaxis_title='ETH',
yaxis_title='BNB', zaxis_title='ADA'))
fig.show()
Regression Diagnostics
Linear regression relies on several key assumptions that must be true for the
results to be meaningful and accurate. If any of these assumptions are incor-
rect, the results of the regression may be misleading or incorrect. Therefore,
it’s important to check the truthfulness of these assumptions before using
linear regression. This can be done through various regression diagnostic tech-
niques, such as linearity, multicollinearity, heteroscedasticity, autocorrelation,
and normality of residuals. By verifying the truthfulness of these assumptions,
we can ensure that the results of the linear regression are valid and reliable.
Before diving into the diagnostics, let’s load the data D:/Data/my_data.csv
and perform a multiple linear regression using the given data.
18 Basic Regression 315
import pandas as pd
import statsmodels.api as sm
Linearity Check
The output will show the scatter plots of all the X variables and Y vari-
ables separately. If the relationships appear to be non-linear, you can consider
transforming the independent variables using functions such as logarithmic,
square root, or inverse transformations.
Multicollinearity
print(vif)
316 S. Kumar
Heteroscedasticity
bp_test = het_breuschpagan(model.resid, X)
print("Breusch-Pagan test p-value:", bp_test[1])
The output will show Breusch–Pagan test p-value. A p-value less than 0.05
indicates the presence of heteroscedasticity. If heteroscedasticity is present,
you can consider using robust standard errors or weighted least squares
regression.
Autocorrelation
Autocorrelation occurs when the residuals are correlated across time. It can
lead to biased parameter estimates. To test for autocorrelation, we can use the
Durbin–Watson test from the statsmodels library.
from statsmodels.stats.stattools import durbin_watson
dw = durbin_watson(model.resid)
print("Durbin-Watson statistic:", dw)
Normality of Residuals
The normality assumption states that the residuals should follow a normal
distribution. To test this assumption, we can use the Shapiro–Wilk test from
the scipy.stats library and visually inspect the residuals using a Q–Q plot.
from scipy.stats import shapiro
import statsmodels.graphics.gofplots as gofplots
A p-value less than 0.05 from the Shapiro–Wilk test indicates non-normal
residuals. If the residuals are not normally distributed, you can consider trans-
forming the dependent variable or using non-linear regression techniques
such as generalized linear models (GLMs) or machine learning algorithms.
Regression diagnostics are crucial for validating the assumptions of the
regression model and ensuring its accuracy. By carefully examining linearity,
multicollinearity, heteroscedasticity, autocorrelation, and the normality of
residuals, you can improve the quality of your regression analysis and enhance
the reliability of your research findings.
Linear regression is a powerful and widely used technique in accounting
and finance research. This chapter has provided a comprehensive overview
of both simple and multiple linear regression and its implementation using
Python. Regression diagnostics were also discussed to ensure the validity of
the models and conclusions. The use of Python for implementing linear
regression and advanced regression techniques provides researchers with a
flexible and accessible toolkit for their research endeavors. The upcoming
chapters will delve into more advanced regression techniques accompanied
by Python implementations, allowing researchers to confidently apply them
to their research projects.
19
Logistic Regression
This code snippet imports the necessary libraries and loads the my_data1.csv
dataset into a DataFrame. It then defines the dependent variable (Y) as the
’BTC’ column and the independent variables (X ) as the ’ETH’, ’BNB’,
’SOL’, and ’ADA’ columns. After adding a constant term to the independent
variables, the code fits a logistic regression model using the Logit function
from the statsmodels library and prints a summary of the results. The output
will look like:
19 Logistic Regression 321
The logistic regression results show that the model has a relatively low
pseudo R-squared value of 0.001059, indicating a weak explanatory power.
Among the independent variables, only the coefficient of ETH is significant
at the 10% level (p-value = 0.079), suggesting a positive association between
ETH and the log-odds of BTC being equal to 1. The coefficients of BNB,
SOL, and ADA are not statistically significant (p-values > 0.1), indicating that
there is no strong evidence to support a relationship between these variables
and the log-odds of BTC being equal to 1 in this model.
To save the output of the logistic regression model in a publication quality
table in Microsoft Word format, you can use the stargazer package from the
stargazer library. First, you’ll need to install the stargazer library by running
!pip install stargazer. Then, you can modify your code as follows:
322 S. Kumar
import pandas as pd
import statsmodels.api as sm
from stargazer.stargazer import Stargazer
This code will generate a.tex file with the LaTeX code for the table and then
convert it to a Word document (.docx) using pandoc. If you don’t have
pandoc installed, you can download and install it from the official Pandoc
website. Sometime after installing you may have to add pandoc as system
PATH variable also. The output table will be saved as output_table.docx in
the same folder as your script. You can use the code similar to the one we used
for linear regression also, but pandoc gives the tables in the format typically
used in accounting and finance literature.
Evaluating the performance of a logistic regression model is crucial to
understanding its predictive power and accuracy. Several metrics can be used
to assess the model, including the confusion matrix, ROC curve, AUC,
precision, recall, and F1-score.
19 Logistic Regression 323
Confusion Matrix
A confusion matrix is a table that represents the number of true positive (TP),
true negative (TN), false positive (FP), and false negative (FN) predictions
made by a binary classifier. It helps to visualize the performance of the model
in terms of its correct and incorrect predictions. A confusion matrix for this
regression can be calculated as:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
The confusion matrix shows that out of 952 instances, the model correctly
predicted 498 instances (251 true positives and 247 true negatives), while it
incorrectly predicted 454 instances (241 false positives and 213 false nega-
tives). The model’s accuracy can be calculated as (251 + 247)/952 = 0.525,
while the misclassification rate can be calculated as (241 + 213)/952 =
0.475.
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
The output indicates that the logistic regression model was optimized success-
fully with a log-likelihood value of 0.692409. The algorithm required only
3 iterations to converge. The AUC score of 0.522284677717709 indicates
that the model’s discriminatory power is only slightly better than random
guessing, as a score of 0.5 corresponds to a random classifier. In general, an
AUC score of 0.6–0.7 is considered poor, 0.7–0.8 is considered fair, 0.8–0.9
is considered good, and 0.9–1.0 is considered excellent. Therefore, the AUC
score of 0.522284677717709 suggests that the model’s discriminatory power
is poor, and the model may not be suitable for practical applications where
accurate classification is important.
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import classification_report
The first three lines indicate that the optimization of the logistic regression
model was successful, and the algorithm required only three iterations to
converge.
The classification report shows the Precision, Recall, and F1-score for both
the positive (class 1) and negative (class 0) classes. The Precision of class 0
is 0.52, meaning that 52% of the predicted positive values are true positives,
19 Logistic Regression 327
and the Precision of class 1 is 0.51, meaning that 51% of the predicted nega-
tive values are true negatives. The Recall of class 0 is 0.51, meaning that 51%
of the actual positive values are correctly predicted, and the Recall of class 1 is
0.53, meaning that 53% of the actual negative values are correctly predicted.
The F1-score of class 0 is 0.51, which is the harmonic mean of Precision and
Recall, and the F1-score of class 1 is 0.52.
The accuracy of the model is 0.52, which means that it correctly predicted
52% of the test set. The macro average of Precision, Recall, and F1-score is
also 0.52, which is the average of the scores for both the classes. The weighted
average of Precision, Recall, and F1-score is also 0.52, which is the average of
the scores weighted by the number of samples in each class.
In summary, the classification report suggests that the logistic regression
model has poor performance in distinguishing between the two classes, with
an accuracy of 0.52 and F1-scores of 0.51 and 0.52 for the two classes.
By leveraging Python’s extensive libraries and tools, academic researchers
can efficiently implement and analyze logistic regression models, gaining a
deeper understanding of the relationships between predictor variables and
binary response variables. The integration of Python in this field not only
streamlines the research process but also enhances the accuracy and inter-
pretability of the results, ultimately leading to more robust and impactful
findings in accounting and finance research.
20
Probit and Logit Regression
The probit model and logit model are both types of generalized linear models
(GLMs) used to analyze the relationship between a binary dependent vari-
able and one or more independent variables. While they are similar in many
aspects, they differ in the link function they use to model the relationship
between the dependent and independent variables.
Probit Regression
Probit regression is a statistical technique used to analyze the relationship
between a binary dependent variable and one or more independent variables,
that may or may not be binary. It is commonly employed in accounting and
finance research to study events with binary outcomes, such as bankruptcy
or fraud, which can be represented as either a 0 or 1. Probit regression is a
useful alternative to logistic regression when the underlying assumption of
the logistic model may not hold.
The probit regression equation is based on the cumulative distribution
function (CDF) of the standard normal distribution, also known as the probit
function:
Y ∗ = Xβ + ε.
Y = 1 if Y ∗ > 0Y = 0 if Y ∗ ≤ 0.
The probit model estimates the probability of the binary outcome variable
Y being equal to 1, given the values of the independent variables X . This
probability is given by the CDF of the standard normal distribution evaluated
at Xβ:
This code uses the statsmodels library. The code reads the data from the
CSV file, adds a constant to the independent variables, defines the probit
regression model using the specified dependent variable and independent
variables, fits the model, and prints the results. The output will look like:
Optimization terminated successfully.
Current function value: 0.689319
Iterations 4
Probit Regression Results
==============================================================================
Dep. Variable: BTC No. Observations: 1554
Model: Probit Df Residuals: 1549
Method: MLE Df Model: 4
Date: Sun, 07 May 2023 Pseudo R-squ.: 0.002171
Time: 07:29:46 Log-Likelihood: -1071.2
converged: True LL-Null: -1073.5
Covariance Type: nonrobust LLR p-value: 0.3238
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0506 0.048 1.065 0.287 -0.043 0.144
ETH 0.0001 0.000 0.884 0.377 -0.000 0.000
BNB -0.0010 0.001 -1.559 0.119 -0.002 0.000
SOL 0.1234 0.082 1.502 0.133 -0.038 0.284
ADA -0.0161 0.144 -0.112 0.911 -0.298 0.266
==============================================================================
Probit Marginal Effects
=====================================
Dep. Variable: BTC
Method: dydx
At: mean
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ETH 4.141e-05 4.69e-05 0.884 0.377 -5.04e-05 0.000
BNB -0.0004 0.000 -1.559 0.119 -0.001 0.000
SOL 0.0491 0.033 1.502 0.133 -0.015 0.113
ADA -0.0064 0.057 -0.112 0.911 -0.118 0.106
==============================================================================
The results can be exported to a word file in a table using the following code:
20 Probit and Logit Regression 333
import os
from docx import Document
from docx.shared import Pt
# Create a new Word document
doc = Document()
# Save the Word document in the same folder as the data file
output_filename =
os.path.join(os.path.dirname("D:/Data/my_data3.csv"),
"probit_results.docx")
doc.save(output_filename)
import os
import pandas as pd
from docx import Document
from docx.shared import Pt
# Save the Word document in the same folder as the data file
output_filename =
os.path.join(os.path.dirname("D:/Data/my_data3.csv"),
"probit_results.docx")
doc.save(output_filename)
These are information criteria that help to compare different models. Lower
values of AIC and BIC indicate a better model. These criteria can be especially
useful when comparing probit models with different sets of independent vari-
ables, and it is often useful to compare the AIC and BIC of different models
to determine which model has the best fit to the data. AIC and BIC can be
calculated during the regression adding the following code after fitting the
model:
336 S. Kumar
print(f"\nAIC: {probit_results.aic:.4f}")
print(f"BIC: {probit_results.bic:.4f}")
Model Diagnostics
# Calculate VIF
X = data[independent_variables]
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for
i in range(X.shape[1])]
vif["feature"] = X.columns
print(vif)
• Link test: The link test is used to check for model misspecification, i.e.,
whether important variables or higher-order terms have been omitted from
the model. The test creates two additional variables: the predicted values of
Ʌ Ʌ
2
the dependent variable (Y ) and the squared predicted values (Y ). Then,
Ʌ Ʌ
2
a new probit regression model is fitted using Y and Y as independent
Ʌ
2
variables. If the coefficient for Y is significant, it suggests that the original
model might be misspecified. Breusch-Pagan test is most commonly used
link testy, which is implemented as below:
20 Probit and Logit Regression 337
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
The het_breuschpagan function returns four statistics. The output will look
like:
(4.609828115548001, 0.3297225985432579, 1.1521668138469945,
0.3303018227688186)
The first statistic (4.61 in this case) is the LM test statistic, and the second
statistic (0.33 here) is the p-value for the test. The null hypothesis of the
Breusch–Pagan test is that there is no heteroskedasticity in the residuals, and
the alternative hypothesis is that there is heteroskedasticity. Since the p-value
of the test (0.33) is greater than the typical significance level of 0.05, we
cannot reject the null hypothesis, and we conclude that there is no signifi-
cant evidence of heteroskedasticity in the residuals. The third statistic (1.15
in this case) is the degrees of freedom used in the LM test, and the fourth
statistic (0.33 here) is the p-value for the F -test of the null hypothesis that all
of the coefficients on the squared terms are zero. Since this test is not signifi-
cant (p-value > 0.05), we conclude that the squared terms do not contribute
significantly to the model.
338 S. Kumar
Logit Model
The logit model, also known as logistic regression, uses the logit link function.
The logit link function is the natural logarithm of the odds, which is the ratio
of the probability of the event (success) to the probability of the non-event
(failure). In mathematical terms, the logit model can be expressed as:
log( p/ (1 − p)) = β0 + β1 X 1 + β2 X 2 + · · · + βn X n
where p is the probability of success (the dependent variable taking the value
1), X 1 , X 2 , …, X n are the independent variables, and β 0 , β 1 , …, β n are the
coefficients to be estimated.
While both probit and logit models are used to estimate the probability of
an event occurring, they differ in the way they model the relationship between
the dependent variable and independent variables. One of the primary differ-
ences between the two models is the link function used to transform the
linear predictor into a probability. The logit model uses the logit link func-
tion, while the probit model uses the probit link function. The functional
340 S. Kumar
forms of the two models differ because the logistic distribution used in the
logit model has heavier tails than the standard normal distribution used in
the probit model. Moreover, the interpretation of coefficients differs between
the two models. In the logit model, the coefficients represent the change in
the log-odds of success for a one-unit increase in the independent variable,
while in the probit model, the coefficients represent the change in the z-score
for a one-unit increase in the independent variable.
Despite these differences, the choice between the two models often comes
down to personal preference or the specific research question being addressed.
Both models can produce similar results, especially when the relationship
between the dependent and independent variables is not extreme. Ultimately,
the choice of which model to use will depend on the researcher’s familiarity
with each method and which model is best suited for the research question
at hand.
Python implementation of logit (logistic regression) is quite similar to the
probit model. You can use the statsmodels library for both models, and the
process of fitting the model and interpreting the results is almost identical.
The only difference is that you use smf.logit() instead of smf.probit() to
define the logit model. Here’s an example of how to implement a logit model
in Python using the same dataset from the previous example:
import pandas as pd
import statsmodels.formula.api as smf
The methods to export the results and to assess the quality of regression also
remain the same.
The logit model has the advantage of producing coefficients that can
be easily transformed into odds ratios, which can be more intuitive for
researchers to understand and interpret. The logit model is based on the
20 Probit and Logit Regression 341
Y = B0 + B1 X + B2 X 2 + . . . + Bn X n + ε,
where
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
This code uses the statsmodels library to create and fit an ordinary least
squares (OLS) regression model with the third degree polynomial features.
It then prints a summary of the regression results, including the coefficients,
t-statistics, number of observations, and other important statistics such as
R-squared and adjusted R-squared.
The output will look like:
346 S. Kumar
Notes:
[1] Standard Errors assume that the covariance matrix of the errors
is correctly specified.
[2] The condition number is large, 1.91e+10. This might indicate
that there are strong multicollinearity or other numerical
problems.
This code creates a Word document with the title “Regression Summary:”
and inserts the image using the docx module. Note that you may need to
adjust the file path for saving the image and the document, based on your
system configuration.
If you need to export a publication quality table that can directly be used
in a paper, the following code will be useful:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from docx import Document
from docx.shared import Inches
348 S. Kumar
row_cells = table.add_row().cells
row_cells[0].text = str(result[0])
row_cells[1].text = str(round(result[1], 4))
row_cells[2].text = str(round(result[2], 4))
This code creates a table in the Word document with three columns: variable,
coefficient, and t-statistic. The code extracts these values from the regression
model and adds them to the table, along with the number of observations
and the adjusted R-squared. Note that you may need to adjust the file path
for saving the document, based on your system configuration.
Choosing the appropriate degree for the polynomial regression model is
critical for accurate predictions and avoiding overfitting. Several model selec-
tion and validation techniques can help researchers determine the optimal
degree of the polynomial.
2. Cross-Validation
different polynomial degrees, researchers can choose the degree that yields
the lowest error. Cross-validation scores can be calculated by inserting the
following code in main code:
# Perform k-fold cross-validation
scores = cross_val_score(lin_reg, X_poly, Y, cv=5)
The cross_val_score function returns an array of scores, one for each fold,
which represents the negative MSE for that fold. A negative MSE indicates
that the model performed worse than the average of the target values, which
means that the model did not fit the data well for that fold. A lower negative
MSE indicates a better fit.
In the output you provided, the cross-validation scores are mostly negative,
with one exception. The mean cross-validation score is also negative, which
indicates that the model did not fit the data well overall.
It’s worth noting that the interpretation of the MSE depends on the scale
of the target variable, and it may not be immediately clear how well the model
is performing based on the scores alone. It’s often useful to compare the cross-
validation scores to the mean and standard deviation of the target variable to
get a sense of how well the model is doing relative to the variability in the
data.
3. Regularization Techniques
In this code, we first load and split the data into training and testing
sets. We then create polynomial features using PolynomialFeatures, and fit
a Lasso regression model using Lasso. The alpha parameter controls the
regularization strength, and the max_iter parameter controls the maximum
number of iterations. We then use the trained model to predict the test set
results and calculate the mean squared error. Note that Lasso regression can be
sensitive to the choice of alpha. You may need to experiment with different
values of alpha to find the best one for your data. The output will look like:
Mean squared error: 18797519.140195906
Coefficients:
Degree 1: 0.0
Degree 2: 27.0842
Degree 3: -0.0045
Degree 4: 0.0
Mean squared error is a measure of the difference between the actual values
and the predicted values of the test set. In this case, the mean squared error is
quite high at 18,797,519.14. This indicates that the Lasso model is not a very
good fit for the data. The coefficients indicate the relative importance of each
feature in predicting the dependent variable. In this case, the Lasso model
has assigned zero weight to the first and fourth degree polynomial features,
indicating that they are not contributing to the model. The second degree
polynomial feature (i.e., ETH2 ) has a positive weight of 27.0842, indicating
that it has a positive relationship with the dependent variable. The third
degree polynomial feature (i.e., ETH3 ) has a negative weight of −0.0045,
indicating that it has a negative relationship with the dependent variable. It’s
worth noting that the Lasso model is a type of regularized regression that can
help prevent overfitting, by penalizing the coefficients of features that are not
important. However, in this case, the mean squared error is quite high, indi-
cating that the Lasso model is not a very good fit for the data. You may need
to try other types of regression or adjust the hyperparameters of the Lasso
model to improve its performance.
On the same data, ridge regression would be implemented as follows:
21 Polynomial Regression 353
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures,
StandardScaler
from sklearn.linear_model import Ridge
from docx import Document
from docx.shared import Inches
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
The output will be saved in a publication quality table. This code creates a
table in the Word document with two columns: degree and coefficient. The
code fits a Ridge regression model with an alpha value of 1.0, and extracts
the coefficients, number of observations, and adjusted R-squared.
The main advantage of polynomial regression is that it can capture non-
linear relationships between the independent and dependent variables, which
cannot be modeled using linear regression. This makes polynomial regres-
sion a useful tool in fields such as physics, engineering, and finance, where
non-linear relationships between variables are common. However, one major
21 Polynomial Regression 355
Q_Y (τ |X ) = Xβ(τ ),
where
where
class from the statsmodels library, and print the results, including the esti-
mated coefficients for each quantile. You can modify the quantiles list to
include any desired quantiles between 0 and 1. The code will give an output
for all the quantiles. In this case output for 0.25 quantile is as follows:
Results for 0.25 Quantile Regression:
QuantReg Regression Results
==============================================================================
Dep. Variable: y Pseudo R-squared: 0.2050
Model: QuantReg Bandwidth: 0.5631
Method: Least Squares Sparsity: 3.404
Date: Sat, 06 May 2023 No. Observations: 683
Time: 15:45:49 Df Residuals: 676
Df Model: 6
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 10.3025 0.819 12.573 0.000 8.694 11.912
sex -0.8047 0.118 -6.833 0.000 -1.036 -0.573
dex 0.1098 0.009 11.895 0.000 0.092 0.128
lex -0.0693 0.058 -1.184 0.237 -0.184 0.046
kwit -0.0166 0.145 -0.114 0.909 -0.302 0.269
job_tenure -0.0003 0.000 -1.782 0.075 -0.001 3.19e-05
censored 0.1063 0.154 0.689 0.491 -0.197 0.410
==============================================================================
1. Pseudo R-Squared
2. Residual Analysis
Analyze the residuals (differences between the observed values and the
predicted values) to assess the model’s performance. Ideally, the residuals
should be randomly distributed and not show any patterns, trends, or
heteroscedasticity. You can create residual plots, such as scatter plots of resid-
uals versus predicted values or histograms of residuals, to visually assess
the distribution and behavior of the residuals. A residual plot for this 0.25
quantile regression can be created as follows:
import matplotlib.pyplot as plt
# Fit the 0.25 quantile regression model
quantile_025 = 0.25
model_025 = sm.QuantReg(y, X).fit(q=quantile_025)
# Calculate predicted values
y_pred_025 = model_025.predict(X)
# Calculate residuals
residuals_025 = y - y_pred_025
The output will show the scatterplot of residuals for different predicted values
of Y:
22 Quantile Regression 361
A good residual plot for a regression model, whether it’s linear or quantile
regression, should display the following characteristics:
3. Coefficient Significance
Check the stability of the model by examining the condition number (as
already discussed above), as high condition numbers may indicate multi-
collinearity or numerical instability. You can also assess multicollinearity by
analyzing the correlation matrix or the variance inflation factor (VIF) of
the predictor variables. Addressing multicollinearity issues may improve the
quality of the model. VIF can be calculated as:
# Import necessary libraries
import pandas as pd
from statsmodels.stats.outliers_influence import
variance_inflation_factor
#Load Data
data = pd.read_csv('D:/Data/my_data2.csv')
This will print the VIF for all the independent variables, which in this would
be:
22 Quantile Regression 363
feature VIF
0 sex 2.429564
1 dex 38.462426
2 lex 42.388046
3 kwit 1.689730
4 job_tenure 2.957368
5 censored 5.402124
Compare the quantile regression model with other models, such as linear
regression or alternative quantile regression specifications, to see if it provides
a better fit or more meaningful insights into the data. This can help you
choose the most appropriate model for your research question.
Quantile regression offers several advantages over traditional linear regres-
sion, making it particularly well-suited for accounting and finance research.
One of the key benefits is its robustness to outliers. Unlike linear regression,
which minimizes squared residuals, quantile regression minimizes absolute
residuals. This makes it less sensitive to outliers and more robust when
dealing with skewed or heavy-tailed distributions, which are common in
financial data. Another advantage of quantile regression is its ability to esti-
mate heterogeneous relationships between predictor and response variables
across different quantiles. This is especially useful in cases where the impact
of predictor variables varies across different levels of the response variable.
By considering the relationships at various quantiles, researchers can gain a
more detailed understanding of the underlying processes and interactions,
which can lead to more accurate and insightful models. Quantile regression
enables researchers to perform a more comprehensive analysis of the distribu-
tion of the response variable by estimating multiple quantiles. This provides
important insights into the behavior of financial variables, such as tail risk
and extreme events. By examining the entire distribution, rather than just
the mean, researchers can develop a deeper understanding of the data and
364 S. Kumar
better identify patterns and trends that may not be apparent with traditional
linear regression methods.
Despite its advantages, quantile regression also has some limitations and
drawbacks. One of the main disadvantages is its computational complexity.
Quantile regression often requires more computational resources and time
to estimate the model parameters compared to linear regression, particularly
when dealing with large datasets or multiple quantiles. This can be a limiting
factor for researchers working with extensive data or constrained computa-
tional resources. Another limitation of quantile regression is the difficulty in
interpreting the results, especially for non-experts or those unfamiliar with
the technique. Unlike linear regression, which provides a single coefficient
for each predictor variable that can be easily interpreted, quantile regression
produces different coefficients for each quantile. This can make the inter-
pretation of the results more complex and may require additional effort to
convey the findings to a broader audience. Quantile regression does not
provide a direct measure of goodness-of-fit like the R-squared value in linear
regression. Instead, it relies on pseudo R-squared, which can be harder to
interpret and compare with the goodness-of-fit values from other models.
This makes model selection and evaluation more challenging for researchers
using quantile regression. The presence of multicollinearity can still be an
issue in quantile regression, similar to linear regression. While quantile regres-
sion is robust to outliers and can handle heteroskedasticity, it does not inher-
ently address the problems caused by multicollinearity. Researchers should
still be cautious when dealing with highly correlated predictor variables and
consider appropriate techniques to mitigate the potential issues.
Quantile regression has emerged as an essential tool for academic
accounting and finance research due to its ability to model complex rela-
tionships, robustness to outliers, and comprehensive distributional analysis
capabilities. While it has some limitations, such as computational complexity
and challenges in interpretation, the advantages it offers make it a valuable
alternative to traditional linear regression methods. By employing quantile
regression, researchers can gain deeper insights into the underlying processes,
better understand the behavior of financial variables across different quan-
tiles, and ultimately contribute to the development of more accurate and
informative models. Embracing quantile regression in academic accounting
and finance research can help researchers uncover important findings that
may have been overlooked by conventional methods, thereby advancing the
understanding of financial markets and their dynamics.
23
Advanced Regressions
In the field of accounting and finance research, the complex and diverse
nature of data often requires the utilization of a diverse range of analyt-
ical tools. Standard linear regression models, such as ordinary least squares
regression and logistic regression, may not always be suitable due to their
assumptions not aligning with the characteristics of the data or research
question at hand. This necessitates researchers to explore alternative statis-
tical methods that respect the specific properties of the data and cater to the
unique needs of their research. In this chapter, we will discuss some specific
regression technique that are suitable for some specific situations.
Tobit Regression
Tobit regression is a statistical method specifically designed to estimate rela-
tionships between variables when there is either left- or right-censoring in
the dependent variable. It is a type of generalized linear model that combines
elements of both linear regression and probit models, making it particularly
well-suited to deal with scenarios in which the range of outcomes is mechan-
ically constrained. In accounting and finance research, Tobit regression is
particularly beneficial in situations where the dependent variable is censored.
Censoring occurs when the value of the dependent variable is only partially
known; for instance, when the value of an observation falls below or above a
certain threshold, it is not exactly observable. A classic example in finance is
the situation where a firm’s liabilities exceed its assets, but the exact amount of
negative equity is not observed or reported, making it a case of left-censoring
at zero.
The Tobit model can be formally written as follows:
yi ∗ = β xi + εi,
yi = max(0, yi ∗),
class TobitModel:
def __init__(self, y, X, sigma=1, left=0):
self.y = y
self.X = X
self.n, self.k = X.shape
self.sigma = sigma
self.left = left
def fit(self):
x0 = np.concatenate([np.ones(self.k), [self.sigma]]) #
Use a different initial guess
res = minimize(self.nll, x0, method='BFGS') # Use BFGS
optimizer which also computes the Hessian
368 S. Kumar
self.params_ = res.x
self.sigma_ = res.x[-1]
self.beta_ = res.x[:-1]
def summary(self):
results = pd.DataFrame({
'Variable': ['intercept', 'ETH_returns',
'BNB_returns', 'ADA_returns', 'sigma'],
'Estimate': self.params_,
'Standard Error': self.stderr_,
't-statistic': self.tvalues_,
'p-value': self.pvalues_
})
return results
As the Tobit model is not directly supported by any major Python statis-
tical package (though some custom solutions by user community do exist),
a custom solution needs to be created to compute additional statistics in the
code above. The following code will directly export the regression table in a
format typically used in accounting and finance papers to a word file.
from docx import Document
from docx.shared import Inches
class TobitModel:
def __init__(self, Y, X, sigma=1, left=0):
self.Y = Y
self.X = X
self.n, self.k = X.shape
self.sigma = sigma
self.left = left
def fit(self):
x0 = np.concatenate([np.ones(self.k), [self.sigma]]) #
Use a different initial guess
res = minimize(self.nll, x0, method='BFGS') # Use BFGS
optimizer which also computes the Hessian
self.params_ = res.x
self.sigma_ = res.x[-1]
self.beta_ = res.x[:-1]
370 S. Kumar
def summary(self):
results = pd.DataFrame({
'Variable': ['intercept', 'ETH_returns',
'BNB_returns', 'ADA_returns', 'sigma'],
'Estimate': self.params_,
'Standard Error': self.stderr_,
't-statistic': self.tvalues_,
'p-value': self.pvalues_
})
return results
The script begins by importing necessary modules and defining the Tobit
model class. The class constructor initializes the dependent variable (Y ), the
independent variables (X ), and some optional parameters, including the stan-
dard deviation of the error term (sigma) and a left-censoring limit. The class
includes methods for fitting the model and summarizing or exporting the
results. The method nll calculates the negative log-likelihood of the model
given a set of parameters, differentiating between censored and uncensored
observations. The fit method uses the Broyden-Fletcher-Goldfarb-Shanno
(BFGS) algorithm to find the set of parameters that minimizes the nega-
tive log-likelihood, computes standard errors using the inverse of the Hessian
matrix (an approximation of the second derivative of the likelihood function),
and calculates the t-statistics and two-sided p-values for the parameters. The
summary method returns a DataFrame containing the estimated parameters,
their standard errors, t-statistics, and p-values. The export_to_word method
generates a Word document with a table that includes the same information
as the summary.
The script then loads a dataset, converts the Date column to datetime
format, and sorts the observations by date. It defines the dependent variable
as ‘BTC_returns’ and the independent variables as ‘ETH_returns’, ‘BNB_
returns’, and ‘ADA_returns’. It adds a constant to the matrix of indepen-
dent variables to account for the intercept term in the regression model. After
preparing the data, the script fits the Tobit model and exports the summary
of the results to a Word document. This implementation provides a compre-
hensive analysis of the relationship between Bitcoin returns and the returns
372 S. Kumar
of three other cryptocurrencies under the assumption that Bitcoin returns are
left-censored at zero.
This was a censoring model. If we need to implement a trimming model,
then TobitModel class in the above code needs to be modified as:
from scipy.stats import norm
import numpy as np
import pandas as pd
from scipy.optimize import minimize
class TobitModel:
def __init__(self, y, X, trim_percent=0.1):
self.y = y
self.X = X
self.n, self.k = X.shape
self.trim_percent = trim_percent
def fit(self):
x0 = np.concatenate([np.ones(self.k), [1]]) # Use a
different initial guess
res = minimize(self.nll, x0, method='BFGS')
self.params_ = res.x
self.sigma_ = res.x[-1]
self.beta_ = res.x[:-1]
return self
def summary(self):
results = pd.DataFrame({
'Variable': ['intercept'] + ['X{}'.format(i) for i
in range(1, self.k + 1)],
'Estimate': self.params_[:-1],
'Standard Error': np.nan,
't-statistic': np.nan,
'p-value': np.nan
})
return results
23 Advanced Regressions 373
Poisson Regression
Poisson regression is a generalized linear model form of regression analysis
used to model count data and contingency tables. The Poisson regression
model allows us to examine the relationship between a set of predictor vari-
ables and a count-dependent variable. In a Poisson regression model, the
response variable Y is assumed to follow a Poisson distribution, and the
logarithm of its expected value can be modeled by a linear combination of
unknown parameters. The canonical form of the Poisson regression model is
expressed as:
log(E[Y |X ]) = β0 + β1 X 1 + β2 X 2 + . . . + β p ∗ X p ,
where
y = log(μ) = β0 + β1 ∗ x1 + β2 ∗ x2 + . . . + βn ∗ xn .
Here, ‘y’ is the log of the expected count ‘μ’, and ‘x1, x2, …, xn ’ are
the predictor variables. ‘β0 , β1 , …, βn ’ are the parameters to be estimated,
reflecting the impact of respective predictors on the expected log count.
23 Advanced Regressions 375
import pandas as pd
import statsmodels.api as sm
Y = β0 + β1 X + ε.
X = γ0 + γ1 Z + u.
Y = β0 + β1 (γ0 + γ1 Z + u) + ε.
import pandas as pd
import statsmodels.api as sm
# Assign variables
endogenous_variable = data[''Dep_Var'']
exogenous_variables = data[['Indep_Var1', ' Indep_Var2', '
Indep_Var3', 'Indep_Var4', 'Indep_Var5']]
instrumental_variable = data['Instrumental_variable']
Stage 2: Y = Xβ + ε.
In Stage 1, the endogenous variable X is regressed on a set of instrumental
variables Z . The fitted values from this stage, denoted as Ȳ , are obtained.
In Stage 2, the outcome variable Y is regressed on Ȳ and other exogenous
variables X , yielding unbiased and consistent estimates of the coefficients β.
The application of 2SLS regression in academic research within the
accounting and finance domain is wide-ranging. It is commonly employed
in studies that investigate the impact of certain accounting practices, finan-
cial policies, or corporate governance mechanisms on firm performance,
investment decisions, or market outcomes. Moreover, 2SLS regression is
frequently utilized in examining the effects of financial regulations, mergers
and acquisitions, or capital structure choices on firm valuation and financial
performance.
The importance of 2SLS regression lies in its ability to provide researchers
with a rigorous method to address endogeneity problems that may arise in
empirical studies. By incorporating instrumental variables and employing a
two-stage estimation procedure, 2SLS allows researchers to obtain consistent
and unbiased estimates of causal relationships, thereby enhancing the validity
of their findings. This is particularly crucial in empirical research where estab-
lishing causal relationships is of paramount importance. However, 2SLS relies
on the availability of valid instrumental variables. Finding suitable instru-
ments that meet the necessary criteria, such as relevance and exogeneity, can
be challenging and may introduce additional sources of bias if improperly
chosen. Moreover, 2SLS regression assumes that there are no measurement
errors in the instrumental variables, which may not always hold in practice.
A python implementation of 2SLS would be as follows:
380 S. Kumar
import pandas as pd
import statsmodels.api as sm
# Assign variables
endogenous_variable = data['Dep_Var']
exogenous_variables = data[['Indep_Var1', 'Indep_Var2',
'Indep_Var3', 'Indep_Var4', 'Indep_Var5']]
instrumental_variable = data['Instrumental_variable']
Time series data is a type of data that is collected over time at regular inter-
vals. It is a sequence of data points that are ordered by time, where each
data point represents a measurement or observation made at a specific point
in time. Time series data can be collected from a wide range of sources,
including sensors, financial markets, social media, weather stations, and more.
Time series data is unique in that it contains both temporal and structural
dependencies, meaning that each observation is dependent on previous obser-
vations and that the data points have a specific order. This makes time series
data useful for analyzing trends, patterns, and behaviors over time. Time
series analysis techniques can be used to identify trends and seasonality, make
predictions and forecasts, and detect anomalies or outliers. Examples of time
series data include annual financial statements, daily stock prices, hourly
weather data, monthly sales figures, and annual GDP growth rates. Time
series data is commonly used in fields such as accounting, finance, economics,
engineering, and environmental science, among others, where understanding
patterns and trends over time is crucial for making informed decisions.
Time series analysis is a set of statistical techniques used to analyze and
draw insights from data points collected over time. In accounting and
finance research, time series analysis plays a crucial role in understanding the
dynamics of financial markets, firm performance, and economic indicators.
By analyzing time series data, researchers can identify patterns, detect trends,
and make forecasts to support informed decision-making.
To effectively analyze time series data, researchers should understand the
following basic concepts:
Before analyzing time series data, it’s essential to preprocess the data to ensure
it’s in the correct format and has the desired properties.
First, let’s import the necessary libraries and load the data from the specified
file path:
import pandas as pd
print(data.head())
24 Time Series Analysis 383
print(interpolated_data.head())
To check for stationarity, you can use the Augmented Dickey-Fuller (ADF)
test provided by the statsmodels library. If the data is non-stationary, you
can apply transformations like differencing or log transformation to achieve
stationarity.
import numpy as np
from statsmodels.tsa.stattools import adfuller
BTC data, the differenced BTC data (created by taking the first differ-
ence), and the log-transformed BTC data to determine if any of these
transformations result in a stationary time series. The results will look like:
Results of ADF test for BTC:
ADF Statistic: -0.4283402641607983
p-value: 0.9052583521073132
Critical Values:
1%: -3.4621857592784546
5%: -2.875537986778846
10%: -2.574231080806213
Results of ADF test for differenced BTC:
ADF Statistic: -4.669125404631494
p-value: 9.613914488683148e-05
Critical Values:
1%: -3.4623415245233145
5%: -2.875606128263243
10%: -2.574267439846904
Results of ADF test for log-transformed BTC:
ADF Statistic: -0.9652852765849609
p-value: 0.7657142553972955
Critical Values:
1%: -3.4602906385073884
5%: -2.874708679520702
10%: -2.573788599127782
In this example, for the original BTC time series, the ADF statistic is −0.428,
and the p-value is 0.905, which suggests that the data is non-stationary and
that there is a high probability that it has a unit root (i.e., a trend). This is
supported by the critical values, which are all greater than the ADF statistic.
For the differenced BTC time series, which is the original series after differ-
encing (i.e., taking the first difference), the ADF statistic is −4.669, and the
p-value is 9.61e−05, which suggests that the data is now stationary and that
there is a low probability that it has a unit root. This is supported by the crit-
ical values, which are all less than the ADF statistic. For the log-transformed
BTC time series, which is the original series after taking the natural loga-
rithm, the ADF statistic is −0.965, and the p-value is 0.766, which suggests
that the data is still non-stationary and that there is a high probability that it
has a unit root. This is supported by the critical values, which are all greater
than the ADF statistic. ADF test results suggest that the original BTC time
series data is non-stationary and has a unit root, which means that it has a
trend that needs to be removed.
For our research question, if the data calls for a difference transformation,
it can be achieved as follows:
24 Time Series Analysis 385
print(differenced_data.head())
This code will apply the first-order difference transformation to all columns
in the interpolated_data DataFrame and store the result in a new DataFrame
called differenced_data. The dropna() method is used to remove any rows
with missing values that result from the differencing process.
# Plot ACF
plt.figure(figsize=(12, 6))
plot_acf(btc_diff, lags=20, title='Autocorrelation Function
(ACF) for BTC')
plt.show()
# Plot PACF
plt.figure(figsize=(12, 6))
plot_pacf(btc_diff, lags=20, title='Partial Autocorrelation
Function (PACF) for BTC')
plt.show()
386 S. Kumar
any shorter lags. If the PACF plot demonstrates a gradual decrease or expo-
nential decay and cuts off after a specific lag, it suggests a Moving Average
(MA) process of that order. If the PACF plot displays a sudden cut-off after a
certain lag, it indicates an Autoregressive (AR) process of that order. If the
PACF plot exhibits a mixture of gradual decrease and sudden cut-offs, it
signifies a mix of AR and MA components.
It is also essential to consider the confidence intervals. The confidence
intervals are shown as horizontal lines or shaded areas around zero. If the
autocorrelation or partial autocorrelation is within the confidence interval,
it is not statistically significant, and you can consider it as zero. If the auto-
correlation or partial autocorrelation is outside the confidence interval, it is
statistically significant.
There are several types of time series models, each with its own assump-
tions and applications. Some of the most commonly used time series models
include:
Each of these models has different assumptions and is suited for different
types of time series data. AR, MA, and ARMA models are useful for analyzing
stationary time series, while ARIMA, SARIMA, and SARIMAX models can
handle non-stationary and seasonal data. VAR and VECM models are useful
for analyzing the relationships between multiple time series variables, while
GARCH models are used to model volatility in financial time series data.
We’ll discuss the most important time series models in greater detail.
388 S. Kumar
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.ar_model import AutoReg
This code will import the data from the specified file path, preprocess it by
setting the index to the ‘Date’ column and resampling to weekly frequency,
interpolate missing values, apply the difference transformation, check station-
arity using the ADF test, and fit an AR(1) model to the differenced BTC data.
If the AR model is to be employed on the main dataset i.e., my_data4.csv,
390 S. Kumar
then the code has to be modified accordingly. Keep in mind that the choice
of the AR model’s order should be informed by the ACF and PACF plots, as
well as model selection criteria such as Akaike Information Criterion (AIC)
and Bayesian Information Criterion (BIC). You can export the results in a
publication quality table in word file as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.ar_model import AutoReg
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.shared import Pt
where Y t is the observed value at time t, μ is the mean of the time series, εt
is the error term at time t, and θ i is the parameter associated with the error
term at t −i.
The MA model is useful for modeling time series with short-term depen-
dencies or when the underlying process generating the data is best described
by moving averages of past errors. A second-order MA model can be
implemented on D:/Data/my_data4.csv as follows:
24 Time Series Analysis 393
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.arima.model import ARIMA
# Step 3: Determine the order of the MA model (q) using the ACF
plot
column_name = 'BTC' # Replace with the desired column name
plot_acf(data[column_name], lags=20)
plt.show()
# Based on the ACF plot, choose an appropriate value for q
q = 2 # Replace with the appropriate value based on the ACF
plot
print(ma_result.summary())
# Based on the ACF plot, choose an appropriate value for q
q = 2 # Replace with the appropriate value based on the ACF
plot
print(ma_result.summary())
SARIMAX Results
==============================================================================
Dep. Variable: BTC No. Observations: 1554
Model: ARIMA(0, 0, 2) Log Likelihood -15744.331
Date: Sun, 07 May 2023 AIC 31496.662
Time: 10:59:56 BIC 31518.056
Sample: 10-01-2017 HQIC 31504.618
- 01-01-2022
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.787e+04 561.039 31.851 0.000 1.68e+04 1.9e+04
ma.L1 1.4355 0.010 142.757 0.000 1.416 1.455
ma.L2 0.9999 0.014 71.448 0.000 0.972 1.027
sigma2 3.663e+07 0.320 1.14e+08 0.000 3.66e+07 3.66e+07
===================================================================================
Ljung-Box (L1) (Q): 370.45 Jarque-Bera (JB): 2025.11
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 6.10 Skew: 1.99
Prob(H) (two-sided): 0.00 Kurtosis: 6.94
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of
gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with
condition number 3.25e+22. Standard errors may be unstable.
Ljung-Box (L1) (Q) and Prob(Q): The Ljung-Box test is a statistical test used
to check for autocorrelation in the residuals of the model. A high test statistic
value (Q) and a low p-value (Prob(Q)) indicate that there is significant auto-
correlation in the residuals, suggesting that the model may not have captured
all the underlying patterns in the data. In this case, the test statistic is 370.45,
and the p-value is 0.00, indicating evidence of significant autocorrelation in
the residuals. Jarque–Bera (JB) and Prob(JB): The Jarque–Bera test is a statis-
tical test used to determine whether the residuals are normally distributed.
A high test statistic value (JB) and a low p-value (Prob(JB)) indicate that
24 Time Series Analysis 395
the residuals are not normally distributed. In this case, the test statistic is
2025.11, and the p-value is 0.00, suggesting that the residuals do not follow
a normal distribution.
where
In simpler terms, the ARMA model represents the current value of the time
series as a linear combination of its past values (autoregressive component),
past error terms (moving average component), and a constant term. The
model aims to capture both the dependence on previous values and the effect
of random shocks in the data.
The ARIMA model is a generalization of the Autoregressive (AR) and MA
models, combining their strengths to better capture the dynamics of time
series data. ARIMA models can also incorporate differencing to handle non-
stationary data. An ARIMA model is defined by three parameters: p, d , and
q. The p parameter refers to the order of the AR component, the d parameter
396 S. Kumar
represents the degree of differencing, and the q parameter indicates the order
of the MA component. An ARIMA(p, d , q) model can be written as:
where Φ(B) and Θ(B) are the lag polynomials for the AR and MA compo-
nents, respectively, B is the backshift operator, and (1−B)d represents the
differencing operator applied d times.
ARIMA (Autoregressive Integrated Moving Average) models are particu-
larly useful for forecasting time series data with complex patterns, such as
trends, seasonality, and a combination of short- and long-term dependencies.
By choosing the appropriate p, d , and q parameters, researchers can capture
a wide range of time series behaviors using ARIMA models.
SARIMA (Seasonal Autoregressive Integrated Moving Average) is an exten-
sion of the ARIMA model that specifically addresses seasonality. In addition
to the AR, differencing, and MA components, SARIMA models include
seasonal components for each of these features. This allows the model to
capture complex seasonal patterns in the time series data.
SARIMAX (Seasonal Autoregressive Integrated Moving Average with
Exogenous Regressors) is an extension of the ARMA model that incorpo-
rates both seasonality and exogenous variables, making it more versatile and
capable of capturing complex patterns in time series data, especially when
there are seasonal effects or additional external factors influencing the obser-
vations. This enhanced modeling capability allows SARIMAX to provide
more accurate forecasts and insights compared to the simpler ARMA model.
There is no universally “best” model among ARMA, ARIMA, SARIMA,
and SARIMAX, as the choice of the most suitable model depends on the
specific characteristics of the time series data being analyzed. The optimal
model for a given dataset may vary based on factors such as trend, seasonality,
and the presence of exogenous variables.
To select the best model for your data, consider the following:
• If your time series exhibits both trend and seasonality, a SARIMA model
may be the most appropriate choice.
• If your time series has trend, seasonality, and is affected by external factors
or exogenous variables, a SARIMAX model is likely the best option.
SARIMAX Results
==============================================================================
Dep. Variable: BTC No. Observations: 1554
Model: ARIMA(2, 1, 3) Log Likelihood -12914.550
Date: Sun, 07 May 2023 AIC 25841.100
Time: 11:27:37 BIC 25873.187
Sample: 0 HQIC 25853.033
- 1554
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.7486 0.033 22.721 0.000 0.684 0.813
ar.L2 -0.9021 0.034 -26.280 0.000 -0.969 -0.835
ma.L1 -0.7920 0.036 -22.196 0.000 -0.862 -0.722
ma.L2 0.9124 0.040 22.678 0.000 0.834 0.991
ma.L3 -0.0092 0.017 -0.533 0.594 -0.043 0.025
sigma2 9.793e+05 1.45e+04 67.450 0.000 9.51e+05 1.01e+06
===================================================================================
Ljung-Box (L1) (Q): 0.03 Jarque-Bera (JB): 7231.20
Prob(Q): 0.86 Prob(JB): 0.00
Heteroskedasticity (H): 0.08 Skew: 0.16
Prob(H) (two-sided): 0.00 Kurtosis: 13.57
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of
gradients (complex-step).
Here, Y 1(t ) and Y 2(t ) represent the values of the two time series at time t,
c 1, and c 2 are constants, a11, a12, a21, and a22 are the coefficients to be
estimated, and e1(t ) and e2(t ) are the error terms.
VAR models can be estimated using ordinary least squares (OLS) on a
per-equation basis. To determine the optimal number of lags to include
in the model, researchers can use information criteria, such as the Akaike
Information Criterion (AIC) or the Bayesian Information Criterion (BIC).
Some key features of VAR models include:
• Impulse Response Functions (IRFs): IRFs show the response of each vari-
able to a one-time shock in another variable while holding all other shocks
constant. They help researchers analyze the dynamic effects of shocks to
the system and understand the transmission of shocks across variables.
• Variance Decomposition: Variance decomposition measures the propor-
tion of the forecast error variance of each variable that is attributable to
shocks in other variables in the system. It provides insights into the relative
importance of each variable in explaining the forecast error variance of the
other variables.
• Granger Causality Tests: Granger causality tests are used to determine
whether one variable can predict another variable better than using the past
values of the dependent variable alone. Granger causality does not imply
true causality but can provide evidence of a predictive relationship between
variables.
To implement a VAR model in Python, researchers can use the VAR class
from the statsmodels.tsa.vector_ar.var_model module, which provides tools
for estimating, analyzing, and forecasting VAR models. An implementation
of VAR model on D:/Data/my_data4.csv with BTC as Y 1 and ETH as Y 2,
assuming that the series are stationary is as follows:
400 S. Kumar
import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller
This code loads time series data of Bitcoin and Ethereum prices from a CSV
file, sets the date column as the index, selects the columns of interest, and
performs an Augmented Dickey-Fuller (ADF) test for each of the selected
columns to check for stationarity. A vector autoregression (VAR) model is
then fitted to the selected data, and the optimal lag order is determined using
the Akaike information criterion (AIC). Finally, the VAR model is fit with
the optimal lag order, and a summary of the model results is printed. This
code provides a basic example of how to perform time series analysis using
the VAR model in Python.
If the series are non-stationary, we need to include the differencing also.
The code in this case would be:
24 Time Series Analysis 401
lags = model.select_order(maxlags=10)
optimal_lag_order = lags.aic
print(f"Optimal lag order (AIC): {optimal_lag_order}")
One of the main advantages of the VAR model is its ability to capture
the linear interdependencies among multiple time series, which makes it
particularly useful for analyzing the behavior of interconnected economic
and financial variables. The model is easy to estimate using ordinary least
squares (OLS) and provides a straightforward way to examine the dynamic
effects of shocks to the system through impulse response functions (IRFs).
Additionally, VAR models can be used for forecasting multiple time series
simultaneously, making them suitable for a variety of applications.
However, VAR model requires a large number of parameters to be esti-
mated, which can lead to overfitting, especially in cases where the number
of time series and lags is high. This problem can be mitigated by using
model selection criteria, such as the Akaike Information Criterion (AIC) or
the Bayesian Information Criterion (BIC), to determine the optimal number
of lags. Another disadvantage of VAR models is that they are purely linear,
which means they may not accurately capture complex, non-linear relation-
ships between time series. This limitation can be addressed by employing
non-linear models, such as the Nonlinear Vector Autoregression (NVAR)
model or the Vector Autoregression Moving Average (VARMA) model. VAR
models are not suitable for analyzing time series with structural breaks or
regime shifts, as the model assumes constant coefficients throughout the
entire sample period. In such cases, more advanced techniques, such as the
Markov-switching VAR model or the time-varying parameter VAR model,
may be more appropriate. VAR model is essentially a reduced-form model,
which means it does not provide a structural interpretation of the relation-
ships among the variables. Researchers who are interested in identifying causal
relationships or understanding the underlying economic mechanisms may
need to employ structural models, such as structural VAR models, which
impose identifying restrictions based on economic theory.
402 S. Kumar
where
Advantages of VECM
• Long-run and Short-run Dynamics: VECM allows for the simultaneous
analysis of long-run equilibrium relationships and short-run adjustments
among multiple time series variables, providing a comprehensive under-
standing of their interdependencies.
• Cointegration: By considering the cointegration relationships between
non-stationary variables, VECM avoids the problem of spurious regres-
sion, which can lead to misleading results when analyzing non-stationary
data.
24 Time Series Analysis 403
To implement VECM in Python, you can use the VECM class from
the statsmodels.tsa.vector_ar.vecm module. Before applying VECM, you’ll
need to determine the appropriate lag order (p) and the rank of the cointe-
gration matrix (∏) using model selection criteria or tests for cointegration,
such as the Johansen test, as follows:
import pandas as pd
from statsmodels.tsa.vector_ar.vecm import VECM, select_order,
coint_johansen
The VECM output provides with the loading coefficients (alpha) for each
variable in the system, as well as the cointegration relations (beta) for each
cointegrating vector.
GARCH Model
The Generalized Autoregressive Conditional Heteroskedasticity (GARCH)
model is a time series model commonly used to model volatility clustering,
where large price changes in a time series tend to be followed by more large
price changes, and small price changes tend to be followed by more small
price changes. This phenomenon is often observed in financial data, where
sudden market shocks can lead to prolonged periods of high volatility.
The GARCH model is based on the idea that the variance of a time series
is a function of its own past values and the past values of its own squared
residuals, which are the deviations of the actual values from the predicted
values. The model assumes that the squared residuals follow an Autoregressive
(AR) process and a Moving Average (MA) process, with the variance of the
time series being a function of these past squared residuals.
The GARCH(p, q) model can be represented by the following equation:
∑( ) ∑( )
σt2 = ω + αi ε{t−i}
2
+ β j σ{t−
2
j} ,
Panel data, also known as cross-sectional time series data, is a type of data
that combines both cross-sectional and time series dimensions. In accounting
and finance research, panel data typically consists of observations on multiple
entities (such as firms, countries, or individuals) over multiple time periods.
This data structure enables researchers to study the dynamics of relationships
between variables and analyze the effects of various factors over time.
Panel data plays a crucial role in accounting and finance research as it
allows researchers to control for unobserved heterogeneity between entities,
which can lead to more accurate estimates and inferences. It also provides
a richer source of information by combining both cross-sectional variation
and time series variation, which can help in identifying causal relationships.
Furthermore, panel data enables researchers to study dynamic relationships
and analyze the impact of past events on current outcomes.
The structure of panel data includes two dimensions: cross-sectional and
time series. The cross-sectional dimension refers to the individual entities
being observed, such as companies or countries, while the time series dimen-
sion refers to the observations of these entities over different time periods.
In a balanced panel, each entity has the same number of observations across
all time periods. In contrast, an unbalanced panel has an unequal number of
observations for different entities or time periods.
Working with panel data offers several benefits, such as the ability to
control for unobserved fixed effects, which can reduce omitted variable
bias and improve the accuracy of estimates. Moreover, it exploits both
within-entity and between-entity variations, which can improve the power of
statistical tests and increase the precision of estimates. Panel data also provides
time. Pooled OLS models can provide efficient estimates when these assump-
tions hold, but they may suffer from omitted variable bias and other issues
when unobserved heterogeneity is present.
The choice between fixed effects and random effects models depends on
the underlying assumptions and the research question being addressed. The
Hausman test is a widely used statistical test that helps researchers decide
between these two models. The test compares the estimates obtained from
both models and evaluates the null hypothesis that the random effects model
is consistent and efficient. If the Hausman test fails to reject the null hypoth-
esis, the random effects model is preferred, as it provides more efficient
estimates. However, if the test rejects the null hypothesis, it suggests that the
random effects model is inconsistent, and the fixed effects model should be
used instead. In Python, Hausman statistic can be calculated as under:
import pandas as pd
import numpy as np
import scipy.stats as stats
from linearmodels import PanelOLS, RandomEffects
This code calculates the Hausman test statistic manually using the param-
eter estimates and covariance matrices of the fixed effects and random effects
models. It then calculates the p-value using the chi-squared distribution from
the scipy.stats library. If the p-value is smaller than your chosen significance
level (e.g., 0.05), you would reject the null hypothesis and prefer the fixed
effects model. If the p-value is larger than the significance level, you would
not reject the null hypothesis, and the random effects model might be more
appropriate.
Y _it = α + β X it + εit ,
entity or during the same time period. This can lead to inefficient estimation
and may adversely affect hypothesis testing.
Pooled OLS models offer a simple and intuitive approach to panel data
analysis. However, researchers should be cautious when using them, as they
may not provide accurate results when unobserved heterogeneity or correla-
tions between observations are present in the data. Alternative methods, such
as fixed effects or random effects models, may be more suitable in these cases
and should be considered as part of a robust panel data analysis.
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
Notes:
[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified.
[2] The condition number is large, 1.46e+04. This might indicate
that there are strong multicollinearity or other numerical
problems.
The model includes fixed effects for the Year variable, which represents the
time series dimension of the panel data. The fixed effects are represented by
the C(Year) terms in the model formula. The results indicate that the model
has a high R-squared value of 0.932, suggesting a strong relationship between
the independent variables and the dependent variable. The coefficients for the
ETH and BNB variables are both statistically significant, with p-values of less
than 0.001. However, the coefficient for the ADA variable is not statistically
25 Panel Data 417
significant, with a p-value of 0.388. This suggests that ADA may not have a
significant effect on BTC. The fixed effects coefficients suggest that Year has
a significant effect on BTC, with the largest coefficient for the Year 2021.
import pandas as pd
import statsmodels.formula.api as smf
The model is estimated using the mixed linear model regression method, with
random effects for each entity in the panel data. Intercept term estimates the
expected value of the dependent variable (BTC) when all the explanatory
variables are equal to zero. In this model, the intercept is 6726.135, which
means that when all the explanatory variables are equal to zero, the expected
value of BTC is around 6726.135. The coefficients for the explanatory vari-
ables (ETH, BNB, and ADA) estimate the expected change in the dependent
variable (BTC) for a one-unit increase in each of the explanatory variables,
25 Panel Data 419
holding all other variables constant. For example, a one-unit increase in ETH
is associated with a 5.911 increase in BTC, on average, while holding BNB
and ADA constant. Similarly, a one-unit increase in BNB is associated with a
40.393 increase in BTC, on average, while holding ETH and ADA constant.
The standard errors of the coefficients indicate the amount of variation in
the estimated coefficients due to sampling error. The z-scores measure the
number of standard errors away from the null hypothesis of zero effect and are
used to test the statistical significance of the coefficients. In this model, all the
explanatory variables are highly statistically significant, as indicated by their
low p-values and high z-scores. Group variance estimate of 20,647,469.979
indicates the amount of variation in the dependent variable that is explained
by the random effects for each entity in the panel data. This value is impor-
tant for assessing the relative importance of the random effects versus the
fixed effects in explaining the variation in the dependent variable.
Python implementation using GLS for the same data would be:
import pandas as pd
from linearmodels import PanelOLS
from linearmodels.panel import RandomEffects
# Set the index as a MultiIndex with Date as the first level and
the unique identifier as the second level
data = data.set_index(['Date', 'UniqueIdentifier'])
and finance research, as they allow researchers to study the impact of past
values of variables on current outcomes, as well as capture time-varying effects
and unobserved heterogeneity. Dynamic panel data models have been used
to analyze a wide range of topics, such as corporate governance, financial
performance, risk management, and capital structure decisions.
The key difference between static and dynamic panel data models lies in
the inclusion of lagged dependent variables as explanatory variables. While
static panel data models only consider the contemporaneous relationships
between variables, dynamic panel data models incorporate the past values of
the dependent variable to capture the persistence or inertia in the relation-
ships. The general equation for a linear dynamic panel data model can be
written as follows:
where Y it is the dependent variable, Y {i, t −1} is the lagged dependent vari-
able, X it is a matrix of independent variables, α is the intercept term, β 1 and
β 2 are vectors of coefficients, μi and λt are the unobserved entity-specific and
time-specific effects, respectively, and εit is the error term. i and t represent
the entity and time indexes, respectively.
There are several types of dynamic panel data models, each with distinct
features and suited for different research scenarios. Some of the most
common dynamic panel data models include:
1. Linear Dynamic Panel Data Models: These models assume a linear rela-
tionship between the lagged dependent variable, independent variables,
and the current value of the dependent variable. They are the most basic
form of dynamic panel data models. Linear dynamic panel data models
can capture the persistence or inertia in the relationships between vari-
ables, allowing researchers to study both short-term and long-term effects
25 Panel Data 421
import pandas as pd
import xarray as xr
from linearmodels.panel import PanelOLS, FirstDifferenceOLS,
compare
from linearmodels.iv import IV2SLS
Finally, the results of both models are compared and printed using the
compare function from linearmodels.
• Dynamic Probit Models: These models are used when the dependent vari-
able is binary, representing two possible outcomes. The dynamic probit
model includes lagged values of the dependent variable as explanatory vari-
ables, capturing persistence or state-dependence in the binary outcomes.
For instance, a dynamic probit model could be used to analyze corpo-
rate bankruptcy predictions, where the binary outcome could represent
bankruptcy (1) or no bankruptcy (0).
import pandas as pd
from linearmodels.panel import PooledOLS
import statsmodels.api as sm
# Dummy entity
df['Year'] = 'Year_1' # replace with actual entity column if
available
import pandas as pd
import statsmodels.api as sm
# Dummy entity
df['Year'] = df['Date'].dt.year
• Dynamic Tobit Models: These models are used when the dependent vari-
able is censored, meaning that it is only observed within a certain range.
The dynamic Tobit model incorporates past values of the dependent vari-
able to capture dynamic relationships in the censored outcomes. This could
be useful, for example, in analyzing the impact of past financial perfor-
mance on current dividend payouts, where the dividend payout cannot
fall below zero.
import pandas as pd
import numpy as np
from scipy import stats
from scipy.optimize import minimize
where Y it is the dependent variable, Y {i, t −1} is the lagged dependent vari-
able, X it is a matrix of independent variables, α t , β t , and γ t are time-varying
intercepts and coefficients, μi and λt are the unobserved entity-specific and
time-specific effects, respectively, and εit is the error term. i and t represent
the entity and time indexes, respectively. In these models, the coefficients β t
and γ t are allowed to vary over time, capturing the temporal changes in the
relationships between the dependent variable and the lagged dependent vari-
able or independent variables. This allows the model to capture more complex
dynamics, such as non-stationarity, structural breaks, or regime changes in the
data.
Estimation of dynamic panel data models with time-varying coefficients
can be challenging, as it requires the estimation of a larger number of
parameters and involves more complex assumptions. Common estimation
techniques include Maximum Likelihood, Bayesian methods, and Kalman
Filtering. These methods often rely on strong assumptions about the process
governing the time-varying coefficients, such as linearity or stationarity, and
may require large amounts of data for reliable estimation.
4. Dynamic Panel Threshold Models: These are a type of dynamic panel data
model that is particularly useful for examining non-linear relationships in
panel data. These models allow for different regimes or states in the data,
which can capture discontinuities or threshold effects in the relationships
between variables. In a Dynamic Panel Threshold Model, the relationship
between the dependent variable and the independent variables can change
depending on the value of a threshold variable. The threshold variable can
be one of the independent variables or an entirely separate variable. The
model assumes that the data switches between different regimes when the
threshold variable crosses a certain threshold value.
where Y_it is the dependent variable, Y {i, t −1} is the lagged dependent vari-
able, X it is a matrix of independent variables, α 1 , α 2 , β 1 , β 2 , γ 1 , and γ 2 are
parameters to be estimated, Z it is the threshold variable, τ is the threshold
value, and εit is the error term. i and t represent the entity and time indexes,
respectively.
The key feature of Dynamic Panel Threshold Models is their ability to
capture different dynamics in different regimes, which can help uncover more
complex and nuanced relationships in the data. For example, these models
can show how the effect of a financial shock or policy change may differ
above and below a certain threshold. Estimating Dynamic Panel Threshold
Models can be quite challenging due to the presence of endogeneity and
autocorrelation issues, as well as the need to determine the threshold value.
The estimation usually involves iterative procedures, where the threshold
value and the model parameters are estimated simultaneously. Hansen (1999)
developed an estimator for static panel threshold models, and this approach
can be extended to dynamic panel threshold models.
5. Dynamic Panel Data Models with Spatial Dependence: These are a sophis-
ticated type of dynamic panel data models that account for spatial inter-
actions or dependencies between the entities under study. These models
are increasingly used in fields such as regional science, environmental
economics, urban planning, and other areas where spatial relationships are
crucial. In the context of accounting and finance research, dynamic panel
data models with spatial dependence could be used to analyze phenomena
such as the diffusion of corporate practices across firms, the impact of
geographical proximity on investment decisions, or the spillover effects of
financial shocks across countries or regions.
Spatial dependence in a dynamic panel data model implies that the dependent
variable for one entity at a particular time period could be influenced not just
by its own past values or the current and past values of other variables, but
also by the values of the dependent variable for other entities. This spatial
interaction is typically modeled through a spatial weight matrix that captures
the relationship or proximity between entities.
The general form of a spatial dynamic panel data model can be written as
follows:
where Y {it } is the dependent variable, Y {i, t −1} is the lagged dependent vari-
able, X ’{it } is a matrix of independent variables, α i is the entity-specific
intercept term, β i and γ i are the entity-specific coefficients for the lagged
dependent variable and independent variables, respectively, and ε{it } is the
error term. i and t represent the entity and time indexes, respectively.
Estimating DPD models can be challenging, as they involve a large
number of parameters to be estimated and may require specialized estimation
25 Panel Data 429
# Breusch-Pagan test
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(model.resid, model.model.exog)
print('Breusch-Pagan test p-value:', bp_test[1])
# White test
from statsmodels.stats.diagnostic import het_white
white_test = het_white(model.resid, model.model.exog)
print('White test p-value:', white_test[1])
This is the code for the fixed effect model already discussed earlier in this
chapter. Adaptation of this code needs to be included at the end of any
regression code.
430 S. Kumar
# Durbin-Watson test
from statsmodels.stats.stattools import durbin_watson
dw_test = durbin_watson(model.resid)
print('Durbin-Watson test statistic:', dw_test)
# Breusch-Godfrey test
from statsmodels.stats.diagnostic import acorr_breusch_godfrey
# VIF test
from statsmodels.stats.outliers_influence import
variance_inflation_factor
vif = pd.DataFrame({'VIF':
[variance_inflation_factor(model.model.exog, i) for i in
range(model.model.exog.shape[1])]},
index=model.model.exog_names)
print('VIF:\n', vif)
influence = OLSInfluence(model)
cooks_dist = pd.Series(influence.cooks_distance[0],
index=data.index, name="Cook's Distance")
print('Cook\'s distance:\n', cooks_dist)
# Leverage plot
from statsmodels.graphics.regressionplots import
plot_leverage_resid2
fig, ax = plt.subplots(figsize=(10, 6))
plot_leverage_resid2(model, ax=ax)
plt.show()
• Normality tests: Normality assumptions are often made about the error
terms in panel data models. Diagnostic tests for normality include the
Jarque–Bera test, the Shapiro–Wilk test, and the Kolmogorov–Smirnov
test.
432 S. Kumar
# Jarque-Bera test
from scipy.stats import jarque_bera
jb_test = jarque_bera(model.resid)
print('Jarque-Bera test statistic:', jb_test[0])
print('Jarque-Bera test p-value:', jb_test[1])
# Shapiro-Wilk test
from scipy.stats import shapiro
shapiro_test = shapiro(model.resid)
print('Shapiro-Wilk test statistic:', shapiro_test[0])
print('Shapiro-Wilk test p-value:', shapiro_test[1])
# Kolmogorov-Smirnov test
from scipy.stats import kstest
ks_test = kstest(model.resid, 'norm')
print('Kolmogorov-Smirnov test statistic:', ks_test[0])
print('Kolmogorov-Smirnov test p-value:', ks_test[1])
For all these tests, a rejection of null hypothesis, as indicated by the test
statistic/p values, indicates that the data is not normally distributed.
# RESET test
from statsmodels.stats.diagnostic import linear_reset
reset_test = linear_reset(model)
print('RESET test p-value:', reset_test.pvalue)
In the context of the RESET test, a significant p-value suggests that the
observed relationship between the dependent variable and the independent
variables cannot be adequately explained by the current model, indicating a
potential functional form misspecification. Hausman test has already been
discussed earlier.
Panel data analysis is a powerful and widely used method in academic
research, particularly in the fields of accounting and finance. Panel data, also
known as longitudinal or panel longitudinal data, refers to data that contains
observations on multiple entities (such as firms or individuals) over time.
This data structure enables researchers to analyze both cross-sectional and
time series variations, providing valuable insights into various phenomena.
In academic accounting and finance research, panel data analysis has
proven to be invaluable in investigating complex relationships and addressing
key research questions. One major advantage of panel data is its ability to
25 Panel Data 433
larger variances, which can sometimes lead to biased results if the variables
are not appropriately scaled or if the variance is not necessarily indicative of
the variable’s importance.
Here is an example of a data D:/Data/acctg.csv in which gvkey is the cross
sectional identifier and fyear is the time variable. The independent variable is
ue_ce and rest all variables, eight in number, are dependent variables. PCA
implementation, assuming there are no missing values and all variables are
numerical, is as follows:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Apply PCA
pca = PCA(n_components=3) # Reduce dimensionality to 3
features_pca = pca.fit_transform(features_scaled)
print(pca_df.head())
The resultant data will consist of 4 variables i.e., dependent variable and three
components. The result will look like:
Principal Component 1 Principal Component 2 Principal Component 3 ue_ce
0 -0.949901 -0.575974 -0.053595 0.016199
1 -0.778554 0.913777 0.329411 -0.019262
2 0.248846 -0.455588 -0.063551 0.019104
3 -1.022333 -0.022026 -0.178154 -0.002423
4 -0.994182 -0.230897 -0.12789 0.016183
Here’s a Python code that implements Kaiser Criterion, Scree Plot, and
Cumulative Explained Variance Rule on the data mentioned earlier:
26 Special Techniques in Multivariate Analysis 439
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load your dataset
df = pd.read_csv('D:/Data/acctg.csv')
# Apply PCA
pca = PCA()
features_pca = pca.fit(features_scaled)
# Kaiser Criterion
eigenvalues = pca.explained_variance_
n_components_kaiser = len(eigenvalues[eigenvalues > 1])
# Scree plot
plt.plot(np.arange(1, len(eigenvalues) + 1), eigenvalues, 'ro-',
linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.show()
# Number of folds
n_folds = 5
The cross-validated scores are then plotted against the number of compo-
nents. The optimal number of components is the one that gives the highest
cross-validated score.
Parallel analysis using Kaiser–Meyer–Olkin (KMO) measure, which is a
measure of sampling adequacy recommended for factor analysis, is demon-
strated below. The code performs parallel analysis by comparing the eigen-
values from the PCA with those from a random dataset. The optimal number
of components is the one for which the eigenvalue is greater than the
corresponding random value.
26 Special Techniques in Multivariate Analysis 441
import numpy as np
from factor_analyzer.factor_analyzer import calculate_kmo
random_val.append(np.percentile(np.random.random_sample([1000,
i]).std(axis=0), 95))
parallel_analysis_results['random_val'] = random_val
Factor Analysis
Factor Analysis is a multifaceted, statistical procedure predominantly utilized
in the behavioral and social sciences, but its utility extends to the domain of
accounting and finance research as well. It is a technique to identify latent
variables or ‘factors’ from a set of observed variables. By doing so, it can
condense the information contained in the original variables into a smaller
set of factors with minimum loss of information.
442 S. Kumar
import pandas as pd
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
# Loading data
df = pd.read_csv('D:/Data/acctg.csv')
# Checking adequacy
from factor_analyzer.factor_analyzer import
calculate_bartlett_sphericity
chi_square_value, p_value = calculate_bartlett_sphericity(df)
print(chi_square_value, p_value)
# Check Eigenvalues
ev, v = fa.get_eigenvalues()
You should adjust the number of factors (n_factors) based on the scree
plot and your research questions. Note that the factor loadings represent the
correlation between the observed variables and the factors.
The output will look like:
444 S. Kumar
The first matrix shown is the ‘Factor Loadings’ matrix, which reflects the
correlation between each variable and the inferred factors. Each row repre-
sents a variable, and each column represents a factor. A larger absolute value
for a loading indicates a stronger association between the variable and the
factor. Loadings can be positive or negative, indicating the direction of the
relationship between a variable and a factor.
The second array shown in the output is the ‘Factor Variance’ matrix. This
matrix contains three rows. The first row shows the ‘SS Loadings’, which is
the sum of squared loadings for each factor (the sum of the square of each
column in the loadings matrix). It represents the variance in all the variables
accounted for by each factor. The second row shows the ‘Proportion Var’,
which is the proportion of the total variance accounted for by each factor. The
third row shows the ‘Cumulative Var’, which is the cumulative proportion of
the total variance accounted for by the current and all preceding factors.
The interpretation of these results is largely contingent on the context
of data and the research questions you are trying to answer. Your task as a
researcher is to assign meaningful interpretations to the factors based on the
loadings of the variables on the factors.
Cluster Analysis
Cluster analysis, a fundamental technique in the domain of multivariate anal-
ysis, holds substantial importance in academic research within accounting
and finance. This analytical technique strives to identify homogeneous
subgroups within larger heterogeneous datasets. It offers a versatile method-
ology for exploratory data analysis, hypothesis generation, and pattern recog-
nition, facilitating an effective distillation of the essence from large volumes
of data.
26 Special Techniques in Multivariate Analysis 445
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
Please replace ‘3’ with the desired number of clusters and adapt the variable
names to match your data. This script will add a new column cluster to your
DataFrame, indicating the cluster to which each row belongs.
Determining the optimal number of clusters for a clustering analysis is a
critical aspect and involves selecting a suitable number of groups that effec-
tively capture the underlying structure within the data. Several methods exist
to aid in this decision-making process. In this section, we will discuss two
commonly used approaches: the elbow method and the silhouette method.
We will illustrate the implementation of these techniques using the acctg.csv
dataset mentioned earlier in Python.
The elbow method aims to identify the number of clusters by evaluating the
within-cluster sum of squares (WCSS) for different values of k. The WCSS
represents the sum of the squared distances between each data point and
26 Special Techniques in Multivariate Analysis 447
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
In the plot, the average silhouette score is plotted against the number of clus-
ters (k). Higher silhouette scores indicate better clustering results, where data
points within clusters are more similar to each other than to those in other
clusters. The optimal number of clusters can be determined by selecting the
value of k that maximizes the average silhouette score (4 in this case).
In addition to the widely used k-means clustering algorithm, several other
methods have been developed for clustering analysis. These methods offer
alternative approaches to identifying clusters within datasets, each with its
own underlying assumptions and characteristics.
1. Hierarchical Clustering:
2. Density-Based Clustering:
3. Model-Based Clustering:
Model-based clustering approaches assume that the data points are gener-
ated from a mixture of probability distributions. These methods estimate the
parameters of the underlying distributions and assign data points to clusters
based on their likelihood of belonging to a specific distribution. Exam-
ples of model-based clustering algorithms include Gaussian Mixture Models
(GMM) and Finite Mixture Models (FMM).
4. Fuzzy Clustering:
Fuzzy clustering allows data points to belong to multiple clusters with varying
degrees of membership. Unlike traditional hard clustering algorithms that
assign each data point to a single cluster, fuzzy clustering assigns membership
values indicating the degree to which each point belongs to different clusters.
Fuzzy C-Means (FCM) is a well-known algorithm for fuzzy clustering.
5. Spectral Clustering:
Spectral clustering combines techniques from linear algebra and graph theory
to partition data points into clusters. It leverages the spectral properties
of the data similarity matrix and performs dimensionality reduction using
eigenvectors. Spectral clustering can effectively handle complex datasets with
non-linear structures and has shown promising results in various applications.
import pandas as pd
from sklearn.cross_decomposition import CCA
Discriminant Analysis
import pandas as pd
import statsmodels.api as sm
The output will be a table with DA coefficients, t-stats, and p-values for all
independent variables.
Multivariate analysis holds an indispensable role in the sphere of
accounting and finance research, illuminating the interconnectedness of
multiple variables and their collective impact on financial phenomena. The
utilization of these sophisticated statistical techniques allows researchers to
apprehend the intricate nature of financial markets, business operations, and
economic systems. It provides the ability to scrutinize the complex rela-
tionships among financial variables that would otherwise remain obscured
in univariate or bivariate analyses. The diverse suite of multivariate anal-
ysis methods, including but not limited to Principal Component Analysis,
Cluster Analysis, and Discriminant Analysis, etc. ensures the flexibility and
adaptability required to address the multifaceted research questions prevalent
in the field.
Part VI
Advanced Topics
27
Deep Learning
Deep learning has found extensive applications across various domains due
to its capacity to learn complex patterns and representations from large
datasets. In image recognition and computer vision, deep learning models
like Convolutional Neural Networks (CNNs) have demonstrated outstanding
performance in tasks such as object detection, image classification, and facial
recognition. These applications have significant implications for industries
like security, automotive, health care, and retail.
In the realm of natural language processing (NLP), deep learning tech-
niques, including Recurrent Neural Networks (RNNs), Long Short-Term
Memory (LSTM) networks, and Transformer models, have revolutionized
tasks like machine translation, sentiment analysis, text summarization, and
chatbots. Deep learning has also been employed to develop highly accu-
rate speech recognition systems, enabling voice assistants to understand and
respond to spoken commands, and generating realistic human-like speech
for text-to-speech systems. Furthermore, deep learning models have been
utilized for creating personalized recommendations in industries such as e-
commerce, entertainment, and social media. In health care, deep learning
has been applied to various applications, including medical image analysis,
drug discovery, and patient monitoring.
Deep learning is extremely useful for academic research in various
fields, including accounting and finance. Its ability to handle large and
complex datasets and learn intricate patterns makes it particularly suitable
for addressing research questions in these areas. One of the key applications
of deep learning in accounting and finance research is financial statement
analysis. Deep learning models can be employed to analyze financial state-
ments, extracting valuable information that may help predict future financial
performance, evaluate solvency, and assess credit risk.
462 S. Kumar
Another area where deep learning has made a significant impact is fraud
detection. By leveraging deep learning techniques, researchers can detect
fraudulent activities in financial data, identifying anomalies, irregularities, or
suspicious patterns that may indicate financial fraud or earnings manipula-
tion. Additionally, deep learning models have proven valuable in predicting
the likelihood of corporate bankruptcy or financial distress by analyzing
various financial ratios, market data, and other relevant features.
Furthermore, deep learning has found applications in algorithmic trading
and portfolio management. Researchers use these advanced models to develop
trading strategies and optimize portfolio management by predicting asset
prices, modeling market sentiment, and identifying patterns in financial time
series data.
An example of application of deep learning in accounting and finance
research is to predict corporate bankruptcy. Predicting corporate bankruptcy
is an essential task in accounting research, as it helps investors, financial
institutions, and other stakeholders assess the financial health of companies
and make informed decisions. To develop a deep learning model that can
effectively predict the likelihood of bankruptcy based on historical financial
data, researchers must first collect financial data of companies, including both
bankrupt and non-bankrupt firms. This data can be obtained from sources
like financial statements, stock market data, and economic indicators, with
relevant features such as financial ratios, market capitalization, and stock price
volatility.
Once the data has been collected, it needs to be cleaned and preprocessed
to address missing values, outliers, and scaling issues. The dataset is then
divided into training, validation, and testing sets to facilitate model training,
tuning, and evaluation. Researchers can choose a suitable deep learning archi-
tecture for the problem, such as a feedforward neural network, recurrent
neural network (RNN), or long short-term memory (LSTM) network. These
architectures can effectively model the relationships between financial features
and bankruptcy likelihood.
During the model training phase, the chosen deep learning model is
trained on the training dataset using a suitable loss function and optimization
27 Deep Learning 463
Neuron
In the context of artificial neural networks (ANN) and deep learning, a
neuron, also referred to as a node or artificial neuron, is a fundamental
computational unit that mimics the basic functioning of biological neurons
in the human brain. It receives input from other neurons or external sources,
processes the information, and produces an output based on the input and
its internal activation function.
464 S. Kumar
1. Inputs: These are the incoming connections from other neurons or data
sources. Each input is associated with a weight, which represents the
strength or importance of the connection.
2. Weighted Sum: The neuron computes the weighted sum of its inputs by
multiplying each input by its corresponding weight and then summing the
results. This sum is also known as the neuron’s net input or pre-activation
value.
3. Bias: A bias term is added to the weighted sum to shift the activation func-
tion, enabling the neuron to learn more complex relationships between
inputs and outputs. The bias acts as a constant offset and is also a learnable
parameter.
4. Activation Function: The activation function is applied to the weighted
sum (plus bias) to transform the neuron’s output, introducing non-
linearity to the network. Common activation functions include the
sigmoid, hyperbolic tangent (tanh), ReLU (rectified linear unit), and
softmax functions.
5. Output: The result of applying the activation function is the neuron’s
output, which can be sent to other neurons in the network or used as
the final output for the model.
During training, the weights and biases of neurons in the network are
adjusted through a process called backpropagation, which involves mini-
mizing a loss function using optimization algorithms like gradient descent.
This process allows the artificial neural network to learn complex patterns
and relationships in the input data.
Transformer Models
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast,
BertForSequenceClassification, Trainer, TrainingArguments
470 S. Kumar
These imports include the necessary libraries and modules for handling data,
creating the model, and training the model.
df = pd.read_csv("D:/python
codes/Repositories/FinancialPhraseBank/all-data.csv",
encoding='utf-8', header=None, names=['sentiment', 'text'],
dtype=str)
df['sentiment'] = df['sentiment'].map({'positive': 2,
'negative': 0, 'neutral': 1})
df['sentiment'] = df['sentiment'].astype(int)
train_df, val_df = train_test_split(df, test_size=0.2,
random_state=42)
train_df.reset_index(drop=True, inplace=True)
val_df.reset_index(drop=True, inplace=True)
Here, the data is loaded from the CSV file and the sentiment labels are
converted to integer values. The integer values have to be within a range [0,
C-1], where C is the number of classes. In our case, we have 3 classes: posi-
tive, negative, and neutral, so the labels should be in the range [0, 2]. The
dataset is then split into training and validation sets. test_size = 0.2 in the
train_test_split function specifies that 20% of the dataset should be used as
the validation set, and 80% should be used as the training set. random_state
= 42 sets the random seed used for shuffling the data before splitting it into
the training and validation sets. By setting the random seed to a fixed value,
you can ensure that the data will be split in the same way every time you run
the code. This is useful for reproducibility purposes.
tokenizer = BertTokenizerFast.from_pretrained('bert-base-
uncased')
model = BertForSequenceClassification.from_pretrained('bert-
base-uncased', num_labels=3)
27 Deep Learning 471
The tokenizer is responsible for converting text into the format required by
the BERT model. The model is an instance of BertForSequenceClassification,
which is a BERT model fine-tuned for sequence classification tasks.
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
class SentimentDataset(torch.utils.data.Dataset):
...
This part sets the hyperparameters for the model and creates a custom
PyTorch Dataset class to handle the input data. The dataset instances are
then created for training and validation. MAX_LEN: Maximum length of
input text sequences. Sequences that are longer than MAX_LEN will be
truncated and sequences that are shorter will be padded with special tokens.
BATCH_SIZE is number of training examples processed in one iteration
during training. EPOCHS is number of times the entire training dataset is
passed forward and backward through the neural network during training.
training_args = TrainingArguments(
...
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
472 S. Kumar
Here, the training arguments are set, and a Trainer instance is created to
handle the training and evaluation of the model.
trainer.train()
This line starts the training process using the Trainer instance.
This function takes a text input, preprocesses it using the tokenizer, and feeds
it to the trained model to predict the sentiment.
if sentiment == 1:
print("The sentiment of the given 10-K statement is
Positive.")
elif sentiment == -1:
print("The sentiment of the given 10-K statement is
Negative.")
else:
print("The sentiment of the given 10-K statement is
Neutral.")
Finally, this part uses the predict_sentiment() function to predict the senti-
ment of a new 10-K statement and prints the result. It is pertinent to mention
that BERT model and tokenizer handle text preprocessing internally, which
includes lowercasing and tokenization. As a result, there is no need for any
additional normalization before using the tokenizer. BERT tokenizer expects
input text in a specific format, and it will handle the necessary preprocessing
27 Deep Learning 473
steps internally. For example, the tokenizer will convert the text to lower-
case (since it’s using ‘bert-base-uncased’) and split it into tokens, including
WordPiece tokens for out-of-vocabulary words.
A sample of detailed code is:
474 S. Kumar
import torch
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast,
BertForSequenceClassification, Trainer, TrainingArguments
# Hyperparameters
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
# Dataset class
class SentimentDataset(torch.utils.data.Dataset):
def __init__(self, dataframe, tokenizer, max_len):
self.tokenizer = tokenizer
self.data = dataframe
self.text = dataframe.text
self.sentiment = dataframe.sentiment
self.max_len = max_len
def __len__(self):
return len(self.text)
27 Deep Learning 475
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
return_token_type_ids=True,
truncation=True
)
return {
'input_ids': torch.tensor(inputs['input_ids'],
dtype=torch.long),
'attention_mask':
torch.tensor(inputs['attention_mask'], dtype=torch.long),
'labels': torch.tensor(self.sentiment[index],
dtype=torch.long)
}
# Training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
# Start training
trainer.train()
476 S. Kumar
# Prediction function
def predict_sentiment(text, model, tokenizer):
inputs = tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=MAX_LEN,
padding="max_length",
return_token_type_ids=True,
truncation=True
)
input_ids = torch.tensor([inputs['input_ids']],
dtype=torch.long)
attention_mask = torch.tensor([inputs['attention_mask']],
dtype=torch.long)
with torch.no_grad():
outputs = model(input_ids,
attention_mask=attention_mask)
logits = outputs.logits
sentiment = torch.argmax(logits, dim=1).item()
return sentiment
However, since 10K statements are big, a good approach is to use a combina-
tion of topic modeling and sentiment analysis. Topic modeling can be used
to identify the main themes or topics in a large document like a 10K, which
can then be used to inform the sentiment analysis. For example, if the topic
is about positive earnings, then it’s likely that the sentiment will also be posi-
tive. This approach can help to reduce the amount of noise in the data and
improve the accuracy of the sentiment analysis. A code that performs LDA
on all the text files and then calculates the sentiment score as an aggregate of
sentiments for all the topics is provided below. This code creates 20 topics.
Determining the optimum number of topics has already been discussed in
section on Topic Modeling.
478 S. Kumar
import os
import torch
import pandas as pd
import gensim
from gensim import corpora
from gensim.models import LdaModel
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast,
BertForSequenceClassification, Trainer, TrainingArguments
# Hyperparameters
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
# Dataset class
class SentimentDataset(torch.utils.data.Dataset):
def __init__(self, dataframe, tokenizer, max_len):
self.tokenizer = tokenizer
self.data = dataframe
self.text = dataframe.text
self.sentiment = dataframe.sentiment
self.max_len = max_len
27 Deep Learning 479
def __len__(self):
return len(self.text)
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
return_token_type_ids=True,
truncation=True
)
return {
'input_ids': torch.tensor(inputs['input_ids'],
dtype=torch.long),
'attention_mask':
torch.tensor(inputs['attention_mask'], dtype=torch.long),
'labels': torch.tensor(self.sentiment[index],
dtype=torch.long)
}
# Training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
# Start training
trainer.train()
480 S. Kumar
# LDA Model
def get_lda_model(texts, num_topics=20, passes=20):
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = LdaModel(corpus, num_topics=num_topics,
id2word=dictionary, passes=passes)
return lda_model, dictionary
return sentiment_score
10K is a very lengthy and complex document. This code can also be applied
to any accounting and financial text data.
FNN Models
BERT is computationally very intensive and it may take several days to get
output, depending on the size of your data. FNN, which is simple and
computationally less intensive as compared to BERT can give good results
when combined with topic modeling such as LDA. However, FNN model’s
performance might not be as good as BERT. A sample code of FNN with
topic modeling is provided below. This code is similar in all respects to the
above code, the only difference being FNN instead of BERT.
482 S. Kumar
import os
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import gensim
from gensim import corpora
from gensim.models import LdaModel
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
# Tokenizer
tokenizer = gensim.utils.simple_preprocess
# Hyperparameters
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
EMBEDDING_DIM = 100
HIDDEN_DIM = 64
NUM_CLASSES = 3
# Dataset class
class SentimentDataset(torch.utils.data.Dataset):
def __init__(self, dataframe, tokenizer, max_len):
self.tokenizer = tokenizer
self.data = dataframe
self.text = dataframe.text
self.sentiment = dataframe.sentiment
27 Deep Learning 483
self.max_len = max_len
def __len__(self):
return len(self.text)
return {
'input_ids': torch.tensor(indices,
dtype=torch.long),
'labels': torch.tensor(self.sentiment[index],
dtype=torch.long)
}
train_dataloader = DataLoader(train_dataset,
batch_size=BATCH_SIZE, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
# FNN model
class FNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim,
num_classes):
super(FNN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.fc1 = nn.Linear(embedding_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, num_classes)
# Training loop
for epoch in range(EPOCHS):
model.train()
for batch in train_dataloader:
optimizer.zero_grad()
input_ids = batch['input_ids']
labels = batch['labels']
outputs = model(input_ids)
loss = loss_function(outputs, labels)
loss.backward()
optimizer.step()
# Validation
model.eval()
total_val_loss = 0
correct_predictions = 0
with torch.no_grad():
for batch in val_dataloader:
input_ids = batch['input_ids']
labels = batch['labels']
outputs = model(input_ids)
loss = loss_function(outputs, labels)
total_val_loss += loss.item()
_, predicted = torch.max(outputs, 1)
correct_predictions += (predicted ==
labels).sum().item()
# LDA Model
def get_lda_model(texts, num_topics=20, passes=20):
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = LdaModel(corpus, num_topics=num_topics,
id2word=dictionary, passes=passes)
return lda_model, dictionary
sentiment = model(torch.tensor([indices]))
_, predicted = torch.max(sentiment, 1)
sentiment_score += predicted.item() * prob
return sentiment_score
The code first imports the necessary libraries and reads the labeled dataset for
training the FNN model. It then preprocesses the dataset and tokenizes the
text, splitting it into training and validation sets. A custom PyTorch dataset
class, SentimentDataset, is created to manage the input–output pairs from
the preprocessed dataset. Then, the FNN model is defined and trained on the
dataset using the Adam optimizer and cross-entropy loss. The performance of
the model is evaluated on the validation set after each training epoch. After
training the FNN model, the code focuses on applying LDA to a set of text
documents in a directory. Each document is read, preprocessed, and added
486 S. Kumar
to a list of texts. Then, the LDA model is trained on this list of preprocessed
texts to generate topic distributions for each document.
A custom function, sentiment_score_for_topics, calculates the sentiment
scores of the topics within each document using the trained FNN model.
This function tokenizes and pads the terms extracted from the topics, and the
FNN model predicts the sentiment of these terms. The sentiment scores are
then weighted by the probability of the topics in the document and summed
to produce the final sentiment score for each document. Finally, the senti-
ment scores are calculated for all text files in the directory and printed. This
code demonstrates how topic modeling can be combined with sentiment
analysis to assess the sentiment of a collection of documents based on their
topics.
LSTM Models
LSTM, a type of RNN, modification of the code to get sentiment score after
creating topic through LDA is as follows:
27 Deep Learning 487
import os
import torch
import pandas as pd
import gensim
from gensim import corpora
from gensim.models import LdaModel
from sklearn.model_selection import train_test_split
from torch import nn
from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import StepLR
# Hyperparameters
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
LEARNING_RATE = 1e-4
LR_STEP = 1
GAMMA = 0.95
EMBED_DIM = 100
HIDDEN_DIM = 256
NUM_LAYERS = 2
# Dataset class
class SentimentDataset(torch.utils.data.Dataset):
def __init__(self, dataframe, tokenizer, max_len):
self.tokenizer = tokenizer
self.data = dataframe
self.text = dataframe.text
self.sentiment = dataframe.sentiment
self.max_len = max_len
def __len__(self):
return len(self.text)
488 S. Kumar
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
return_token_type_ids=True,
truncation=True
)
return {
'input_ids': torch.tensor(inputs['input_ids'],
dtype=torch.long),
'attention_mask':
torch.tensor(inputs['attention_mask'], dtype=torch.long),
'labels': torch.tensor(self.sentiment[index],
dtype=torch.long)
}
# Create the dataset and dataloaders
train_dataset = SentimentDataset(train_df, tokenizer, MAX_LEN)
val_dataset = SentimentDataset(val_df, tokenizer, MAX_LEN)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
shuffle=False)
# Training loop
for epoch in range(EPOCHS):
model.train()
for i, batch in enumerate(train_loader):
optimizer.zero_grad()
input_ids, attention_mask, labels = batch['input_ids'],
batch['attention_mask'], batch['labels']
outputs = model(input_ids)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
27 Deep Learning 489
model.eval()
total_loss, total_correct, total_count = 0, 0, 0
for batch in val_loader:
input_ids, attention_mask, labels = batch['input_ids'],
batch['attention_mask'], batch['labels']
with torch.no_grad():
outputs = model(input_ids)
loss = criterion(outputs, labels)
_, predicted = torch.max(outputs, 1)
total_loss += loss.item()
total_correct += (predicted == labels).sum().item()
total_count += labels.size(0)
scheduler.step()
# Prediction function
def predict_sentiment(text, model, tokenizer):
inputs = tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=MAX_LEN,
padding="max_length",
return_token_type_ids=True,
truncation=True
)
input_ids = torch.tensor([inputs['input_ids']],
dtype=torch.long)
attention_mask = torch.tensor([inputs['attention_mask']],
dtype=torch.long)
with torch.no_grad():
outputs = model(input_ids,
attention_mask=attention_mask)
logits = outputs.logits
sentiment = torch.argmax(logits, dim=1).item()
return sentiment
# LDA Model
def get_lda_model(texts, num_topics=10, passes=20):
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = LdaModel(corpus, num_topics=num_topics,
id2word=dictionary, passes=passes)
return lda_model, dictionary
490 S. Kumar
return sentiment_score
The code uses the PyTorch framework for building and training the model.
The first part of the code imports the necessary libraries, including PyTorch,
Pandas, Gensim, and Scikit-learn. It then defines the LSTM model class,
which inherits from the PyTorch nn.Module class. The LSTMModel class
takes as input several hyperparameters, such as the vocabulary size, embed-
ding dimension, hidden dimension, number of layers, and number of output
classes. The forward() method of the class processes the input data by first
passing it through an embedding layer, followed by the LSTM layer, and
finally through a linear layer to produce the output.
Next, the code defines the hyperparameters, such as the maximum
sequence length, batch size, number of epochs, and learning rate. It also
defines the dataset class and creates the training and validation datasets and
27 Deep Learning 491
data loaders. The model is then initialized with the hyperparameters, and the
loss function and optimizer are defined. The code then trains the model for
the specified number of epochs, evaluating the validation loss and accuracy
at each epoch.
After training, the code defines a predict_sentiment() function that takes
as input a text string and returns the predicted sentiment of the text using
the trained model. The code then defines an LDA topic modeling function,
which takes as input a list of texts and returns an LDA model and dictionary.
The code also defines a sentiment_score_for_topics() function, which takes
as input an LDA model, text, and the trained sentiment analysis model, and
returns the sentiment score of the text by computing the sentiment scores of
the individual topics in the text.
Finally, the code iterates over all the text files in a given directory, prepro-
cesses them, and passes them through the LDA model and sentiment analysis
model to compute the sentiment scores for the texts. The sentiment scores are
then printed along with the corresponding file names. This code can be useful
for analyzing the sentiment of large volumes of textual data and extracting
insights from them.
RNN (LSTM) model might not perform as well as the BERT model in
certain cases, and you may need to fine-tune the hyperparameters to achieve
better performance. Fine-tuning a model involves adjusting its hyperparame-
ters to achieve better performance. Some key steps include training for more
epochs, adjusting the learning rate, tuning the learning rate scheduler, modi-
fying the LSTM architecture, using pre-trained word embeddings, applying
regularization techniques, optimizing the batch size, and performing model
selection. It’s essential to monitor the model’s performance on a validation set
during this process to avoid overfitting and ensure generalization to unseen
data. Keep in mind that fine-tuning can be time-consuming, and it might
require multiple iterations of experimentation to find the best combination
of hyperparameters.
GRU Model
# GRU Model
class GRUModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim,
num_layers, num_classes, dropout=0.5):
super(GRUModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.gru = nn.GRU(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.fc = nn.Linear(hidden_dim, num_classes)
Again, just like LSTM, the GRU model also need to be fine-tuned.
CNN Model
Convolutional Neural Networks can capture local patterns in the input data
and are often used for image recognition tasks. They can also be applied to
text data, with 1D convolutions, to detect local patterns in the text. CNNs
generally require less training time compared to RNNs (like LSTM or GRU)
but may not perform as well in capturing long-range dependencies in the text.
Here is a 1D implementation of CNN in the context of sentiment analysis
after LDA:
27 Deep Learning 493
import os
import torch
import pandas as pd
import gensim
from gensim import corpora
from gensim.models import LdaModel
from sklearn.model_selection import train_test_split
from torch import nn
from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import StepLR
# Hyperparameters
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
LEARNING_RATE = 1e-4
LR_STEP = 1
GAMMA = 0.95
EMBED_DIM = 100
494 S. Kumar
# Training loop
for epoch in range(EPOCHS):
model.train()
for i, batch in enumerate(train_loader):
optimizer.zero_grad()
input_ids, attention_mask, labels = batch['input_ids'],
batch['attention_mask'], batch['labels']
outputs = model(input_ids)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
model.eval()
total_loss, total_correct, total_count = 0, 0, 0
for batch in val_loader:
input_ids, attention_mask, labels = batch['input_ids'],
batch['attention_mask'], batch['labels']
with torch.no_grad():
outputs = model(input_ids)
loss = criterion(outputs, labels)
_, predicted = torch.max(outputs, 1)
total_loss += loss.item()
total_correct += (predicted == labels).sum().item()
total_count += labels.size(0)
scheduler.step()
# Prediction function
def predict_sentiment(text, model, tokenizer):
inputs = tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=MAX_LEN,
27 Deep Learning 495
padding="max_length",
return_token_type_ids=True,
truncation=True
)
input_ids = torch.tensor([inputs['input_ids']],
dtype=torch.long)
attention_mask = torch.tensor([inputs['attention_mask']],
dtype=torch.long)
with torch.no_grad():
outputs = model(input_ids)
_, predicted = torch.max(outputs, 1)
sentiment = predicted.item()
return sentiment
# LDA Model
def get_lda_model(texts, num_topics=20, passes=20):
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = LdaModel(corpus, num_topics=num_topics,
id2word=dictionary, passes=passes)
return lda_model, dictionary
return sentiment_score
In the beginning, necessary libraries such as PyTorch, pandas, and Gensim are
imported. The preprocessing and LDA functions are assumed to be already
defined, as they remain unchanged. Two deep learning models, an LSTM-
based and a 1D CNN-based model, are defined as separate classes. Both
models share a common architecture: They start with an embedding layer,
followed by layers specific to the model (either LSTM or 1D Convolution
layers), and end with a fully connected layer for classification. Hyperpa-
rameters for the models, such as the maximum sequence length, batch size,
learning rate, and embedding dimensions, are defined as well.
A custom PyTorch dataset class is created to handle the tokenization and
padding of the input text. This class also takes care of mapping the input
text to their corresponding sentiment labels. DataLoader objects are created
for both the training and validation datasets, which are then used during the
training process. The training loop iterates through the DataLoader objects,
feeding the input text into the selected deep learning model, computing the
loss using the CrossEntropyLoss criterion, and updating the model’s parame-
ters using the AdamW optimizer. The learning rate scheduler is also employed
to adjust the learning rate during training. After each epoch, the model’s
27 Deep Learning 497
Autoencoder Model
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from gensim import corpora
from gensim.models import LdaModel
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
texts = []
for file_name in file_list:
with open(os.path.join(dir_path, file_name), "r") as f:
texts.append(preprocess_data(f.read()))
test_data_encoded = encoder.predict(test_data)
test_data_decoded = autoencoder.predict(test_data)
test_data_reconstruction_error = np.mean(np.square(test_data -
test_data_decoded), axis=1)
In this code, we first preprocess the 10-K statement text files and apply
Latent Dirichlet Allocation (LDA) to extract topics from the documents.
By applying LDA, we transform the high-dimensional text data into a
lower-dimensional representation based on the learned topic distributions,
which allows us to focus on topic-level features for anomaly detection. We
then construct an autoencoder model for anomaly detection on the topic
distributions. Autoencoders are unsupervised neural networks that consist
of an encoder and a decoder. The encoder maps the input data to a lower-
dimensional latent space, and the decoder reconstructs the input data from
the latent representation. In this context, we train the autoencoder to mini-
mize the reconstruction error between the input topic distributions and the
reconstructed topic distributions. After training the autoencoder, we calculate
the reconstruction errors for both the training and test datasets. These errors
represent the difference between the original topic distributions and their
reconstructed counterparts. We set a threshold based on the 95th percentile
of the training data’s reconstruction errors to identify anomalies. Documents
with reconstruction errors above the threshold are considered anomalous,
indicating that their topic distributions deviate significantly from the norm.
500 S. Kumar
The code outputs the number of anomalies detected in the test data, which
can be interpreted as documents with unusual topic distributions.
The use of deep learning in accounting and finance has been steadily
increasing as the field recognizes the potential of these powerful models to
extract valuable insights from vast amounts of unstructured text data. By
employing deep learning techniques, practitioners in accounting and finance
can automate complex tasks, enhance decision-making processes, and gain a
competitive edge in the industry.
Some of the most promising applications of deep learning in accounting
and finance include fraud detection, risk management, sentiment anal-
ysis, and financial document analysis. By leveraging advanced models such
as Convolutional Neural Networks (CNNs), Recurrent Neural Networks
(RNNs), LSTMs, and autoencoders, researchers, and professionals can
uncover hidden patterns and relationships in the data that traditional
methods might overlook.
Moreover, the combination of deep learning with other machine learning
techniques, such as Latent Dirichlet Allocation (LDA) for topic modeling,
can provide a more comprehensive understanding of financial documents
and their underlying structures. This integration enables the identification
of anomalies and unusual patterns in the data, which can be crucial for early
detection of potential risks and fraud.
As the volume of financial data continues to grow, the importance of deep
learning in accounting and finance will only become more pronounced. By
staying ahead of these advancements and understanding the latest develop-
ments in deep learning techniques, professionals in the field can ensure they
are well-equipped to tackle the challenges of the future and drive innovation
in accounting and finance.
With the knowledge gained from this chapter on various deep learning
models and their applications, we hope to inspire further exploration and
adoption of these techniques in the accounting and finance domain, leading
to more efficient, accurate, and insightful analysis of financial data.
Most Common Errors and Solutions
© The Editor(s) (if applicable) and The Author(s), under exclusive 501
license to Springer Nature Switzerland AG 2024
S. Kumar, Python for Accounting and Finance, https://doi.org/10.1007/978-3-031-54680-8
502 Most Common Errors and Solutions
Solution: Make sure that the values you’re using are appropriate for the
context. For example, if you’re trying to convert a string to an integer using
int(), make sure the string represents a valid integer.
• ImportError: This error is raised when an import statement fails to find
the module definition or when a from … import fails to find a name that
is to be imported.
Solution: Check the spelling and case of your module name. Also, ensure
that the module is installed and accessible in the Python path.
• ZeroDivisionError: This error is raised when you try to divide a number
by zero.
Solution: Add checks in your code to prevent division by zero. For
example, you might add an if statement to check that the denominator
is not zero before performing the division.
• FileNotFoundError: This error is raised when Python can’t locate the file
you’re trying to open with built-in open() function.
Solution: Ensure the file exists at the path you’re specifying. Also, check
for any typos in the filename or path.
• AttributeError: This error occurs when you try to access an attribute or
method that doesn’t exist on an object.
Solution: Check the object you’re trying to access the attribute or method
on. Ensure that it is defined and that it is spelled correctly. Use the dir()
function to see all the attributes and methods of an object.
• KeyError: This error is raised when a dictionary is accessed with a key that
does not exist in the dictionary.
Solution: Before accessing a key, check if it exists in the dictionary using
the in keyword or the dictionary’s get() method, which returns None
instead of raising an error if the key is not found.
• IndexError: This error is raised when you try to access an index that does
not exist in a list, tuple, or string.
Solution: Make sure the index you’re trying to access exists. You can use
the len() function to check the length of your sequence.
• ModuleNotFoundError: This error occurs when the module you are
trying to import cannot be found.
Solution: Check the spelling and case of the module name. Make sure the
module is installed in your Python environment and is in your Python
path.
• MemoryError: This error is raised when an operation runs out of memory.
Most Common Errors and Solutions 503
Solution: Review your code to find places where you could reduce memory
usage. This could involve deleting objects that aren’t needed anymore,
using generators instead of lists where possible, or optimizing your algo-
rithms.
• OverflowError: This error occurs when the result of an arithmetic opera-
tion is too large to be expressed by Python’s built-in numerical types.
Solution: Use the decimal or fractions modules for more precise arith-
metic, or use the numpy or scipy libraries if you’re working with large
numerical datasets.
• StopIteration: This error is raised by the next() function (or an iter-
ator object’s __next__() method) to signal that there are no further items
produced by the iterator.
Solution: If you’re writing a function or method that uses next(), consider
using the default parameter to avoid raising a StopIteration error. In
Python 3.7 and later, you can use the for loop to iterate over the items
in an iterator, which automatically handles the StopIteration error.
Index
© The Editor(s) (if applicable) and The Author(s), under exclusive 505
license to Springer Nature Switzerland AG 2024
S. Kumar, Python for Accounting and Finance, https://doi.org/10.1007/978-3-031-54680-8
506 Index
L
Latent Dirichlet Allocation (LDA) O
213–224 Object Oriented Programming in
Lemmatization 136–137 Python 26
Lexicon-based methods 278–281 Outlier detection 204
Limiting by date range 104–105 Overview of python programming
Line chart 158, 163 language 5
Linguistic rules 274–278
Lists 21
Literals 33 P
Logistic regression 298–300, Pandas 67
319–322 Panel data 411–429
Logit regression 339–341 Panel data diagnostics 429–433
Long short-term memory (LSTM) Pearson’s correlation coefficient 206
467 Perplexity 217
Lowercasing 134 Pie Chart 159, 169
LSTM model 485–491 Plotly 167
Poisson regression 373–374
Polynomial regression 343–355
M Pooled OLS models 414–417
Machine learning algorithms Portfolio management 462
282–300 Principal component analysis
Machine learning libraries 82 436–441
Matplotlib 158, 184 Probabilistic latent semantic analysis
Matplotlib library 74 (PLSA) 229–231
Mean 199 Probit and logit regression 329–341
Median 199 Probit regression 329–335
Metacharacters 34 PyMC3 80
Mode 199 PyTorch 86
Modified Z-score method 205
Moving average 208
Multiple regression 309–314 Q
Quantifiers 37
Quantile regression 357–364
N
Naive Bayes 250–255, 283–287
Negative binomial regression R
374–376 Random effects models 417–419
Network graph 187 Random forests 261–264, 294–298
508 Index