Building An Effective Data Science Practice
Building An Effective Data Science Practice
Effective Data
Science Practice
A Framework to Bootstrap and
Manage a Successful Data Science
Practice
—
Vineet Raina
Srinath Krishnamurthy
Building an Effective
Data Science Practice
A Framework to Bootstrap
and Manage a Successful Data
Science Practice
Vineet Raina
Srinath Krishnamurthy
Building an Effective Data Science Practice: A Framework to Bootstrap
and Manage a Successful Data Science Practice
Part I: Fundamentals�������������������������������������������������������������������1
Chapter 1: Introduction: The Data Science Process������������������������������3
What We Mean by Data Science���������������������������������������������������������������������������4
The Data Science Process������������������������������������������������������������������������������������6
Machine Learning��������������������������������������������������������������������������������������������8
Data Capture (from the World )������������������������������������������������������������������������9
Data Preparation��������������������������������������������������������������������������������������������12
Data Visualization������������������������������������������������������������������������������������������13
Inference�������������������������������������������������������������������������������������������������������14
Data Engineering�������������������������������������������������������������������������������������������15
Terminology Chaos: AI, ML, Data Science, Deep Learning, Etc.���������������������������15
Conclusion����������������������������������������������������������������������������������������������������������20
Further Reading��������������������������������������������������������������������������������������������������21
References����������������������������������������������������������������������������������������������������������21
iii
Table of Contents
iv
Table of Contents
Chapter 5: Regression������������������������������������������������������������������������57
Data Capture�������������������������������������������������������������������������������������������������������58
Data Preparation�������������������������������������������������������������������������������������������������59
Data Visualization������������������������������������������������������������������������������������������������60
Machine Learning�����������������������������������������������������������������������������������������������61
Inference�������������������������������������������������������������������������������������������������������������62
Conclusion����������������������������������������������������������������������������������������������������������62
v
Table of Contents
Chapter 7: Clustering��������������������������������������������������������������������������75
Data Capture�������������������������������������������������������������������������������������������������������76
Data Preparation�������������������������������������������������������������������������������������������������78
Handling Missing Values��������������������������������������������������������������������������������78
Normalization������������������������������������������������������������������������������������������������79
Data Visualization������������������������������������������������������������������������������������������������79
Machine Learning�����������������������������������������������������������������������������������������������80
Similarity of Observations�����������������������������������������������������������������������������81
Data Visualization Iteration����������������������������������������������������������������������������84
Inference�������������������������������������������������������������������������������������������������������������86
Interpreting the Dendrogram�������������������������������������������������������������������������86
Actionable Insights for Marketing�����������������������������������������������������������������88
Conclusion����������������������������������������������������������������������������������������������������������89
Further Reading��������������������������������������������������������������������������������������������������89
Reference������������������������������������������������������������������������������������������������������������89
vi
Table of Contents
Complex Anomalies�������������������������������������������������������������������������������������������108
Collective Anomalies�����������������������������������������������������������������������������������108
Contextual Anomalies����������������������������������������������������������������������������������109
Conclusion��������������������������������������������������������������������������������������������������������111
Further Reading������������������������������������������������������������������������������������������������112
References��������������������������������������������������������������������������������������������������������112
Chapter 9: Recommendations����������������������������������������������������������113
Data Capture�����������������������������������������������������������������������������������������������������114
Items and Interactions���������������������������������������������������������������������������������114
Quantifying an Interaction���������������������������������������������������������������������������114
Example Data����������������������������������������������������������������������������������������������115
Data Preparation�����������������������������������������������������������������������������������������������116
Normalization����������������������������������������������������������������������������������������������117
Handling Missing Values������������������������������������������������������������������������������118
Data Visualization����������������������������������������������������������������������������������������������118
Machine Learning���������������������������������������������������������������������������������������������119
Clustering-Based Approach�������������������������������������������������������������������������119
Inference�����������������������������������������������������������������������������������������������������������120
End-to-End Automation�������������������������������������������������������������������������������������121
Conclusion��������������������������������������������������������������������������������������������������������122
Further Reading������������������������������������������������������������������������������������������������122
References��������������������������������������������������������������������������������������������������������123
vii
Table of Contents
Processing Videos���������������������������������������������������������������������������������������������129
Video Classification�������������������������������������������������������������������������������������130
Object Tracking��������������������������������������������������������������������������������������������130
Data Science Process for Computer Vision�������������������������������������������������������131
The World and Data Capture������������������������������������������������������������������������131
Data Preparation������������������������������������������������������������������������������������������132
Data Visualization����������������������������������������������������������������������������������������134
Machine Learning����������������������������������������������������������������������������������������134
Inference�����������������������������������������������������������������������������������������������������136
Data Engineering�����������������������������������������������������������������������������������������137
Conclusion��������������������������������������������������������������������������������������������������������138
Further Reading������������������������������������������������������������������������������������������������138
References��������������������������������������������������������������������������������������������������������139
viii
Table of Contents
ix
Table of Contents
Transforming Images����������������������������������������������������������������������������������������182
Libraries and Tools��������������������������������������������������������������������������������������������184
Libraries������������������������������������������������������������������������������������������������������184
Tools������������������������������������������������������������������������������������������������������������184
Data Engineering�����������������������������������������������������������������������������������������������185
Conclusion��������������������������������������������������������������������������������������������������������185
x
Table of Contents
Logistic Regression�������������������������������������������������������������������������������������214
Support Vector Machine������������������������������������������������������������������������������220
Decision Tree�����������������������������������������������������������������������������������������������225
Random Forest��������������������������������������������������������������������������������������������232
Gradient Boosted Trees��������������������������������������������������������������������������������234
Artificial Neural Network�����������������������������������������������������������������������������238
Convolutional Neural Network���������������������������������������������������������������������247
Evaluating and Tuning Models��������������������������������������������������������������������������249
Evaluating Models���������������������������������������������������������������������������������������249
Tuning models���������������������������������������������������������������������������������������������252
Cross-Validation������������������������������������������������������������������������������������������253
Libraries and Tools��������������������������������������������������������������������������������������������255
Data Engineering�����������������������������������������������������������������������������������������������255
Conclusion��������������������������������������������������������������������������������������������������������256
Further Reading������������������������������������������������������������������������������������������������256
References��������������������������������������������������������������������������������������������������������256
xi
Table of Contents
xii
Table of Contents
xiii
Table of Contents
Data Scientist����������������������������������������������������������������������������������������������313
Chief Data Scientist�������������������������������������������������������������������������������������313
Deviations in Skills��������������������������������������������������������������������������������������������314
Conclusion��������������������������������������������������������������������������������������������������������314
xiv
Table of Contents
Data Quality�������������������������������������������������������������������������������������������������������338
Importance of Data Quality��������������������������������������������������������������������������339
Dimensions of Data Quality�������������������������������������������������������������������������341
Measuring Data Quality�������������������������������������������������������������������������������343
Ensuring Data Quality����������������������������������������������������������������������������������344
Resistance to Data Quality Efforts���������������������������������������������������������������345
Data Protection and Privacy������������������������������������������������������������������������������346
Encryption���������������������������������������������������������������������������������������������������346
Access Controls�������������������������������������������������������������������������������������������347
Identifiable/Protected/Sensitive Information�����������������������������������������������348
Federated Learning�������������������������������������������������������������������������������������350
Legal and Regulatory Aspects���������������������������������������������������������������������������350
When Are These Relevant?��������������������������������������������������������������������������351
Nondiscrimination���������������������������������������������������������������������������������������351
Explainability and Accountability�����������������������������������������������������������������352
Explainable AI: What Is an “Explanation”?���������������������������������������������������353
Cognitive Bias���������������������������������������������������������������������������������������������������354
Cognitive Bias and Data Science Projects���������������������������������������������������355
Conclusion and Further Reading�����������������������������������������������������������������������356
References��������������������������������������������������������������������������������������������������������356
Index�������������������������������������������������������������������������������������������������359
xv
About the Authors
Vineet Raina is a Chief Data Scientist at GS Lab, India, and has led
the effort of setting up a data science group at GS Lab which has now
successfully executed data science projects in diverse fields like healthcare,
IoT, communication, etc. He has also led research projects in Computer
Vision and Demand Forecasting and developed new data science
algorithms/techniques in areas like model performance tuning.
Vineet is a computer science engineer from Pune University with a
master’s degree from BITS Pilani. For most of his 17-year professional
career, he has been associated with data science projects and has two US
patents in his name. Prior to joining GS Lab, he worked at SAS for seven
years building data science products. He has presented papers in global
conferences and has given talks in colleges on topics related to data
science. He has also been associated with universities for research projects
in the field of data science.
xvii
About the Technical Reviewer
Jojo Moolayil is an artificial intelligence
professional and published author of three
books on machine learning, deep learning, and
IoT. He is currently working with Amazon Web
Services as a research scientist – AI in their
Vancouver, BC, office.
In his current role with AWS, he works on
researching and developing large-scale AI
solutions for combating fraud and enriching
the customer’s payment experience in the
cloud. He is also actively involved as a tech
reviewer and AI consultant with leading publishers and has reviewed over
a dozen books on machine learning, deep learning, and business analytics.
You can reach out to Jojo at www.jojomoolayil.com/ or www.linkedin.
com/in/jojo62000.
xix
Acknowledgments
The culture of interdisciplinary innovation fostered at GS Lab has been
instrumental in providing us with the experiences that formed the
foundation of this book.
The editorial team at Apress – Celestin Suresh John, Aditee Mirashi,
Jim Markham, and Matthew Moodie – has been extremely helpful in
coordinating and supporting the development of this book. Jojo John
Moolayil reviewed the early drafts and provided suggestions that helped
improve the quality of the book.
Mugdha Hardikar helped create some of the visualizations in Chapter 15.
xxi
Introduction
An increasing number and variety of organizations are eager to adopt data
science now, regardless of their size and sector. In our collaborations and
discussions with technology leaders from various companies, we noticed
some recurring patterns:
1
Such as https://venturebeat.com/2019/07/19/why-do-87-of-data-science-
projects-never-make-it-into-production/
xxiii
Introduction
xxiv
Introduction
xxv
Introduction
xxvi
PART I
Fundamentals
This part talks about data science and how it could be beneficial to your
business. The three chapters in this part lay the groundwork for the rest of
this book.
In Chapter 1, we shall cover our perspective of what data science is and
introduce the data science process around which the rest of the book is
based.
In Chapter 2, we cover the business aspects around data science.
We touch upon various benefits of data science to help understand its
importance for your business and, at the same time, cover aspects to
evaluate the readiness of your business for data science. This chapter also
initiates the notion of how the business needs drive the data science.
In Chapter 3, we introduce the two cultures of data science and how
these matter to your business. The cultures permeate every aspect from
the technical ecosystem to the processes, as we shall see in the rest of the
book.
CHAPTER 1
…May 29, 1919. What was to become perhaps the most important
eclipse in the history of physics. The eclipse allowed the scientists to
collect data about the position of certain stars, which would indicate the
course taken by the light from the stars as it grazed the sun on its way to us
on Earth.
Nov 1919. Analyses of the data confirmed Einstein’s predictions.
General relativity was established as a more proper description of the laws
of gravitation in our universe than the existing standard – Newton’s.
4
Chapter 1 Introduction: The Data Science Process
1
Note that this model is a result of “engineering” rather than science. Though the
notion of “model” is not limited to science, in this book, we will use the term
model to refer to models created by applying the scientific method.
5
Chapter 1 Introduction: The Data Science Process
• Reality we are Ingest data from • Cleanse and Visual analycs Train models that Use the models in
trying to sensors, devices, transform data to gain insights reflect the real- real-world
model databases etc. • Signal processing useful for world phenomena applicaons and
• Source of data modeling processes for
predicons, insights
etc.
Data Engineering
Figure 1-1. Data science process (iterations are implicit across any of
these steps)
6
Chapter 1 Introduction: The Data Science Process
require going back and forth repeatedly between the different steps – we
refer to these as iterations of the data science process. Let us look at a few
examples:
As we see, the data science process does not end with creating
models – it continues beyond, to ensuring that the models can be deployed
to production operations and monitored. This leads to continual gathering
of observational evidence to validate or improve the models over time.
The entire ecosystem of techniques, tools, and skills is oriented around
this data science process – we shall keep returning to it and see how it is
realized as part of end-to-end solutions. This data science process will also
form the basis for how we categorize the various techniques, tools, and
technologies in Part 3.
Let us look at a few simple examples to separately illustrate each
step in the process. We shall start with the step that is at the heart of the
scientific method – machine learning – which results in candidate models.
7
Chapter 1 Introduction: The Data Science Process
M
achine Learning
Let us start by looking at the machine learning step in Figure 1-2, because
that is fundamental to understanding the other steps of the process.
Predict for:
Experience = 2.5
Salary = 8 x 2.5 + 40
Result:
Salary = 60K
2
The algorithm in this case is linear regression, which we shall cover in Chapter 16.
8
Chapter 1 Introduction: The Data Science Process
This model can now be used in the next step, inference, to predict the
salary of a new individual. For example, if the individual has 2.5 years’
experience (a value that was not present in the original observations), this
model will predict their salary to be 60K.
Note that, like all models, this is only a sufficiently useful representation
of reality – there are bound to be some errors and approximations. For
example, from the data, we see that 6 years’ experience should result in
salary of 87K, but our model will predict 88K, with an error of 1K. But
nevertheless, if we consider all the data points together, this model yields
an acceptable error overall.3 If the errors are acceptable, as in this example,
the model is useful. If the error is too high, we would continue to search for
a better model.
The machine learning algorithm and the resulting model can get
increasingly complex in structure. While we have an equation that can
represent our model in this case, some other models would be represented
as trees or complex graph topologies and so forth. We shall cover these in
Chapter 16.
3
Model performance evaluation will also be covered in Chapter 16 in more detail.
9
Chapter 1 Introduction: The Data Science Process
Predict for:
Temperature = 22,
Humidity = 75
4
This is an extremely simplistic view. But the fundamental formulation of
predicting precipitation (rainfall in our example) based on other factors such as
temperature, humidity, etc., would be applicable, especially for localized weather
predictions, for example, when you have your own weather station on your
industry premises.
10
Chapter 1 Introduction: The Data Science Process
Since the wind speed was not captured and the model uses only
temperature and humidity, it predicts rainfall of 8 mm for a temperature
of 22 degrees and 75% humidity irrespective of the wind speed. This
prediction will be correct only if the current wind speed was also close
to 10 kmph. The prediction may be incorrect if the wind speed is very
different from 10 kmph (say 20 kmph).
The model does not represent reality sufficiently because it was
unable to learn if and how rainfall is impacted by wind speed.5 The main
underlying issue here was that we did not capture all the data necessary to
be able to create a model that could sufficiently represent reality.
While this example is rather simplistic, many data science projects
in the real world face initial failures due to a similar issue. Data scientists
may create models that look good, that is, they have very little error on the
data/observations based on which the model was created, but sometimes
they mysteriously fail in production. In Chapter 16, we shall cover some
5
We have only mentioned wind speed to simplify the illustration. In reality, there
would be other factors such as atmospheric pressure, etc., as well.
11
Chapter 1 Introduction: The Data Science Process
D
ata Preparation
The data captured often needs to be transformed in certain ways to extract
maximum information from it. Figure 1-5 illustrates an example of this
which is typical in sales prediction scenarios.
6
The role of domain experts (business analysts) will be covered in Chapter 21.
7
This is a conventional quirk of software systems to store a time zone-neutral
representation of time.
12
Chapter 1 Introduction: The Data Science Process
D
ata Visualization
Data Data Data Machine
World Inference
Capture Preparation Visualization Learning
Once a data scientist has prepared data, they first do an exploration of the
data and relationships between the various fields. We refer to this step as
data visualization. Typically, during this visual analysis, the data scientist
discerns patterns that enable him to decide which machine learning
techniques are likely to perform better.
For example, a linear relationship between two fields, such as Salary
and Experience, would prompt the data scientist to try a linear regression
model first. Similarly, they can determine based on visual patterns
whether there is a correlation between some fields or whether certain rows
are outliers. These analyses help the data scientist arrive at the design of
the machine learning step. Techniques used for data visualization will be
covered in Chapter 15.
13
Chapter 1 Introduction: The Data Science Process
I nference
Data Data Data Machine
World Inference
Capture Preparation Visualization Learning
This is the step after machine learning and is executed once the data
science team has determined that a sufficiently good model has been
created. The resulting model is typically deployed as part of a production
system to infer information about new observations based on their
attributes. For example, if current readings of temperature, humidity, and
wind speed are available, the rainfall model can be used to infer rainfall
later that day. Or, given a future date, we can use the sales model to predict
potential sales.
The act of deploying models into production systems is a transition
from the scientific discipline to the (software) engineering discipline.
This transitional boundary has its own unique set of challenges and
corresponding techniques and tools to address these challenges.
This niche area has, in the past couple of years, burgeoned into the
discipline christened ML Ops that we shall cover in more detail in
Chapter 17.
Recall that, for the scientific method, we need to continue gathering
information about how well the model performs. Observational evidence
that either validates the model or uncovers issues with the model is
crucial to improving the model as part of the scientific method. Thus, the
observations collected regarding the model are typically persisted and
effectively become a part of the World for the next big iteration through the
data science process.
14
Chapter 1 Introduction: The Data Science Process
D
ata Engineering
All the steps we have seen so far require software engineering techniques
specialized toward optimal storage and compute of data. Data science
experiments at big data scale often need dedicated engineering support
for ensuring optimized and ready access to data – especially when there
are multiple applications or sources of data, which is rather common these
days.
We shall look at data engineering in more detail throughout Part 3.
For now, the key takeaway is that we regard data engineering as the
engineering backbone that enables the science and thus as a pervasive
horizontal within the data science process.
Having briefly covered what data science is, and having an overview of
the data science process, let us look at how it relates to some of the other
buzzwords rampant these days.
15
Chapter 1 Introduction: The Data Science Process
8
Philosophical note: there are varying perspectives of what AI is. We shall skip
over the philosophical angles to this.
9
You may know this AI engine by the name of spam filter.
16
Chapter 1 Introduction: The Data Science Process
email. Of course, as spammers get more creative, more such rules would
need to be added to the engine by the developer. Also, as the complexity of
patterns in the text that indicate spam increase over time, the complexity
of the rules would increase correspondingly.
It can also be difficult to adapt the engine based on false positives.
That is, if the engine declares an email to be spam but the user overrides
and says that the email was not a spam, it can get tricky to adapt the rules
accordingly unless the reason, the underlying pattern that caused this
exception, is traced.
Note that this illustrative example of spam filter, and the solution we
have taken is rather simplistic. In general, depending on the complexity of
the problem, the rules in a rule-based AI system can be made increasingly
complex and sophisticated.10 Rule-based AI has been sufficient to power
expert systems in industries for decades and to become so strong at chess
that grandmasters eventually ceased competing against chess engines in
the mid-2000s.
Data science approach
Now, consider a different data-driven scientific approach where we try
to infer whether the new email is spam or not based on its resemblance to
past spam or non-spam emails from our historical email dataset.
According to the data science process,11 we begin with data capture,
that is, by gathering observations (data) about emails, that is, we have a
large body of emails that have been labeled as spam or non-spam. The
data scientist then attempts to build a model based on this data. To do this,
they would first perform the data preparation step to apply techniques to
convert the email text into a format that is conducive to machine learning
algorithms. They will then perform some data visualization to determine
10
You may have heard of some techniques like minimax, alpha-beta pruning,
etc., which are widely used in rule-based AI systems. There is a huge body of
literature in this area that should not be overlooked.
11
Refer to Figure 1-1.
17
Chapter 1 Introduction: The Data Science Process
HYPE NOTES
AI has gone through various phases of decline and revival in the past several
decades. After a decade of relative hype-stagnation in the mid- to late 2000s,
the popularity of data science-based approach to AI that started around 2012
has propelled AI back into the hype-limelight.
12
Based on what they see, there may be iterations to data preparation before we
proceed to the next step.
18
Chapter 1 Introduction: The Data Science Process
It is important to bear in mind that the term data science started gaining
currency only in the last decade. Deep learning, while theoretically rather old,
also started becoming tractable and useful only in the last decade.
But AI has been around for a long, long time. Expert systems controlled
complex machinery, and computer programs became stronger than any
grandmaster at chess, long before the term data science was popular. Thus, it
is quite possible that your business may not really need data science – more
traditional AI, for example, rule-based approaches, suffices in many cases.
13
We shall cover more details of computer vision in Chapter 10.
19
Chapter 1 Introduction: The Data Science Process
Let us look at an example where one might feel the allure of data science,
but rule-based systems may suffice. Consider a smart-city solution which
integrates multiple devices, systems, and sensors to allow actionable policies.
One common category of policies you may have encountered is that of parking
policies – tiered parking rates based on time of day and day of the week are
common. A smart-city platform may allow more complex policies that can
dynamically increase or decrease the parking rate in a parking space based
on the current occupancy, nearby traffic, and weather conditions, since all
this information has been integrated into the smart-city platform. In this case,
one might feel tempted to take an approach of creating a data science-based
model to estimate parking demand and decide pricing accordingly. This has a
couple of issues. First, we would need to collect a lot of historical data about
parking in various traffic and weather conditions before we can even get
started. Second, it may be a lot simpler and sufficient to define a set of rules
based on which the parking rate should be varied. An example rule could be
that if traffic in nearby lanes doubles, then hike the parking rate by 20%. An
analyst could be given access to an application in which they can observe how
the rules are performing and tweak them over time.
Note that a rule-based approach can often act as a stand-in, bootstrap mechanism
until sufficient data is available to create data science-based predictive models.
In the parking example just that we just saw, once we have enabled rule-based
dynamic pricing and collected data for several months, data science can then be
used to determine optimal pricing rules based on predictive models.
Conclusion
We covered the data science process around which the rest of the book is
oriented. We also saw how the data science process effectively allows us
to apply the scientific method to data using software, that is, to do data
science.
20
Chapter 1 Introduction: The Data Science Process
We then clarified some of the terms related to data science and AI, to
help determine which is most appropriate for the business problems at
hand. The kinds of business problems determine the kinds of techniques
that are best suited to solve those problems. Based on the kinds of
techniques required, you can identify the appropriate technologies and the
mix of skills needed in your team.
In the next chapter, we shall therefore begin looking into how data
science fits into a business and how it benefits a business.
F urther Reading
The gripping account of Eddington and Dyson’s expeditions is given in
Dyson, Eddington, and Davidson (1919). Thankfully, data collection for
data science projects is a tad easier. At least, in this Internet and cloud-
based data era, we seldom need to worry about situations such as “a strike
of the steamship company” while collecting data.
Varied definitions of data science, and explanations of what data
science is, abound in the literature. For more theoretical and historical
background of data science, refer to Chapters 1, 2, and 8 of Braschler,
Stadelmann, and Stockinger (2019).
R
eferences
Braschler, Martin, Thilo Stadelmann and Kurt Stockinger. Applied Data
Science, Lessons Learned for the Data-Driven Business. Cham, Zug,
Switzerland: Springer, 2019.
Dyson, F. W. , A. S. Eddington and C. Davidson. “A Determination of
the Deflection of Light by the Sun’s Gravitational Field, from Observations
Made at the Total Eclipse of May 29, 1919.” Philosophical Transactions of
The Royal Society. London, 1920. 291-333.
21
CHAPTER 2
Business Process
(c) Data
• Product usage Technologies (j)
• Transactions Data input
• Customers
(d)
(h)
Let us walk through this, beginning from the business process box. When
the business strategy (a) is executed, either products are created, or certain
operational processes are automated (b). An example of a product would be
a cloud-hosted app your technology company may have created. An example
of operational automation could be SCADA1 systems or IoT infrastructure
that you might have set up as part of digitalization of your enterprise.
In either of these cases, data is generated (c). For example, data can
be about how customers are using your service online or about how the
equipment on your shop floor is performing. All this data can be used (d)
by a data science team2 (e), to create models that help improve processes/
products through operational optimizations/product enhancements (f ),
and/or to provide strategic insights (g) to the business.
1
Supervisory control and data acquisition (SCADA) is widely adopted by
industries for monitoring and controlling devices.
2
Part 4 of this book covers the skills, roles, and typical structure in a data
science team.
24
Chapter 2 Data Science and Your Business
Operational Optimizations
The data science team can create predictive models indicating when
certain activities might occur – these predictive models can be used for
optimized scheduling of activities.
As an example, consider the case of predicting the inventory
requirements at a gas station. Suppose we use data science to create
models that can predict, based on past usage data, when each type of fuel
would need a refill at each gas station in the neighborhood of the main
terminal of the gas company. Knowing the refill needs well in advance
allows the creation of an optimized plan of deliveries from the main
terminal to the gas stations for future dates. Optimizations can be based
on the type/amounts of fuels needed at the gas stations, the type of trucks
available on the future dates, and the optimal route to complete the refills
on those dates.
25
Chapter 2 Data Science and Your Business
ROI NOTES
Product Enhancements
If you are building technology products, data science models can help
add differentiating features to your product. One common example is
that of recommender engines that suggest movies on Netflix or products
on Amazon. Recommender engines often make recommendations based
on the past choices of similar users. Voice-based technologies, chatbots,
etc., are other examples which can be added to any product for improving
customer interaction.
Suppose you have a technology solution for customer care. You could
add a feature into your solution that could automatically detect during a
voice call, whether the customer is disappointed or angry or having any
negative sentiments. To achieve this, you might use a speech transcription
service that converts the speech to text in real time. You would then feed
this text to a data science-based sentiment analysis model which can
classify the emotion of the customer. Once your product has access to
customer sentiment in real time, it can provide various business benefits.
It can act as real-time feedback loop to the team lead. Offline analytics can
also determine which customer care executives faced difficult customers
and how well they handled those situations. As creator of the product, you
can provide these analytics as value-added services to your customer (the
customer care company).
26
Chapter 2 Data Science and Your Business
S
trategic Insights
Data-driven transformations to business strategy have been on the
rise, especially since the big data revolution. Data science is playing an
increasing role in strategy decisions and management consulting.
Consider the problem of identifying an optimal location in which to
open a new store. Traditionally, this has been done based on demographic
factors of customers and geographic factors such as proximity to transport
hubs, shopping centers, and competitive stores. In the past few years,
there has been a rapid rise of data about movement of people using their
mobile app and location – referred to as mobility data. Given this increased
visibility into the mobility of potential customers, it is now possible to
create more sophisticated ML models that include mobility aspects into
the decision-making along with geographic and demographic factors.3
Such models can determine which among a set of candidate locations is
likely to attract the largest number of visits.
3
Refer to Karamshuk (2013) for a detailed example, where mobility factors were
based on Foursquare check-in data.
27
Chapter 2 Data Science and Your Business
4
For an example model, refer to Jeremy Curuksu, “Developing a business strategy
by combining machine learning with sensitivity analysis,” https://aws.
amazon.com/blogs/machine-learning/developing-a-business-strategy-
by-combining-machine-learning-with-sensitivity-analysis/, November
13, 2019.
28
Chapter 2 Data Science and Your Business
A Cautionary Tale
One example we have seen is a company that invested in having the ability
to predict the yield of a chemical product based on the various control
parameters and readings taken during the chemical process that runs
for several days at a time. Smart folks used cutting-edge deep learning
algorithms and were able to predict the yield of the product over time as
the process ran. They achieved exceptional accuracy in these predictions
and even built an application that would show the predicted trend of the
yield, along with notifying key personnel when the actual yield varied
significantly from the predictions.
Having achieved this milestone, they now wanted to go back and
tune the control parameters to improve the yield. Since extremely
complex neural networks were used, they had no explainability5 for their
predictions and had not really gained actionable insights into how the
control parameters impact the yield. Thus, it turned out that all their
investment thus far did not generate significant business value that was
hoped for.
On the other hand, the examples we saw in the previous section,
particularly those of operational optimization, are cases where the route
to business RoI was outlined before embarking on data science. Once we
know the business goals, we can attempt to start data science.
5
We shall cover this crucial aspect in more detail over Chapters 16, 20, and 23.
29
Chapter 2 Data Science and Your Business
HIRING NOTES
The complexity and extent of data sources may also dictate the initial hires
of your data science team – there are some data scientists who will do the
necessary data integration and cleansing6 and others who expect properly
curated data to be made available for them so they focus purely on the
scientific analysis.7
6
Also referred to as data wrangling, munging, and other colorful terms.
7
We will look at the skills framework of a data science team in Chapter 21.
30
Chapter 2 Data Science and Your Business
Conclusion
Business considerations are paramount – even for something as cool as
data science; and particularly for something as interdisciplinary as data
science. Data science can require quite some investment and readiness
from the business – if done right, the RoI can be quite good. But there are a
few pitfalls as well, and a hasty decision to embark on data science is often
identified rather late in hindsight.
In this chapter, we have covered the benefits that can accrue from
data science, as well as the aspects to consider before plunging into data
science. In the next chapter, we shall delve deeper into the two cultures
of data science and how they relate to your business. As you head toward
forming a data science team, these considerations will help identify the
kind of skills and scientific culture that would be required of your team
members to best align with your business’ goals.
31
Chapter 2 Data Science and Your Business
Further Reading
The business and operational aspects of data science introduced in this
chapter will be covered in fuller detail in Part 4 of this book.
Applications of data science to business strategy and management
consulting is covered in quite some depth in Curuksu (2018).
References
Curuksu, Jeremy David. Data Driven, An Introduction to Management
Consulting in the 21st Century. New York, NY: Springer, 2018.
Karamshuk, Dmytro et. al. “Geo-Spotting: Mining Online Location-
based Services for Optimal Retail Store Placement.” Proceedings of the 19th
ACM SIGKDD international conference on Knowledge discovery and data
mining. Association for Computing Machinery, 2013. 793-801.
32
CHAPTER 3
from around 40K and gets raised by (approximately) 8K every year. Based
on this understanding, we can predict the salary for future observations of
experience as well. Thus, this model1 fulfils both purposes.
So, given the two purposes – explaining the observations and
predicting future observations – the two cultures of data science pivot
around whether the focus is on both purposes or only on prediction.
To clarify this further, let us revisit a few examples we saw in the earlier
chapters, delving deeper into the problems being solved.
Recall the cautionary tale from Chapter 2, where we saw that the ability
to predict the yield of a chemical was not sufficient to help tune the control
parameters for optimal yield. In this case, one requires deeper insights
into how the chemical process works in nature, that is, given the control
parameters and initial mixture, how the reaction will proceed over time
and the resulting yield. Merely predicting the yield is a relatively simpler
problem than determining the interrelationships among all the control
parameters themselves and their collective effect on the yield. In these
kinds of cases, especially involving natural processes, it is often beneficial
to focus on estimating the truth underlying the data, that is, the process
that might have generated the data. This allows us to control certain
parameters to influence the underlying process itself to a great degree.
Once understood fully, the chemical process itself is largely unchanging –
unless new parameters/chemicals are introduced, we can continue to use
the same model forever.
Contrast this to the problem of predicting potential demand for a
product at your stores. In this case, we may use the historical sales data
and several other data sources such as trends from social media, events
1
Recall as mentioned in Chapter 1 that this is a simplified example for illustrative
purposes. More realistic models would incorporate additional factors, and a
similar model (linear regression) would then determine relationship between
salary and the combined effect of all the other factors.
34
Chapter 3 Monks vs. Cowboys: Data Science Cultures
that might happen around the store location, etc., to build a complex
model for predicting demand. In this case, our primary goal is prediction
so that we can plan for inventory accordingly.
It may be useful to know which factors affect the demand more than
others; in some cases, such as social media factors, we may even attempt
to drive trends. But this aspect of controlling the underlying factors is
secondary and incidental – demand prediction is the goal. As the sales
data, social trends, and nearby events keep changing, we are also okay to
frequently update our models with the new data.
Compared to the chemical process example earlier, the models in the
demand prediction case are looking at shorter-term, relatively contingent
aspects, rather than at long-term truths underlying natural processes.
These two extreme examples highlight the hallmarks of the two cultures
of data science. One culture focuses on fully understanding the underlying
process out of which the observations are generated, and the other culture
focuses primarily on being able to predict future observations accurately.
Correspondingly, the choices of mathematical, statistical, and algorithmic
techniques tend to differ among these two cultures.
We refer to the first culture, focusing on deciphering the underlying
truths, as the monastic culture. We refer to the second culture, focusing
on empirical goals with predictive accuracy, as the wild-west culture.
In our experience, while expert data scientists can navigate both the
monastic and the wild-west territories, they tend to innately have a default,
predominant culture to them. We shall cover the cultural spectrum of data
scientists in a later section of this chapter.
We believe that, ideally, the problem statement and the business goals
should dictate which culture/approach is appropriate for the problem. For
some businesses, it is possible that a hybrid approach is suitable.
35
Chapter 3 Monks vs. Cowboys: Data Science Cultures
Hybrid Cultures
For example, consider a weather company. Since the business is primarily
reliant on the natural weather processes, it makes a lot of sense to invest
effort in understanding the underlying weather systems in greater detail.
This enables the business to technologically advance the field, creating
better instruments and sources of gathering data as well as advanced
weather models. For this aspect of their business, a monastic culture is
appropriate.
The customer facing end of your business might not just be limited to
predicting weather. Maybe you are offering advanced services for shipping
companies such as routing algorithms based on your weather predictions
and other data about oceans. Or maybe you are offering services to
coastline industries to predict disruption using models that incorporate
your weather models along with other data of the customer’s industry. In
these cases, the more specific models for your customers could be based
on the wild-west culture, because understanding the underlying processes
of your client’s data is not too beneficial to your core business. It is also
reasonable to frequently update these models that are tailored to your
customers.
If, at this point, the two cultures seem a bit abstract, don’t worry –
we shall continue to add more details demarcating these two cultures
throughout this book. The primary reason for introducing these two
cultures so early in this book is that we believe adopting the culture
appropriate to your business is one of the keys to increasing chances of
success of your data science practice.
In the following sections, we elaborate on some of the factors pertaining
to these two cultures and map these factors to your business’ goals.
36
Chapter 3 Monks vs. Cowboys: Data Science Cultures
C
ultural Differences
Table 3-1 summarizes the key differences between the two cultures. We
shall continue to add to this table in Chapter 20.
Mindset Find the underlying, eternal truth Find what works now. Can update
(nature) which led to (caused) the frequently. Empiricism is the only
observations eternal truth
Purposes Estimation of truth behind the Predictive accuracy is the primary
observations, which enables goal
prediction and deeper, accurate Causation is often a casualty.
causative2 insights Causative insights are either
irrelevant, less accurate, or just
good to have
Evaluation How close to the truth is my Am I getting the predictions as
estimation? accurately as I wanted to?
In the next section, we shall look at how these cultural factors relate to
your business.
2
Note that we are referring to causality loosely and intuitively here. Data scientists
never confidently derive causality; they only attempt to derive insights that are
indicative of likely causation. In Chapter 20, we shall be more technically precise
about this.
37
Chapter 3 Monks vs. Cowboys: Data Science Cultures
(continued)
3
Another example: human speech production based on other health factors.
4
Also refer to other examples in Chapter 2.
5
Another example: clinical trials for a disease condition.
38
Chapter 3 Monks vs. Cowboys: Data Science Cultures
Data Your data is from a specific You are in big data territory, with
homogeneity population/environment so multiple data sources and varied
that a single, true, underlying population/environments, where a
(natural) process that single underlying truth may not exist,
generated the data can be for example, demand prediction
determined, for example, a using social media and historical
chemical process in your sales
company6
In some cases, even within one business, different problems may seem
to require different cultures. When we look at the extant problems in a
business cohesively, one predominant culture tends to emerge – often this
predominant culture then suffices for future problems as well.
For the occasional problem that deviates significantly from the
predominant culture, a pragmatic approach would be to get a consultant
on board for that specific problem. For example, if you inculcate a wild-
west culture and there is one problem that requires etiological insights
(e.g., say biological causative factors are relevant to the context of your
problem), you can get a consulting monk (e.g., biostatistician) to help.
6
Clinical trials are the archetype of data homogeneity.
39
Chapter 3 Monks vs. Cowboys: Data Science Cultures
7
Seen in Chapter 1.
40
Chapter 3 Monks vs. Cowboys: Data Science Cultures
8
Automation can be at various levels, from automated feature extraction (e.g.,
from images using CNNs) to automating model choices (e.g., AutoML). We shall
cover these in Part 3.
41
Chapter 3 Monks vs. Cowboys: Data Science Cultures
D
ata Engineering
The engineering requirements are often driven by the data science
culture. This is primarily related to the data homogeneity factor discussed
previously.
If your data scientists are primarily monkish, they may prefer
homogenous data because it’s more amenable to discovering an
underlying “truth.” With heterogeneous data, it is less likely that there will
be a single underlying truth to be discovered. Thus, they might typically
work on data that is relatively contained and homogenous. This often
implies that more data engineering effort may be spent to provide clean,
relevant subsets of the data. Such dataset sizes typically are amenable
to analysis on a single machine.9 Monks often tend to analyze a single
homogenous dataset for several weeks crafting their models as they obtain
increasing insights about the truth.
Cowboys, on the other hand, do not have a specific preference for
homogenous data. They often work with heterogeneous data obtained
from multiple sources, characteristic of big data. At the very extreme, they
would run deep learning on multiple GPUs during the machine learning
step. Given the heterogeneous nature of their data, they also iterate more
rapidly over multiple variations, starting from the data preparation step.
9
As big as needed, often on the cloud. But still, typically a single machine rather
than a cluster.
42
Chapter 3 Monks vs. Cowboys: Data Science Cultures
C
onclusion
The two cultures in data science have been introduced – the undercurrent
of differences between these cultures will appear throughout this book.
This discussion will continue in Chapter 20, where we will summarize
further differences, including technical aspects.
We also touched upon how the business goals can outline the appropriate
culture – to establish the appropriate culture, the data science team needs to
be formed accordingly of monks or cowboys. In Chapter 22, we shall revisit
the defining characteristics of monks and cowboys, particularly with respect to
their skills and background – this will help outline the team formation aspects
based on the desired culture.
Summary of Part 1
With this, we conclude Part 1 of this book. We covered the data science process
in Chapter 1 and how data science relates to your business in Chapter 2. In
this chapter, we saw more details of the scientific approach, especially the two
cultures within data science. The key takeaway is that business goals should
determine how the data science practice is bootstrapped – this part of the
book has broadly covered these formative factors.
We are now ready to delve deeper into the classes of problems that are
solved using data science. We shall cover these in Part 2.
43
PART II
Classes of Problems
In Part 1, you read about what data science is and how it promises to be
useful for businesses in general. You will now naturally be curious to learn
more about the concrete problems data science can solve and understand
how these problems relate to the problems in your business/organization.
A good way to learn about the data science problems is to first
understand the classes of problems. Many individual problems can be
mapped to one of these classes and require similar treatment though
each problem has its own unique set of challenges to be addressed. This
part of the book talks about the classes of problems that are solved using
data science and contains one chapter for each class that walks through a
concrete problem from that class – the chapter first establishes a business
motive and then transforms that motive to a concrete data science
problem and shows how you could solve it by choosing the appropriate
techniques for each step of the data science process. The techniques we
have chosen for each problem are just some examples meant only to give
an overview of the thought process that goes into solving such a problem.
Based on the exact nature of the problem that you might want to solve,
you will have to yourself design the steps of the data science process by
choosing from the plethora of techniques that are covered in more detail
later in Part 3. We will also cover the libraries/tools that help you apply
these techniques in Part 3.
PART II CLASSES OF PROBLEMS
46
CHAPTER 4
Classification
Let’s begin with a common class of problems called classification
problems. A classification problem requires you to infer/predict the
class/category to which a new observation belongs based on values of
some observed attributes. For example, infer whether a mail is “Spam” or
“Regular” based on the body of the mail, sender’s email address, etc.; infer
whether a digital payment transaction is “Fraud” or “Non-Fraud” based
on the details of the transaction like the location of the transaction, the
amount and mode of payment, etc.
Let’s say an automobile company has launched a new car and the
marketing team has executed an effective advertising strategy leading to
a steady stream of inquiries from interested customers. Data science can
help identify interested customers who are likely to eventually buy the car
so that the sales team can focus on such customers leading to improved
sales. This is a classification problem since the goal here is to infer whether
an interested customer belongs to the class of customers who buy the car
or to the class who don’t. Let’s look at the detailed steps of the data science
process that you could follow to achieve this goal. This being the first end-
to-end problem we are discussing, we will discuss its steps in more detail
compared to problems covered in the following chapters. We will also use
this first problem to introduce some new terms that we will use throughout
the book.
D
ata Capture
The goal is to predict which customers would eventually buy the car. The
source of all magic in data science is data. You could use the past data
containing details of interested customers of previous similar cars if it
is available. If that data is not available, you would define a strategy to
capture the relevant details of a first few interested customers. This data
would then be used in the later steps of the data science process to build
models that can predict which interested customers coming in future are
likely to buy the car based on the trends seen in the initial customers. So,
as part of your data capture strategy, you might direct the sales team to
capture the Gender, Age, Occupation, and Annual Income1 of each initial
customer along with the Outcome for that customer indicating whether the
customer purchased the car or not. You believe that it should be possible
to predict with reasonable accuracy the Outcome of an interested customer
based on the values of the other four variables. The variables based on
which you make the predictions (Gender, Age, etc., in this case) are referred
to as features, and the variable whose value you try to predict (Outcome
in this case) is referred to as the target. Note that the values of the target
variable here just like in all classification problems are classes (Purchased,
Not Purchased, etc.). A snapshot of the data collected is shown in Figure 4-1.
Assume that, in this case, the sales team manually enters these details in
an excel file. We will discuss how you can capture data programmatically
and the tools/libraries that help you do this in Part 3.
1
Annual income reported in USD.
48
Chapter 4 Classification
D
ata Preparation
After you have captured the data, you need to prepare the data in various
ways for building effective models. The data we have captured in the
previous step has some rows where Outcome is open which indicates an
ongoing inquiry. These are customers who had initiated inquiry and were
in active discussions with sales team when the data was captured. Since
we do not know whether these customers will eventually buy the car or
not, the data of such customers is not relevant for building our predictive
model. As part of preparing the data, we will remove the observations for
such customers. If you look at the snapshot of data prepared by the data
preparation step in Figure 4-2, you will notice that the data preparation
step has removed observations of such customers.
While working on classification problems, you might also run into
the scenario where the classes in the target variable are not equally
represented. For example, you might have many more customers
whose Outcome is Not Purchased compared to those whose Outcome is
Purchased. This is referred to as class imbalance and might lead to low
49
Chapter 4 Classification
2
We recommend an in-depth study of the class imbalance problem.
3
There are a few simple techniques commonly used for modifying datasets to
make them more balanced. We recommend familiarizing yourself with these.
50
Chapter 4 Classification
D
ata Visualization
In this step, you can analyze your prepared data using powerful
visualizations to get insights into the trends. These insights are useful
in various ways and can help you build effective models in the machine
learning step. Figure 4-3 shows a visualization created in this step based
on our prepared data. The visualization here is a stacked bar chart that
stacks the number of customers who didn’t purchase the car on top
of the customers who purchased the car for every income segment.
Overall height of each bar depicts the total interested customers for the
corresponding segment, and the green portion in each bar indicates the
interested customers who actually purchased the car in that segment.
We can see that we have had more interested customers in the past for
higher-income segments. Also, the percentage of interested customers
who purchased the car seems to be higher in the higher-income segments.
This means that Annual Income seems to have an impact on whether
a customer will eventually buy the car, that is, Annual Income seems to
impact the Outcome. This insight should make you feel more confident
about your decision of using Annual Income for building models that
predict the Outcome. You could visualize other features as well, and once
you feel confident about your choice of features, you are ready for the next
step, machine learning.
As mentioned earlier, since this is a classification problem, the values
in the target variable are classes (Purchased, Not Purchased) which we
have stacked in this visualization. The visualizations would be different
if the target variable contained continuous values (e.g., salary) as in the
case of regression problems which are discussed in the next chapter.
Data visualization is a vast subject in itself, and the art of designing the
right visualizations for your problem can help you easily uncover trends
which would otherwise be difficult to identify. We will look at a few more
popular visualizations and the tool/libraries you can use to create these
visualizations in Part 3.
51
Chapter 4 Classification
2500
2000
Number of Customers
1500
1000
500
0
40-60 60-80 80-100 100-120 120-140 140-160
Anuual income
M
achine Learning
Since the values in our target variable are classes (Purchased, Not
Purchased), we will use one of the classification machine learning
algorithms in this step. Classification machine learning algorithms learn
from the past observations to infer the class for a new observation. We
choose a decision tree algorithm here for simplicity since we haven’t yet
introduced the more complex algorithms which are discussed later in
Part 3 of the book. The corresponding tools/libraries that implement such
algorithms are also discussed in Part 3.
The decision tree algorithm builds a decision tree model based on our
prepared data. Figure 4-4 shows a partial view of our decision tree model
that focuses on the portions relevant to this discussion and omits other
details. Note that the decision tree has learned that females below 35
years of age do not buy the car irrespective of their occupation or income.
On the other hand, males above 35 years of age with income above 100K
buy the car irrespective of their occupation. This model is now capable of
predicting which interested customers in the future are likely to buy the car.
52
Chapter 4 Classification
Gender
Female Male
Age Age
Purchased
Figure 4-4. Partial view of the decision tree model created by the
machine learning step
I nference
Now that all the hard work has been done, it is time to reap the benefits. You
can now deploy the model you just created in production environment and
request application developers to create an app which a sales executive can
use for predicting whether a new customer will purchase the car. The sales
executive will fill the details (Gender, Age, Occupation, Annual Income)
of a new interested customer in the app which will pass on these details to
your deployed model for prediction. The model will traverse the tree based
on the details to do the prediction. Figure 4-5 shows how the deployed
decision tree model traverses the tree based on the details of a new customer
to predict the Outcome. The new customer here is a female of age 30, so
the model will go left4 at Gender node and then go left again at Age node
and predict Not Purchased indicating that the customer is not likely to buy
the car. The app will receive this predicted Outcome and show it on the
4
Reader’s left.
53
Chapter 4 Classification
Gender
Female Male
Purchased
D
ata Engineering
Data engineering takes care of storage and access of data throughout the
data science process as shown in Figure 4-6. In this example, we assumed
that data was stored in a spreadsheet in the data capture step and later
read into an appropriate data structure in the later steps. So we didn’t
require heavy data engineering for our scenario, but data engineering
becomes important for ensuring efficient storage and fast access when you
are dealing with large amounts of data. We will look at a few techniques for
efficient storage and access of data and related tools/libraries in Part 3. We
will skip the data engineering section for other problems in the following
chapters unless a problem requires unique treatment from the point of
view of data engineering.
54
Chapter 4 Classification
Data Engineering
Conclusion
Classification problems are among the most common types of problems
that data scientists work on. So if you are setting up a data science practice,
it is highly likely that you or your team will end up working on one. In this
chapter, we looked at what classification problems are and discussed a
concrete scenario to demonstrate how such problems are tackled.
55
CHAPTER 5
Regression
A regression problem requires you to infer/predict a quantity
corresponding to a new observation based on the values of some observed
attributes. The problems we briefly discussed in Chapter 1 which aimed at
predicting the salary of a person based on their experience and predicting
the amount of rainfall based on the temperature, pressure, etc., were both
examples of regression problems.
Let’s say you are an insurance company that offers health insurance
policies and want to optimize the insurance premium for maximizing
profits. While you may want to offer an affordable premium to attract
customers, you would charge higher premium for customers who are
likely to claim high amounts in order to reduce your losses. You can build
models that predict the amount that a customer is likely to claim and use
that as one of the factors for deciding the final premium for that customer.
This is a regression problem since we want to predict a quantity Claim
Amount. Let’s look at the possible techniques you could use in the different
steps of the data science process for this problem.
D
ata Capture
You would begin by looking at the stored policy records of past users and
their claim details. Let’s look at the factors that can affect the amount that
a customer claims. Older people are more likely to need medical care
and hence might claim higher amounts. People with higher BMI could
claim higher amounts because they are at higher risk of having heart
disease. Similarly, smokers might claim higher amounts because of the
adverse effects it has on health. There are many other factors like gender,
profession, etc., on which the amount a person will claim depends. You
might pull all such relevant details (Age, Smoking Status, etc.) that can act
as features for our model along with the Claim Amount which is our target
from the policy records into another location for easy access in subsequent
steps of the data science process. For simplicity, we will only focus on Age,
Smoking Status, and Gender as features. Note that the target variable here
just like in all regression problems is continuous valued. We will assume
that each customer has only one yearly policy and the Claim Amount is the
total amount that was claimed in the corresponding year by that customer.
A snapshot of the data extracted from policy records is shown in Figure 5-
1. Each row here corresponds to one user and their policy.1
1
Claim amount measured in USD.
58
Chapter 5 Regression
D
ata Preparation
Since this is a regression problem, you will need to choose a regression
machine learning algorithm in your machine learning step later. As is
often the case, let’s say you decide to use the linear regression algorithm in
your initial experiments. Linear regression tries to create a linear equation
that explains how the value of the target variable can be calculated from
the values of features. The example from Chapter 1 where the machine
learning step came up with the equation “Salary (K) = 8 × Experience (yrs)
+ 40” is an example of linear regression. It is obvious that such an equation
will work only for numeric features, that is, features whose values are
numbers. In our captured data, Age is a numeric feature, but we also have
features like Gender that are categorical, that is, they contain categories or
classes (e.g., Male/Female).
So as part of preparing our data, we will need to convert our categorical
features (Gender, Smoking Status) to numeric features. Each of these
features has just two possible values so you can easily encode one value as
0 and the other value as 1. Hence, for Gender, you could encode Male as 1
and Female as 0. For Smoking Status, you could encode Smoker as 1 and
Non Smoker as 0. Figure 5-2 shows a partial snapshot of the prepared data
after the encoding.
59
Chapter 5 Regression
D
ata Visualization
Let us now explore our target variable and its relationships with the
features visually. Since this is a regression problem, the target variable
contains continuous values, and hence, we design visualizations that
would be useful in such a scenario. Let us focus on visualizing the
relationship of the target Claim Amount with Age. You could generate
a scatter plot that displays each customer/policy as a marker whose x
coordinate is based on the customer’s Age and y coordinate is based on
their Claim Amount. You can simplify the exploration by restricting it to
just one segment of customers at a time – let’s look at customers who are
females and nonsmokers. If you look at Figure 5-3, you will notice that
Claim Amount has a linear relationship with Age. Let’s say you notice a
linear relationship of Claim Amount with Age for other segments (male
smokers, male nonsmokers, etc.) as well. This makes this scenario suitable
for trying out the linear regression algorithm which could figure out this
linear relationship. So now you have a convincing reason to use linear
regression in the machine learning step.
60
Chapter 5 Regression
M
achine Learning
Since our target variable contains continuous values, we will use one of the
regression machine learning algorithms in this step. Regression machine
learning algorithms learn from the past observations to predict a quantity
corresponding to a new observation. We have already discussed various
reasons for using the linear regression algorithm to build our predictive
model for this case.
The linear regression algorithm builds a linear regression model based
on the prepared data which, as discussed earlier, is a linear equation that
explains how the target variable value can be calculated from the values
of features. Figure 5-4 shows the linear equation created by the linear
regression algorithm for this problem. Based on this equation, we can
tell that Claim Amount increases by 0.4K with every year of Age. We can
also tell that for males (Gender = 1), Claim Amount is higher by 2.1K as
compared to females. And smokers (Smoking Status =1) tend to claim 2.9K
more than nonsmokers. This model is now capable of predicting the Claim
Amount for a new customer based on their age, gender, and smoking status.
61
Chapter 5 Regression
I nference
The linear regression model we just created can be deployed and used for
predicting the Claim Amount for a new customer. The model will simply
use the equation to calculate the Claim Amount using the Age, Gender,
and Smoking Status. Figure 5-5 shows how the deployed linear regression
model calculates the Claim Amount using the equation for two new
customers. The model predicts that the second customer who is a male
smoker is likely to claim a higher amount than the first customer despite
being the younger one. So you will recommend that the second customer
should be charged a higher premium.
C
onclusion
In this chapter, we introduced regression problems and discussed a
specific problem in detail as an example. We discussed the data science
process for this problem to give an overview of the kind of techniques that
could be used for this class of problems. As mentioned earlier, each new
problem will require you to decide the techniques best suited for solving it.
62
CHAPTER 6
Natural Language
Processing
AI, as discussed in Chapter 1, refers to computers behaving intelligently
like humans. One aspect of human intelligence is the capability of
understanding and speaking languages. The subfield of AI that focuses on
making computers seemingly intelligent in understanding and generating
languages just like humans is called natural language processing. We will
henceforth refer to this subfield by the popular acronym NLP. Teaching
computers how to understand and speak natural languages offers a
plethora of benefits. Humans can do mathematical calculations, but
when computers learn to do them, they can perform much more complex
calculations much faster than humans would. Similarly, when computers
learn human languages, they can process much more language data which
opens up myriad possibilities.
64
Chapter 6 Natural Language Processing
Just like its parent field AI, there are two approaches to NLP: rule-
based approach and data science approach. Let’s look at a document
classification problem in detail and see how you can solve it using the data
science approach. Let’s say your company has a personal assistant product
that helps the user manage their to-do list, emails, meetings, devices,
etc., based on voice commands from them. Let’s say the user of this
personal assistant is an engineer who receives emails related to product
development, research work, trainings, etc. They may find it useful to
create a folder for each such category and organize their emails by moving
each email to its relevant folder. You could add an interesting feature to the
assistant that automatically moves the user’s emails to their correct folders.
For example, if the user has received an email requesting them to complete
an online training, the assistant could figure out, based on the text in the
email, that the email belongs to the category “Trainings” and hence move
it to the folder “Trainings”. This problem falls under document classification
since the goal here is to assign each document (email in this case) to a
category. Let’s take a look at how data science can help you achieve this
goal by choosing the appropriate techniques in each step.
D
ata Capture
The assistant silently observes the user in the initial period as they move
the emails to the appropriate folders. This initial manual movement of
emails captures the necessary data using which the assistant will learn how
to automatically determine the right folder for each email in the future
based on the email text without asking the user. Figure 6-1 shows the
folders and some sample emails that have been moved into these folders
by the user.
65
Chapter 6 Natural Language Processing
Hi Tom,
This is in context of our
discussion on the security
fix…
D
ata Preparation
The solution to this problem is different from the ones we discussed so
far, as all the steps of the data science process are automated inside the
assistant, instead of a data scientist performing each step. As part of the
data preparation, the assistant could strip the greeting (e.g., Hi Tom,) and
closing (e.g., Regards, Rich) from each email and tag the remaining email
body with its Category which is the name of the folder containing that
email. Figure 6-2 shows a snapshot of the consolidated data which has
the stripped email bodies tagged with their categories. The idea is that the
assistant will use this data to build a model that learns how to infer the
category of an email by looking at the important words in the email body.
66
Chapter 6 Natural Language Processing
Email_Body Category
This is in context of our discussion on the security fix… Product Development
The release date has been moved to last week of … Product Development
Great progress on the design of new visualizations. These visualizations... Research Work
Thanks for the demo. A few suggestions you could consider.. Research Work
……………… ………………..
………………… ………………….
67
Chapter 6 Natural Language Processing
Email_Words Category
'context', 'discussion', 'security', 'fix‘, … Product Development
……………… ………………..
………………… ………………….
68
Chapter 6 Natural Language Processing
……………… ………………..
………………… ………………….
The assistant will create a model that learns to infer the Category of
an email. Hence, Category in the prepared data is the target variable for
the model. But the model will also need features based on whose values
it will do the prediction. So, the assistant will use some mechanism to
extract features from the Email_Base_Words. Let’s assume the assistant
uses the bag-of-words technique to achieve this – we will look at a more
advanced technique used for this purpose in Chapter 14. Bag-of-words
will determine the vocabulary which is the total set of unique base words
across all emails and then create one feature for every base word in the
vocabulary. The value of a feature for an email is the number of times
the corresponding base word occurred in that email. Figure 6-5 shows a
1
For example, verb, noun, etc.
69
Chapter 6 Natural Language Processing
partial view of the features extracted using the bag-of-words technique; the
figure also shows the target variable Category. You can see that there is one
feature corresponding to each base word in our vocabulary. The value of
the feature fix for the first email is “1” because the base word fix appeared
once in the list of base words for this email. The value of this feature for
the second email is “0” as the second email does not contain the word fix.
Similarly, the value of feature visualization is “2” for the third email as the
list of base words for this email contains the word “visualization” twice.
Now that the features and target are available, the assistant can move on to
the next steps of the data science process. Since there is no data scientist
actively involved here who can look at visualizations and draw insights
from them, the assistant will move directly to the machine learning step.
context … security fix release … … design … visualization … demo suggestion … … attend train … … course … … Category
1 … 1 1 0 … … 0 … 0 … 0 0 … … 0 0 … … 0 … … Product
Development
0 … 0 0 1 … … 0 … 0 … 0 0 … … 0 0 … … 0 … … Product
Development
0 … 0 0 0 … … 1 … 2 … 0 0 … … 0 0 … … 0 … … Research
Work
0 … 0 0 0 … … 0 … 0 … 1 1 … … 0 0 … … 0 … … Research
Work
0 … 0 0 0 … … 0 … 0 … 0 0 … … 1 1 … … 0 … … Trainings
0 … 0 0 0 … … 0 … 0 … 0 0 … … 0 0 … … 1 … … Trainings
… … … … … … … … … … … … … … … … … … … … … … ………………..
… … … … … … … … … … … … … … … … … … … … … … ………………….
Machine Learning
The assistant will now give this prepared data to a machine learning
algorithm which can learn how to infer the Category to which an email
belongs based on the values of the features. This problem is now reduced
to a plain classification problem, so we won’t go into too much detail
as we have already discussed such classification problems in an earlier
chapter. The assistant could try different available classification machine
70
Chapter 6 Natural Language Processing
I nference
The assistant till now was just observing and learning; now it becomes
active and tries to route new emails to their correct folders. It will use the
model that it has created to infer the category of each new incoming email
and move it to the folder corresponding to that category. But the model, as
discussed in the previous section, can only infer the category based on the
values of the features. This means that the assistant will need to extract the
feature values from each new email and then pass them to the model for
inference. To do this, it will follow the same preprocessing steps described
in the data preparation section earlier. Figure 6-6 shows a new incoming
email and the preprocessing it goes through before inference. The greeting
and closing are removed first, followed by removal of punctuation,
conversion to lowercase, extraction of words, and removal of stop words.
This is followed by lemmatization, and finally, the bag-of-words is applied
on the list of base words to generate the feature values which are passed
on to the model. Note that the values of features fix and release are both “1”
as the list of base words for this email has one occurrence of both words.
The model then infers based on the feature values that the category of
the email is Product Development, so the assistant moves the email to the
folder Product Development.
71
Chapter 6 Natural Language Processing
Hi ,
The list of fixes for the upcoming
release …
Bag-of-Words
context … security fix release … … design … visualization … demo suggestion … … attend train … … course … …
0 . 0 1 1 . . 0 . 0 . 0 0 . . 0 0 . . 0 . .
Predict
Classification Model
Result:
Category = Product Development
Figure 6-6. Inference using the model with the features extracted
from the new email
It may happen that the model infers the category for an email
incorrectly and the assistant ends up putting the email in the wrong folder.
When the user reads the email, they will move it to the right folder which
acts as feedback for the assistant. The next time the assistant repeats the
entire process of using the emails and their folders to build a model, the
new model will automatically learn from the new emails and their folders.
When the assistant starts using the new model, the emails that might have
got incorrectly classified earlier might now get correctly classified. So, over
time, the assistant keeps getting better at moving the emails to the correct
folders.
72
Chapter 6 Natural Language Processing
Conclusion
NLP is a popular subfield of AI that is advancing at a fast pace. We saw the
two aspects of NLP and briefly looked at a few common NLP problems.
We discussed the data science approach to NLP and walked through the
steps for making a personal assistant product capable of moving emails
to the appropriate folders. This involved a discussion of some common
text preprocessing steps used in NLP. The steps also covered how text is
transformed into numeric features that are used by the ML models.
73
CHAPTER 7
Clustering
We tend to determine, almost instinctively, when two objects are similar or
dissimilar to each other. For example, when we see all the myriad objects
in nature, we tend to divide them into two groups: one that is able to ingest
food and convert it to energy, is able to reproduce etc. and the second
group of objects that do not show these characteristics. Once we discern
two such apparently distinct groups, we give them a name – living and
nonliving things. While this may seem like a simplistic example, similar
groupings based on biological characteristics, evolution, etc., lead to
various biological taxonomies.1
The same tendency is seen when a company such as a retail store
is interested in grouping its customers based on their demographics,
purchase patterns, and other personal details.
These are a couple of examples of a fundamental aspect of our human
“intelligence” – our innate tendency and ability to create groups of similar
objects/observations, that is, create groups such that an observation is
more similar to other observations in its group, than it is to observations in
other groups. We refer to such groups of observations as clusters. And we
refer to the class of problems that involve identifying clusters given a set of
observations as clustering.
1
For example, see https://tree.opentreeoflife.org/
D
ata Capture
Let us suppose we have data about customer transactions at a store,
containing details of the various products purchased by a customer as
shown in Figure 7-1.
76
Chapter 7 Clustering
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
2
In real systems, Customer, Product, and Trans_id would be UUIDs (universally
unique identifiers) rather than such simple names/numbers. Also, products
would have a complex hierarchy of categories, and a transaction would capture
the SKU. Finally, the timestamp of transaction would be captured rather than the
Trans_date. We are glossing over these details for simplicity.
77
Chapter 7 Clustering
D
ata Preparation
In this case, let us suppose we are interested only in recent trends and
patterns. So, we shall first filter the data to focus only on the transactions
in the last month, say April. We can filter the rows based on Trans_date to
achieve this.
Then, we shall aggregate and pivot this data of recent transactions to
obtain the structure shown in Figure 7-2.
Alice 4 4 1 2 3 3
Bob 15 12 9 9 9 6
Chandler 2 4 1 3 5 5
Dilbert 12 8 20 16 8 4
Einstein 2 2 4 5 1 1
Note that we have aggregated the product quantity per customer for
the entire month of April, for example, Alice has bought four items of
Potato Chips overall in April.
78
Chapter 7 Clustering
In our case, we do not have any missing values – if a customer did not
purchase a particular product, we would simply have a value of zero in the
preceding table.
Normalization
It is possible that some attributes have rather different ranges of values
than others. For instance, some products may be purchased more
frequently than other products (e.g., socks may be purchased much more
frequently than an electronic item). In such cases, we can normalize the
values so that the quantities of all products would fall in similar ranges.
This helps ensure that products that sell more do not bias the clustering
heavily.
In our example, we believe the products do not have such variation. So
we can proceed with the previous data.
Data Visualization
A data visualization technique that is especially applicable to clustering
is dendrograms. This is typically used in a second iteration after the
machine learning step. So we shall look at it in a subsection of the machine
learning step.
79
Chapter 7 Clustering
M
achine Learning
We shall use a technique called agglomerative clustering.3 The idea of this
is very simple, and it works as follows:
1. Begin with single-observation clusters, that is, create
as many clusters as there are observations, and
assign an observation to each cluster.
3
Also known as bottom-up hierarchical clustering.
80
Chapter 7 Clustering
S
imilarity of Observations
There are several ways to define the similarity of two observations – refer to
Figure 7-4 for a couple of them.
81
Chapter 7 Clustering
4
Where the coordinates of a point in different dimensions are based on the values
of different features of the corresponding observation.
82
Chapter 7 Clustering
5
For example, suppose the horizontal and vertical axes in Figure 7-4 represented
the (normalized) values of purchases of products P1 and P2, respectively. Then
A clearly prefers P2 over P1, while both B and C slightly prefer P1 over P2. This is
indicated by the direction of the corresponding vectors.
6
Other linkages exist, such as centroid and Ward.
83
Chapter 7 Clustering
84
Chapter 7 Clustering
The y axis represents the dissimilarity, that is, lower values indicate
higher similarity. In our case, since we used cosine similarity, the
dissimilarity is computed as (1 – cosine_similarity), which is also referred
to as the cosine distance. Thus, we can see from the plot that the cosine
distance between the clusters of Dilbert and Einstein is around 0.03
(height of the magenta horizontal line), while the cosine distance between
the magenta and the green clusters is around 0.26 (height of the blue
horizontal line).
Having understood how to read the dendrogram, let us now see how to
interpret the dendrogram and what insights can be inferred from it.
Note the role that the similarity measure played in clustering – even though
Dilbert has bought many more products than Einstein, they are deemed similar
because we had used cosine similarity, and they both have a similar direction,
that is, similar preferences. It seems, for example, that they both prefer fruit/
juices to chocolates/cakes.
85
Chapter 7 Clustering
I nference
In the earlier chapters, we saw examples where we predicted some value
for new observations in the inference step. In case of clustering, we instead
infer some insights based on the clusters created – this is also referred to as
knowledge discovery.7
7
Clustering problems are rather common in KDD/data mining projects – see
Chapter 23.
86
Chapter 7 Clustering
L1: Health-conscious,
Junk-food-lovers
L2: Health-conscious,
Sweet-tooth,
Fried-food-lovers
8
Note that this is merely an illustrative example that we hope will be intuitive to
most readers – we are not considering the “are chocolates junk food?” kind of
debates here.
87
Chapter 7 Clustering
9
We refer to this kind of marketing practice as cross-selling.
88
Chapter 7 Clustering
C
onclusion
Clustering problems are encountered frequently in any business that has
captured a lot of data and is aiming to derive some insights from it. Solving
clustering problems typically leads to new knowledge about the customer,
process, domain, etc., and is one of the common ways to conduct
knowledge discovery from data – more commonly referred to as KDD. We
shall look at KDD projects again in Chapter 23.
In this chapter, we looked at one end-to-end example of a clustering
problem. The thought process and techniques used for clustering are often
applicable in other areas – we shall see an example of this in Chapter 9,
“Recommendations.”
F urther Reading
James et al. (2013) have an excellent introductory coverage to clustering
problems and the typical challenges. It also covers more details of various
techniques for clustering, including agglomerative clustering.
R
eference
James, Gareth, et al. An Introduction to Statistical Learning. New York:
Springer, 2013.
89
CHAPTER 8
Anomaly Detection
Often, we tend to have an intrinsic notion of whether an observation is
unexpected or abnormal. Sensors that seem to behave erratically or give
readings that are rarely seen; an unheard-of combination of symptoms/
test readings, or a rare pattern in a medical image such as a CT scan; and
network traffic in an IT system that is unusual, these are a few cases that
tend to raise attention. Detecting any abnormal occurrences in the data is
referred to as anomaly detection.
Broadly, there are three categories of anomaly detection based on the
kind of data you have:
• Pure data: You have data that you know does not
contain any abnormal observations. In this case, you
would need to train a model that learns what normal
observations are. You can then determine if any new
observations are not regarded as normal by the model,
that is, if any new observations are novel with respect to
the model. A novel observation can potentially indicate
an anomaly. This subcategory of anomaly detection is
often referred to as novelty detection.
92
Chapter 8 Anomaly Detection
93
Chapter 8 Anomaly Detection
94
Chapter 8 Anomaly Detection
1
www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/semantic/
HeartDisease/
2
Healthy, in this context, implies not having a heart disease.
95
Chapter 8 Anomaly Detection
D
ata Preparation
The usual data preparation aspects such as handling missing values,
etc., would apply here. Also, a quick look at the range of the values of the
various features indicates that some features such as sex are categorical,
while others are numeric. The numeric features also have varying ranges.
Usually, for algorithms that work on numeric features, it can be useful
to normalize them such that all the values fall between 0 and 1. In this case,
let us also prepare a normalized dataset. (In some cases, we might do this
only after an initial exploratory visual analysis of the data.)
D
ata Visualization
Typically, we would first like to see if any of the values of individual
features are outliers. This can be seen, for example, with box plots. Also, we
can try to see if certain combinations of feature values are rather rare, for
example, with scatter plots.
Note that we are using the term outlier here in the sense that a point
in a plot is a “visual” outlier. An outlier in a plot may or may not indicate
anomalous observations; visually eyeballing outliers is often just a starting
point for the analysis.
For example, this kind of analysis might indicate which features are
more likely to identify anomalies and help gain more intuition about
the data.
B
ox Plots
Let us first look at a box plot of one of the features, SerumCholestoral, in
Figure 8-1.
96
Chapter 8 Anomaly Detection
This plot indicates that most of the observations tend to fall between
126 and 360 (with half of the observations falling between 210 and 273).
We can regard values outside this range of 126–360 as outliers. With the
library we have used for plotting, it indicates extreme outliers using crosses
and mild outliers using circles. In this case, we see there are three outliers,
of which one is an extreme outlier.
Let’s take a closer look at what the different visual elements in this box
plot represent at a high level. The lower edge of the box represents the first
quartile (Q1) and indicates that 25% of the observations fall below this
value (210). The upper edge of the box represents the third quartile (Q3)
and indicates that 75% of the observations fall below this value (273). The
97
Chapter 8 Anomaly Detection
thick horizontal line inside the box represents the median and indicates
that 50% of the observations fall below this value (240). The distance
between the third quartile and first quartile is known as interquartile
range (i.e., IQR = Q3 – Q1). The upper whisker extends up to the largest
observation that falls within a distance of 1.5 times the IQR measured from
Q3 (so the upper whisker extends up to 360). The lower whisker extends up
to the smallest observation that falls within a distance of 1.5 times the IQR
measured from Q1 (so the lower whisker extends up to 126). And as you
can see in Figure 8-1, the observations beyond the two whiskers are drawn
as outliers. Note that, in this case, there are no outliers below the lower
whisker.
A box plot is a quick way to get intuition about the values of a single
feature.
You can jot down some of the observations that look interesting, for
example, if they are outliers in the box plots of multiple features, etc., for
further discussion with a domain expert or business analyst. In this case,
the domain expert could be a diagnostician/cardiologist.
This analysis only used the numeric features. We can further include
categorical features in the analysis, using conditional box plots.
98
Chapter 8 Anomaly Detection
99
Chapter 8 Anomaly Detection
S
catter Plots
Scatter plots are used to visualize the relationship between two numeric
features, as shown in Figure 8-3. Note that here, we have used the
normalized feature values so that both features are on the same scale.
100
Chapter 8 Anomaly Detection
Machine Learning
When we use box and scatter plots, we are effectively eyeballing the
distance between the observations, that is, observations that are seen to be
far away are regarded as anomalies. There are several algorithms that work
on detecting anomalies based on the distances between the observations
in a similar way, by extending the notion to multiple features.
101
Chapter 8 Anomaly Detection
We shall use the local outlier factor (LOF) algorithm. Our choice here
is due to the following reasons. First, it is intuitive and highly resonates
with what we tend to visually sense as outliers. Second, it is one of the
few algorithms that can reasonably be used in both cases – unlabeled
data or pure data. Finally, in the case of a healthcare example like this,
we may be interested in finding patients who are different from “similar”
patients, rather than patients who are different w.r.t. the entire cohort. This
naturally leads us to find “local” anomalies, that is, observations which are
anomalous w.r.t. similar observations, but not necessarily w.r.t. the overall
dataset. LOF finds observations that are relatively isolated compared to
their neighbors and is thus able to detect local anomalies.
We shall use the term “local density” of an observation to refer to the
density of the neighbors around that observation, that is, if the neighbors
are densely packed around an observation, the local density of that
observation is high.3 LOF works on the intuition that if an observation has
a local density that is lower than the local density of its neighbors, then the
observation is more anomalous than its neighbors. Figure 8-4 provides a
quick intuition of this using a toy example.
3
The number of neighbors to be considered in calculating the local density is a
parameter that has to be tuned empirically.
102
Chapter 8 Anomaly Detection
Figure 8-4. Local outlier factor: a toy example with two features4
As we see from this figure, the points that are relatively isolated
compared to their neighbors get a higher outlier (anomaly) score. The
anomaly score, in this case, is effectively a function of the deviation of the
local density of an observation as compared to its neighbors. For example,
the local density of point A is similar to the local density of its two closest
neighbors, so A gets a score similar to its two closest neighbors. On the
other hand, the local density of point B is less than the local density of its
closest neighbors, due to which B gets a higher score than its neighbors.
4
https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_
outlier_detection.html
103
Chapter 8 Anomaly Detection
If we run LOF on our HeartDisease dataset, but using only the two
features SerumCholestoral and MaxHeartRate, we get the results shown in
Figure 8-5.
We can extend this to run the LOF algorithm using multiple features
in our normalized HeartDisease dataset to obtain anomaly scores for all
the observations and rank them by the anomaly scores. When we ran LOF
using seven selected features, including SerumCholestoral, MaxHeartRate,
and Sex, we obtained anomaly scores as shown in Figure 8-6.
104
Chapter 8 Anomaly Detection
Inference
Now that a model has generated anomaly scores for the observations in
the machine learning step, we can proceed to determine anomalies. In
order to do so, we would first set an appropriate threshold for the anomaly
scores. For example, if we kept a threshold of 1.3, we would call the first
five observations as anomalies. This is where the insights of the analyst
are useful in deciding an appropriate threshold. The analyst can also
determine if the relative anomaly scores seem appropriate, for example,
checking that a normal observation isn’t getting a high anomaly score.
Based on the inputs from the analyst, we can perform further iterations by
modifying the parameters5 and features used in the LOF algorithm.
5
Such parameters to an ML algorithm, such as the number of neighbors to be
considered in calculating local density, are referred to as hyperparameters. We
shall look at hyperparameters in Chapter 16.
105
Chapter 8 Anomaly Detection
Note that the LOF algorithm could be used with pure data for novelty detection
as well. In this case, each new observation is independently tested for novelty.
The local density of the new observation is compared to the local density of
its neighbors (in the pure data) to determine its anomaly score. If the new
observation’s local density is significantly lower than the local densities of its
neighbors, it can be regarded as a novel observation.
A
natomy of an Anomaly
Having seen an example of anomaly detection, let us delve slightly more
formally into what exactly we mean when we say, “this observation is an
anomaly.” We rarely mean that an observation is abnormal in itself – we
typically mean that it feels abnormal in relation to other observations that
we have seen.
This intuition is reflected in the anomaly score generated by most
anomaly detection techniques. The anomaly scores can be used to
determine relative anomalousness as we saw in the HeartDisease example
earlier. For example, a business analyst could rank the observations
based on their anomaly score and determine an appropriate threshold –
observations whose scores exceed this threshold can be flagged as an
anomaly. In cases where we need deeper insight into the anomalous
behavior, this approach of including a business analyst in the loop can be
quite beneficial. For example, an analyst can tailor the false-alarm6 rate
according to the domain requirements:
6
False alarm, that is, when the algorithm incorrectly flags a normal observation as
an anomaly.
106
Chapter 8 Anomaly Detection
107
Chapter 8 Anomaly Detection
C
omplex Anomalies
Anomalies do not necessarily occur in isolation. Often it is a particular
group of observations that is anomalous, or an anomaly depends on
additional contextual information. We shall look a few examples of each of
these cases.
C
ollective Anomalies
Several common anomalies occur as a group of observations, such that
each observation in itself isn’t abnormal, but the occurrence of the entire
group is abnormal. For example, if on a computer, a buffer overflow,
remote session initiation, and data transfer events all occur together, it is
an anomaly potentially indicating a hack. But individually, each of those
three events can occur during normal operations.8
7
More precisely, points that belong to a cluster containing only that one point.
8
A buffer overflow by itself can be due to a bug in an application.
108
Chapter 8 Anomaly Detection
Contextual Anomalies
In many cases, anomalies depend on the context within which the data is
observed. Common contexts are time and location of an observation.
For example, an individual might typically spend at most $500 on any
day, except for a vacation season like Christmas when they might spend up
to $2000. Now, if they spend, say, $2000 in December, it is normal. But if in
July they spend even $1000, it could be regarded as anomalous.
Similarly, the geographic context of location (captured as, say, latitude/
longitude), nearby events/attractions, and other such information can play
an important role.
109
Chapter 8 Anomaly Detection
Time Series
A sequence of observations seen over time, in a temporal sequence, is
referred to as time series data. Examples include sensor readings, financial
transactions, or ECG readings. Here, time is a context for a sequence of
observations that we are analyzing for anomalies. Given its importance
and widespread applicability, we cover a couple of additional examples of
this special case.
It may be inappropriate to ignore the time information and treat
such data as simply a sequence. Incorporating time information into the
analysis can uncover anomalies that depend on the periodicity of the data.
For example:
110
Chapter 8 Anomaly Detection
Conclusion
In this chapter, we covered the various nuances and types of anomaly
detection problems. We also looked at one example to get a feel of both
human ways (e.g., box plots) and algorithmic, slightly mysterious, ways to
determine anomalies.
Anomaly detection is one of the areas where the role of the domain
expert (or business analyst) can be significant – to try to make sense of
when a model flags an anomaly or misses one. The importance of this
would vary depending on the criticality and impact of an anomaly in your
business setting.
111
Chapter 8 Anomaly Detection
Further Reading
There are numerous survey papers for anomaly detection. One of the
classics is Chandola, Banerjee, and Kumar (2009). Though anomaly
detection techniques have evolved a lot since then, the conceptual
framework of classifying types of data, problems/applications, and
techniques is still largely applicable.
For more details and up-to-date coverage of the field, refer to
Mehrotra, Mohan, and Huang (2017). It covers applications of anomaly
detection in various domains, followed by the approaches and algorithms
used for anomaly detection.
References
Chandola, Varun, Arindam Banerjee and Vipin Kumar. “Anomaly
detection: A survey.” ACM Computing Surveys, Volume 41, Issue 3 July 2009.
Mehrotra, Kishan G., Chilukuri K. Mohan and HuaMing Huang.
Anomaly Detection Principles and Algorithms. Cham, Switzerland:
Springer, 2017.
112
CHAPTER 9
Recommendations
Once upon a time, you may have had a close friend recommending a book,
song, or movie to you – since the friend knew your “tastes,” you would
typically check out their recommendations.
In the online world today, websites and mobile apps (and the
companies that build them) have collected data about all their visitors and
customers at a granular level of possibly each click that has happened on
the website/app. This data includes every book/song/movie/product they
have purchased/rejected and liked/disliked.
Based on this data, if a company is able to determine a user’s “taste,”
it can then masquerade as their friend and recommend stuff that they
might be interested in. This not only acts as an excellent mechanism to
cross-sell/upsell and thus increase sales for a company but also adds
tremendous value for the user given the breadth of inventory these
companies would carry (think of Amazon.com, Netflix, etc.) and thus
increases customer engagement.
In this chapter, we shall look at an end-to-end example of how
individual “taste” is estimated from data of past customer purchases,
ratings, etc., and how recommendations are made to improve the user
experience.
Data Capture
In this section, we shall first look at the generic notion of items/
interactions and some common variations in regard to how this data
would be captured. Then we look at the example data used in this chapter
for determining recommendations.
Quantifying an Interaction
When a user interacts with an item, how do we capture the nature and
quality of the interaction? Broadly, there are two ways to capture the
feedback of a user regarding an item:
114
Chapter 9 Recommendations
E xample Data
In our current example, we shall refer to the sample data shown in
Figure 9-1.
Alice Titanic 4
Alice Terminator 1
Chandler Titanic 2
Chandler Terminator 1
Figure 9-1. Movies rated by users, one row per user-movie pair
115
Chapter 9 Recommendations
D
ata Preparation
We shall pivot the data to obtain the structure shown in Figure 9-2.
My best
You've got Terminator Scary friend's Men in
User Titanic mail Terminator 2 Hot Shots Movie wedding Black
Alice 4 4 1 2 3 3 4
Bob 5 4 3 3 3 2 3
Chandler 2 4 1 3 5 5 3 5
Dilbert 3 2 5 4 2 1 1
Einstein 2 2 4 5 1 1 4
Figure 9-2. Movies rated by users, one row per user, one column per
movie
116
Chapter 9 Recommendations
N
ormalization
Users vary not only in their tastes/preferences but also in regard to how
they provide feedback. Two common variations in how users provide
feedback are
1
Subtracts the mean of a user’s ratings from each of their ratings.
2
Divides the mean-centered ratings by the standard deviation of a user’s ratings.
117
Chapter 9 Recommendations
that they will probably like. In this case, adjusting their score using the
aforementioned techniques could be inappropriate.3
In our current illustrative example, we shall not apply any
normalization.
D
ata Visualization
Systems that provide recommendations to a user are typically automated
end-to-end. Thus, we are not covering data visualization for such
recommender systems.
3
See Ning, Desrosiers, and Karypis (2015) for a more detailed coverage of these
aspects.
118
Chapter 9 Recommendations
M
achine Learning
The high-level approach we shall take is to, given a user A
C
lustering-Based Approach
We first cluster the users based on the ratings they gave to the movies – this
will enable us to find users similar to a given user (step 1). Conceptually,
this is like what we saw in Chapter 7. We shall thus reuse the agglomerative
clustering4 technique that we saw in Chapter 7 to yield a hierarchy of
clusters as shown in Figure 9-3.
4
We have done clustering based on Euclidean distance in this case. Refer to
Chapter 7 for details.
119
Chapter 9 Recommendations
Figure 9-3. Clusters of users based on how they rated the movies
I nference
Once clusters are formed, we can then predict how user A would rate
a movie based on ratings given to it by other users in their cluster.5 For
example, we could calculate a simple average of the ratings that other users
in their cluster have given to a movie; or we could calculate a weighted
average based on how similar another user is to A.
5
In practice, we could also include users from other similar clusters – refer to Xue
et al. (2005) for more details.
120
Chapter 9 Recommendations
In our simplistic example, we can thus fill the table with predicted
ratings shown in red in Figure 9-4.
My best
You've got Terminator Scary friend's Men in
User Titanic mail Terminator 2 Hot Shots Movie wedding Black
Alice 4 4 1 2 3 3 4 3
Bob 5 4 3 3 3 2 4 3
Chandler 2 4 1 3 5 5 3 5
Dilbert 3 3 5 4 2 1 1 4
Einstein 2 2 4 5 1 1 1 4
The entries with high predicted ratings would then lead to the
following recommendations by the system6:
• My Best Friend’s Wedding would be recommended to
Bob.
E nd-to-End Automation
As users watch and rate movies, our data will continue to grow. The
clustering algorithm can be automated to run periodically to form clusters
of users. When recommendations are to be shown to a user, we would then
run the inference step.
How frequently the clusters are updated would depend on the domain
and use case. In our case, we can update the clusters every few days.
6
Note that, unlike in Chapter 7, here, we are not interested in understanding what
the clusters represent, etc.
121
Chapter 9 Recommendations
C
onclusion
Recommender systems are now a vital part of several online services –
websites and mobile apps. In this chapter, we covered an end-to-end
example of building a recommender system using a clustering technique
we first saw in Chapter 7.
The techniques to build recommender systems are continuing
to evolve rapidly – see the “Further Reading” section for some new
developments.
F urther Reading
Our clustering-based approach to recommendation in this chapter is
inspired by Xue et al. (2005).7
One of the earliest recommender systems at Internet scale was at
Amazon.com to recommend products online. For a brief history of
recommender systems and some recent developments using deep
learning techniques, refer to Hardesty (2019).
7
Note that for conceptual simplicity, we’ve used agglomerative clustering with
Euclidean distance; the paper actually used k-means clustering with Pearson
correlation as the similarity measure.
122
Chapter 9 Recommendations
R
eferences
Hardesty, Larry. The history of Amazon’s recommendation algorithm. 22 11
2019. <www.amazon.science/the-history-of-amazons-recommendation-
algorithm>.
Ning, Xia, Christian Desrosiers and George Karypis. “A Comprehensive
Survey of Neighborhood-Based Recommendation Methods.”
Recommender Systems Handbook. Ed. Francesco Ricci, Lior Rokach and
Bracha Shapira. New York: Springer, 2015.
Xue, Gui-Rong and Lin, Chenxi and Yang, Qiang and Xi, WenSi and
Zeng, Hua-Jun and Yu, Yong and Chen, Zheng. “Scalable Collaborative
Filtering Using Cluster-Based Smoothing.” Proceedings of the 28th Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval. Salvador, Brazil: Association for Computing
Machinery, 2005. 114–121.
123
CHAPTER 10
Computer Vision
The term computer vision fundamentally refers to the ability of a software
algorithm to process visual information such as images and videos, similar
to how humans process visual information. Under the overarching field
of artificial intelligence, the field of computer vision is one of the more
complex, higher levels of “intelligence.” The tremendous successes of deep
learning approaches to computer vision resuscitated neural networks, data
science, and AI in general, starting from 2012.
In this section, we cover various problems that fall in the category of
computer vision. We begin by looking at various types of problems that
involve processing images and then move on to problems that involve
processing videos. We shall then have a brief look at some public datasets
and competitions that have propelled this field into the limelight in the
past decade. Finally, we shall wrap up with a brief example of how the data
science approach can be used to solve computer vision problems.
P
rocessing Images
As humans, when we perceive a scene or an image visually, we draw
various conclusions:
I mage Classification/Regression
The ability to assign an overall target class or numeric value to an image
has several applications. For example:
1
That is, determining the label of an image such as tree, bird, cat, etc.
126
Chapter 10 Computer Vision
O
bject Detection
When we look at any scene, we not only detect the objects such as persons,
vehicles, etc. but also their locations – in our human perception, we
typically detect the type of objects and their locations simultaneously.
Figure 10-1 shows an example of detecting objects such as persons and
airplanes in an image.
2
This example is taken from the documentation of the open source Mask-RCNN
library at https://github.com/matterport/Mask_RCNN
127
Chapter 10 Computer Vision
3
Briefly introduced in Chapter 1.
128
Chapter 10 Computer Vision
P
rocessing Videos
As humans, when we perceive a scene visually over time, or when we
watch a recorded video, we draw various conclusions:
129
Chapter 10 Computer Vision
Video Classification
The problem here is to classify an entire video, for example, to determine
the kind of activity happening in the video or to determine the genre of a
video. These can be used, for example, in
These are challenging problems, and the techniques are an active area
of research.
Object Tracking
This is the most commonly encountered kind of problem in video analysis
and finds applications in varied domains. Examples include
130
Chapter 10 Computer Vision
4
This figure is sourced from www.kaggle.com/c/prostate-cancer-grade-
assessment/overview; our example is based on this Kaggle competition.
131
Chapter 10 Computer Vision
D
ata Preparation
One of the challenges that we see from the biopsy image is that there is a
large amount of background whitespace that does not contain any useful
information. Also, the image is typically longer in one direction than
another – most image classification techniques work best with square
images and can tolerate some variation in the aspect ratio, but probably
won’t work well with the kinds of variations in our images.
Since in our case, we are interested more in localized patterns that
indicate cancerous cells rather than a pattern in the entire image, one
technique is to slice the image up into smaller tiles and reassemble the
tiles into a square image. While doing so, we pick the tiles with the least
background whitespace, that is, the most tissue, thus covering most of the
tissue regions. By taking this approach, we resolve both the challenges.
Figure 10-3 shows an example of this.5
5
Refer to www.kaggle.com/iafoss/panda-16x128x128-tiles; this novel data
preparation technique turned out to be crucial to solving this problem.
132
Chapter 10 Computer Vision
But how to break the image into tiles, that is, what size should the tiles
be? Very small tiles will ensure that nearly no background whitespace
is present, and most of the tissue region is retained. But it can lead to
breaking up the localized patterns into multiple tiles making it difficult
to detect the patterns. On the other hand, large tiles would retain most of
the relevant patterns, but may result in incorporating a lot of background
whitespace. These variations are initially eyeballed using visual analysis
like in the previous figure.
In some cases, images may contain other marks, for example, pen
marks drawn by a pathologist. It would be better to remove this noise in
the images6 – this also falls in data preparation – and visualize (some of )
the cleaned images to validate the cleansing process.
6
The field of image processing that deals with filters, cleaning, filling, etc., in
images is referred to as morphology. Data scientists may need to gain some
expertise in morphological techniques for improving the accuracy in computer
vision solutions.
133
Chapter 10 Computer Vision
D
ata Visualization
After preparing the data, you might typically want to design visualizations
for gaining insights such as variation of tissue coverage based on the size
and number of selected tiles, etc. But we will skip directly to the machine
learning step for conciseness.
M
achine Learning
Our prepared data now has the transformed square images with maximum
tissue region and the ISUP grade. We shall train a neural network using
this prepared data to predict ISUP grade directly from a transformed
square image.7 From the literature, we see that the EfficientNet family
of models, described in Tan and Le (2019), is apparently state of the art.
The EfficientNet family of models has architectures called B0, B1 … up to
B7. EfficientNet-B0 is the simplest architecture with the least complexity
and is typically used to create a first baseline model. So, let us use an
EfficientNet-B0 model for our problem.8 We shall look at CNNs, on which
EfficientNet is based, in more detail in Chapter 16.
If the model doesn’t perform well, we may need to go back to
improving the data preparation methodologies, for example, using the
tiling approach with varying tile sizes to determine which data preparation
results in the best model. But how to determine whether a model is
performing well? In other words, how do we evaluate the performance of a
model in this case?
7
Without determining the intermediate Gleason score that we saw in Figure 10-2.
8
We’re acting as cowboys here – treating an EfficientNet model as a black box,
interested only in predictive accuracy rather than understanding which regions
are possibly cancerous and led to the final grade, etc.
134
Chapter 10 Computer Vision
9
A variable like ISUP grade, whose values are an ordered set of categories, is
referred to as an ordinal. When the target variable is an ordinal, the problem
is referred to as ordinal regression, which is slightly different from both
classification and regression that we saw earlier. (Ordinals are discussed further
in Chapter 14.)
135
Chapter 10 Computer Vision
During multiple iterations and experiments, we may have several models that
we have trained. In practice, rather than use the best of these models, it is
often better to use an ensemble of models.
An ensemble model basically takes the predictions from all its constituent
underlying models and aggregates them to give the final result. As a crude
example, the ensemble might simply return the majority result, for example,
if a majority of the models is predicting an ISUP grade of 3, then the final
result is 3.
Inference
During inference, we will need to run the exact data preparation steps
that we used during training. For example, prior to creating our final
model in the machine learning step, we would have used certain values
for the size and number of selected tiles and transformed the image into
a square image as explained in the data preparation section. Those exact
transformations, with the same values for the size and number of selected
tiles, should be applied to the new images during inference as well.
136
Chapter 10 Computer Vision
One common technique applied during inference for computer vision problems
is that of test-time augmentation (TTA). TTA is intended to increase robustness
against variations in the orientation or positioning of the image.
Now for TTA, we can create an augmented image by, for example, shifting
the original image slightly to the right. Then the model is used to infer
the probabilities of the ISUP grades from both the original image and the
augmented image. By averaging the probabilities of each ISUP grade across
the original and augmented images, we obtain the final probabilities of the
ISUP grades. We then predict the ISUP grade with the highest probability as
our final result.
Data Engineering
In our dataset, each individual image can be quite large, and the overall
dataset size can be in the order of hundreds of gigabytes. Typically, if you
have a team of data scientists working collaboratively on large datasets, a
shared file system will be useful. This way, as data scientists perform the
various data preparation steps we saw earlier, the modified images will
also be accessible to the entire team.
137
Chapter 10 Computer Vision
Also, for training models such as CNN-based models, you would need
to use powerful GPUs because the computations are rather intensive and
complex. While it is possible to setup an on-premises infrastructure of this
sort, it is increasingly common and cost-effective to use cloud services
such as AWS10 for such deep learning experimentation. For example, AWS
Elastic File System (EFS) can be used as a shared file system, and GPU
machines of varying sizes can be used to train models.
C
onclusion
In this chapter, we covered the various subclasses of problems that fall
under the ambit of computer vision. We also looked at an end-to-end
example of detecting prostate cancer grade from biopsy images.
F urther Reading
Refer to Liu, Ouyang, and Wang (2020) for a survey of object detection
techniques based on deep learning.
Chollet (2018) has an excellent chapter that introduces technical
details of deep learning for computer vision.
Lu et al. (2019) cover a wide range of topics in medical imaging.
Refer to Prostate cANcer graDe Assessment (PANDA) Challenge for
more details of the prostate cancer grading process.
10
Amazon Web Services, https://aws.amazon.com/
138
Chapter 10 Computer Vision
R
eferences
Chollet, Francois. Deep Learning with Python. NY, USA: Manning, 2018.
Liu, L., W. Ouyang and X. Wang. “Deep Learning for Generic Object
Detection: A Survey.” International Journal of Computer Vision 128 (2020):
261–318.
Lu, Le, et al. Deep Learning and Convolutional Neural Networks
for Medical Imaging and Clinical Informatics. Cham, Switzerland:
Springer, 2019.
Prostate cANcer graDe Assessment (PANDA) Challenge. www.kaggle.
com/c/prostate-cancer-grade-assessment/overview/additional-
resources. n.d.
Tan, Minxing and Quoc V. Le. “EfficientNet: Rethinking Model Scaling
for Convolutional Neural Networks.” Proceedings of the 36th International
Conference on Machine Learning. Long Beach, California, 2019.
139
CHAPTER 11
Sequential
Decision-Making
One of the more advanced manifestations of “intelligence” is the ability to
voluntarily take decisions that knowingly accept losses in the short-term
with a view to gaining a desired outcome in the longer run. The “desired
outcome” can take various forms – maximizing profits/rewards (e.g., in
an investment strategy), maximizing the chances to realize a targeted goal
(e.g., winning in a competition such as chess), or saving patients with life-
threatening diseases.
The common thread to these is the ability to take a sequence of
decisions – buy/sell, or the move to make, or the test/treatment to
recommend. And these decisions are to be taken dynamically in an
environment which is itself ever changing and influenced by the decisions
taken as well.
Reinforcement learning (RL) is a branch of ML that deals with this
aspect of automating sequential decision-making to maximize long-
term rewards, often at the seeming cost of short-term losses. While
sequential decision-making is a class of problems, RL is currently the de
facto framework within which these kinds of problems are formulated
and solved. In the rest of this chapter, we shall thus cover reinforcement
learning.
1
RL has been around for a long time, with the theory originating in the late 1950s.
But RL combined with deep learning (deep RL) has led to the recent popularity of
RL since 2017. In this book, we simply use the term RL to refer to the overall field.
142
Chapter 11 Sequential Decision-Making
The RL Setting
As an illustrative example for this chapter, let us consider a diagnostic
expert Dr. House, whose only goal is to save his patient’s life by diagnosing
and treating the patient’s mysterious ailment in time before they succumb
to the illness. Unfortunately, this one-pointed focus causes him to violate
legal, ethical, or professional rules and conventions when required, if it
enables him to improve the chances of saving his patient’s life. This can
often result in drastic measures such as subjecting the patient to extreme
pain and suffering temporarily, or violating hospital policies/legalese,
if it offers slightly improved odds of diagnosing the patient faster than
traditional tests for which there isn’t sufficient time.
Each case that House handles begins with a patient exhibiting
intriguing symptoms. After House takes several sequential decisions that
each recommend further tests resulting in new observations, a case is
concluded when the root cause of the patient’s symptoms is determined in
time or when the patient dies. House, being the expert that he is, the latter
outcome has been rather rare.
Let us see how an AI engine, Dr. Nestor,2 can be trained to diagnose like
House. While we would like Nestor to learn the diagnostic skills of House,
we also want to temper it with some aversion to illegal actions. In the
process of looking at how Nestor can be trained to achieve this, we shall
cover the basic variations of RL and some of the challenges.
2
The name is a nod to the robot Nestor-10 in Asimov’s Little Lost Robot which is
not fed with the constraint to prevent harm to humans.
143
Chapter 11 Sequential Decision-Making
T raining Nestor
The agent can be trained in a few different ways depending on the kind
of data and domain. In our current example, we will be structuring the
training into multiple phases. Note that, depending on the details of the
problem being solved, there can be numerous variations in regard to
the kinds of phases used – the phases we cover are oriented toward our
illustrative example of Nestor.
In this section, we shall first understand how Nestor interacts with its
environment while attempting to solve a single case; this interaction with
the environment is common for all the training phases that follow.
E pisode
We provide an intuitive notion of what an episode is, based on Figure 11-1
which shows what happens within each episode – this forms the basic
framework for RL.
144
Chapter 11 Sequential Decision-Making
Action At
Environment
Observation
St St+1 Agent / Actor
Ot+1 Ot (Decision Maker)
Reward function
Rt+1 Rt
Reward
145
Chapter 11 Sequential Decision-Making
• -0.2 for any illegal interim action, that is, such actions
are penalized slightly.
At the end of each episode, Nestor “learns” based on the total reward
it got for all its actions. For example, it knows that a sequence of decisions
that led to a total reward of 1 is better than a sequence of decisions that led
to a total reward of 0.6, etc.
Having understood how Nestor interacts with its environment, let us
now look at the phases of training.
3
Final observation is that the patient gets diagnosed/cured or dies.
146
Chapter 11 Sequential Decision-Making
Training Phases
In this section, we shall look at the phases that could go into one possible
way of training Nestor.
Past Cases
In our current example, we shall first use the past case files of House
during the episodes, because we want Nestor to diagnose like House
would.
To execute the previous framework on past case files of House, the
setup can be such that the agent mimics the action of the expert (House),
and the environment/observation is updated to reflect what actually
occurred in the case as a result of the action. The mechanism of assigning
rewards to each action, and learning based on the total reward at the end
of an episode, is the same as mentioned in the previous section. In this
way, the basic framework explained earlier unfolds, and the agent learns
from the past case files and their handling by the expert.
147
Chapter 11 Sequential Decision-Making
Supervised Exploration
Now, we allow the agent to take its decisions independently, rather
than imitate the expert. In this case, we shall continue to supervise the
agent, that is, the action recommended by the agent isn’t executed in the
real world unless approved by a supervising expert diagnostician. The
framework can be slightly modified so that if an utterly nonsensical4 action
is suggested, the agent gets a severe rebuke immediately (e.g., a reward
of -1) and the episode (for the agent) is terminated. Also, if, for a particular
episode, the actions recommended by the agent are entirely nonsensical,
we can fall back to allow the agent to imitate the expert for that episode (as
in the previous section). Apart from these minor modifications, the rest
of the framework continues in the same way, that is, rewards are assigned
similarly, and the agent learns similarly at the end of each episode.
Once Nestor has been trained and is found to be taking reasonable
decisions, it can be exploited to take decisions on new cases. (Our kid AI
engine is now a teenager.)
S
upervised Exploitation
When being exploited, the environment-agent interaction shown in
Figure 11-1 and the supervisory setup in the previous section still apply,
with one important difference. In the earlier supervised exploration phase,
Nestor was trying to learn as much as possible about which decisions are
better – so it could often take new random decisions to see what reward it
gets. In the exploitation phase, Nestor will instead only take decisions that
it expects will result in maximum overall reward for the entire episode. The
agent can now be considered as assisting an expert by suggesting novel
approaches which the expert carefully evaluates – Nestor is now a part of
the expert’s “team” of doctors, effectively.
4
Nonsensical, that is, incorrect, inappropriate, or impossible to execute.
148
Chapter 11 Sequential Decision-Making
Having seen an example of training and using an agent, let us now look
at a few variations, particularly in regard to the data used for training.
5
It “learned” further by playing thousands of matches with human players.
149
Chapter 11 Sequential Decision-Making
S
imulated Data
In addition to learning from human games, the initial version of AlphaGo
also learned from data simulated by self-play, that is, playing games
against itself. This version of AlphaGo was strong enough to defeat Lee
Sedol, the Go World Champion.
The next version of AlphaGo – called AlphaGo Zero – was trained
solely by self-play, that is, the engine played games against itself starting
from random play without any use of human data. This way, the engine
would not be limited by the human understanding of the game. AlphaGo
Zero defeated AlphaGo by 100 games to 0.
Motivated by these successes, the latest AlphaZero system was created,
which generalized the AlphaGo Zero approach into a single algorithm that
learns games such as chess, Go, and shogi. See Silver et al. (2018) for more
details. It is this version we referred to at the beginning of this chapter.
Similarly, in simulations for areas such as self-driving cars,6 the
agent can try arbitrary random trajectories of the car that may result in
crashes until it learns to drive properly. This is one of the more popular
environments for developers to learn about RL.
C
hallenges in RL
Having covered some basic concepts around RL, in this section, we look at
a few common, practical challenges in RL.
6
For example, see Amazon DeepRacer.
150
Chapter 11 Sequential Decision-Making
A
vailability of Data
Reinforcement learning requires a lot of data. For games such as chess and
Go, in addition to the historical archive of grandmaster games, simulations
can be used to generate an arbitrary amount of data, which is one of the
reasons those were the first areas to be “solved.” For realistic applications
where the risk of using simulations is very high, and which therefore rely
on expert imitation and/or supervision, the availability of sufficient data is
a primary challenge.
I nformation in Observations
Even if we manage to get a large amount of data, the other challenge
in realistic situations is ensuring that each observation has sufficient
information. As Figure 11-1 shows, an observation is only a subset of the
overall state of the environment. In other words, an observation may not
capture sufficient information from the environment. Often, data that
humans “observe” are not available in an automated way, for example, for
automating diagnosis/treatment, many aspects seen by an expert doctor
when they size up a patient may not be captured as “readings” in the
patient records. Humans also tend to have an intuition that lead them to
gather more information (i.e., expand the scope of an observation) in some
cases. House, for example, has been known to break into a patient’s house
to get more information or to regard an apparent personality trait such as
bravery or altruism as symptoms. The notion of when to expand the scope
of observations needed is one of the keys to decision-making and is a
challenge to automate.
151
Chapter 11 Sequential Decision-Making
Conclusion
In this chapter, we only touched upon some of the fundamentals of
modern deep reinforcement learning and covered a basic setting
for RL. Numerous variations abound regarding the way rewards are
determined, how actions are taken based on the observations, and so
forth. We shall point to some of the relevant literature in the next section.
153
Chapter 11 Sequential Decision-Making
The reader may have realized that Dr. House in our illustrative example
is based on the eponymous TV series. The semi-fictional narrative in this
chapter is indicative of the notion that classical science fiction, somewhat
reminiscent of a few of Asimov’s works, is increasingly becoming a reality,
especially when machines can take decisions that can seem “intuitive” like
humans.
Our take is that, apart from gaming systems, we are far from letting AI
agents loose in the field to take sequential decisions for maximizing overall
reward. Initial adoption will likely be along the lines of augmenting human
experts to improve decision-making based on novel, alternative decision
paths suggested by the agent. There is increasing research in this regard,
even in fields such as healthcare.
Further Reading
For a detailed introduction to RL, including intuitive and some
mathematical details, refer to Sutton and Barto (2018).
For the milestone publication of AlphaZero, refer to Silver et al. (2018)
and its supplementary material.
Amazon DeepRacer is one of the more popular web services to begin
to start learning RL principles with a hands-on approach.
For a general survey of reinforcement learning in healthcare – it
doesn’t get more real than in healthcare! – refer to Yu, Liu, and Nemati
(2020). It provides a good overview in general of the extant jargon and
variations in the field of RL, followed by applications in healthcare
including diagnosis and dynamic treatment regimes, and the significant
challenges. An example of a real-world application in the field of dynamic
treatment recommendation is Wang et al. (2018).
154
Chapter 11 Sequential Decision-Making
R
eferences
Silver, David et al. “A general reinforcement learning algorithm that
masters chess, shogi, and Go through self-play.” Science 362 7 December
2018: 1140–1144.
Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning - an
Introduction, 2nd edition. Cambridge, Massachusetts: The MIT Press, 2018.
Wang, Lu, et al. “Supervised Reinforcement Learning with Recurrent
Neural Network for Dynamic Treatment Recommendation.” Proceedings of
ACM Conference (Conference’17). New York, NY, USA: ACM, 2018.
Yu, Chao, Jiming Liu and Shamim Nemati. Reinforcement Learning in
Healthcare: A Survey. https://arxiv.org/abs/1908.08796. 2020.
155
PART III
Techniques and
Technologies
In Part 2, we looked at a few data science problems and how they are
solved by choosing the appropriate techniques in the different steps of the
data science process. The skill of choosing the right techniques depends
on a good conceptual understanding of the host of techniques available
for each step of the data science process. In this part, we cover the various
techniques used in different steps of the data science process and the
technologies (libraries, tools, services, etc.) that you can use to apply these
techniques. This part forms the technical meat of the book.
In Chapter 12, we first cover an overview of the techniques and
technologies involved in all the steps of the data science process. In
Chapters 13–17, we look at each step of the data science process in
more detail – one chapter per step of data science process covering the
techniques/technologies for that step. Then, in Chapter 18, we cover other
important tools and services that cut across multiple steps in the data
science process.
In Chapter 19, we look at a reference architecture which brings
together these technologies to enable an operational data science team.
Having thus grasped the concepts behind the techniques, in Chapter 20,
we wrap up the discussion about monks vs. cowboys that we started in
Chapter 3. We fill in the differences in their praxis, particularly regarding how
the culture determines the preferred choices of techniques and technologies.
CHAPTER 12
Techniques and
Technologies: An
Overview
In this chapter, we provide a very high-level overview of the various
techniques and technologies used for data science. This brief overview
is intended to establish the framework for specifics covered in
Chapters 13–19.
Figure 12-1 and Figure 12-2 show some of the techniques and
technologies, respectively, used in each step of the data science process.
160
Chapter 12 Techniques and Technologies: An Overview
1
Some of the technologies depicted within one step might be useful in other steps
as well. We have included them within the step where we find them most useful.
161
Chapter 12 Techniques and Technologies: An Overview
Figure 12-3. Some tools and services that cut across multiple steps of
the data science process
Thus, by the end of Chapter 18, we would have covered the various
technologies used for data science. Then in Chapter 19, we shall look at
how these technologies come together in a reference architecture to enable
the operations of a data science team.
Note that the three figures in this chapter are intended to capture
some of the key representative terms and concepts in the field of data
science at the time of writing. We find them useful also as a framework
for categorizing the overall field, that is, whenever a new technique or
technology is encountered, we find it useful to place them within an
appropriate category in one of these figures. This helps capture the
primary capability of a technique/technology and enables communication
with other team members and stakeholders about it. Placing a technique
or technology in an appropriate category can also be useful to determine,
compare, and evaluate alternatives.
162
CHAPTER 13
Data Capture
In this chapter, we shall cover the typical techniques and technologies
used in the data capture step of the data science process.
For data capture, the very first activity is to identify what data is
relevant and which are the sources of the relevant data. Once these are
determined, the further activities in the data capture step, and the relevant
components, are shown in Figure 13-1.
Used by data science team for next
steps like Data Preparation etc.
(2)
Ingestion
(Spark, Airflow, Kafka)
1
In some cases, the data could also have been collected by manual entries in an
excel file by personnel.
164
Chapter 13 Data Capture
In some cases, you may source data from external APIs. Examples of
external APIs are
In a few rare cases, you may want to extract some information from
websites directly. This is referred to as web scraping and requires some
knowledge of the layout of a source web page. It is not a recommended
approach, because whenever the structure of a website changes, you
would need to modify the way you extract information – thus, maintenance
and reliability tend to be an issue with this approach. But this approach
is followed in rare cases – for example, if you are building an information
system based on publicly available information from multiple websites
that do not expose an API.
A data scientist could theoretically query these disparate sources to
obtain the data they need for analysis. But a significantly more effective
approach would be to collate this data into a single, centralized sandbox,
that is, data storage, from which the data science team can perform their
analysis. This requires ingesting the data from these multiple data sources,
which we shall look at next.
165
Chapter 13 Data Capture
I ngestion (2)
Various ETL2 and workflow libraries/tools can be used to merge,
transform, and ingest data into the central data storage from multiple data
sources – we are referring to this broadly as ingestion. Spark, Airflow, and
Kafka are commonly used – often together – and represent three common
categories of tools:
D
ata Storage
The data storage contains all the ingested data and is used by the rest
of the data science team for the other steps in the data science process,
beginning with data preparation.
2
Extract, transform, and load: refers to extracting data from one or more sources,
transforming the data, and finally loading it into a destination system.
166
Chapter 13 Data Capture
The data storage can consist of one or more of the following: data lake
(3), data warehouse (4), and shared file system (5). We shall look at each of
these in the following subsections.
3
Before S3, HDFS (Hadoop Distributed File System) was a popular choice.
4
When engineers have to handle row-level updates themselves without support
for atomicity, etc., from the storage technology, it often leads to issues related to
data inconsistency, duplication, etc.
167
Chapter 13 Data Capture
5
Using Redshift Spectrum.
168
Chapter 13 Data Capture
In any case, we would suggest that you choose the right technology, that is,
data lake and/or data warehouse, for your team and organization – without
getting distracted by the notion of lakehouse.
169
Chapter 13 Data Capture
170
Chapter 13 Data Capture
P
rogrammatic Access
Tabular data stored in a data lake in a format such as CSV, JSON, or Parquet
can be read using programmatic APIs in a language such as Python, R,
Scala, etc. Examples include
Apart from tabular data, other kinds of data such as images, etc.,
can be accessed using the programmatic APIs provided by the data lake.
For example, AWS provides libraries in programming languages such as
Python to access data stored in S3.
6
AWS also provides paid services of its own, for example, AWS Kinesis is an
alternative to Kafka.
171
Chapter 13 Data Capture
For data lakes and data warehouses, paid tools and services on the
cloud are typically preferred these days as they offer a flexible pricing
model and elastic scaling.7 They are often more cost-effective than
deploying an open source warehouse (such as PostgreSQL) and managing
it in your own infrastructure.
D
ata Engineering
Within the data capture step, a domain expert or the chief data scientist typically
determines which data is to be captured and the appropriate data sources.
The further activities mentioned in this chapter – of ingesting the data
into the data storage and enabling efficient access to it – primarily fall
within the ambit of data engineering. The various tools mentioned in this
chapter are also typically owned by data engineering.
C
onclusion
In this chapter, we covered the various components and activities involved
in the data capture step of the data science process, including some of the
technologies used. We also touched upon some aspects around opting for
an open source vs. a paid tool.
7
Elastic scaling refers to the capability of automatically or easily scaling the
infrastructure to meet the demands.
172
CHAPTER 14
Data Preparation
This chapter is dedicated to the data preparation step of the data science
process. The captured data is typically explored to understand it better.1
Such exploration may reveal that the captured data is in a form which
cannot be directly used to build models – we saw one such case in
Chapter 6 where the data captured for predicting the category of an email
consisted of just emails and their folders. This data had to go through a
lot of preparation before it could be used for building models. In some
other cases, it may seem that the data could be given to ML algorithms
directly, but preparing the data in various ways might result in more
effective models. We saw such a case in one of the examples we discussed
in Chapter 1, where the captured data contained the timestamps and sale
amounts for transactions at the checkout counter of a store. In this case,
we felt that sale amount might have some trends based on what day of
the week it is, what month it is, etc. Hence, we transformed the data so
that it contains the hour, day, month, etc., along with the corresponding
aggregated sale amount hoping that it will enable ML algorithms to find
such trends. Hence, preparing the data in such ways might result in better
models.
1
This may also involve looking at statistics like mean, median, standard deviation,
etc.
174
Chapter 14 Data Preparation
F eature Scaling
Another task that we have discussed earlier, and you will likely be doing
in the data preparation step, is transforming your features so that they
have the same scale.2 To do this, you could perform a min-max scaling
that transforms the feature values so that they range from 0 to 1 – smallest
feature value is thus transformed to 0 and largest to 1. The other common
approach to achieve this is a technique called standardization. This
approach transforms each feature value by subtracting the mean from it
and dividing the difference by the standard deviation. You can see that this
approach does not restrict the transformed values to a specific range.
T ext Preprocessing
Let’s revisit the text preprocessing tasks from the NLP problem in
Chapter 6, where the goal was to build a model that could infer/predict the
category of an email based on the text in the email. For achieving this goal,
the personal assistant stripped the greeting and closing from each email
body in the data and preprocessed the remaining email body by removing
punctuation, converting it to lowercase, extracting individual words, and
removing stop words which it thought were not useful for inferring the
category of an email. As the next step, the assistant used a technique called
lemmatization on the words in all emails to convert the inflected forms of
words to their base forms because it thought that the inflected forms might
not give any additional clue about the category of an email. Lemmatization
also reduced the vocabulary that the assistant had to deal with. The
assistant then extracted features from the lists of base words of all emails
using the bag-of-words technique.
2
There are various reasons why you might want to do this (c.f. Chapters 7 and 8).
175
Chapter 14 Data Preparation
In this section, we will discuss another technique that you can use to
reduce words to their base forms which is known as stemming. We will also
see a different technique known as TF-IDF that can be used for extracting
features from lists of base words.
S
temming
Lemmatization performs a sophisticated analysis on each word
considering its part of speech and accordingly maps the word to its correct
base form. Stemming on the other hand adopts a more crude approach
which applies rules that simply cut off portions of words to arrive at their
base forms. For example, if you perform stemming3 on the word “working,”
the suffix “ing” will be chopped off and you will get “work” as the base
word. This crude approach adopted by stemming obviously works faster
than lemmatization which performs a detailed analysis.
However, stemming could sometimes give you a result that is not a
valid word. For example, performing stemming on the word “dries” might
simply remove the suffix “es” and return “dri” as the base word which
is not a valid word. Let’s look at another word, “worse.” If you pass this
word “worse” along with its part of speech “adjective” to lemmatization
operation, it will be able to figure out with its detailed analysis that the
base word is “bad.” On the other hand, there is no way that the approach
of cutting off portions of words adopted by stemming can produce the
result “bad” from the original word “worse.” To summarize, stemming
and lemmatization have their strengths and weaknesses, and you need to
decide which one is suited for you as per your requirements.
3
For example, using Porter Stemmer.
176
Chapter 14 Data Preparation
T F-IDF
We used the bag-of-words technique in Chapter 6 to extract features
from the lists of base words of all emails. This technique determined
the vocabulary (total set of unique base words across all emails) and
then created one feature for every base word in the vocabulary. It then
calculated the value of a feature for an email by counting the number of
times the corresponding base word occurred in that email. Figure 14-1
shows a partial view of the features extracted from the lists of base words
of all emails using the bag-of-words technique – the figure also shows the
target variable Category. Note that there is a feature for every unique base
word. You can see that the value of feature visualization for the third email
is 2 because the base word visualization was present twice in the list of
base words of this email.
1 … 1 1 0 … … 0 … 0 … … Product
Development
0 … 0 0 1 … … 0 … 0 … … Product
Development
0 … 0 0 0 … … 1 … 2 … … Research
Work
0 … 0 0 0 … … 0 … 0 … … Research
Work
0 … 0 0 0 … … 0 … 0 … … Trainings
0 … 0 0 0 … … 0 … 0 … … Trainings
… … … … … … … … … … … … ………………..
… … … … … … … … … … … … ………………….
177
Chapter 14 Data Preparation
So you can see that the bag-of-words technique uses a very basic
approach of counting the number of occurrences of a base word to
calculate the value of a feature. There are several enhancements you can
make to this approach of calculating features which could lead to more
effective features. And more effective features could result in more effective
models which is crucial for success in data science. Let’s look at a few such
enhancements.
Let’s say there was an email in our data which was very long and
contained the word visualization many times. The bag-of-words technique
would simply count the occurrences and give a high value for the feature
visualization for this email. The ML algorithm would think that the feature
visualization has a much higher value for this email compared to the
third email in the previous figure. But this is probably true only because
this email is much longer than the third email. So it will be a good idea to
make adjustments for the email size while calculating feature values. You
can do this by simply dividing the original value of the feature for an email
by the size of the list of base words for that email. The modified value of
the feature is a more effective representative of how frequently the base
word occurs in the email. We refer to this modified feature value as term
frequency (TF).
However, this modified feature calculation process still has limitations.
The frequent occurrence of a base word in an email may not be of special
significance if that word is in general a very common word and thus
occurs frequently in other emails too. But our modified feature calculation
process does not consider this aspect, so it gives a high feature value if
a base word occurs frequently in an email even if it is a very common
word. So we could modify our feature value calculation process further
by scaling down the feature value considerably if the corresponding base
word is a very common word. To do this, we can multiply the feature
value calculated previously by a value which is called inverse document
frequency (IDF). IDF is equal to the logarithm of the quotient obtained
178
Chapter 14 Data Preparation
4
On similar lines, you can figure out what effect this operation will have for a rare
base word.
179
Chapter 14 Data Preparation
higher level of fatigue than the value Low, and so on. Thus, these possible
values have a natural order. Such a categorical variable whose possible
values have a natural order is called an ordinal variable. It will be ideal to
preserve this order while converting such a variable into a numeric variable.
For example, since the value Low indicates a higher level of fatigue than the
value Very Low, you should encode Low using a higher number compared
to the number you use to encode Very Low, and so on. So you could simply
encode Very Low as 0, Low as 1, Moderate as 2, High as 3, and Very High as
4. Thus, for all patients whose level of fatigue was originally Very Low, the
encoded value of the variable will be 0. Similarly, for patients whose level of
fatigue was Low, the encoded value will be 1, and so on.
Let’s look at a different kind of categorical variable now. Figure 14-2
shows a partial view of the data where each observation corresponds to a
movie. The variable Length indicates the length of the movie in minutes,
variable Rating indicates the average rating given by users, and Genre
indicates what type of movie it is. For each movie, the categorical variable
Genre can take one of the three possible values: Action, Comedy, and
Horror. These possible values are just names of movie genres and do not
have a natural order. Such a categorical variable whose possible values are
just names that do not have a natural order is called a nominal variable.
To convert this variable into a numeric variable, you could obviously
encode Action as 0, Comedy as 1, and Horror as 2 as shown in Figure 14-3.
But this approach could be misleading in some cases. Let’s say you were
trying to identify clusters of movies based on Euclidean distance between
movies (we discussed clustering in Chapter 7). If the encoded data in
Figure 14-3 is used for clustering5 and distances are calculated based on
the three variables, the third movie seen in the table will seem closer to
the second movie than to the first movie. This is because the length and
rating of the third movie is equal to the length and rating of the other two
movies, but the encoded genre of the third movie is closer to the encoded
5
We have avoided feature scaling for simplicity of discussion.
180
Chapter 14 Data Preparation
genre of the second movie than to the encoded genre of the first movie.
However, this interpretation is not correct because you can see in the
original data (Figure 14-2) that the genre of the third movie is simply
different from the genres of the other two movies – it is not closer to one
and farther from the other. So we will discuss a different technique to
convert the categorical variable genre into numeric form.
…….. …….. ……
125 8.1 0
125 8.1 1
125 8.1 2
…… …… …
We will replace the variable Genre in Figure 14-2 with three new
variables corresponding to the three possible values of Genre variable.
Figure 14-4 shows the movie data with these three new variables. You can
see that there is a variable for each genre. If a movie belongs to a particular
genre, the value of the variable corresponding to that genre is 1, and the
181
Chapter 14 Data Preparation
value of other two variables is 0. For example, the first movie seen in this
figure belongs to the genre Action, so the value of variable Action is 1, the
value of variable Comedy is 0, and the value of variable Horror is 0. This is
the general idea behind the popular technique known as one-hot encoding.
So we can see how the data and the nature of problem being solved affect
our choice of the technique for converting categorical variables into
numeric variables.
…….. …….. …. …. ….
125 8.1 1 0 0
125 8.1 0 1 0
125 8.1 0 0 1
…… …… …. …. ….
T ransforming Images
As we discussed in the beginning of the chapter, preparing the data in
various ways can result in more effective models. For computer vision
problems, a common task that is carried out in data preparation is
transforming original images to produce new images, with the aim of
building effective models. The exact techniques you use to transform
the images depend on the effect you want the transformations to have for
creating better models. We talked about transforming images in Chapter 10
where we were looking for localized patterns in images with a lot of
background whitespace. Accordingly, the technique/approach we adopted
was to slice the image into tiles, choose the tiles with most tissue, and
reassemble them to create a square image that has a large tissue region.
182
Chapter 14 Data Preparation
• Flipping horizontally/vertically
• Zooming in/out
• Shifting horizontally/vertically
183
Chapter 14 Data Preparation
Libraries
Data manipulation and analysis libraries usually include functions for
aggregation and transformation of data. Pandas, for example, is a popular
library for processing data in Python – it has functions to aggregate and
transform data, including date/time transformations.
If you are working with big data, for example, on Spark clusters, you
might use Spark library functions for data preparation. Also, the relatively
recent Koalas library provides a pandas-compatible API to perform Spark
operations – this enables data scientists familiar with pandas to work with
Spark without a learning curve.
ML libraries usually incorporate functions for common data
preparation techniques as well. Scikit-learn, for example, provides
functions for filling missing values, standardization, one-hot encoding, etc.
There also exist libraries oriented toward a specific class of problem,
for example, natural language toolkit (NLTK) is a popular library for NLP,
which also includes functions for stemming, lemmatization, etc. Similarly,
for computer vision, libraries like scikit-image and Keras provide APIs to
ease the task of transforming images.
Tools
In the past few years, tools like Paxata, Trifacta, etc., have gained
popularity as they enable data preparation using an intuitive and friendly
user interface. This ease of use allows not only data scientists but also
analysts and business stakeholders to work on these tools, thus enabling
184
Chapter 14 Data Preparation
Data Engineering
There are two broad areas of data engineering activities to support data
preparation:
Conclusion
In this chapter, we revisited a few data preparation tasks and saw that
multiple techniques are available for each task. We also saw that a deep
understanding of these techniques is important for deciding which
technique is most suited for our problem.
185
CHAPTER 15
Data Visualization
We looked at data preparation in the previous chapter; let’s now delve
deeper into the techniques and technologies for data visualization which
is the next step in our data science process.
We emphasized the importance and benefits of designing effective
visualizations with concrete examples while discussing the different
classes of problems. We also mentioned how data visualization is a vast
subject in itself that covers many different types of charts, legends, layouts,
etc. Each of these provides a variety of simple and advanced options to the
users for greater control. Also, there are mechanisms to add interactive
features to your visualizations or combine existing visualizations to create
your own custom visualizations. We will look at some of these general
aspects and then delve deeper into a few visualizations and how they could
provide insights needed for building effective models in data science. We
will conclude the chapter by discussing a few popular libraries and tools in
the data visualization category.
G
raphs/Charts/Plots
While some people will use these terms loosely and interchangeably, a few
others could give enlightening discourses on how they all mean different
things. Without going deeper into that discussion, what we would like to
highlight is that these are the basic building blocks using which you can
design your visualizations. You might already have used or come across
some of these like bar chart, scatter plot, pie chart, etc., but we consider it a
useful investment to go beyond these and broaden your awareness of other
highly effective ones.
Many of these like bar chart, scatter plot, etc. use a horizontal and
vertical axis and display visual elements representing the data. The
position or size of each visual element along each axis is derived using
the values in the data. Suppose you have data related to demographics of
students living in a region. A scatter plot could show each student in this
dataset as a circular marker whose position along the horizontal axis is
based on the height of the student and position on the vertical axis is based
on the weight of the student.
There are other charts that do not use axes like pie chart, treemap,
etc. A treemap shows categories in a hierarchy as nested rectangular tiles
whose area represents some property of the categories. For example,
Figure 15-1 shows the cellular phone sales at a store for different models
from different manufacturers.1 You can see that there is a tile for every
manufacturer whose area represents the total sale for that manufacturer.
And the tiles for the different models from that manufacturer are nested
within the tile of the manufacturer such that the area of each nested tile
represents the sale for its corresponding model. For example, there is a
large red tile whose area represents the total sale for manufacturer Xiaomi,
1
This is purely synthetic data to illustrate a treemap; the data is not intended to
represent actual sales of any of the manufacturers mentioned.
188
Chapter 15 Data Visualization
and it contains three nested red tiles for the three models of Xiaomi. The
area of each nested red tile represents the sale for the corresponding
model of Xiaomi.
Another category among the basic building blocks is the node link
diagrams. These make use of nodes and links for showing entities and their
connections and are useful for visualizing network data.
189
Chapter 15 Data Visualization
L egends
You can visualize more variables in existing charts by making use of visual
properties like color, size, etc. We discussed previously a scatter plot that
shows each student as a marker whose x coordinate and y coordinate
are based on height and weight, respectively. You could also add Gender
(which contains discrete categories Male and Female) to this visualization
by showing markers for male students in blue and markers for female
students in green. A discrete legend will be added to the plot that shows
what each color represents. Figure 15-2 shows the scatter plot and the
discrete legend enabling visualization of Gender along with height
and weight.
2
The dataset used for all the height/weight plots in this chapter is derived from the
NHANES data at https://pypi.org/project/nhanes/
190
Chapter 15 Data Visualization
191
Chapter 15 Data Visualization
Figure 15-4. Bubble plot of Weight vs. Height. The size of a bubble
indicates the Age of that student
L ayouts
Often you will need to add multiple related charts to your visualization
so that you can compare the data across charts. Layouts allow you to add
multiple charts to your visualization and organize them in different ways.
Depending on your need, you can pick the appropriate layout that makes
your visualization more effective and conveys more insight into trends.
One of the simplest layout is the overlay layout that just overlays one axes-
based chart on another. Let’s say you have a stacked bar chart that stacks
the revenue from electronics goods on top of revenue from software for
every year. The overall height of each bar thus indicates the total revenue
for the corresponding year. On this bar chart, you could overlay a scatter
plot that shows the revenue target for each year. Figure 15-5 shows the
scatter plot overlaid on the bar chart. Notice that the bar is lower than the
scatter marker for year 1990 which means that the revenue target was not
192
Chapter 15 Data Visualization
met that year as the total revenue was below the revenue target. On the
other hand, in year 1995, the total revenue exceeded the revenue target as
the bar is higher than the scatter marker for that year.
3
Note that different libraries might use different names for this layout.
193
Chapter 15 Data Visualization
for every pair of country and year. And you could specify that each cell
should automatically display a bar chart (showing sales of models) for the
corresponding country and year. Figure 15-6 shows the data lattice layout
showing a bar chart for every pair of country and year. The simple bar
chart showing sales for different models would have allowed visualization
of two variables (Model and Sale); the data lattice layout allows you to
visualize two more variables: Country and Year.
Figure 15-6. Data lattice of sales for every <country, year> pair. Each
cell in the lattice shows the sale per model4
4
This is purely synthetic data to illustrate a data lattice; the data is not intended to
represent actual sales of any of the car models mentioned.
194
Chapter 15 Data Visualization
Options
All charts, legends, and layouts expose options to the users in the form
of properties which can be set to different values to avail more features
or control the output better. For example, you could use the appropriate
option in scatter plot to change the symbol to square shape instead of
circular, or you could use the appropriate option to change an axis of an
axes-based chart to use logarithmic scale. The spectrum of options varies
from the basic ones that result in just cosmetic changes to advanced
ones that involve complex calculations and have a significant impact on
the visualization. We won’t go into further details but encourage you to
increase your awareness of different options available as the insight gained
from a visualization can be greatly enhanced by using the right options.
Interactive Visualizations
Some tools let you design interactive visualizations which enable a deeper
exploration of data or provide dynamic views which are not possible
with simple visualizations. Let’s again consider the simple bar chart we
discussed earlier that shows the sales for car models – each car model is
represented as a bar whose height represents the sale for that model. You
could add an animation to this visualization that shows how the sales for
models changed on a monthly basis. Under the hood, the visualization
creates a bar chart (showing sales for models) for every month and
displays them in quick succession in chronological order. By observing
195
Chapter 15 Data Visualization
how the bars increase or decrease with time, you can figure out clearly how
the sale for each model changed with time.
Another important interactive feature is the drill down feature. When
you look at a visualization, you might want to drill deeper into details.
Let’s say you are looking at a bar chart showing the regional sales of a retail
chain in the United States.5 You might want to analyze further the trends
in a particular region and look at the sales in different states of that region.
You might decide to go even deeper into a state and look at the sales for
different cities in that state. You can achieve this by adding a drill down
feature to your visualization. When the visualization is rendered, it starts
with a bar chart showing sales for regions. Figure 15-7 (A) shows this bar
chart. You can then select the bar for a region and choose to drill down
into it; this causes the visualization to go to the next level in the hierarchy
which is State and display the sales for the states in that region. So, if
you chose to drill down on the bar corresponding to the East region as
highlighted in Figure 15-7 (A), you would see the bar chart in Figure 15-7
(B). You can further select the bar for a state and drill down to see the next
level which shows the sales for different cities in that state. Figure 15-7 (C)
shows the bar chart that would be displayed if you chose to drill down on
state New York.
5
The sample dataset used in this example is sourced from www.kaggle.com/
rohitsahoo/sales-forecasting
196
Chapter 15 Data Visualization
197
Chapter 15 Data Visualization
H
istogram
You can use a histogram to look at the distribution of values of a numeric
variable. A histogram divides the entire range of values into smaller
intervals called bins and shows a bar for each bin whose height indicates
the number of values falling in that bin. You can draw several conclusions
by looking at a histogram. For example, if all the bars are of similar height,
it means that the number of values in each bin is similar. In other words,
the values are more or less uniformly distributed across the intervals
or bins. On the other hand, if you look at Figure 15-8 which shows the
histogram for weights of students in a class, you can see that the bars for
intervals 30–35, 35–40, and 40–45 are high and the bars for intervals below
30 or above 45 are very small. This means that there are many students
whose weights fall in the intervals 30–35, 35–40, and 40–45 and there are
very few students whose weights fall in the intervals below 30 or intervals
above 45. In other words, a vast majority of students have weights between
30 and 45.
198
Chapter 15 Data Visualization
Let’s now see how you could use histograms to derive useful insights
for building ML models. Let’s say you are working toward creating
a model that will be deployed on a wearable device and will predict
whether a person has heart condition based on some health parameters
like temperature, blood pressure, etc., that it measures. For building
this model, let’s say you have captured the health parameter values for
approximately a thousand normal people and a thousand people with
heart condition. You could prepare your data in the data preparation
step in such a way that you have a row for each person that contains
their parameter values and a label whose value is Diseased if the person
has a heart condition or Healthy if the person is healthy. The variables
corresponding to the parameters are your features, and the variable
containing the labels Healthy/Diseased is your target. Now you want to
evaluate whether the feature corresponding to a particular parameter is
useful for predicting heart condition. For this, you could plot the histogram
of this feature for healthy people and overlay it on top of the histogram
199
Chapter 15 Data Visualization
of this feature for people with heart condition. Figure 15-9 shows these
overlaid histograms for the Healthy and Diseased class. X axis shows the
intervals of feature values and y axis shows the number of people.
You can see that the histogram for Healthy class has bars from feature
value 10 to 70 and the histogram for Diseased class has bars from value
50 to 130. So there is a good separation between the histograms of the two
classes – the histogram for Healthy class is shifted left compared to that of
Diseased class. Note that there are high bars of Healthy class for feature
value less than 50 but no bars of Diseased class for feature value less than
50. So our data is indicating that healthy people often have feature value
below 50, but people with heart condition never seem to have a value
below 50. Hence, if you come across a case whose feature value is below
50, you could infer that the person might likely be healthy. Similarly, our
data indicates that people with heart condition often have a value above
70, but healthy people never seem to have a value above 70. So if you come
across a case whose feature value is above 70, you could infer that the
200
Chapter 15 Data Visualization
person might likely have a heart condition. So we can see that the value
of this feature could give some indication about the presence of heart
condition in a person. Hence, an ML model could learn to use the value
of this feature to try to predict if a person has heart condition. So it will
be a good idea to use this feature while building a model that predicts the
presence of heart condition. You can thus see how looking at histograms
gives you hints about what features could be effective in building your
predictive models.
201
Chapter 15 Data Visualization
You can see in the figure that the area under the curve between weight
30 and 45 is much more than area under the curve below weight 30. This
means that the probability of a student’s weight falling between 30 and
45 is much more than the probability of the weight falling below 30. This
is because we saw in the histogram in Figure 15-8 that a vast majority of
students have weights between 30 and 45. You can thus see that a KDE plot
is like a smoother version of a histogram.
The benefit of using a KDE plot instead of histogram becomes evident
when you plot multiple distributions. Let’s revisit the feature whose
distributions for healthy people and people with heart condition were
plotted using histograms in Figure 15-9. Let’s now overlay the KDE plot for
these thousand healthy people on the KDE plot for the thousand people
with heart condition instead of histograms as shown in Figure 15-11. The x
axis in this figure represents the feature value, and the y axis represents the
probability density. You can see that the overlaid KDE plots in this figure
look like smooth versions of the overlaid histograms in Figure 15-9, but the
202
Chapter 15 Data Visualization
benefit you get with KDE plots is that the KDE plots look less cluttered and
are more readable than the histograms. The reduction in clutter would be
more evident if you were comparing, say, four overlaid histograms with the
corresponding four overlaid KDE plots.
Figure 15-11. Overlaid Kernel Density Estimate (KDE) plots for the
“Healthy” class and “Diseased” class
Let’s now see how you can get insights from KDE plots for building
models. We can see in the figure that there is good separation between
the KDE plots of the two classes – KDE plot of Healthy class is shifted left
compared to KDE plot of Diseased class. Note that the area under the
Healthy curve below feature value 50 is large, whereas the area under the
Diseased curve below feature value 50 is close to 0. This means that the
probability of a healthy person’s feature value being less than 50 is high,
whereas the probability of the value being below 50 for a person with
heart condition is almost nil. So our plots are indicating that the value of
the feature for a healthy person will often be below 50 and the value for a
person with heart condition will rarely be below 50. Hence, if you come
across a person whose feature value is 40, you could infer that the person
might likely be healthy. Similarly, the area under the Diseased curve
203
Chapter 15 Data Visualization
above feature value 75 is large, whereas the area under the Healthy curve
above feature value 75 is close to 0. So our plots are indicating that the
feature value for a person with heart condition will often be above 75 and
the value for a healthy person will rarely be above 75. Hence, if you come
across a person whose feature value is 90, you could conclude that they
might likely have a heart condition. Since the value of this feature gives you
some hint about the presence or absence of heart condition, you could use
this feature to build models that predict heart condition. Thus, you could
evaluate the suitability of this feature for building models by looking at the
KDE plots.
L ibraries
There are several visualization libraries geared toward data science. For
example, in the Python ecosystem, Matplotlib is one of the oldest, and still
very popular, libraries. Seaborn6 is another popular library in Python used
by data scientists. We have used these two libraries for creating the scatter
plots, histograms, and KDE plots in this chapter.
6
Also based on Matplotlib.
204
Chapter 15 Data Visualization
T ools
The space of data visualization tools, also comprising of business
intelligence or visual analytics tools, is one of the oldest in the analytics
arena. Tools like Tableau have a rich history dating back to the turn of
the century. The expansion of data science in the past few years has
seen several BI tools expand their capabilities to add support for data
preparation and data visualization as needed by data scientists, including
integration with Python, etc.
In our experience, Tableau and SAS Visual Analytics are two tools that
represent the wide gamut of capabilities offered for data visualization and
are popular among data scientists in academia and industry alike. Other
popular tools include PowerBI, Looker, etc. – this is a very crowded space
with numerous popular tools.7
Being aware of this category of tools will enable you to identify if your
organization already has a tool (such as a BI tool) which can be leveraged
by data scientists as well. This can significantly improve collaboration
between the data scientists and other stakeholders. These tools also allow
embedding the views and dashboards into existing web applications very
easily – it can be very useful to incorporate the visualizations created by
data scientists into any existing internal operations portals.8
D
ata Engineering
If the data science team is working with data sizes that can fit on a single
machine and using visualization libraries such as Matplotlib, Seaborn, etc.,
then there aren’t many data engineering activities required for the data
visualization step.
7
Recent acquisitions of Tableau by Salesforce and Looker by Google continue to
rapidly evolve this space.
8
This especially applies to DSI-Proc projects; see Chapter 23.
205
Chapter 15 Data Visualization
But when the data size is large and clusters such as Spark are used,
data engineering would ensure that efficient ad hoc queries are supported
for data visualization.
In addition to ensuring efficient queries, when BI or visual analytics
tools are used for data visualization, the following data engineering
activities are typically required:
C
onclusion
In this chapter, we tried to emphasize that data visualization is much
more than just bar charts and scatter plots. We discussed some general
aspects related to the field of data visualization and also discussed how we
can derive insights from visualizations. These insights can help us build
effective ML models which is the focus of our next chapter.
206
CHAPTER 16
Machine Learning
Machine learning, as we have seen, is at the heart of the data science
process as it is in this step that the actual models are built. This chapter is
dedicated to the ML algorithms/techniques you can use to build models
and libraries that implement these algorithms. Awareness of different ML
algorithms and an intuitive understanding of the underlying concepts is
crucial for the success of the entire data science process. We will start with
a general categorization of ML algorithms and then look at a few popular
algorithms. We will then discuss model performance evaluation that can
help you evaluate the effectiveness of your models. This evaluation helps
you choose the best model from multiple candidate models you might
have built and also gives you an idea of how well the chosen model is likely
to perform when deployed in production.
S
upervised Learning
A supervised learning algorithm, as you might have guessed from the
name, requires human supervision. As part of this supervision, you need
to tell the algorithm what the correct labels are for existing observations.
The algorithm learns the relationships between these observations and
their labels and is then able to predict the label for a new observation. For
example, if you are using a supervised algorithm to build a model that
can predict whether a new digital payment transaction is fraudulent, you
will not only provide the details of the existing transactions (like location
of transaction, amount and mode of payment, etc.) but will also need to
provide labels Fraud/Non-Fraud for these transactions to the algorithm.
The model built by the algorithm will then be able to predict which new
transaction is fraudulent based on the details of the transaction. We have
seen earlier that the variables corresponding to such details (like amount
of payment, etc.) based on which prediction is done are called features and
the variable corresponding to the labels is called target. Let’s again look
at the example from Chapter 4 where we were trying to predict whether
an interested customer is likely to eventually buy the car or not based on
their gender, age, occupation, and annual income. Figure 16-1 shows the
data that we passed to the machine learning algorithm. Note that we not
only passed the details of past customers (like gender, age, etc.) to the
algorithm, but we also passed the labels (Purchased/Not Purchased) for
those customers. Hence, the algorithm we used in that example (decision
tree algorithm) is a supervised learning algorithm. The figure also points
out the features, target variable, and labels for this case.
208
Chapter 16 Machine Learning
U
nsupervised Learning
An unsupervised learning algorithm works without human supervision,
that is, you do not provide any labels for the observations. The algorithm
tries to learn patterns on its own from the unlabeled data. Clustering
algorithms, like agglomerative clustering that we saw in Chapter 7, are
examples of unsupervised learning algorithms. A clustering algorithm
tries to divide the set of unlabeled observations into groups or clusters
209
Chapter 16 Machine Learning
such that observations belonging to the same cluster are more similar
than observations belonging to different clusters. We saw in Chapter 7, for
example, that running a clustering algorithm on customer data can group
customers with similar characteristics into segments. Another popular
clustering algorithm is k-means.
Some anomaly detection algorithms, such as local outlier factor that
we saw in Chapter 8, are also unsupervised algorithms as they aim to
detect anomalous observations in data without any labels. Other examples
of popular unsupervised, anomaly detection algorithms are isolation forest
and one-class SVM.
It is important to note that unsupervised learning algorithms are
particularly important for KDD/data mining projects – we shall revisit this
in Chapter 23.
R
einforcement Learning
We had looked at reinforcement learning in Chapter 11. These algorithms
that aim to take decisions to optimize long-term results are rather different
in nature from the supervised and unsupervised algorithms, as they
interact with their environment and learn based on rewards given for their
decisions.
210
Chapter 16 Machine Learning
L inear Regression
The linear regression algorithm is one of the most popular regression
machine learning algorithms. As mentioned earlier, this algorithm builds
a linear regression model based on the prepared data which is a linear
equation that explains how the target variable value can be calculated
from the values of features. It is a favorite with many data scientists who
often start their initial experimentation on regression problems with linear
regression models. The equation created by the linear regression algorithm
is used to calculate the target value for a new observation. The equation
also provides insight into how the target variable is related to the features.
Let’s try and understand how the linear regression algorithm works by
revisiting the example from regression class of problems1 where the goal
was to predict target Claim Amount based on the values of features Gender,
Age, and Smoking Status. Figure 16-2 shows the prepared data from that
example. Recall that Male was represented as 1 and Female as 0 in Gender
feature and Smoker was represented as 1 and Non Smoker as 0 in Smoking
Status feature.
1
Chapter 5.
211
Chapter 16 Machine Learning
where b1, b2, and b3 are weights for features Gender, Age, and Smoking
Status, respectively, and b0 is the constant term (also called intercept
term). As per this equation, the algorithm would predict the Claim
Amount for the customer whose Gender = 0, Age = 40, and Smoking Status
= 0 to be
b0 + b1 × 0 + b2 × 40 + b3 × 0
The column Predicted Claim Amount in Figure 16-3 shows the claim
amounts that the algorithm would predict for the existing customers
using this equation. The error in prediction for an existing customer is the
difference between the predicted claim amount and the actual amount
claimed by that customer. The column Prediction Error in this figure shows
the prediction errors for the existing customers. Note that the value in this
column is simply the value in the column Predicted Claim Amount minus
the value in the column Claim Amount.
2
We have skipped discussion of feature scaling to focus on the main concepts of
linear regression.
212
Chapter 16 Machine Learning
Figure 16-3. Predicted claim amounts and prediction errors for the
existing customers
The algorithm tries to estimate b0, b1, b2, and b3 for which the overall
prediction error will be minimum. In other words, it tries to calculate the
values of b0, b1, b2, and b3 for which the earlier expression has minimum
value by making use of techniques like gradient descent. In this case, it
finds that when b0 = -6.9, b1 = 2.1, b2 = 0.4, and b3 = 2.9, the expression will
have minimum value. Once the algorithm has thus zeroed in on the right
values for the weights and intercept term, it uses these values to concretize
the original equation
into
Claim Amount = -6.9 + 2.1 × Gender + 0.4 × Age + 2.9 × Smoking Status
213
Chapter 16 Machine Learning
Claim Amount = 0.4 × Age + 2.1 × Gender + 2.9 × Smoking Status - 6.9
L ogistic Regression
Just like many data scientists begin experimentation for regression
problems with linear regression, logistic regression3 is a favorite for initial
experimentation for classification problems. Let’s try and understand
logistic regression using an example similar to the insurance example in
the previous section. However, this time, our goal is to predict only whether
a new customer will make a claim or not instead of predicting the claim
amount. Let’s assume we have data from a different insurance company
which contains Gender, Age, and Smoking Status for each existing customer
like the data in the previous section. But, instead of Claim Amount, this data
contains a variable called Claim Status whose value for a customer is either
3
Lot of interesting content is available online that explains why this algorithm
is called “regression” even though it is used so commonly for classification
problems.
214
Chapter 16 Machine Learning
4
We have skipped discussion of feature scaling to focus on the main concepts of
logistic regression.
215
Chapter 16 Machine Learning
This is because the e-x term in the function definition becomes 1 for x = 0
causing the overall value of the function to be 1/(1+1) or 0.5. As x increases
toward large positive values, the e-x term starts becoming close to 0 causing
the overall function value to become close to 1. And as x decreases toward
large negative values, the e-x term starts becoming very large causing the
overall function value to become close to 0. So the value of the function is
guaranteed to be between 0 and 1.
where pclaim is the probability that the customer will make a claim,
f is the logistic function,
b0 is the intercept term,
b1, b2, b3 are weights for the features.
216
Chapter 16 Machine Learning
f(b0 + b1 × 1 + b2 × 50 + b3 × 0)
The algorithm could also predict the probability of not making a claim
by subtracting the probability of making a claim from 1. This could be
represented as
where pno_claim is the probability that the customer will not make a
claim.
Using this equation, the algorithm would predict the probability of not
making a claim for a customer whose Gender = 0, Age = 30, and Smoking
Status = 0 to be
1 - f(b0 + b1 × 0 + b2 × 30+ b3 × 0)
5
This is an extremely simplified explanation that skips a discussion of the actual
function which the algorithm tries to minimize.
217
Chapter 16 Machine Learning
Let’s say the algorithm finds that b0 = -80, b1 = 10, b2 = 2, b3 = 10 are the
optimum values for weights and intercept term. It will use these values to
concretize the equation
into
Now this equation represents the logistic regression model and can be
used to predict the probability of making a claim for a new customer. For
example, the probability that a 37-year-old male smoker (Gender = 1, Age
= 37, Smoking Status = 1) will make a claim is given by
= f(-80 + 10 × 1 + 2 × 37 + 10 × 1)
= f(14)
218
Chapter 16 Machine Learning
We can see that for larger values of age, the value of the inner
expression (which is input for the logistic function in this equation) will
also be larger. And the plot of logistic function discussed earlier tells us
that as input value becomes larger, the logistic function value becomes
larger too. So the probability of making a claim will be larger for larger
values of age. This means that older people are more likely to make a
claim. Similarly for smokers (Smoking Status = 1), the inner expression
increases by 10. With this increase in the value of inner expression, logistic
function value also increases. So the probability of making a claim is
higher for smokers. In other words, smokers are more likely to make a
claim. You could similarly analyze the effect of gender on the likelihood of
making a claim.
219
Chapter 16 Machine Learning
Let’s analyze the equation of the model further to find more insights.
We will assume a model probability cutoff of 0.5 for this discussion. Now,
let’s say we were interested in analyzing the female nonsmokers (Gender
= 0, Smoking Status = 0) group. Substituting the values of Gender and
Smoking Status for this group in the equation of the model gives us
which is equivalent to
220
Chapter 16 Machine Learning
Let’s say our data contains two features f1 and f2 and a target variable
t whose value is either N (representing Negative class) or P (representing
Positive class). This data is shown in Figure 16-7.
f1 f2 t
…… …. ….
1 4 P
4 2 N
…… …… ….
221
Chapter 16 Machine Learning
Figure 16-8. Scatter plot for our simple data and the imaginary
separating line
222
Chapter 16 Machine Learning
two classes. And you could use a linear SVM to find this hyperplane for
linearly separable data which can then be used to predict the class for a
new observation.
However, data is not always linearly separable. Let’s look at a popular
toy example to understand this. Figure 16-9 shows this toy data with
just one feature f1 and target t whose value could be P (positive) or N
(negative). Figure 16-10 shows the plot for this data in which you can
see that the data is not linearly separable because you cannot find a
hyperplane that separates the positive and negative observations. For such
cases where the data is not linearly separable, you could use a nonlinear
SVM to separate the classes.
f1 t
…… ….
-5 P
1 N
3 P
…… ….
223
Chapter 16 Machine Learning
f1 f2 t
…… … ….
-5 25 P
1 1 N
3 9 P
…… … ….
224
Chapter 16 Machine Learning
Figure 16-12. Scatter plot for toy data with derived feature
D
ecision Tree
Decision tree algorithm is another popular supervised machine learning
algorithm that is used for both classification and regression problems. We
already saw an example of decision tree when we discussed classification
problems. Let’s try and understand decision tree algorithm in more detail
with another example. Let’s say you are building an app for the employees of
your organization which can monitor their health and give timely warning
about onset of various illnesses. One of the features of this app indicates
whether the user is at risk of heart disease. Behind the scenes, the app will
use a machine learning model to predict whether the user is likely to have
heart disease. Your job is to build this model using the available data that has
roughly an equal number of healthy people and people with heart disease.
Figure 16-13 shows a snapshot of the data for adults that we will use for
building this predictive model. Gender and Smoking Status in the table are
straightforward. Weight Category column can have one of the three values:
Normal (indicating that the person is in the normal weight range for their
height and gender), Underweight (indicating that person is below the normal
weight range), and Overweight (indicating that person is above the normal
weight range). Exercises column indicates whether the person exercises
225
Chapter 16 Machine Learning
regularly or not. Health Status column indicates whether the person is healthy
or has heart disease. We will try and build a simple model that predicts
the value of target variable Health Status based on the values of other four
columns which are our features for this example. In reality, you would want to
use more sophisticated features to build a very accurate model.
If we give these features and target variable to the decision tree algorithm,
it will build a decision tree model. A decision tree model is simply a tree6 in
which non-leaf nodes inspect values of features and leaf nodes predict the
target value. Figure 16-14 shows a partial view of the decision tree model
built by the decision tree algorithm for the previous data. You can see that the
model inspects the value of Exercises feature in the first non-leaf node and
goes right7 if the value is Yes and goes left if the value is No. If the value is Yes,
it then checks the value of the feature Smoking Status in another non-leaf
node. If the value of Smoking Status is Smoker, it goes right, and if the value is
Non Smoker, it goes left to a leaf node. This leaf node predicts the target value
as Healthy. Other non-leaf and leaf nodes could be similarly explained.
6
A tree is a representation that looks like an inverted real-life tree: with a root node
at the top and leaf nodes at the bottom.
7
Reader’s right.
226
Chapter 16 Machine Learning
Exercises
No Yes
…….
Weight Category Heart Disease Healthy
Underweight Overweight
Normal
Male Female
…….
Heart Disease
Figure 16-14. Partial view of the decision tree model for predicting
heart disease
227
Chapter 16 Machine Learning
for making the prediction. The model will start at the root node Exercises
and take the left8 branch corresponding to No. It will again take the left
branch corresponding to Non Smoker from node Smoking Status and take
the rightmost branch corresponding to Overweight from node Weight
Category. It thus reaches the leaf node Heart Disease (surrounded by red
rectangle in the figure) and hence predicts Heart Disease for the employee.
Exercises
No Yes
…….
Weight Category Heart Disease Healthy
Underweight Overweight
Normal
Male Female
…….
Heart Disease
Figure 16-15. Traversing the decision tree for predicting heart disease
for the employee
8
Reader’s left.
228
Chapter 16 Machine Learning
Let’s take a step back and understand why the model predicts Heart
Disease when a person reaches this leaf node inside the red rectangle.
The model knows from the tree structure that if a person reaches this
leaf node, they must belong to the group of non-exercising nonsmoking
overweight people. And the model observed that all people in this group
in the data it was provided had heart disease.9 So the model confidently
predicts Heart Disease for this person. Similarly if a person reaches the
leaf node inside the green rectangle, the model knows that they must
belong to the group of exercising nonsmoking people and all people in
this group in the data were healthy. So the model confidently predicts
this person as Healthy. The important point to note here is that the model
confidently predicted the target value as Heart Disease in the first case
because the leaf node he reached corresponds to a group (non-exercising
nonsmoking overweight people) which has a very high predominance of
people with heart disease in the data provided. And the model confidently
predicted the target value as Healthy in the second case because the
leaf node he reached corresponds to a group (exercising nonsmoking
people) which has a very high predominance of heathy people in the data
provided. So we can see that, to be able to confidently predict the target
value, it is important that the leaf nodes should correspond to groups that
have a very high predominance of one class (Healthy or Heart Disease).
Or we could say in short that the leaf nodes need to have a very high
predominance of one class.
The algorithm uses measures like entropy to evaluate whether a
node has a very high predominance of one class. Just like the leaf nodes
inside the red and green rectangles correspond to certain groups as
mentioned previously, every node in the tree corresponds to a certain
group. For example, the node to the right of the root node corresponds
9
This is true for our data even though it is not visible in the partial view we have
shown in Figure 16-13.
229
Chapter 16 Machine Learning
= - 0.5 × -1 - 0.5 × -1
=1
230
Chapter 16 Machine Learning
Now we can finally look at how the algorithm creates the tree. The
algorithm starts with the root node which corresponds to the group of “all”
people which has roughly equal number of healthy people and people
with heart disease in our data. So no class is predominant in the root node.
It then splits the root node on that feature from the feature set for which
the child nodes have a high predominance of one class. So it will pick the
feature for which the child nodes have the lowest entropy. It then repeats
this for each child node, that is, it splits each child node further using
the feature that results in children with high predominance of one class
(or lowest entropy). And it keeps doing this to build the tree. Thus, the
algorithm keeps splitting to create nodes with high predominance of one
class. Hence, by the time it has reached the leaf nodes, there will be a very
high predominance of one class. This is the reason that the leaf node in the
red rectangle earlier had a very high predominance of class Heart Disease
(the group of people corresponding to this leaf node had people with heart
disease only and no healthy people as we saw earlier) and the leaf node in
the green rectangle earlier had a very high predominance of class Healthy
(the group of people corresponding to this leaf node had only healthy
people and no people with heart disease as we saw earlier). And as we
saw earlier, a very high predominance of one class in a leaf node gives the
model the confidence to predict that class as the target value for the person
who has reached that leaf node during prediction.
Decision tree models also provide insights into the trends within data
and the underlying processes. For example, you can tell looking at the tree
that if a person exercises and does not smoke, the decision tree will predict
that person to be healthy irrespective of that person’s gender and weight
category. More specifically, the important insight we gain is that even an
overweight person is likely to be healthy if that person exercises and does
not smoke.
231
Chapter 16 Machine Learning
R
andom Forest
A close relative of the decision tree algorithm is the random forest
algorithm. Random forest algorithm can also be used for classification as
well as regression problems. We saw that the model created by the decision
tree algorithm is simply a decision tree. The model created by the random
forest algorithm, on the other hand, is a forest of decision trees. So the
random forest algorithm, instead of trusting the prediction from a single
decision tree, forms a team of multiple decision trees each of which gives
its prediction for a new observation. The predictions from all the trees are
considered to arrive at the final prediction for the new observation.
For classification problems, each decision tree in the forest predicts
the class which it thinks the new observation might belong to. In other
words, each tree recommends a class for the new observation. The class
that is recommended by the largest number of trees is chosen as the final
predicted class. Let’s assume we gave the data of healthy people and
people with heart disease from previous section to the random forest
algorithm which builds a forest of trees. Figure 16-16 shows how the
trees collaborate to predict whether a new employee has heart disease.
You can see that 80 trees think this new employee is healthy and 20 feel
the employee has heart disease. Since a larger number of trees think the
employee is healthy, the model predicts that the employee is healthy.
232
Chapter 16 Machine Learning
………
………
Healthy: 80
Heart Disease: 20
Healthy
234
Chapter 16 Machine Learning
f1 f2 f3 t f1 f2 f3 error_1st_
tree
…… …. ….. ….
…… …. ….. ….
5 7 2 10
5 7 2 2
…… …… ….. ….
…… …… ….. ….
235
Chapter 16 Machine Learning
The algorithm now builds a second decision tree that can predict the
errors that the first tree will make in predicting target values. For this,
the algorithm calculates the error made by the first tree for each existing
observation which is the difference between the actual target value
and target value predicted by the first tree (i.e., t minus predicted_t in
bottom-left table). The top-right table shows the errors made by the first
tree (in column error_1st_tree10) along with the feature values for existing
observations that are passed to the second tree. One of the rows in this
data tells the second tree that the error made by the first tree is 2 when f1
is 5, f2 is 7, and f3 is 2. The second tree uses this data to learn to predict
the errors that the first tree will make. Once the second tree has learned, it
can be used to predict the errors that the first tree makes. The additional
column predicted_error_1st_tree in the bottom-right table shows the errors
that the second tree predicts will be made by the first tree for the existing
observations. For example, for the observation whose f1 is 5, f2 is 7, and
f3 is 2, the second tree predicts that the first tree will make an error of 1.7,
whereas the actual error made by the first tree was 2.
Now that the trees are built, predicting the target value for new
observations is simple. Let’s say we have a new observation which is quite
similar to the existing observation in our original data (which was shown
in the top-left table of previous figure). So for this new observation too,
f1 is 5, f2 is 7, and f3 is 2. Let’s see how the gradient boosted trees predict
the target value using the feature values for this new observation – notice
how this prediction approach explained next tries to make the predicted
target value close to the actual target value for this new observation (we
will assume the actual target value for the new observation to also be 10
like the existing observation since the new observation is very similar
to the existing observation, but this actual target value will obviously be
unknown to us in practical scenarios). Now, for these feature values, the
10
The errors will be negative for rows where actual target value is less than
predicted target value.
236
Chapter 16 Machine Learning
first tree will predict a target value of 8 (we can tell this from the bottom-
left table of previous figure). Now the second tree is used to improve this
prediction and make it closer to actual target value. For these feature
values, the second tree will predict an error of 1.7 by first tree (we can tell
this from the bottom-right table in previous figure). The predicted target
value from the first tree will now be corrected using the error of the first
tree as predicted by the second tree to get the final target value that will
be predicted for this observation. So the predicted error given by the
second tree (1.7) is added to the predicted target value from the first tree
(8) to get the corrected target value (9.7) that gets finally predicted for this
observation. Thus, the final predicted target value is close to the actual
target value (which we assumed to be 10 as mentioned earlier).
Let’s see how we can improve the predictions further. We saw in the
bottom-right table previously that the actual error made by the first tree is
2 when f1 is 5, f2 is 7, and f3 is 2, but the second tree predicts the error to
be 1.7. This means that the second tree itself has an error of 0.3 for these
feature values. So you could configure the algorithm to build a third tree
that can predict the errors of the second tree. This is shown in Figure 16-
18. You can see in the figure that the third tree learns from the data passed
to it that the error made by the second tree is 0.3 when f1 is 5, f2 is 7, and
f3 is 2. And when the third tree is in turn asked to predict the error made
by the second tree for these feature values, it predicts 0.25 as you can see
in the figure. Once the three trees are built, when we make a prediction for
the new observation (whose f1 is 5, f2 is 7, f3 is 2), the first tree will predict
that the target value is 8, the second tree will predict that the first tree will
make an error of 1.7, and the third tree will predict that the second tree will
make an error of 0.25. So the final predicted target value will be equal to
the predicted target value from the first tree (8) plus error of the first tree as
predicted by the second tree (1.7) plus error of the second tree as predicted
by the third tree (0.25) which is equal to 9.95. Hence, the third tree further
helps us make the final predicted target value closer to actual target value
(which we assumed to be 10).
237
Chapter 16 Machine Learning
You can configure the algorithm to use more and more trees
sequentially in a similar fashion. And even when you have many trees,
while predicting for a new observation, the predictions by all trees are
added, and the resulting sum is predicted as the final target value. This can
be represented as a simple equation:
Final Predicted Target Value = Prediction by 1st tree + Prediction by 2nd tree
+ Prediction by 3rd tree + Prediction by 4th tree + Prediction by 5th tree + …
f1 f2 f3 t f1 f2 f3 error_1st_ f1 f2 f3 error_2nd_
…… …. ….. …. tree tree
5 7 2 10 …… …. ….. …. …… …. ….. ….
…… …… ….. …. 5 7 2 2 5 7 2 0.3
…… …… ….. …. …… …… ….. ….
238
Chapter 16 Machine Learning
239
Chapter 16 Machine Learning
1 … 1 1 0 … … 0 … … Product
Development
0 … 0 0 1 … … 0 … … Product
Development
0 … 0 0 0 … … 1 … … Research
Work
0 … 0 0 0 … … 0 … … Research
Work
0 … 0 0 0 … … 0 … … Trainings
0 … 0 0 0 … … 0 … … Trainings
… … … … … … … … … … ………………..
… … … … … … … … … … ………………….
We will convert the single target variable Category into three target
variables corresponding to the three possible categories: Product
Development, Research Work, and Trainings. Figure 16-20 shows the
prepared data11 that has been transformed12 to have three target variables
which can be seen in the last three columns. If an email belongs to a certain
category, the value of the corresponding target variable for that email is 1
and the other two target variables have a value of 0. For example, the first
email in the figure belongs to category Product Development, so the value
of target variable Product Development is 1, the value of target variable
Research Work is 0, and the value of target variable Trainings is also 0. Let’s
now build an artificial neural network that can predict the values of the
three target variables using the values of the features.
11
We have skipped discussion of feature scaling to focus on the main concepts of
artificial neural networks.
12
We saw similar transformations when we converted categorical variables into
numeric variables in Chapter 14.
240
Chapter 16 Machine Learning
0 … 0 0 1 … … 0 … … 1 0 0
0 … 0 0 0 … … 1 … … 0 1 0
0 … 0 0 0 … … 0 … … 0 1 0
0 … 0 0 0 … … 0 … … 0 0 1
0 … 0 0 0 … … 0 … … 0 0 1
… … … … … … … … … … … … …
… … … … … … … … … … … … …
241
Chapter 16 Machine Learning
context
…..
Product
Development
security
Research
Work
fix
release Trainings
…..
The first column in this figure is the input layer and contains a neuron for
each feature. Note there is a neuron for feature context, a neuron for feature
security, and so on. Each neuron in this layer simply receives the value of
the corresponding feature for an email and outputs the same value without
changing it. The output of each neuron in the input layer (which is simply a
feature value as we just saw) is fed to all neurons in hidden layer 1. Thus, each
neuron in the first hidden layer receives all feature values as input.
The neurons in hidden layer 1, hidden layer 2, and output layer are
different from neurons in the input layer. What each neuron in these three
layers does is similar to what we saw in logistic regression. It calculates
the weighted sum of its inputs, adds a bias, and uses the sigmoid function
to convert the resulting value into a value between 0 and 1. This can be
represented as an equation:
242
Chapter 16 Machine Learning
13
For example, random values.
243
Chapter 16 Machine Learning
predicted target values with the actual target values for this email which are
known from the data provided. It then adjusts the weights of the neurons
in the network in such a way that the error for this email decreases. The
figure also depicts these steps of measuring the error and adjusting the
weights. Let’s take a closer look at the process of adjusting the weights. The
error in prediction for this email depends on the actual target values and
predicted target values. The predicted target values themselves depend on
the feature values for this email and weights of neurons. So we could say
that the error depends on the feature values, weights of neurons, and actual
target values for the email. Since the feature values and actual target values
for this email are fixed as present in the data, the error for this email ends
up being a function of the weights of neurons.14 So the neural network could
use techniques similar to gradient descent to change the weights from their
current values to new values in a way that the error decreases.
context = 1
Predicted Actual
Target Target
….. Values Values
Product
Development = 0 1
security = 1
Research
Work = 0
0
fix = 1
Trainings = 0
release= 0 0
error
…..
Adjust weights
14
Similar to how the overall prediction error in linear regression was a function of
the weights b0, b1, b2, and b3.
244
Chapter 16 Machine Learning
The neural network then performs these steps of adjusting the weights
to reduce the error (which it performed for the first email) for all other
emails as well. With the new weights, the neural network should make
less errors in predicting the target values for existing emails. And then
the neural network repeats this entire process of adjusting the weights
for all emails one more time to reduce the errors further. It continues to
repeat this entire process many times till the errors for existing emails
have minimized. At this stage, for the existing emails, the neural network
is able to predict target values that are close to the actual target values.
Figure 16-23 shows that the neural network now predicts target values
that are close to the actual target values if you feed it the feature values of
the first email. So the network predicts a value close15 to 1 for the target
variable Product Development, value close to 0 for Research Work, and
value close to 0 for Trainings for the first existing email. However, the real
benefit comes from the fact that the neural network is now able to predict
reasonably well the category for even new incoming emails that it hasn’t
seen earlier.16 For example, if a new incoming email has similar text as the
first existing email (and hence similar feature values), the neural network
will still predict a value close to 1 for target variable Product Development,
value close to 0 for Research Work, and value close to 0 for Trainings. In
other words, the neural network will be able to correctly predict that the
new incoming email belongs to the category Product Development.
15
Figure 16-23 shows the predicted target value for product development to be
exactly 1 for simplicity. The predicted target values for other two target variables
are shown to be exactly 0 in the figure for the same reason.
16
Assuming that the neural network after learning from a large set of existing
emails will generalize well for new emails.
245
Chapter 16 Machine Learning
context = 1
Predicted Actual
Target Target
….. Values Values
Product
Development = 1 1
security = 1
Research
Work = 0
0
fix = 1
Trainings = 0
release= 0 0
…..
Figure 16-23. Predictions for the first email after the neural network
has learned
Like we just saw, the neural network might do a great job of correctly
predicting the category for new incoming emails. However, the network
doesn’t give us a simple picture of how it is doing the predictions unlike
some other models we discussed earlier. If you recall our discussion on
linear regression, you might remember that a linear regression model gives
you a simple equation that explains how the value of target variable is
calculated using the feature values. And that equation gives many insights
about the underlying processes. However, with neural networks, all we
know is that many neurons are doing lots of computations to arrive at the
predicted target values. We don’t have a simple picture that explains how
the target values are related to feature values.
Finally, let’s talk a bit about designing neural networks. For this
problem, we chose to have two hidden layers in our network. If you decide
to use neural networks for your problem, you will need to decide how
many hidden layers your network should have and how many neurons
each hidden layer should have. We won’t dwell further on this aspect but
recommend a study of the approaches used to decide the appropriate
number of hidden layers and neurons.
246
Chapter 16 Machine Learning
247
Chapter 16 Machine Learning
surveillance camera. The CNN takes the image as input and passes it
through the convolutional and pooling layers as you can see in the figure.
Convolutional layers apply filters containing weights to the image – each
filter uses its weights to calculate the weighted sum of pixel values from a
portion of the image, thus producing a calculated value for that portion;
the filter does this for all portions of the image resulting in a matrix of
calculated values (containing a calculated value for each portion of the
image as explained) which is known as a feature map. Pooling layers
reduce the size of the feature maps – this is often done by retaining only
the largest value or the average value from each small chunk of a feature
map. The next step seen in the figure is flattening which flattens/converts
the generated feature maps into a plain one-dimensional vector of feature
values. As you can see in the figure, this feature value vector is passed on
to the regular neural network which outputs the predictions in its final
(rightmost) layer.
Convolution Predictions
Flattening
&
Feature
Pooling Value
Grayscale Image Vector
This CNN learns to correctly predict the class for an image using a
process that is similar to the learning process we discussed earlier for the
regular neural network. The CNN will require a set of existing images and a
label for each image indicating the class to which the image belongs. It will
then calculate and assign such values to filter weights and neuron weights
248
Chapter 16 Machine Learning
that will produce the correct predictions for existing images. And with
these appropriate values of filter weights and neuron weights, the CNN
could correctly predict the class for a new unseen image as well. This is the
10000-foot view of the learning process that skips many details.
Since a CNN has a few layers in addition to the usual layers of a regular
neural network, there are some additional design aspects you need to
consider while designing a CNN. Refer to Chollet (2018) for a coverage
of some of these aspects. You could also explore standard architectures
available out there17 and see if any of those work well for your problem.
E valuating Models
To objectively determine whether a model is performing well, we need
to check how it performs on unseen data, that is, data that is not used for
creating the model. In order to achieve this, we set aside some of the data
from the original dataset for testing; this data set aside is called the test
data and is usually represented as a percentage of the total dataset. All data
except the test data is used by the ML algorithm to train the model and is
referred to as the training data. Thus, if you hear something like a “70-30
split,” or “we set aside 30% for testing,” it means
17
Such as EfficientNet that we saw in Chapter 10.
249
Chapter 16 Machine Learning
The model thus trained on the training data is used to predict the
target values for the observations in the test data. The performance of the
model on the test set is evaluated by checking how close these predicted
target values are to the actual target values. For example, for classification
problems, we could check what fraction of the observations in the test set
have a predicted target value which is the same as the actual target value.
This is known as accuracy of the model. There are numerous other metrics
for evaluating the performance of models for classification problems like
precision, recall, F1 score, AUC, etc. Similarly there are metrics for regression
problems like mean squared error, mean absolute error, R squared, etc.
How this train-test split happens is one of the most important factors
to ensure that the data science process, that is, the scientific method, is
applied correctly and successfully. We have seen several cases where a
claim is made to having created a good model, but the model does not
perform well in production – one of the primary reasons is that something
was overlooked in the train-test split.
There can be several aspects to carefully consider in how the train-test
split is done, depending on the specifics of the data and the problem being
solved. The following are a few common considerations and nuances:
250
Chapter 16 Machine Learning
251
Chapter 16 Machine Learning
T uning models
ML algorithms typically provide various “knobs” to tune how they build
the model. For example, the random forest algorithm allows you to specify
the number of trees, maximum depth of the tree, etc.
The “knobs,” or parameters to the algorithm that are used to specify
how training should happen, are referred to as hyperparameters. And
determining the right combination of values for the hyperparameters that
results in the best model performance is referred to as hyperparameter
tuning. For example, you might find out in your scenario that you
have the best performance with random forest for the combination of
hyperparameter values in which the number of trees is 30, maximum
depth of tree is 8, etc.
The simplest way to perform hyperparameter tuning is to train models
with various combinations of hyperparameter values using the train set
and determine the combination whose model performs the best on the
test set. And the model corresponding to the best combination is chosen
as the final model. However, with this approach, we end up choosing the
final model by seeing the test set – a gross violation of the scientific method
which demands that the test set be “unseen”.
To address this issue, the train set is typically split further to create a
“validation set” as shown in Figure 16-25.
252
Chapter 16 Machine Learning
Original Dataset
70% 30%
80% 20%
C
ross-Validation
The notion of creating a train-validation split can be generalized further.
For example, after creating the train-test split, let us split the train data into
multiple parts called “folds.” Figure 16-26 shows such a split that creates
three folds.
253
Chapter 16 Machine Learning
Original Dataset
70% 30%
33.3% 33.3%
33.3%
254
Chapter 16 Machine Learning
only for model testing. The more complex nested cross-validation strategies
use multiple folds for both validation and test data.
D
ata Engineering
Data engineering is primarily required when dealing with a large amount
of data for training models.
255
Chapter 16 Machine Learning
Conclusion
We discussed categories of ML algorithms in this chapter and looked at
a few ML algorithms in detail. We discussed how these algorithms work,
how they make predictions, and what kind of insights they provide. We
also talked about model evaluation and tuning and mentioned a few
popular ML technologies. The next chapter focuses on deploying and
using the ML models in production systems for inference.
Further Reading
For a coverage of ML techniques, refer to James et al. (2013). For hands-
on examples of ML in Python, refer to Géron (2019). Specifically for deep
learning, we recommend Chollet (2018).
References
Chollet, Francois. Deep Learning with Python. NY, USA: Manning, 2018.
Géron, Aurélien. Hands-on Machine Learning with Scikit-Learn, Keras
& Tensorflow, 2nd ed. Sebastopol, CA: O’Reilly, 2019.
James, Gareth, et al. An Introduction to Statistical Learning. New York:
Springer, 2013.
256
CHAPTER 17
Inference
Once models are created in the machine learning step, they need to be
deployed as part of real-world processes and production systems. This is
done in the inference step of the data science process.
In the inference step, we perform the tasks required to push the
models to the production systems so that they can be used by other
applications and to monitor the performance of these models.
Figure 17-1 shows the various activities, techniques, and technologies
that go into this last mile of data science. In this chapter, we shall cover
Figure 17-1 in detail. We first cover the model release process, wherein
the models created during internal experimentation are prepared to be
pushed to the production systems. Then, we cover the production system
itself, including how the models are deployed, used for predictions/
inferences by applications, and monitored. We shall cover the diagram
starting from the bottom and moving upward in the numeric sequence
indicated in the boxes/arrows. While doing so, we shall discuss the various
components, techniques, and technologies used in each activity. We shall
then touch upon a few factors to consider while choosing between open
source and paid tools for inference. Finally, we shall mention the data
engineering aspects involved in inference.
Internal Production
experimentation
Mobile/web
(3) apps
(2.1)
Inference Server
Model AB Inference /
testing
Monitoring Prediction Service
(Arize, Seldon Deploy, (Sagemaker Inference,
Sagemaker Model Monitor) Lambda, Seldon Core)
(2)
(1.1)
Model Registry
The models created by the data science team in the machine learning step
are stored in a model registry. Models in the model registry are typically
versioned. Models would also have a life cycle, for example, under testing,
in production, etc.
258
Chapter 17 Inference
M
odel Converter
A model created by the data science team is typically represented initially
in the technology stack that they use, for example, if the data science team
uses Python and ML libraries such as scikit-learn, then a Python pickle
file would represent an ML model. The pickle file would capture, say, the
structure of a decision tree or the equation of a linear regression model.
This ML model typically needs to be converted to a form suitable for
deploying to the targeted production systems.
Model converter involves converting ML models either to an
interexchange format or for a target system such as mobile/web apps. Let
us briefly look at both these options and when they are used.
I nterexchange Format
Several ML libraries and platforms exist in various programming
languages – we saw some of these in Chapter 16. To enable interoperability
among these, interexchange formats are developed in the machine
learning community. An interexchange format
1
The MLFlow API, invoked as part of the experimentation code in the machine
learning step, enables this.
259
Chapter 17 Inference
T arget System
In some cases, a model can be converted directly to target a specific
production system,5 without using an interexchange format. This is possible if
the ML library tech stack is standardized across the data science and software
engineering teams. For example, if TensorFlow is agreed as a standard across
the teams, then the TensorFlow models can be converted using TensorFlow
Lite, Tensorflow.js, etc., to target systems such as mobile/web apps, where the
model must execute directly on the user’s device/browser.6
2
https://wikipedia.org/wiki/Predictive_Model_Markup_Language
3
https://onnx.ai/
4
Such as the ones we saw in Chapter 16.
5
Refer to the “Mobile and Web Applications” section for examples of when this
may be appropriate.
6
Or on IoT devices as well.
260
Chapter 17 Inference
M
odel Packaging
Model packaging involves creating a deployable artifact from the
converted model. The type of artifact depends on whether the model is
to be deployed to a server or an end user’s system (e.g., mobile app/web
browser).
P
roduction
The models created by the data science team are used by applications in
production systems. We shall now look at how these applications typically
invoke the models that have been deployed.
7
In case of (1.2), occasionally the converted model is directly integrated without
packaging into a module. This depends on the low-level design of the app.
261
Chapter 17 Inference
I nference/Prediction Service
This refers to the services (most commonly REST APIs) that expose
the models. Typically, data engineers or software engineers can easily
implement a REST API layer on top of the packaged models.
But if there are a large number of models, or if you need the ability
to scale rapidly to several thousands of concurrent requests, etc., you
may instead want to consider using services such as Amazon Sagemaker
Inference, Seldon, Algorithmia, etc. In our experience, we have also
found serverless techniques rather appropriate. For example, Amazon
API Gateway coupled with AWS Lambda is a cost-effective way to deploy
models that can scale easily.
M
odel Monitoring
To know how effective the models are, you need to know when the
predictions given by a model turned out to be correct or incorrect.
Let’s refer to the example in Chapter 4 on classification – depending on
whether a prospective sale materialized or not, we can determine whether
the prediction given by the model was correct or not. Often a model in
production tends to start drifting over time, that is, increasingly starts
giving incorrect predictions. Detecting this early and fixing/upgrading the
model is essential.
262
Chapter 17 Inference
263
Chapter 17 Inference
M
L Ops
The discipline of releasing and maintaining models in production is
referred to as ML Ops. This has evolved as a discipline in the past couple of
years and broadly covers both the blocks of model release process and the
inference server.
Correspondingly, a dedicated ML Ops role is also increasingly seen in
data science teams – we shall look at this role in more detail in Chapter 21.
8
Typically done using a deep learning model that is trained to produce clean
audio from noisy audio.
264
Chapter 17 Inference
In the early stages, you can simply store model predictions directly to
your data lake or data warehouse.9 At some point, you would begin to have
several models in production, and deploying and monitoring the models
at scale would start gaining importance.
Once you reach a stage where a more advanced tool may seem more
effective for all these activities, you can then consider using a paid tool,
possibly with an open source strategy. You may want to begin by adopting
the new tool for some of the components in an incremental fashion.
For example, you could begin by using open source Seldon10 for the
model release process and the inference/prediction service and later at an
appropriate juncture adopt their enterprise solution11 that covers the entire
inference server as well.
D
ata Engineering
All the activities in the inference step can be considered as a part of data
engineering. When the number of models in production increase and
the team grows, a small, specialized MLOps group can be segregated if
needed. Refer to Chapters 21 and 22 for more details.
Also, from the data engineering perspective, the data captured from
model monitoring is on par with the rest of the data, that is, the model
monitoring system is yet another data source. Here, we circle back to
the data capture step of the data science process – the data about the
predictions given by a model and its performance (evaluated against the
true outcomes) is used in the further iterations of the data science process
to tune and upgrade the models.
9
See Chapter 13.
10
That is, Seldon Core.
11
That is, Seldon Deploy.
265
Chapter 17 Inference
C
onclusion
In this chapter, we covered the various activities involved in the inference
step of the data science process. We also touched upon a few tools and
libraries typically used in the various activities for specific purposes and
covered a few pointers related to choosing between open source and paid
tools.
The predictions and performance of a model in production are
effectively a new data source for the data capture step of the data science
process. This loop forms the maximal iteration – from data capture to
inference, back to data capture – of the data science process that we first
saw in Chapter 1.
266
CHAPTER 18
D
evelopment Environment
The dev environment is used by data scientists and data engineers to write
the code for all the steps in the data science process from data capture
to machine learning. If you are just starting out with one or two data
scientists, and you are able to make the data available to them in CSV files,
they can simply do the analysis on their respective machines using an IDE1
of their choice, such as Spyder or RStudio. But often a data science team
works in a highly collaborative environment, coding in environments such
1
Integrated development environment.
as notebooks that are in a shared location for discussion with the rest of the
team. Ideally, the notebooks should also support collaborative editing by
multiple data scientists simultaneously.
Jupyter notebooks are the most common environment for data
scientists. Other popular environments are Databricks, Sagemaker Studio,
JupyterHub, and Zeppelin. Some of these allow a mix of R/Python/SQL in
a single notebook, which can be useful if your data science team has a mix
of these skills.
The dev environment should also allow the data science team to
register common, standard libraries to ensure the entire team is working
with the same versions of the various libraries.
We shall look at how a dev environment is used in conjunction with all
the other components in Chapter 19.
E xperiment Registry
An experiment registry is where all the experiments executed by the data
science team would be stored. An experiment registry needs to support the
following:
268
Chapter 18 Other Tools and Services
C
ompute Infrastructure
Compute resources serve three broad purposes in the operations of a data
science team:
2
We covered model registry in Chapter 17.
3
Databricks, for example, integrates MLFlow into its environment.
269
Chapter 18 Other Tools and Services
A
utoML
One of the motivations of AI is to automate a lot of the repetitive work done
by humans. What if we aim to automate the work done by data scientists?
This is the vision of AutoML.
4
Covered in Chapter 15.
5
Covered in Chapter 13.
6
Elastic Container Service.
270
Chapter 18 Other Tools and Services
P
urpose of AutoML
AutoML serves two primary purposes:
7
Searching through the various architectures of a neural network to determine
the best architecture is referred to as Neural Architecture Search or NAS. The
EfficientNet family of models we saw in Chapter 10 was a result of NAS.
271
Chapter 18 Other Tools and Services
A
utoML Cautions
It is important to be cautious in the use of AutoML. If someone in the team
is using AutoML, it is important that they know how to evaluate model
performance, so that they can determine if the resulting models are good
enough. They also need to have a complete understanding of the data
being fed to the AutoML and the formulation of the problem itself (which
is the target variable, which are the features, etc.). A person with these
skills is referred to as an ML engineer these days – refer to Chapter 21 for
more details.
AutoML does not reduce the need for domain understanding –
particularly when it comes to identifying the right features and some
domain-specific data preparation steps. For example, your goal may be to
predict when the fuel stock at various gas stations in a region would need
replenishment, so that you can then optimize the dispatch of fuel to the
gas stations from the main terminal. In this case, it may seem at first sight
that we need to predict the inventory of a gas station. But inventory will
typically not have a clear pattern – what is more likely to have a pattern
is the sales at a gas station. The sales may depend on day of the week,
whether a day was a holiday, the weather, and so forth. Once we have a
model that can predict sales, we can deduce the future inventory based
272
Chapter 18 Other Tools and Services
on current inventory and future sales. From the future inventory, we can
determine when the inventory will be low and need replenishment. Even
if one is using AutoML, this kind of problem formulation still needs to be
done by the data scientist or ML engineer.
While some data preparation aspects such as normalization or missing
value handling can be done by AutoML, the primary focus of AutoML is to
automate the machine learning step. A human still needs to execute the
other steps of the data science process.
273
Chapter 18 Other Tools and Services
8
This term is coined by Forrester.
274
Chapter 18 Other Tools and Services
9
See Chapter 21 for understanding the role of citizen data scientist.
275
Chapter 18 Other Tools and Services
The primary reason for this has been the mismatch in technology
stacks –while data scientists use languages such as R and Python, creating
a prototype application to showcase a model requires classical web app
skills of JavaScript, web server, REST APIs, etc. To address this problem,
there are a couple of approaches that we see in the industry today10:
• Enable data scientists or data engineers to quickly
create prototype apps that can invoke their models/
scripts using a technology they are familiar with.
Plotly Dash and Bokeh are examples of this – they
allow quickly prototyping web applications using only
Python code.11 So, any models/visualizations created
by the data scientists/engineers can directly be plugged
in to quickly create an interactive web application.
10
If you happen to be using an advanced tool such as SAS VDMML, then this is
readily available already. So, you wouldn’t need these approaches.
11
An equivalent in the R ecosystem is Shiny.
276
Chapter 18 Other Tools and Services
277
Chapter 18 Other Tools and Services
Apart from this, one can find numerous libraries that are specific to a
particular domain such as medical imaging, etc. Typically you would want
to explore whether any such libraries/services – preferably open source –
exist for your domain, which you can leverage.
The skills needed to make effective use of these AI services and
libraries are similar to the skills required to use AutoML. Thus, ML
engineers – who we encountered in the AutoML section earlier – are often
ideally placed to leverage AI services/libraries as well.
W
hen to Use
If you are building an application that involves a class of problem covered
in Table 18-1, it may be prudent to begin by using one of the corresponding
services. This may require much less investment compared to if you were
to build a model from scratch yourself. It can also hasten your time to
market and allow you to focus on getting the basic functionality of your
application out in the hands of users. The same applies if you would
like to automate some processes in your organization using some of the
capabilities mentioned earlier.
While we can rely on these services to a great extent as they are used
by several organizations, there is no guarantee on how it will perform in
your specific business and domain. Thus, it is important to evaluate the
performance of the models critically in any case.
If your product or application aims to differentiate itself in the market
by leveraging your own domain expertise and data, then it would make
sense to create your own models. In this case, it would still be useful to use
the previously mentioned services as a baseline benchmark.
One final aspect to consider, as always with cloud services, is security.
It is important to read the fine print – some cloud services may use your
data to continually improve their services, and you may need to opt out
from them explicitly.
278
Chapter 18 Other Tools and Services
279
Chapter 18 Other Tools and Services
Conclusion
In this chapter, we saw an assortment of various categories of tools that are
used across multiple steps of the data science process.
With this chapter, we wrap up our coverage of techniques, tools, and
technologies. In the next chapter, we shall look at a reference architecture
that illustrates how the various technologies discussed in Part 3, so far,
come together in executing the entire data science process.
280
CHAPTER 19
Reference
Architecture
So far, we have covered the various tools and technologies that are used by a
data science team to execute the various steps of the data science process. In
this chapter, we shall now look at a reference architecture that can be tailored
and used for your data science team’s operations. The reference architecture
brings together the various tools and technologies that we have seen so far, to
enable the data science process for rapid experimentation and deployment.
Figure 19-1 shows the reference architecture. We have already looked
at the individual components in the earlier chapters of Part 3 – this chapter
covers how they all come together.
It is important to note that not all blocks are necessary to begin with –
depending on the kinds of data science projects and the data science
culture, different blocks would evolve over time. But eventually once your
team reaches a level of maturity, nearly all these blocks would be needed.
At a high level, there are two aspects to data science as seen in
Figure 19-1: the systems that support the data science experimentation
activities and the production systems that consume the models created by
the data science team.
In this chapter, we shall walk through Figure 19-1 in detail. We shall
first look at the experimentation side, followed by the transition from
experimentation to production. Generally, we shall walk through the
blocks in the numeric sequence indicated in the boxes/arrows.
Analytics AI Services
Visual Analytics Batch jobs (AWS Rekognition, Azure (8.2)
Dev
(BI tools e.g., Tableau) (Spark, Airflow) Cognitive Services) (9.2)
Environ (11)
ment
(Jupyter, (1.2) Query Engine
Databricks, (SparkSQL, Presto)
Sagemaker (5) Inference Server
Studio,
Reference Architecture
Spyder, (4.1)
R-studio) Model AB Inference /
Core infra Experiment testing
Data Warehouse monitoring Prediction Service
(1.1) Shared File System (Snowflake, Redshift, Registry (Arize, Seldon Deploy, (Sagemaker Inference,
(1) Synapse)
(Amazon EFS) (ML Flow) Sagemaker Model Monitor) Lambda, Seldon Core)
E xperimentation
In this section, we cover the various components pertaining to the
experimentation activities of the data science team.
(1.2) If you have a big data setup where a query engine such as
SparkSQL or Presto is used, then the data science team should be able to
run SQL using this query engine.
(1.3) If the data science team is exposing some of the models to other
internal stakeholders using simple applications or workflows, it would
be good to enable integration between the dev environment and these
applications/workflows. For example, KNIME Server, which can be used
283
Chapter 19 Reference Architecture
for this like we saw in Chapter 18, allows invoking workflows using REST
APIs – such APIs can thus be invoked from the dev environment to execute
workflows.
(1.4) The data science team may need to access AutoML services
on the cloud, such as Amazon Sagemaker Autopilot, Google Cloud
AutoML, etc.
I ngestion (3)
We had looked at aspects related to ingestion in Chapter 13. The goal of
ingestion is to have the data available in reasonably organized form in the
data lake or warehouse (in 4).
284
Chapter 19 Reference Architecture
A
nalytics (5)
A lot of the data analysis by the data scientists will likely happen within the
dev environment (1) using the core infra (4). But in slightly larger teams
working on big data, the following needs begin to be felt:
1
Data analysts and citizen data scientists are described in Chapter 21.
285
Chapter 19 Reference Architecture
A
utoML (7)
AutoML services were covered in Chapter 18. The AutoML services used
would need access to the data for training the AutoML models.
In case you are planning to use cloud AutoML services but have an on-
prem setup of the core infra (4), your IT team may need to facilitate your
data science team to use AutoML from their dev environment (1.4).
Having covered the experimentation side, let us now see what steps/
components are required to ensure that the models created by data
scientists see the light of day as part of a production system.
286
Chapter 19 Reference Architecture
A
I Services
AI services (11), such as for speech processing, computer vision, time
series forecasting, etc., were covered in Chapter 18. These services are
typically exposed as APIs and SDKs, and can be invoked directly from
mobile or web apps (10.1).
Typically, the data science team2 would perform some experiments
to evaluate the fitment of an AI service for your specific use case before
it is used in production – this is why we have depicted the AI services
component as cutting across the experimentation and production systems.
C
onclusion
In this chapter, we covered a reference architecture that can be tailored to
your specific needs. We also covered the usage of various components in
this reference architecture to support the data science process.
We shall revisit the reference architecture in Chapter 23 – there we
shall look at how the type of data science project influences the need for
various blocks of the reference architecture.
2
Possibly an ML engineer, see Chapter 21.
287
CHAPTER 20
G
oals of Modeling
Recall from Chapter 3 that the primary purposes of a model are
• Explaining observations by estimating the underlying
truth: this can be broken down further into two
granular goals, simplicity of representation and
attribution.
290
Chapter 20 Monks vs. Cowboys: Praxis
1
Sourced from The Newton Project “Untitled Treatise on Revelation (section 1.1).”
<www.newtonproject.ox.ac.uk/view/texts/normalized/THEM00135>
2
Or more generally, a hyperplane.
291
Chapter 20 Monks vs. Cowboys: Praxis
3
That is, the equations that map inputs to an output at each neuron.
4
This does not seem to be a standard term in the industry yet, but we find it
rather apt.
292
Chapter 20 Monks vs. Cowboys: Praxis
5
See Chapter 14.
6
Absolute value of the weight to be more precise.
293
Chapter 20 Monks vs. Cowboys: Praxis
P
rediction: Interpretability
This goal refers to the need for us, humans, to be able to understand and
interpret how the model is generating a specific prediction for a new
observation. In other words, we as humans need to be able to interpret
each step that a model took to reach the target prediction from the input
observation. Let us look at a few examples:
294
Chapter 20 Monks vs. Cowboys: Praxis
P
rediction: Accuracy
This simply refers to the notion that we want our model to accurately
predict values of the target variable for future, unseen observations. Note
that this goal does not include any notions of whether we as humans are
able to understand (interpret) why the model predicts something – it only
talks about the goal of getting the prediction right. At the time of writing, in
terms of power of accurate predictions
7
Particularly XGBoost.
295
Chapter 20 Monks vs. Cowboys: Praxis
Having looked at the four goals and some examples of how a few
techniques fare toward achieving these goals, we can look at a more
formal “grading” matrix to capture these notions for all techniques – this
matrix also helps identify which techniques are typically preferred by each
culture.
Grading ML Techniques
In the previous section, we discussed how each purpose of modeling can
be broken into two goals each, resulting in the four goals of modeling. We
also discussed a few examples of how various techniques fare against these
goals. This is summarized by the grades8 we’ve given in Figure 20-1 – the
figure also shows how the two cultures differ in the way they approach
these goals and how this determines the techniques that they usually
prefer. We had seen in Chapter 3 that the monastic culture focuses on both
purposes, so all four goals are equally relevant to monks as shown in the
table. The wild-west culture focuses only on the purpose of predicting
values, so the goals toward the bottom of the table are more relevant to
cowboys. Consequently, since the ML techniques toward the left have
generally high grades in all four goals, these techniques are preferred in
monastic culture. Similarly, techniques toward the right have very high
grades for the bottom goals and hence are preferred in wild-west culture.
8
Various data scientists might give slightly different grades.
296
Chapter 20 Monks vs. Cowboys: Praxis
Figure 20-1. Some ML techniques graded for each goal of modeling, and
how these grades influence the preferred techniques of the two cultures
The following are examples of the (somewhat subjective) reasoning
behind a couple of cases:
The following are a few classical monastic techniques that we haven’t covered
elsewhere in the book:
C
ultural Differences
Table 20-1 summarizes all the differences between the two cultures – it
elaborates on a few points in Table 3-1 and also adds a few new points
based on our coverage earlier in this chapter.
298
Chapter 20 Monks vs. Cowboys: Praxis
Mindset Find the underlying, eternal Find what works now. Can
truth (nature) which led to update frequently. Empiricism is
(caused) the observations the only eternal truth
Purposes Estimation of truth behind the Predictive accuracy is the
observations, which enables primary goal
prediction and deeper, accurate Causation is often a casualty.
causative insights Causative insights are either
irrelevant, less accurate, or just
good to have
Evaluation How close to the truth is my Am I getting the predictions as
estimation? accurately as I wanted to?
Evaluation – The estimated “truth” Primary focus is on accuracy
what is includes attribution as well of predictions by a model.
evaluated as interpretable and accurate Interpretability of predictions and
predictions by a model. attribution are occasionally good
Also, models with simpler to have
representations are preferred
Domain Domain understanding Lesser domain expertise often
expertise significantly leveraged to suffices when the techniques
craft features; this is because used automate feature
attribution is a primary goal, extraction; for example, using
so features that are well CNN, relevant features are
understood are preferred automatically extracted from an
image
(continued)
299
Chapter 20 Monks vs. Cowboys: Praxis
How many Try to avoid “the curse of Any additional information can
features used dimensionality.” Find a few help improve predictive accuracy
for modeling features that contain most and is useful. Techniques like
information – this reduces deep learning, gradient boosted
model complexity and facilitates trees, etc., can be used even with
attribution hundreds of features
ML techniques Prefer the ones with a high Prefer the ones with grade A+
grade for all the goals in for “accuracy of prediction” in
Figure 20-1 Figure 20-1
Attribution Statistical tests, Important features as identified
important features as identified by the ML model
by the ML model
Model Statistical tests, AIC/BIC, Cross-validation
performance cross-validation9
evaluation
Model upgrade Can take longer to create a Models are typically created
frequency model. But once created, since quickly and upgraded often
it represents a long-term truth, through rapid iterations as new
upgrades are less frequent data is obtained
9
Covered in Chapter 16.
300
Chapter 20 Monks vs. Cowboys: Praxis
C
onclusion
In this chapter, we elaborated on the differences between the two data
science cultures that were first introduced in Chapter 3.
It was Leo Breiman who first highlighted the existence of two such
cultures when it comes to creating models from data. His original paper,
the comments by D.R. Cox and Brad Efron, and Breiman’s responses are
in Breiman (2001). Using our terminology, we would say that Breiman was
apparently the first monk to leave the monastery and venture into the wild
west. After years of doing data science there, including contributions to
techniques like random forest, he returned to the monastery with not only
a new set of techniques but also a new perspective – the aforementioned
article details Breiman’s journey and the welcome he got from a few
monks when he returned to the monastery.
Our description of the two cultures is largely based on our personal
experience. These two cultures have been given various names in the
past – for the record, our “monastic” culture is somewhat akin to Breiman’s
“data modeling” culture, and our “wild-west” culture is somewhat akin to
Breiman’s “algorithmic modeling” culture. It is important to note though
that the implications and details of how the two cultures practice data
science have evolved since Breiman’s original paper – particularly during
the Big Data era (starting around 2006–2010) and then the deep learning
revolution (since around 2013).
301
Chapter 20 Monks vs. Cowboys: Praxis
A more recent survey of the two cultures is Efron (2020). This paper,
along with the Breiman discussions mentioned earlier, helped us put some
structure around our observations of the two cultures. Efron (2020) also
provided the very useful term “attribution” that we have adopted.
We shall see how the various factors covered in this chapter are useful
while building the data science team in Chapter 22.
In Chapter 23, we shall see how the choice of culture is influenced by
the type of data science projects. We shall also revisit the goals of modeling
and Figure 20-1 in the context of explainability.
Summary of Part 3
In Chapters 12 to 18, we covered the various techniques and technologies
used in the data science process. In Chapter 19, we saw how they all come
together in a reference architecture to support the operations of a data
science team. Finally, in this chapter, we covered more details of how
the two data science cultures differ in the way they practice data science,
particularly in regard to their choice of ML techniques.
Thus far in the book, we have covered the business and technological
aspects that need to be factored into building a data science practice. In
the next, final part, we shall look at the practical aspects of building a data
science team and executing data science projects.
R
eferences
Breiman, Leo. “Statistical Modeling: The Two Cultures.” Statistical Science
2001: 199–231.
Efron, Bradley. “Prediction, Estimation, and Attribution.” Journal of the
American Statistical Association 2020: 636–655.
302
PART IV
Data scientist
(ML libraries,
notebooks, Chief data scientist
algorithms literature)
Data science technician
(ML libraries, notebooks)
Data analyst
(Multimodal PAML, BI tools)
ML engineer
(ML services)
ML Ops
Domain (Production systems, ML libraries)
expertise
306
Chapter 21 The Skills Framework
307
Chapter 21 The Skills Framework
D
omain Expertise
Deep understanding of the domain, and how the data relates to the
domain, is essential to formulate the right problem statement, determine
the data science approach to solving the problem, and, finally, to evaluate
the correctness of the solution. Several data science problems are oriented
toward automating the routine work conducted by experts in domains
such as finance, retail, healthcare, etc. – in these cases, the domain
expertise of these folks is critical to the success of the data science team.
Refer to Chapter 20 for a discussion around how the domain expertise
requirements depend on the data science culture.
An effective data science team thus requires these three kinds of skills:
data analysis, software engineering, and domain expertise. We shall now
look at the typical roles in a data science team and how these roles require
a combination of skills along these three dimensions.
308
Chapter 21 The Skills Framework
1
ata strategy refers to the overarching vision around the capture and utilization
D
of data oriented toward achieving business goals.
2
For example, multimodal PAML tools such as SAS VDMML – refer to Chapter 18.
309
Chapter 21 The Skills Framework
Data Analyst
This is a traditional role in most organizations and something you might be
familiar with. Data analysts are experts in the domain. They typically use
BI tools along with query languages such as SQL, but their programming
expertise may be limited. They can extract insights from data using various
visualization and statistical techniques.
If your company already has a data analyst, it would be great to involve
them in a consulting fashion. Given their domain expertise of the products
and business processes, they are often ideally positioned to evaluate
model performance in production systems and the business impact of
incorporating data science models in the operations of your organization.
ML Ops
ML Ops requires understanding of both, the models that the data science
team creates, and the production systems that the engineering/IT team
creates and maintains. This is a niche role that is oriented primarily toward
the inference step of the data science process.
310
Chapter 21 The Skills Framework
D
ata Engineer
A data engineer performs the data engineering step of the data science
process. They are thereby responsible for storing, tracking, transforming,
evaluating, and maintaining the data assets for use by the entire data
science team. Data engineers typically fulfil the following responsibilities:
3
Refer to Chapter 18.
311
Chapter 21 The Skills Framework
D
ata Architect
The data architect is responsible for deciding the entire data and compute
infrastructure aligned with budgetary constraints. This includes the
choice of tools that are best suited for the data science team. Given
the interdisciplinary team, the data architect needs to ensure that the
architecture enables smooth collaboration among all the roles across the
various steps in the data science process.
We had looked at a reference architecture for data science teams in
Chapter 19. The data architect is responsible for tailoring this reference
architecture to the specific needs and constraints of your organization.
M
L Engineer
The last few years have seen a tremendous rise in AI services on the cloud
and AutoML. We covered these in Chapter 18. As we saw in that chapter, an
engineer could use such a service or library to create a model, for example,
to predict the inventory requirements at a store. For this, they would need
to understand the domain and the data, but does not have to know the
details of the data science process, ML techniques, etc., that go into the
creation of the model.
This has led to the relatively new ML engineer role – an engineer with
a good understanding of the domain who can use these services and
libraries and evaluate the resulting models to ensure they meet the desired
goals.
Compared to the data science technician role, ML engineers require
less data analysis skills since they do not need to know the data science
process fully. But ML engineers would need to have stronger engineering
skills in order to use the AutoML libraries and cloud services effectively.
Software engineers can easily be upskilled to ML engineers with
minimal training of ML basics such as model performance evaluation,
combined with a knowledge of the AI/AutoML services and libraries.
312
Chapter 21 The Skills Framework
D
ata Scientist
As we see from Figure 21-1, a data scientist has a good mix of skills across
all three dimensions. Typically, a data scientist is skilled at applying
the scientific method tailored to the domain they are working in.
Correspondingly, they work closely with the domain experts to gain deep
understanding of the business and the domain.
In a small team, they may work with the chief data scientist to help
design the experiments and also execute them. In large teams, the data
scientist may focus on designing experiments with the chief data scientist
and delegate the execution to data science technicians and ML engineers.
They also work with data engineers, defining requirements for the data
pipeline. As the data preparation and data visualization steps become
more repeatable, the data scientist works with the data engineers to
automate these steps for rapid iterations.
A data scientist usually has a deep understanding of the algorithms –
they can thus modify existing open source implementations when
necessary. Some data scientists can also create new algorithms and
techniques as required.
313
Chapter 21 The Skills Framework
D
eviations in Skills
The depiction of some of the roles in Figure 21-1 represents our idealistic
view, and deviations from these are not uncommon in practical scenarios;
particularly, the ideal data scientist and chief data scientist depicted here
are generally regarded as unicorns. In many teams, the other roles usually
augment and fill in for any shortcomings in the skills of these primary
roles, for example, if a (chief ) data scientist has less software engineering
skills, then other engineers such as the data architect, data engineer, or ML
engineer fill in to compensate for this.
The choice of culture (monastic or wild west) can also tend
to influence which skill (data analysis or software engineering) is
predominant in a data scientist. For example, the software engineering
skills of a monk may possibly be lesser than the ideal depicted here.
C
onclusion
In this chapter, we covered a skills framework to explain the various roles
that go into forming an interdisciplinary data science team. Which of
these roles are required in a team depends on the specifics of the data
science culture and the business – we covered some of the aspects in the
descriptions of each of the roles.
In the next chapter, we shall look at aspects around building and
structuring a data science team composed of these roles.
314
CHAPTER 22
Building and
Structuring the Team
In the previous chapter, we saw the various roles and skills that go into
forming an interdisciplinary data science team. In this chapter, we shall
look at a few typical team structures that are seen in practice and then
cover some pointers around hiring data scientists, with a particular focus
on the chief data scientist.
1
Covered in Chapter 18.
2
ML Ops was covered in Chapter 17.
316
Chapter 22 Building and Structuring the Team
Chief Data
Scientist
Data Analyst
Citizen Data
Project 2 Data Architect Scientist
Project 1
Data ML Ops
Scientists
Data Engineers
Data Science
ML Engineers
Technicians
317
Chapter 22 Building and Structuring the Team
Team Evolution
Once the incubation team successfully executes a couple of projects,
more opportunities to apply data science typically present themselves. As
you grow your team toward the mature structure covered in the previous
section, you would typically strive to maintain a right balance along the
following lines:
318
Chapter 22 Building and Structuring the Team
3
Refer to Chapter 19 to recall how these various tools are used together.
4
Such as ML Flow for experiment registry, etc.
319
Chapter 22 Building and Structuring the Team
320
Chapter 22 Building and Structuring the Team
Note that these pointers would apply to hiring any data scientist.
Specifically in regard to the chief data scientist role, the body of work
(points 1 and 2) would be the primary indicator of culture. The factors
based on academic background (point 3) are observations based on our
experience working with data scientists across the cultural spectrum.
5
See Chapter 2.
321
Chapter 22 Building and Structuring the Team
available, you can likely upskill a smart software engineer in your existing
IT/software development team to become an ML engineer quickly.6
If the consultant suggests that existing AI services or AutoML may not
be a good fit, then you can consider hiring a senior or chief data scientist,
depending on complexity of the problem and the potential RoI you see
from the pilot data science project.
6
Recall from the previous chapter that software engineers can easily be upskilled
to ML engineers with minimal training of ML basics such as model performance
evaluation, combined with a knowledge of the AI/AutoML services and libraries.
322
Chapter 22 Building and Structuring the Team
Notes on Upskilling
You may have heard that data engineers can be upskilled to data scientists.
While this may be true in some cases, we often find data engineers lacking
the requisite background in data analysis and the thought process required
to apply the scientific method effectively.
If upskilling is your primary staffing strategy, then data engineers, just
like software engineers, can often be upskilled to ML engineers to begin
with. Once they show promise as ML engineers, you can then consider
upskilling them to data scientists.
Conclusion
We covered the typical team structures and evolution and some pointers
regarding hiring data scientists, particularly the chief data scientist.
In the next chapter, we shall revisit the team structures in the context of
various types of data science projects.
323
CHAPTER 23
that folks often tend to shy away from during regular operations. We then
look at the legal and regulatory considerations unique to data science
projects. We finally wrap up by looking at cognitive bias and how/when to
guard against it in data science projects.
Using Table 23-1, you can identify which type of project best suits
your business needs. This also helps set expectations about the outcomes
accordingly with the team as well as other stakeholders.
In the following subsections, we cover some examples of each type of
data science project and summarize some typical traits of each type of data
science project.
326
Chapter 23 Data Science Projects
1
We saw a similar example in Chapter 7.
327
Chapter 23 Data Science Projects
The terms data mining and KDD are used in a few different flavors. Some, for
example, regard data mining as one of the steps in KDD. In this book, however,
we are using the terms data mining and KDD interchangeably.
2
When used to optimize the internal operations of an organization, then the
project also has some elements of the classical notion of operations research
(OR). This can be felt in the first example that follows in the main text.
328
Chapter 23 Data Science Projects
329
Chapter 23 Data Science Projects
330
Chapter 23 Data Science Projects
331
Chapter 23 Data Science Projects
332
Chapter 23 Data Science Projects
3
Deviations specific to the project at hand are not uncommon.
333
Table 23-2. Traits of data science projects
334
Type of
project Data Science Process Data Science Culture Team Structure Reference Architecture Typical Artifacts
Chapter 23
• Data Capture would
• Insights and conclusions
be completed prior to • 1-2 data scientists
documented
project kick-off • Data analyst /
KDD Monastic, as the goal Interactive tools for ad-hoc Data • ML models
• Inference: Deploying & citizen data scientist
is to uncover insights Preparation & Data Visualization • Experiments captured in
monitoring models • Part-time data
slide-decks or in experiment
needed less often, and engineer
registry (e.g. ML Flow)
kept simple.
• Monastic preferred, • Data Science applications using
• Data Science
• Data Capture would be to enable the tools such as Plotly Dash e.g., to
application/portal to
typically completed analysts to easily test the models before
• Similar to KDD showcase models to
prior understand & integrating with the processes
DSI- • Would additionally stakeholders before
• Inference involves validate the models • Leverage any existing BI tools
involve engineers integrating in the process.
Data Science Projects
Proc integration with existing w.r.t. domain • Integrate with existing systems
for integration and • Service (e.g. REST API)
systems/applications • If wild-west using APIs etc. for Inference
ML Ops activites. that is integrated into
used in the processes. adopted, human-in • Model monitoring and feedback
existing systems used in the
the-loop typically design needed to gauge
process.
essential improvement in processes
• ML engineers if AI
• Data Capture would be
services are used; Service or library –
typically completed Wild-west, as the goal
else data scientists encapsulating the model
prior typically is to add Similar to DSI-Proc.
DSI- • Data engineers are is integrated into product.
• Machine Learning and functionality (with (Data Science applications may not
Prod key to ensure • REST APIs
Inference: Model-tuning accuracy) to existing be needed, as the product itself can
automated data • Javascript/Java/Swift
/ configuration per product rather than be used to invoke the models.)
pipelines for libraries for browsers and
customer may be uncovering insights
continuously mobile devices.
needed for B2B product
arriving data
K
PIs
Key performance indicators, or KPIs, are used to quantify the performance
of a team and the overall progress toward the goals. In this section, we shall
mention a few KPIs and metrics that can be used for data science projects
or the overall team. Note that we are not covering business KPIs such as
RoI, etc., that determine the contribution of the data science team to the
business; rather we are covering operational KPIs that are used to track the
performance of the data science team against its own goals. It is assumed
that the goals of the data science team have already been mapped to
desired business outcomes (see Chapter 2). If you have a mature practice
with multiple projects, then you may want to capture the KPIs at a project
level and aggregate them for tracking the overall team performance.
M
odel Performance
This is the KPI that determines how well a given model is performing.
Achieving a desired model performance is typically the immediate goal of
a data science project.
An appropriate metric of model performance needs to be determined –
refer to the coverage of model performance evaluation and metrics in
Chapter 16.
Each iteration or sprint of a project should capture the improvement in
model performance, which indicates the progress of the core data science.
The model performance KPI ensures the team is progressing in
the right direction and captures the progress made; the other metrics
mentioned in the following only track how efficiently the team is working
toward that direction.
335
Chapter 23 Data Science Projects
WHAT IS AN EXPERIMENT?
336
Chapter 23 Data Science Projects
337
Chapter 23 Data Science Projects
Effort-Cost Trade-Offs
Note that in the preceding table, we have typically included effort as the
metric. One common way to reduce efforts is to procure tools, for example,
such as the ones we saw in Part 3 for the various steps of the data science
process. Like for any other project, data science projects also involve an
appropriate trade-off among the following factors:
Data Quality
Data quality is often a polarizing topic of discussion within a company,
particularly so within the data science team. When you have data coming
in from multiple sources and have set up data pipelines to capture data
meant for being used by data scientists in the data science process, the
expectation of a data scientist is that the data is of “good quality.” But often
what a data scientist means by “good quality” turns out to be different
from what the software or data engineers interpret as “good quality” – this
leads to discovery of issues rather late in the data science process. It is
not uncommon to see several man-months of effort wasted due to a data
quality issue that was uncovered too late.
In this section, we shall first cover the importance of data quality
and how the severity of impact resulting from data quality issues could
vary depending on the type of the data science project. Then, we look at
a few dimensions of data quality, which will help “define” data quality
338
Chapter 23 Data Science Projects
appropriately for your team. Then, we shall touch upon some aspects
around measuring and ensuring data quality. Finally, we shall look at the
typical reasons why data quality often takes a backseat and how to address
these reasons. Our focus in these sections is on making sure that the data
is of expected quality before data scientists begin work on it in the data
science process.4
4
However, once this expected quality is ensured, data scientists themselves can
also play a role in improving the quality further in the data preparation step of the
data science process.
339
Chapter 23 Data Science Projects
Then it is often assumed at first to be an issue with the model, for example,
maybe the model wasn’t tested on sufficiently large amount of test data, so
it is not generalizing well for production scenarios. Only after that, quite
late, does the attention shift to validating the data quality. In this way, the
underlying data quality issues are uncovered rather late, and significant
effort of the data science team could have been wasted by then.
S
everity of Impact
The severity and nature of impact of these issues would depend on your
business and the problem being solved. Nevertheless, the type of data
science project typically can help arrive at some initial approximations of
severity/nature of impact. This is shown in Table 23-4.
340
Chapter 23 Data Science Projects
DSI-Prod Low – moderate Since the product was sustaining the business
already, if a new functionality introduced does not
perform as expected, the impact is relatively low.
Moderate impact could be expected if many users
engaged with the new functionality, and a poor
experience led to customer churn
DSBP High Since the entire business relies on the data and
resulting models, the impact would be high. But in this
case, the issues and impact also tend to get highlighted
(and addressed) much sooner as the model is the
primary focus of the customer-facing product. Thus, we
categorize impact as high rather than as critical
5
In organizations where data stewardship practices exist, the data stewards would
be consulted to understand the existing dimensions and tailor/enhance them as
needed.
341
Chapter 23 Data Science Projects
This ensures that everyone is aligned with the definition and the
resulting data quality requirements, as well as the state of data quality
at any point in time.
The definition of data quality typically takes the form of identifying
various dimensions along which quantifiable metrics can be determined.
The choice of dimensions may vary from one organization to another. As
the data science practice matures, the choice of dimensions often evolve
as well – more dimensions may get added, or existing ones refined further.
In any case, there are a few typical dimensions that are seen repeatedly –
Table 23-5 captures some of them.
342
Chapter 23 Data Science Projects
Bronze Early-stage exploration to Raw data that has not been processed in
understand the data that is any way
available; feasibility analysis
of a data science project
Silver Early iterations of the Data that is transformed to standard
data science process, formats, cleansed, and any basic errors
more oriented toward such as invalid values, inconsistency
data preparation and data across tables, etc., have been identified
visualization steps and marked
Gold Creating data science models Observations of interest, with requisite
quality, distilled from multiple silver tables
6
In Table 23-5, we have tailored some of the dimensions in regard to typical data
engineering and data science activities.
343
Chapter 23 Data Science Projects
344
Chapter 23 Data Science Projects
345
Chapter 23 Data Science Projects
Encryption
Typically, the data collection, storage, and processing would be encrypted
with appropriate mechanisms to comply with the standards set in your
organization for security and regulatory compliance.
It is important to ensure that these mechanisms are also followed by
the data science team during various steps of the data science process. The
following are a few examples of such organizational standards and how
they apply to the data science team:
346
Chapter 23 Data Science Projects
Access Controls
It is important to ensure that data access is limited to the right team
members. For example, if a data scientist is working on a specific project,
then they should be given access only to the tables and rows that are
needed for their analysis. Various data lakes and warehouses support
access control mechanisms for this purpose.
Occasionally, for exploratory purposes, a data scientist may need
access to more data, or even the entire data lake – in these cases, expanded
access can be given for a limited period of time.
It may also be useful to restrict write access to certain data; for
example, you may define a policy to ensure that gold data is written to/
updated by only certain authorized data engineers or data scientists.
Finally, in case some members in the data science team have access to
protected or identifiable information, all such access needs to be audited.
347
Chapter 23 Data Science Projects
Identifiable/Protected/Sensitive Information
Personally identifiable information (PII) refers to attributes such as
name, phone number, address, etc., that can be used to identify a specific
individual. Even voice recordings are potentially PII since an individual
can potentially be identified using their voice.
Protected information expands the scope of PII to include any other
information collected about an individual using your services that can
be used to identify the individual. For example, as per the US Health
Insurance Portability and Accountability Act (HIPAA), Protected Health
Information (PHI) includes information such as the medical record
number, health insurance details, etc., of an individual in addition to the
PII information covered earlier.
Finally, certain regulations may cover aspects related to the processing
of “sensitive” data. For example, as per GDPR (Article 9), sensitive data
includes
348
Chapter 23 Data Science Projects
349
Chapter 23 Data Science Projects
Federated Learning
Often organizations wish to collaborate to create new models. But given
the sensitive nature of data which cannot be shared, such collaboration
has been a challenge. This especially applies to industries such as
healthcare and finance.
To overcome this challenge, the notion of federated learning is
beginning to gain traction recently in these industries. The following are
the primary concepts of federated learning:
One of the key capabilities needed to ensure the security – right down
to the hardware level – of both the model and data, is that of trusted
execution environments (TEE) such as Intel SGX. Refer to Federated
Learning through Revolutionary Technology (2020) for an example which
includes further technical details.
350
Chapter 23 Data Science Projects
N
ondiscrimination
As we have seen throughout this book, data science relies on data, learns
patterns from the data, and makes predictions based on the past data.
Various ML algorithms may “learn” in different ways,8 and the resulting
models would predict in different ways – but they are ultimately based
upon the data that was fed to them during training. If the data itself
is biased in some way, then the resulting models too are likely to be
biased. There have been several examples of this in the past few years
in varied areas of performance of work, financial loans, education,
personal preferences, and so forth – refer to O’Neil (2017). The GDPR
also outlines these aspects under “profiling,” for example, see Recital 71
and Article 22.
7
As these two types of projects are likely to have consumer data.
8
As seen in Chapter 16.
351
Chapter 23 Data Science Projects
9
There are other ethical aspects to accountability such as an organizations’ willingness
and drive toward transparency, fairness, etc., that we shall not cover here.
352
Chapter 23 Data Science Projects
10
Since you only know the predicted value but do not have a simple picture of how
it was calculated internally by the model.
353
Chapter 23 Data Science Projects
C
ognitive Bias
We began by looking at Eddington and Einstein in Chapter 1 – let us iterate
back now over to them.
The tendency to actively seek for, and focus on, data and evidence that
supports one’s preexisting beliefs is referred to as confirmation bias.
Recall from Chapter 3 that the monastic culture aims to find eternal
truths, while the wild-west culture is fine with short-term, seemingly
contingent truths.
Contingent truths are typically not worth having a “belief” in – thus,
cowboys are usually less subject to confirmation bias. For example, if
they detect a drop in model performance on new data, indicating that the
model does not work for those observations, they will not try to defend
the model. They will simply retrain using the new observations to build
models that represent new short-term truths – it’s not such a big deal.
354
Chapter 23 Data Science Projects
Eternal truths, on the other hand, are arrived at after quite some
deliberation and demand “belief” – monks invest significant effort in
distilling “the truth” and believe in it once distilled. Monks may choose to
overlook the occasional paltry evidence (observation) that contradicts an
eternal truth or examine it in order to prove it misleading/incorrect.
It is unfortunate that the term “confirmation bias” seems to have a
negative connotation – after all, the development of all science relies on a
healthy dose of confirmation bias. But occasionally, a data scientist can be
subject to an unhealthy dose of confirmation bias – this can be aggravated
by business realities and looming deadlines – which then harms the
science and can lead to incorrect conclusions/decisions. For example,
a data scientist might reject even good observations that disprove their
truth/model, thus leading to incorrect conclusions.
Confirmation bias is a primary example of the gamut of cognitive
biases, that is, the biases that human reasoning is typically subject to. In
addition to confirmation bias, there are numerous other cognitive biases
that also play an important role in science, in general, and data science, in
particular.
While we have seen several of these biases at work, it is gladdening to
see this aspect of data science and machine learning receiving focus in
the literature in the past few years. For example, see Kliegr, Bahník, and
Fürnkranz (2020) and Miller (2018).
355
Chapter 23 Data Science Projects
References
Collins, Harry and Trevor Pinch. The Golem: what everyone should know
about science, 2nd edition. Cambridge, UK: Cambridge University Press,
1998.
DAMA. DAMA DMBOK, 2nd edition. NJ, USA: Technics Publications,
2017.
356
Chapter 23 Data Science Projects
357
Index
A Artificial intelligence, 16
data science approach, 17
AB testing, 263
rule based approach, 16
Agglomerative clustering, 80, 119
rule based vs. data science, 19
AI services, 277, 287, 321
Asimov, 143, 154
Airflow, 166
Attribution, 290, 292–294, 352
Akaike information criterion, 298
Auto-sklearn, 273
Algorithmia, 262
AutoGluon, 273
AlphaGo, 149
AutoML, 270–271, 284, 286
AlphaGo Zero, 150
AWS Athena, 171
AlphaZero, 142, 150
AWS Ec2, 270
Amazon API Gateway, 262
AWS Lambda, 261, 262
Amazon Comprehend, 277
Azure Cognitive Speech
Amazon DeepRacer, 154
Services, 277
Amazon ECS, 270
Azure Computer Vision, 277
Amazon EFS, 138, 170
Azure Machine Learning, 273
Amazon EMR, 270
Azure Synapse, 168
Amazon Forecast, 277
Azure Text Analytics, 277
Amazon Personalize, 277
Azure Video Analyzer for Media, 277
Amazon Polly, 277
Amazon Redshift, 168
Amazon Rekognition, 277
B
Amazon S3, 167
Bag-of-words, 70, 175
Amazon Sagemaker Autopilot, 273
Bagging, 234
Amazon Sagemaker Inference, 262
Batch jobs, 285
Amazon Transcribe, 277
Bayesian information criterion, 298
Anomaly detection, 91
BI, see Business intelligence (BI)
Anomaly score, 103, 106
C
Categorical features, 59 D
Chi-square test, 298 Data analysis skills, 307
Chief data scientist, 313–314, 317 Data analyst, 275, 310, 317
culture, 320 Data architect, 312, 317
hiring, 319 Data capture, 10, 163
Chief information security officer activities & components, 163
(CISO), 346, 349 for a computer vision
Citizen data scientist, 275, 276, problem, 131
309, 317 for an anomaly detection
Class imbalance, 49 problem, 95
Classification, 47, 70 for an NLP problem, 65
Clustering, 75 for recommendations, 114
CNN, see Convolutional neural Data engineer, 311, 317
networks (CNN) Data engineering, 15
Cognitive bias, 354 for a computer vision
Collective anomalies, 108 problem, 137
Compute infrastructure, 269 for data capture, 172
360
INDEX
361
INDEX
362
INDEX
363
INDEX
K
I Kafka, 166
ILSVRC, 128 KDD, see Knowledge discovery
Image augmentation, 183 from data (KDD)
Image classification, 126, 247 Keras, 184, 255
ImageNet, 128 Kernel density estimate plot, 201
Inference, 14, 257 Kernel density estimation, 107
for a classification problem, 53 Kernel trick, 224
for a clustering problem, 86 Key performance indicator, 335
for a computer vision auxiliary metrics related to
problem, 136 experimentation cycle
for a regression problem, 62 time, 337
for an anomaly detection experimentation cycle
problem, 105 time, 336
for an NLP problem, 71 model performance, 335
for recommendations, 120 k-means, 108
Inference server, 262, 286 k-nearest neighbors, 209
Inference/prediction service, 262 KNIME Analytics Platform, 275
Inflected forms, 68 Knowledge discovery from data
Informatica, 166 (KDD), 86, 326
Ingestion, 166, 284 Koalas, 184
Intel SGX, 350
Interexchange format, 260
Interpretability, 290, 352 L
Interquartile range, 98 Lakehouse, 169
Inverse document frequency, 178 Layout, 192
IoT, 24 data lattice, 193
Isolation forest, 210 overlay, 192
ISUP grades, 131 Lemmatization, 68, 175
364
INDEX
365
INDEX
366
INDEX
R Serverless, 262
Shiny, 276
R, 255
Sigmoid function, 243
Random forest, 232, 292
Similarity measure, 85
Ray Tune, 255
Skills framework, 305
Recommendations, 113
Snowflake, 168
recommender system, 277
Software engineering skills, 307
Recommender engines, 26
Spark, 166, 283
Reference architecture, 281
Spyder, 267
Regression, 57
SQL, 167, 283
Reinforcement learning, 141, 212
Stacked bar chart, 51
Reproducibility of experiments,
Standardization, 175
259, 269
Stemming, 176
Reward function, 146
Supervised learning, 209
RoI, 26
Support vector machine, 220–225
RStudio, 267
S T
Sagemaker Studio, 268 T-test, 298
Sandbox, 165 Tableau, 205
SAS Visual Analytics, 203 Team structure, 315
SAS Visual Data Mining and evolution, 318
Machine Learning, 275 mature, 316–317
SCADA, 24 small incubation team, 316
Scatter plot, 60, 100, 190 TensorFlow, 255, 260
Scientific method, 4, 5, 251 TensorFlow Lite, 260
Scikit-image, 184 Tensorflow.js, 260
Scikit-learn, 184, 255 Term frequency, 178
Seaborn, 204 Terminology chaos
Security, 278, 346 AI, 15
Seldon, 262, 265 classification, 46
Self-driving cars, 150 data mining & KDD, 328
Self-play, 150 data science, 15
Sequential decision-making, 141 lakehouse, 169
367
INDEX
368