Introduction To Data Science - Lin and Li
Introduction To Data Science - Lin and Li
Introduction to Data
Science
Contents
List of Tables ix
List of Figures xi
Preface xiii
1 Introduction 1
1.1 A brief history of data science . . . . . . . . . . 1
1.2 Data science role and skill tracks . . . . . . . . . 5
1.2.1 Engineering . . . . . . . . . . . . . . . . . 6
1.2.2 Analysis . . . . . . . . . . . . . . . . . . . 8
1.2.3 Modeling . . . . . . . . . . . . . . . . . . 9
1.3 What kind of questions can data science solve? . 13
1.3.1 Prerequisites . . . . . . . . . . . . . . . . 13
1.3.2 Problem type . . . . . . . . . . . . . . . . 15
1.4 Structure data science team . . . . . . . . . . . . 18
1.5 List of potential data science careers . . . . . . . 21
iii
iv Contents
5 Data Pre-processing 75
5.1 Data Cleaning . . . . . . . . . . . . . . . . . . . 77
5.2 Missing Values . . . . . . . . . . . . . . . . . . . 80
5.2.1 Impute missing values with median/mode 81
5.2.2 K-nearest neighbors . . . . . . . . . . . . 82
Contents v
Appendix 343
Bibliography 373
Index 381
List of Tables
ix
List of Figures
10.1 Test mean squared error for the ridge regression . 191
10.2 Test mean squared error for the lasso regression . 195
xi
xii List of Figures
xiii
xiv Preface
Big
Data R PythonData Deep ML Career Project
Author(Spark) Code Code Preprocessing
LearningModelsPath Cycle
Lin 3 3 3 3 3 3 3 3
and
Li
Saltz 1 3 0 1 0 3 0 0
Shah 0 3 3 1 0 3 1 0
Irizarry0 3 0 3 0 3 0 0
Robinson0 0 0 0 0 0 3 3
Kelleher1 0 0 1 0 2 3 3
Cady 0 0 3 3 0 3 1 1
Grus 1 0 3 2 3 3 1 0
To repeat the code for big data, like running R notebook, you need
to set up Spark in Databricks. Follow the instructions in section
4.3 on the process of setting up and using the spark environment.
Then, run the “Create Spark Data” notebook to create Spark data
Preface xix
frames. After that, you can run the pyspark notebook to learn how
to use pyspark.
Complementary Reading
If you are new to R, we recommend R for Marketing Research
and Analytics by Chris Chapman and Elea McDonnell Feit. The
book is practical and provides repeatable R code. Part I & II of the
book cover basics of R programming and foundational statistics.
It is an excellent book on marketing analytics.
If you are new to Python, we recommend the Python version
of the book mentioned above, Python for Marketing Research and
Analytics by Jason Schwarz, Chris Chapman, and Elea McDonnell
Feit.
If you want to dive deeper into some of the book’s topics, there
are many places to learn more.
• For machine learning, Python Machine Learning 3rd Edition
by Raschka and Mirjalili is a good book on implementing ma-
chine learning in Python. Apply Predictive Modeling by Kuhn
and Johnston is an applied, practitioner-friendly text using R
package caret .
• For statistics models in R, a recommended book is An Intro-
duction to Statistical Learning (ISL) by James, Witten, Hastie,
and Tibshirani. A more advanced treatment of the topics in ISL
is The Elements of Statistical Learning by Friedman, Tibshi-
rani, and Hastie.
Acknowledgements
About the Authors
xxi
1
Introduction
“When you’re fundraising, it’s AI. When you’re hiring, it’s ML.
When you’re implementing, it’s logistic regression.”
1
2 1 Introduction
For outsiders, data science is whatever magic that can get useful
information out of data. Everyone should have heard about big
data. Data science trainees now need the skills to cope with such
big data sets. What are those skills? You may hear about: Hadoop,
a system using Map/Reduce to process large data sets distributed
across a cluster of computers, or hear about Spark, a system builds
atop Hadoop for speeding up the same by loading massive datasets
into shared memory (RAM) across clusters with an additional suite
of machine learning functions for big data. The new skills are for
dealing with organizational artifacts of large data sets beyond a
single computer’s memory or hard disk and the large-scale cluster
computing but not for better solving the real problem. A lot of data
means more sophisticated tinkering with computers, especially a
cluster of computers. The computing and programming skills to
handle big data were the biggest hurdle for traditional analysis
practitioners to be a successful data scientist. However, this hurdle
is significantly reduced with the cloud computing revolution, as
described in Chapter 2. After all, it isn’t the size of the data that’s
important. It’s what you do with it. Your first reaction to all of
this might be some combination of skepticism and confusion. We
want to address this upfront that: we had that exact reaction.
To declutter, let’s start with a brief history of data science. If
you hit up the Google Trends website, which shows search key-
word information over time, and check the term “data science,”
you will find the history of data science goes back a little further
than 2004. The way media describes it, you may feel that ma-
chine learning algorithms are new, and there was never “big” data
before Google. That is not true. There are new and exciting devel-
opments in data science. But many of the techniques we are using
are based on decades of work by statisticians, computer scientists,
mathematicians, and scientists of many other fields.
In the early 19th century, when Legendre and Gauss came up
with the least-squares method for linear regression , probably only
physicists would use it to fit linear regression for their data. Now,
nearly anyone can build linear regression using excel with just a lit-
tle bit of self-guided online training. In 1936, Fisher came up with
1.1 A brief history of data science 3
Most of the classical statistical models are the first type of stochas-
tic data model. Black-box models, such as random forest, GMB,
and deep learning , are algorithmic modeling. As Breiman pointed
out, algorithmic models can be used on large complex data as
a more accurate and informative alternative to stochastic data
modeling on smaller data sets. Those algorithms have developed
rapidly with much-expanded applications in fields outside tradi-
tional statistics. That is one of the most important reasons that
statisticians are not the mainstream of today’s data science, both
in theory and practice. We observe that Python is passing R as the
most commonly used language in data science, mainly due to many
data scientists’ background. Since 2000, the approaches to getting
information out of data have shifted from traditional statistical
models to a more diverse toolbox that includes machine learning
and deep learning models. To help readers who are traditional data
practitioners, we provide both R and Python codes.
What is the driving force behind the shifting trend? John Tukey
identified four forces driving data analysis (there was no “data
science” back to 1962):
“We don’t see things as they are, we see them as we are. [by
Anais Nin]”
1.2.1 Engineering
Data engineering is the foundation that makes everything else pos-
sible. It mainly involves in building the data pipeline infrastruc-
ture. In the (not that) old day, when data is stored on local servers,
computers, or other devices, building the data infrastructure can
be a massive IT project. It involves the software and the hardware
used to store the data and perform data ETL (i.e., extract, trans-
2
This is based on Industry recommendations for academic data science pro-
grams: https://github.com/brohrer/academic_advisory3 with modifications.
It is a collection of thoughts of different data scientist across industries about
what a data scientist does, and what differentiates an exceptional data scien-
tist.
1.2 Data science role and skill tracks 7
(3) Production
1.2.2 Analysis
Analysis turns raw information into insights in a fast and often
exploratory way. In general, an analyst needs to have decent do-
main knowledge, do exploratory analysis efficiently, and present
the results using storytelling.
(3) Storytelling
1.2.3 Modeling
Modeling is a process that dives deeper into the data to discover
the pattern we don’t readily see. A fancy machine learning model
is the first thing that comes to people’s minds when the general
public thinks about data science. Unfortunately, fancy models only
occupy a small part of a typical data scientist’s day-to-day time.
Nevertheless, many of those models are powerful tools.
Getting data from different sources and dumping them into a data
lake, a dumping ground of amorphous data, is far from the data
schema analyst and scientist would use. A data lake is a storage
repository that stores a vast amount of raw data in its native
format, including XML, JSON, CSV, Parquet, etc. It is a data
cesspool rather than a data lake. The data engineer’s job is to get
a clean schema out of the data lake by transforming and formatting
the data. Some common problems to resolve are:
• Enforce new tables’ schema to be the desired one
• Repair broken records in newly inserted data
• Aggregate the data to form the tables with a proper granularity
One cannot make a silk purse out of a sow’s ear. Data scientists
need data, sound, and relevant data. The supply problem men-
tioned above is a case in point. There were relevant data, but not
sound. All the later analytics based on that data was a building
on sand. Of course, data nearly almost have noise, but it has to be
in a certain range. Generally speaking, the accuracy requirement
for the independent variables of interest and response variable is
higher than others. For the above question 2, it is variables related
to the “new promotion” and “sales of P1197”.
The data has to be helpful for the question. If we want to predict
which product consumers are most likely to buy in the next three
months, we need to have historical purchasing data: the last buy-
ing time, the amount of invoice, coupons, etc. Information about
customers’ credit card numbers, ID numbers, and email addresses
will not help much.
Often, the data quality is more important than the quantity, but
you can not overlook the quantity.
Suppose you can guarantee the quality; usually, the more data, the
better. If we have a specific and reasonable question with sound
and relevant data, then congratulations, we can start playing data
science!
1. Description
problem and yet the most crucial and common one. We will need
to describe and explore the dataset before moving on to a more
complex analysis. For problems such as customer segmentation,
after we cluster the sample, the next step is to figure out each
class’s profile by comparing the descriptive statistics of various
variables. Questions of this kind are:
• What is the annual income distribution?
• Are there any outliers?
• What are the mean active days of different accounts?
Data description is often used to check data, find the appropriate
data preprocessing method, and demonstrate the model results.
2. Comparison
3. Clustering
4. Classification
5. Regression
6. Optimization
There is no data science team. Each team hires its data science
people. For example, a marketing analytics group consists of a
data engineer, data analyst, and data scientist. The team leader
is a marketing manager who has an analytical mind and in-depth
business knowledge. The advantages are apparent.
• Data science resource aligns with the organization very well.
• Data science professionals are first-class members and valued in
their team. The manager is responsible for data science profes-
sionals’ growth and happiness.
• The insights from the data are quickly put into action.
It works well in the short term for both the company and the data
science hires. However, there are also many concerns.
• It sacrifices data science hires’ professional growth since they
work in silos and specialize in a specific application. It is also
difficult to share knowledge across different applied areas.
• It is harder to move people around since they are highly associ-
ated with a specific organization’s specific function.
• There is no career path for data science people, and it is difficult
to retain talent.
1.5 List of potential data science careers 21
Role Skills
Data infrastructure engineer Go, Python, AWS/Google
Cloud/Azure, logstash, Kafka,
and Hadoop
22 1 Introduction
Role Skills
Data engineer spark/scala, python, SQL,
AWS/Google Cloud/Azure,
Data modeling
BI engineer Tableau/looker/Mode, etc.,
data visualization, SQL,
Python
Data analyst SQL, basic statistics, data
visualization
Data scientist R/Python, SQL, basic +
applied statistics, data
visualization, experimental
design
Research scientist R/Python, advanced statistics
+ experimental design, ML,
research background,
publications, conference
contributions, algorithms
Applied scientist ML algorithm design, often
with an expectation of
fundamental software
engineering skills
Machine Learning Engineer More advanced software
engineering skillset,
algorithms, machine learning
algorithm design, system
design
The above table shows some data science roles and common tech-
nical keywords in job descriptions. Those roles are different in the
following key aspects:
• How much business knowledge is required?
• Does it need to deploy code in the production environment?
• How frequently is data updated?
• How much engineering skill is required?
1.5 List of potential data science careers 23
FIGURE 1.1: Different roles in data science and the skill require-
ments
BI engineers and data Analysts are close to the business, and hence
they need to know the business context well. The critical differ-
ence is that BI engineers build automated dashboards, so they are
engineers. They are usually experts in SQL and have the engi-
neering skill to write production-level code to construct the later
steam data pipeline and automate their work. Data analysts are
technical but not engineers. They analyze ad hoc data and deliver
the results through presentations. The data is, most of the time,
structured. They need to know coding basics (SQL or sometimes
R/Python) but are rarely asked to write production-level code.
This role was mixed with “data scientist” by many companies but
is now much better refined in mature companies.
The most significant difference between a data analyst and a data
scientist is the requirement of mathematics and statistics. Data
analysts usually don’t need to have a quantitative background or
have an advanced degree. The analytics they do are mostly descrip-
tive with visualizations. Most data scientists have a quant back-
ground and do A/B experiments and sometimes machine learning
models. They mainly handle structured and ad hoc data.
Research scientists are experts who have a research background.
They do rigorous analysis and make causal inferences by fram-
ing experiments and developing hypotheses, and proving whether
they are true or not. They are researchers that can create new
models and publish peer-reviewed papers. Most of the small/mid
companies don’t have this role.
Applied scientist is the role that aims to fill the gap between
data/research scientist and data engineers. They have a decent
scientific background, but they are also experts in applying their
knowledge and implementing solutions at scale. They have a dif-
ferent focus than research scientists. Instead of scientific discovery,
they focus on real-life applications. They usually need to pass a
coding bar.
Machine learning engineers have a more advanced software engi-
neering skillset, understand the efficiency of different algorithms
and system design. They focus on deploying the models. They are
1.5 List of potential data science careers 25
We want to start the book with soft skills for data scientists. There
are many university courses, online self-learning modules, and ex-
cellent books that teach technical skills. However, there are much
fewer resources discussing the soft skills for data scientists in detail.
However, soft skills are essential for data scientists to succeed in
their career, especially in the early stage. This chapter also intro-
duces the project cycle and some common pitfalls of data science
projects in real life.
27
28 2 Soft Skills for Data Scientists
to have the capability to view the problem from 10,000 feet above
the ground and down to the detail to the very bottom. To convert
a business question into a data science problem, a data scientist
needs to communicate using the language other people can un-
derstand and obtain the required information through formal and
informal conversations.
In the entire data science project cycle, including defining, plan-
ning, developing, and implementing, every step needs to get a data
scientist involved to ensure the whole team can correctly determine
the business problem and reasonably evaluate the business value
and success. Corporates are investing heavily in data science and
machine learning, and there is a very high expectation of return
for the investment.
However, it is easy to set an unrealistic goal and inflated estimation
for a data science project’s business impact. The team’s data sci-
entist should lead and navigate the discussions to ensure data and
analytics, not wishful thinking, back the goal. Many data science
projects often over-promise in business value and are too optimistic
on the timeline to delivery. These projects eventually fail by not
delivering the pre-set business impact within the promised time-
line. As data scientists, we need to identify these issues early and
communicate with the entire team to ensure the project has a re-
alistic deliverable and timeline. The data scientist team also needs
to work closely with data owners on different things. For example,
identify a relevant internal and external data source, evaluate the
data’s quality and relevancy to the project, and work closely with
the infrastructure team to understand the computation resources
(i.e., hardware and software) availability. It is easy to create scal-
able computation resources through the cloud infrastructure for a
data science project. However, you need to evaluate the dedicated
computation resources’ cost and make sure it fits the budget.
In summary, data science projects are much more than data and
analytics. A successful project requires a data scientist to lead
many aspects of the project.
2.4 Three Pillars of Knowledge 31
The types of data used and the final model development define the
different kinds of data science projects.
2.4.1.1 Offline and Online Data
There are offline and online data. Offline data are historical data
stored in databases or data warehouses. With the development of
data storage techniques, the cost to store a large amount of data
is low. Offline data are versatile and rich in general (for example,
websites may track and keep each user’s mouse position, click and
typing information while the user is visiting the website). The data
is usually stored in a distributed system, and it can be extracted
in batch to create features used in model training.
Online data are real-time information that flows to models to make
automatic actions. Real-time data can frequently change (for ex-
ample, the keywords a customer is searching for can change at any
given time). Capturing and using real-time online data requires the
integration of a machine learning model to the production infras-
tructure. It used to be a steep learning curve for data scientists not
familiar with computer engineering, but the cloud infrastructure
makes it much more manageable. Based on the offline and online
data and model properties, we can separate data science projects
into three different categories as described below.
ical data, and the output is a report, there is no need for real-time
execution. Usually, there is no run-time constraint on the machine
learning model unless the model runs beyond a reasonable time
frame, such as a few days. We can call this type of data science
project “offline training, offline application” project.
2.4.1.3 Offline Training and Online Application
Another type of data science project uses offline data for train-
ing and applies the trained model to real-time online data in the
production environment. For example, we can use historical data
to train a personalized advertisement recommendation model that
provides a real-time ad recommendation. The model training uses
historical offline data. The trained model then takes customers’
online real-time data as input features and run the model in real-
time to provide an automatic action. The model training is very
similar to the “offline training, offline application” project. But to
put the trained model into production, there are specific require-
ments. For example, as features used in the offline training have
to be available online in real-time, the model’s online run-time has
to be short enough without impacting user experience. In most
cases, data science projects in this category create continuous and
scalable business value as the model could run millions of times a
day. We will use this type of data science project to describe the
project cycle.
2.4.1.4 Online Training and Online Application
(1) the business team, which may include members from the
business operation team, business analytics, insight, and
metrics reporting team;
(2) the technology team, which may include members from
the database and data warehouse team, data engineering
team, infrastructure team, core machine learning team,
and software development team;
(3) the project, program, and product management team de-
pending on the scope of the data science project.
1. Shadow mode
2. A/B testing
38 2 Soft Skills for Data Scientists
take 60% to 80% of the total time for a given data science project,
but people often don’t realize that.
When there are a lot of data already collected across the orga-
nization, people assume we have enough data for everything. It
leads to the mistake: too optimistic about data availability
and quality. We need not “big data,” but data that can help us
solve the problem. The data available may be of low quality, and
we need to put substantial effort into cleaning the data before we
can use it. There are “unexpected” efforts to bring the right and
relevant data for a specific data science project. To ensure smooth
delivery of data science projects, we need to account for the “unex-
pected” work at the planning stage. Data scientists all know data
preprocessing and feature engineering is usually the most time-
consuming part of a data science project. However, people outside
data science are not aware of it, and we need to educate other
team members and the leadership team.
each instance’s total run time (i.e., model latency) should not im-
pact the customer’s user experience. Nobody wants to wait for even
one second to see the results after click the “search” button. In the
production stage, feature availability is crucial to run a real-time
model. Engineering resources are essential for model production.
However, in traditional companies, it is common that a data sci-
ence project fails to scale in real-time applications due to
lack of computation capacity, engineering resources, or non-tech
culture and environment.
As the business problem evolves rapidly, the data and model in
the production environment need to change accordingly, or the
model’s performance deteriorates over time. The online production
environment is more complicated than model training and testing.
For example, when we pull online features from different resources,
some may be missing at a specific time; the model may run into a
time-out zone, and various software can cause the version problem.
We need regular checkups during the entire life of the model cycle
from implementation to retirement. Unfortunately, people often
don’t set the monitoring system for data science projects, and it is
another common mistake: missing necessary online checkup.
It is essential to set a monitoring dashboard and automatic alarms,
create model tuning, re-training, and retirement plans.
1. Demography
•age: age of the respondent
•gender: male/female
•house: 0/1 variable indicating if the customer owns a
house or not
2. Sales in the past year
•store_exp: expense in store
•online_exp: expense online
•store_trans: times of store purchase
•online_trans: times of online purchase
3. Survey on product preference
45
46 3 Introduction to The Data
1. Strong disagree
2. Disagree
3. Neither agree nor disagree
4. Agree
5. Strongly agree
• Q1. I like to buy clothes from different brands
• Q2. I buy almost all my clothes from some of my favorite brands
• Q3. I like to buy premium brands
• Q4. Quality is the most important factor in my purchasing de-
cision
• Q5. Style is the most important factor in my purchasing decision
• Q6. I prefer to buy clothes in store
• Q7. I prefer to buy clothes online
• Q8. Price is important
• Q9. I like to try different styles
• Q10. I like to make decision myself and don’t need too much of
others’ suggestions
There are 4 segments of customers:
1. Price
2. Conspicuous
3. Quality
4. Style
str(sim.dat,vec.len=3)
�g = (1, 0, −1) × 𝛾, 𝑔 = 1, … , 40
48 3 Introduction to The Data
�g = (1, 0, 0) × 𝛾, 𝑔 = 41, … , 80
40
�T = ⎛
⎜ , 1, 0, −1, ..., 1, 0, 0 , ..., 0, 0, 0 , ..., 0, 0, 0 ⎞⎟∗𝛾
3 ⏟ ⏟ ⏟ ⏟
⎝ 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 1 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 41 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 81 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 120⎠
For each value of 𝛾, 20 data sets are simulated. The bigger 𝛾 is,
the larger the corresponding parameter. We provided the data sets
with 𝛾 = 2. Let’s check the data:
## 6 1 0 0 1 1 0 1
there are a few build-in functions in Keras for data loading and
pre-processing. It contains 50,000 movie reviews (25,000 in training
and 25,000 in testing) from IMDB, as well as each movie review’s
binary sentiment: positive or negative. The raw data contains the
text of each movie review, and it has to be pre-processed before
being fitted with any machine learning models. By using Keras’s
built-in functions, we can easily get the processed dataset (i.e.,
a numerical data frame) for machine learning algorithms. Keras’
build-in functions perform the following tasks to convert the raw
review text into a data frame:
53
54 4 Big Data Cloud Platform
4.2.1 Hadoop
The very first problem internet companies face is that a lot of data
has been collected and how to better store these data for future
analysis. Google developed its own file system to provide efficient,
reliable access to data using large clusters of commodity hard-
ware. The open-source version is known as Hadoop Distributed
File System (HDFS). Both systems use Map-Reduce to allocate
computation across computation nodes on top of the file system.
Hadoop is written in Java and writing map-reduce job using Java
is a direct way to interact with Hadoop which is not familiar to
many in the data and analytics community. To help better use
56 4 Big Data Cloud Platform
4.2.2 Spark
Spark works on top of a distributed file system including HDFS
with better data and analytics efficiency by leveraging in-memory
operations. Spark is more tailored for data processing and analytics
and the need to interact with Hadoop directly is greatly reduced.
The spark system includes an SQL-like framework called Spark
SQL and a parallel machine learning library called MLib. Fortu-
nately for many in the analytics community, Spark also supports
R and Python. We can interact with data stored in a distributed
file system using parallel computing across nodes easily with R and
Python through the Spark API and do not need to worry about
lower-level details of distributed computing. We will introduce how
to use an R notebook to drive Spark computations.
4.3.2 R Notebook
For this book, we will use R notebook for examples and demos
and the corresponding Python notebook will be available online
too. For an R notebook, it contains multiple cells, and, by default,
the content within each cell are R scripts. Usually, each cell is a
well-managed segment of a few lines of codes that accomplish a
specific task. For example, Figure 4.2 shows the default cell for
an R notebook. We can type in R scripts and comments same as
we are using R console. By default, only the result from the last
line will be shown following the cell. However, you can use print()
function to output results for any lines. If we move the mouse to
the middle of the lower edge of the cell below the results, a “+”
symbol will show up and click on the symbol will insert a new
cell below. When we click any area within a cell, it will make it
editable and you will see a few icons on the top right corn of the
cell where we can run the cell, as well as add a cell below or above,
copy the cell, cut the cell etc. One quick way to run the cell is
Shift+Enter when the cell is chosen. User will become familiar with
the notebook environment quickly.
cell becomes Markdown cell, SQL script cell, and Python script
cell accordingly. For example, Figure 4.3 shows a markdown cell
with scripts and the actual appearance when exits editing mode.
Markdown cell provides a straightforward way to descript what
each cell is doing as well as what the entire notebook is about. It
is a better way than a simple comment within the code.
# Install sparklyr
if (!require("sparklyr")) {
install.packages("sparklyr")
}
# Load sparklyr package
library(sparklyr)
library(dplyr)
##
## Attaching package: 'dplyr'
head(iris)
The above one-line code copies iris dataset from the local node to
Spark cluster environment. “sc” is the Spark Connection we just
created; “x” is the data frame that we want to copy; “overwrite”
is the option whether we want to overwrite the target object if
the same name SDF exists in the Spark environment. Finally,
sdf_copy_to() function will return an R object representing the
copied SDF (i.e. creating a “pointer” to the SDF such that we
can refer iris_tbl in the R notebook to operate iris SDF). Now
4.4 Leverage Spark Using R Notebook 63
or using the head() function to return the first few rows in iris_tbl:
head(iris_tbl)
iris_tbl %>%
mutate(Sepal_Width = round(Sepal_Width * 2) / 2) %>%
group_by(Species, Sepal_Width) %>%
summarize(count = n(), Sepal_Length = mean(Sepal_Length),
stdev = sd(Sepal_Length))
Even though we can run nearly all of the dplyr functions on SDF,
we cannot apply functions from other packages directly to SDF
(such as ggplot()). For functions that can only work on local R
data frames, we must copy the SDF back to the local node as an
R data frame. To copy SDF back to the local node, we use the
collect() function. The following code using collect() will collect
the results of a few operations and assign the collected data to
iris_summary, a local R data frame:
library(ggplot2)
ggplot(iris_summary, aes(Sepal_Width, Sepal_Length,
color = Species)) +
geom_line(size = 1.2) +
geom_errorbar(aes(ymin = Sepal_Length - stdev,
ymax = Sepal_Length + stdev),
width = 0.05) +
geom_text(aes(label = count), vjust = -0.2, hjust = 1.2,
color = "black") +
theme(legend.position="top")
After fitting the k-means model, we can apply the model to pre-
dict other datasets through sdf_predict() function. Following code
applies the model to iris_tbl again to predict the cluster and col-
lect the results as a local R object (i.e. prediction) using collect()
function:
prediction %>%
ggplot(aes(Petal_Length, Petal_Width)) +
geom_point(aes(Petal_Width, Petal_Length,
col = factor(prediction + 1)),
size = 2, alpha = 0.5) +
geom_point(data = fit2$centers, aes(Petal_Width, Petal_Length),
col = scales::muted(c("red", "green", "blue")),
pch = 'x', size = 12) +
scale_color_discrete(name = "Predicted Cluster",
labels = paste("Cluster", 1:3)) +
labs(x = "Petal Length",
y = "Petal Width",
title = "K-Means Clustering",
4.5 Databases and SQL 67
These procedures cover the basics of big data analysis that a data
scientist needs to know as a beginner. We have an R notebook on
the book website that contains the contents of this chapter. We
also have a Python notebook on the book website.
state division
Alabama East South Central
Alaska Pacific
Arizona Mountain
70 4 Big Data Cloud Platform
state division
Arkansas West South Central
California Pacific
The results from the above query only return one row as expected.
Sometimes we want to find the aggregated value based on groups
that can be defined by one or more columns. Instead of writing
multiple SQL to calculate the aggregated value for each group, we
can easily use the GROUP BY to calculate the aggregated value
for each group in the SELECT statement. For example, if we want to
find how many states in each division, we can use the following:
4.5 Databases and SQL 73
The database system is usually designed such that each table con-
tains a piece of specific information and oftentimes we need to
JOIN multiple tables to achieve a specific task. There are few
types typically JOINs: inner join (keep only rows that match the
join condition from both tables), left outer join (rows from inner
join + unmatched rows from the first table), right outer join (rows
from inner join + unmatched rows from the second table) and
full outer join (rows from inner join + unmatched rows from both
tables). The typical JOIN statement is illustrated below:
For example, let us join the division table and metrics table to find
what is the average population and income for each division, and
the results order by division names:
group by division
order by division
In real life, depending on the stage of data cleanup, data has the
following types:
1. Raw data
2. Technically correct data
3. Data that is proper for the model
4. Summarized data
5. Data with fixed format
75
76 5 Data Pre-processing
The raw data is the first-hand data that analysts pull from the
database, market survey responds from your clients, the experi-
mental results collected by the R & D department, and so on.
These data may be very rough, and R sometimes can’t read them
directly. The table title could be multi-line, or the format does not
meet the requirements:
• Use 50% to represent the percentage rather than 0.5, so R will
read it as a character;
• The missing value of the sales is represented by “-” instead of
space so that R will treat the variable as character or factor type;
• The data is in a slideshow document, or the spreadsheet is not
“.csv” but “.xlsx”
• …
Most of the time, you need to clean the data so that R can import
them. Some data format requires a specific package. Technically
correct data is the data, after preliminary cleaning or format con-
version, that R (or another tool you use) can successfully import
it.
Assume we have loaded the data into R with reasonable column
names, variable format and so on. That does not mean the data is
entirely correct. There may be some observations that do not make
sense, such as age is negative, the discount percentage is greater
than 1, or data is missing. Depending on the situation, there may
be a variety of problems with the data. It is necessary to clean the
data before modeling. Moreover, different models have different
requirements on the data. For example, some model may require
the variables are of consistent scale; some may be susceptible to
outliers or collinearity, some may not be able to handle categorical
variables and so on. The modeler has to preprocess the data to
make it proper for the specific model.
Sometimes we need to aggregate the data. For example, add up
the daily sales to get annual sales of a product at different loca-
tions. In customer segmentation, it is common practice to build
a profile for each segment. It requires calculating some statistics
such as average age, average income, age standard deviation, etc.
5.1 Data Cleaning 77
Q3 Q4 Q5 Q6 Q7
Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
1st Qu.:1.00 1st Qu.:2.00 1st Qu.:1.75 1st Qu.:1.00 1st Qu.:2.50
Median :1.00 Median :3.00 Median :4.00 Median :2.00 Median :4.00
Mean :1.99 Mean :2.76 Mean :2.94 Mean :2.45 Mean :3.43
3rd Qu.:3.00 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.00
Max. :5.00 Max. :5.00 Max. :5.00 Max. :5.00 Max. :5.00
Q8 Q9 Q10 segment
Min. :1.0 Min. :1.00 Min. :1.00 Conspicuous:200
1st Qu.:1.0 1st Qu.:2.00 1st Qu.:1.00 Price :250
Median :2.0 Median :4.00 Median :2.00 Quality :200
Mean :2.4 Mean :3.08 Mean :2.32 Style :350
3rd Qu.:3.0 3rd Qu.:4.00 3rd Qu.:3.00
Max. :5.0 Max. :5.00 Max. :5.00
age store_exp
Min. :16.00 Min. : 155.8
1st Qu.:25.00 1st Qu.: 205.1
Median :36.00 Median : 329.8
Mean :38.58 Mean : 1358.7
3rd Qu.:53.00 3rd Qu.: 597.4
Max. :69.00 Max. :50000.0
NA's :1 NA's :1
survey. Therefore, there are not many papers on missing value im-
putation in the prediction model. Those who want to study further
can refer to Saar-Tsechansky and Provost’s comparison of differ-
ent imputation methods (M and F, 007b) and De Waal, Pannekoek
and Scholtus’ book (de Waal et al., 2011).
missing values here are numeric, we can use the preProcess() func-
tion. The result is the same as the impute() function. PreProcess() is
a powerful function that can link to a variety of data preprocessing
methods. We will use the function later for other data preprocess-
ing.
Now the two variables are in the same scale. You can check the
result using summary(transformed). Note that there are missing val-
ues.
0.30
0.20
0.20
Density
Density
0.10
0.10
0.00
0.00
0 5 10 15 0 5 10 15
X2 X1
There are different ways may help to remove skewness such as log,
square root or inverse. However, it is often difficult to determine
from plots which transformation is most appropriate for correct-
ing skewness. The Box-Cox procedure automatically identified a
transformation from the family of power transformations that are
indexed by a parameter 𝜆(Box G, 1964).
𝑥𝜆 −1
𝑖𝑓 𝜆 ≠ 0
𝑥∗ = { 𝜆
𝑙𝑜𝑔(𝑥) 𝑖𝑓 𝜆 = 0
describe(sim.dat)
The last line of the output shows the estimates of 𝜆 for each vari-
able. As before, use predict() to get the transformed result:
88 5 Data Pre-processing
150
250
Frequency
Frequency
100
150
50
0 50
store_trans store_trans
## Box-Cox Transformation
##
## 1000 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 5.35 7.00 20.00
5.5 Resolve Outliers 89
##
## Largest/Smallest: 20
## Sample Skewness: 1.11
##
## Estimated Lambda: 0.1
## With fudge factor, Lambda = 0 will be used for transformations
## [1] -0.2155
age
250
50
income
50000
40000
store_exp
0
online_exp
0 6000
store_trans
15
5
online_trans
20
0
It is also easy to observe the pair relationship from the plot. age
is negatively correlated with online_trans but positively correlated
with store_trans. It seems that older people tend to purchase from
the local store. The amount of expense is positively correlated with
income. Scatterplot matrix like this can reveal lots of information
before modeling.
In addition to visualization, there are some statistical methods to
define outliers, such as the commonly used Z-score. The Z-score
for variable Y is defined as:
𝑌𝑖 − 𝑌 ̄
𝑍𝑖 =
𝑠
0.6745(𝑌𝑖 − 𝑌 ̄ )
𝑀𝑖 =
𝑀 𝐴𝐷
5.5 Resolve Outliers 91
## [1] 59
where 𝑥𝑖𝑗 represents the 𝑖𝑡ℎ observation and 𝑗𝑡ℎ variable. As shown
in the equation, every observation for sample 𝑖 is divided by its
square mode. The denominator is the Euclidean distance to the
center of the p-dimensional predictor space. Three things to pay
attention here:
# KNN imputation
sdat <- sim.dat[, c("income", "age")]
imp <- preProcess(sdat, method = c("knnImpute"), k = 5)
sdat <- predict(imp, sdat)
transformed <- spatialSign(sdat)
transformed <- as.data.frame(transformed)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
plot(income ~ age, data = sdat, col = "blue", main = "Before")
plot(income ~ age, data = transformed, col = "blue", main = "After")
Before After
1.0
4
3
0.5
income
income
2
0.0
1
-1.0 -0.5
0
-1
age age
Some readers may have found that the above code does not seem
to standardize the data before transformation. Recall the intro-
duction of KNN, preProcess() with method="knnImpute" by default
will standardize data.
5.6 Collinearity 93
5.6 Collinearity
It is probably the technical term known by the most un-technical
people. When two predictors are very strongly correlated, includ-
ing both in a model may lead to confusion or problem with a
singular matrix. There is an excellent function in corrplot package
with the same name corrplot() that can visualize correlation struc-
ture of a set of predictors. The function has the option to reorder
the variables in a way that reveals clusters of highly correlated
ones.
store_trans
online_exp
store_exp
94 5 Data Pre-processing
income
age
1
online_trans 0.8
0.6
age -0.74
0.4
0
store_exp -0.14 0.07 0.53 -0.2
-0.4
income -0.36 0.29 0.52 0.6
-0.6
-1
The closer the correlation is to 0, the lighter the color is and the
closer the shape is to a circle. The elliptical means the correlation
is not equal to 0 (because we set the upper = "ellipse"), the greater
the correlation, the narrower the ellipse. Blue represents a positive
correlation; red represents a negative correlation. The direction
of the ellipse also changes with the correlation. The correlation
coefficient is shown in the lower triangle of the matrix.
The variables relationship from previous scatter matrix are clear
here: the negative correlation between age and online shopping,
the positive correlation between income and amount of purchasing.
Some correlation is very strong ( such as the correlation between
online_trans andage is -0.7414) which means the two variables con-
tain duplicate information.
Section 3.5 of “Applied Predictive Modeling” (Kuhn and Johnston,
2013) presents a heuristic algorithm to remove a minimum number
of predictors to ensure all pairwise correlations are below a certain
threshold:
5.7 Collinearity 95
## [1] 2 6
# make a copy
zero_demo <- sim.dat
# add two sparse variable zero1 only has one unique value zero2 is a
# vector with the first element 1 and the rest are 0s
zero_demo$zero1 <- rep(1, nrow(zero_demo))
zero_demo$zero2 <- c(1, rep(0, nrow(zero_demo) - 1))
## [1] 20 21
## Female Male
## [1,] 1 0
## [2,] 1 0
## [3,] 0 1
## [4,] 0 1
## [5,] 0 1
## [6,] 0 1
dummyVars() can also use formula format. The variable on the right-
hand side can be both categorical and numeric. For a numerical
variable, the function will keep the variable unchanged. The ad-
vantage is that you can apply the function to a data frame without
removing numerical variables. Other than that, the function can
create interaction term:
## 1 120963 0
## 2 122008 0
## 3 0 114202
## 4 0 113616
## 5 0 124253
## 6 0 107661
This chapter focuses on some of the most frequently used data ma-
nipulations and shows how to implement them in R and Python.
It is critical to explore the data with descriptive statistics (mean,
standard deviation, etc.) and data visualization before analysis.
Transform data so that the data structure is in line with the re-
quirements of the model. You also need to summarize the results
after analysis.
When the data is too large to fit in a computer’s memory, we can
use some big data analytics engine like Spark on a cloud platform
(see Chapter 4). Even the user interface of many data platforms is
much more friendly now; it is still easier to manipulate the data as
a local data frame. Spark’s R and Python interfaces aim to keep
the data manipulation syntax consistent with popular packages
for local data frames. As shown in Section 4.4, we can run nearly
all of the dplyr functions on a spark data frame once setting up
the Spark environment. And the Python interface pyspark uses a
similar syntax as pandas. This chapter focuses on data manipula-
tions on standard data frames, which is also the foundation of big
data manipulation.
Even when the data can fit in the memory, there may be a situ-
ation where it is slow to read and manipulate due to a relatively
large size. Some R packages can make the process faster with the
cost of familiarity, especially for data wrangling. But it avoids the
hurdle of setting up Spark cluster and working in an unfamiliar en-
vironment. It is not a topic in this chapter but Appendix A briefly
introduces some of the alternative R packages to read, write and
wrangle a data set that is relatively large but not too big to fit in
the memory.
101
102 6 Data Wrangling
1. Display
2. Subset
3. Summarize
4. Create new variable
5. Merge
# Read data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
Display
• tbl_df():Convert the data to tibble which offers better checking
and printing capabilities than traditional data frames. It will
adjust output width according to fit the current window.
tbl_df(sim.dat)
glimpse(sim.dat)
Subset
Get rows with income more than 300000:
104 6 Data Wrangling
ave_online_exp = mean(online_exp),
n = n() ) %>%
filter(n > 200)
dplyr::distinct(sim.dat)
dplyr::slice(sim.dat, 10:15)
It is equivalent to sim.dat[10:15,].
dplyr::top_n(sim.dat,2,income)
If you want to select columns instead of rows, you can use select().
The following are some sample codes:
Summarize
A standard marketing problem is customer segmentation. It usu-
ally starts with designing survey and collecting data. Then run a
cluster analysis on the data to get customer segments. Once we
have different segments, the next is to understand how each group
of customer look like by summarizing some key metrics. For exam-
ple, we can do the following data aggregation for different segments
of clothes customers.
6.1 Summarize Data 107
(0.73). They are very likely to be digital natives and prefer online
shopping.
You may notice that Style group purchase more frequently online
(online_trans) but the expense (online_exp) is not higher. It makes
us wonder what is the average expense each time, so you have a
better idea about the price range of the group.
The analytical process is aggregated instead of independent steps.
The current step will shed new light on what to do next. Sometimes
you need to go back to fix something in the previous steps. Let’s
check average one-time online and instore purchase amounts:
sim.dat %>%
group_by(segment) %>%
summarise(avg_online = round(sum(online_exp)/sum(online_trans), 2),
avg_store = round(sum(store_exp)/sum(store_trans), 2))
## # A tibble: 4 x 3
## segment avg_online avg_store
## <chr> <dbl> <dbl>
## 1 Conspicuous 442. 479.
## 2 Price 69.3 81.3
## 3 Quality 126. 105.
## 4 Style 92.8 121.
Price group has the lowest averaged one-time purchase. The Con-
spicuous group will pay the highest price. When we build customer
profile in real life, we will also need to look at the survey summa-
rization. You may be surprised how much information simple data
manipulations can provide.
Another comman task is to check which column has missing values.
It requires the program to look at each column in the data. In this
case you can use summarise_all:
110 6 Data Wrangling
In this case, you will apply window function to the columns and
return a column with the same length. mutate() can do it for you
and append one or more new columns:
The above code sums up two columns and appends the result
(total_exp) to sim.dat. Another similar function is transmute(). The
difference is that transmute() will delete the original columns and
only keep the new ones.
Merge
Similar to SQL, there are different joins in dplyr. We create two
baby data sets to show how the functions work.
## ID x1
## 1 A 1
## 2 B 2
## 3 C 3
## ID y1
## 1 B TRUE
## 2 C TRUE
## 3 D FALSE
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
## ID x1 y1
## 1 B 2 TRUE
## 2 C 3 TRUE
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
## 4 D <NA> FALSE
## simulate a matrix
x <- cbind(x1 =1:8, x2 = c(4:1, 2:5))
dimnames(x)[[1]] <- letters[1:8]
apply(x, 2, mean)
## x1 x2
## 4.5 3.0
## [[1]]
##
## 1 3 7
## 2 1 1
##
## [[2]]
##
## 2 4 6 8
## 1 1 1 1
## [,1] [,2]
## 0% 1 2.0
## 25% 1 3.5
## 50% 2 5.0
## 75% 4 6.5
## 100% 7 8.0
Results can have different lengths for each call. This is a trickier
example. What will you get?
The data frame sdat only includes numeric columns. Now we can
go head and use apply() to get mean and standard deviation for
each column:
Even the average online expense is higher than store expense, the
standard deviation for store expense is much higher than online ex-
pense which indicates there is very likely some big/small purchase
in store. We can check it quickly:
summary(sdat$store_exp)
summary(sdat$online_exp)
There are some odd values in store expense. The minimum value
is -500 which indicates that you should preprocess data before an-
alyzing it. Checking those simple statistics will help you better
understand your data. It then gives you some idea how to prepro-
cess and analyze them. How about using lapply() and sapply()?
Run the following code and compare the results:
6.2 Tidy and Reshape Data 117
sdat<-sim.dat[1:5,1:6]
sdat
For the above data sdat, what if we want to reshape the data
to have a column indicating the purchasing channel (i.e. from
store_exp or online_exp) and a second column with the correspond-
ing expense amount? Assume we want to keep the rest of the
columns the same. It is a task to change data from “wide” to
“long”.
data = msdat)
summary(fit)
## # A tibble: 2 x 4
## # Groups: house [1]
## house gender total_online_exp total_store_exp
## <chr> <chr> <dbl> <dbl>
## 1 Yes Female 413. 1007.
## 2 Yes Male 533. 1218.
The above code also uses the functions in the dplyr package in-
troduced in the previous section. Here we use package::function to
make clear the package name. It is not necessary if the package is
already loaded.
Another pair of functions that do opposite manipulations are sep-
arate() and unite().
You can see that the function separates the original column
“Channel” to two new columns “Source” and “Type”. You can use
sep = to set the string or regular expression to separate the col-
umn. By default, it is “_”.
The unite() function will do the opposite: combining two columns.
It is the generalization of paste() to a data frame.
sepdat %>%
unite("Channel", Source, Type, sep = "_")
123
124 7 Model Tuning Strategy
y = 𝑓(X) + 𝝐 (7.1)
situation, the algorithm that penetrates the data the most wins.
There are some other similar problems such as the self-driving car,
chess game, facial recognition and speech recognition. But in most
of the data science applications, the information is incomplete. If
you want to know whether a customer is going to purchase again
or not, it is unlikely to have 360-degree of the customer’s informa-
tion. You may have their historical purchasing record, discounts
and service received. But you don’t know if the customer sees your
advertisement, or has a friend recommends competitor’s product,
or encounters some unhappy purchasing experience somewhere.
There could be a myriad of factors that will influence the cus-
tomer’s purchase decision while what you have as data is only a
small part. To make things worse, in many cases, you don’t even
know what you don’t know. Deep learning doesn’t have any advan-
tage in solving those problems. Instead, some parametric models
often work better in this situation. You will comprehend this more
after learning the different types of model error. Assume we have
𝑓 ̂ which is an estimator of 𝑓. Then we can further get 𝐲̂ = 𝑓(X).
̂
The predicted error is divided into two parts, systematic error, and
random error:
𝐸(y − y)̂ 2 = ̂
𝐸[𝑓(X) + 𝝐 − 𝑓(X)] 2
= 𝐸[𝑓(X) ̂
− 𝑓(X)]2 + 𝑉 𝑎𝑟(𝝐) (7.2)
⏟⏟⏟⏟⏟⏟⏟ ⏟
(1) (2)
It is also called Mean Square Error (MSE) where (1) is the sys-
tematic error. It exists because 𝑓 ̂ usually does not entirely describe
the “systematic relation” between X and y which refers to the sta-
ble relationship that exists across different samples or time. Model
improvement can help reduce this kind of error; (2) is the random
error which represents the part of y that cannot be explained by
X. A more complex model does not reduce the error. There are
three reasons for random error:
̂
The systematic error 𝐸[𝑓(X) − 𝑓(X)]2
can be further decomposed
as:
̂
(𝑓(X) − 𝐸[𝑓(X)] ̂
+ 𝐸[𝑓(X)] ̂
− 𝑓(X))
2 2
̂
= 𝐸 (𝐸[𝑓(X)] − 𝑓(X)) + 𝐸 (𝑓(X) ̂ ̂
− 𝐸[𝑓(X)]) (7.3)
= ̂
[𝐵𝑖𝑎𝑠(𝑓(X))] 2 ̂
+ 𝑉 𝑎𝑟(𝑓(X))
7.1 Variance-Bias Trade-Off 127
source('http://bit.ly/2KeEIg9')
# randomly simulate some non-linear samples
x = seq(1, 10, 0.01) * pi
e = rnorm(length(x), mean = 0, sd = 0.2)
fx <- sin(x) + e + sqrt(x)
dat = data.frame(x, fx)
4
fx
10 20 30
x
4
fx
10 20 30
x
The resulting plot (Fig. 7.3) indicates the smoothing method fit
the data much better and it has a much smaller bias. However,
this method has a high variance. If we simulate different subsets
of the sample, the result curve will change significantly:
geom_smooth(span = 0.03) +
geom_point()
# sample 2
idx2 = sample(1:length(x), 100)
dat2 = data.frame(x2 = x[idx2], fx2 = fx[idx2])
p2 = ggplot(dat2, aes(x2, fx2)) +
geom_smooth(span = 0.03) +
geom_point()
# sample 3
idx3 = sample(1:length(x), 100)
dat3 = data.frame(x3 = x[idx3], fx3 = fx[idx3])
p3 = ggplot(dat3, aes(x3, fx3)) +
geom_smooth(span = 0.03) +
geom_point()
# sample 4
idx4 = sample(1:length(x), 100)
dat4 = data.frame(x4 = x[idx4], fx4 = fx[idx4])
p4 = ggplot(dat4, aes(x4, fx4)) +
geom_smooth(span = 0.03) +
geom_point()
6 6
5
4
fx1
fx3
4
3
2
2
1
10 20 30 10 20 30
x1 x3
6
6
4 4
fx2
fx4
2 2
10 20 30 10 20 30
x2 x4
The fitted lines (blue) change over different samples which means it
has high variance. People also call it overfitting. Fitting the linear
model using the same four subsets, the result barely changes:
6 6
5
4
fx1
fx3
4
3
2
2
10 20 30 10 20 30
x1 x3
6
6
4
4
fx2
fx4
2 2
10 20 30 10 20 30
x2 x4
is not exactly like the past data, the model prediction may have
big mistakes. A simple model like ordinary linear regression tends
instead to underfit which leads to a bad prediction by learning
too little from the data. It systematically over-predicts or under-
predicts the data regardless of how well future data resemble past
data. Without evaluating models, the modeler will not know about
the problem before the future samples. Data splitting and resam-
pling are fundamental techniques to build sound models for pre-
diction.
may need to increase the training set. If the sample size is small,
you can use cross-validation or bootstrap which is the topic in the
next section.
The next decision is to decide which samples are in the test set.
There is a desire to make the training and test sets as similar
as possible. A simple way is to split data by random sampling
which, however, does not control for any of the data attributes,
such as the percentage of the retained customer in the data. So
it is possible that the distribution of outcomes is substantially
different between the training and test sets. There are three main
ways to split the data that account for the similarity of resulted
data sets. We will describe the three approaches using the clothing
company customer data as examples.
136 7 Model Tuning Strategy
# load data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
library(caret)
# set random seed to make sure reproducibility
set.seed(3456)
trainIndex <- createDataPartition(sim.dat$segment,
p = 0.8,
list = FALSE,
times = 1)
head(trainIndex)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 6
## [6,] 7
According to the setting, there are 800 samples in the training set
and 200 in test set. Let’s check the distribution of the two sets:
datTrain %>%
dplyr::group_by(segment) %>%
dplyr::summarise(count = n(),
percentage = round(length(segment)/nrow(datTrain), 2))
## # A tibble: 4 x 3
## segment count percentage
## <chr> <int> <dbl>
## 1 Conspicuous 160 0.2
## 2 Price 200 0.25
## 3 Quality 160 0.2
## 4 Style 280 0.35
datTest %>%
dplyr::group_by(segment) %>%
dplyr::summarise(count = n(),
percentage = round(length(segment)/nrow(datTrain), 2))
## # A tibble: 4 x 3
## segment count percentage
## <chr> <int> <dbl>
## 1 Conspicuous 40 0.05
## 2 Price 50 0.06
## 3 Quality 40 0.05
## 4 Style 70 0.09
The percentages are the same for these two sets. In practice, it is
138 7 Model Tuning Strategy
possible that the distributions are not exactly identical but should
be close.
library(lattice)
# select variables
testing <- subset(sim.dat, select = c("age", "income"))
set.seed(5)
# select 5 random samples
startSet <- sample(1:dim(testing)[1], 5)
start <- testing[startSet, ]
# save the rest in data frame 'samplePool'
samplePool <- testing[-startSet, ]
The obj = minDiss in the above code tells R to use minimum dissim-
ilarity to define the distance between groups. Next, random select
5 samples from samplePool in data frame RandomSet:
Initial Set
Maximum Dissimilarity Sampling
Random Sampling
60
50
age
40
30
20
income
AR(1) φ = − 0.9
6
4
2
timedata
0
-2
-4
-6
0 20 40 60 80 100
Time
Fig. 7.6 shows 100 simulated time series observation. The goal is
to make sure both training and test set to cover the whole period.
## List of 2
142 7 Model Tuning Strategy
## $ train:List of 53
## $ test :List of 53
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36
testSlices[[1]]
## [1] 37 38 39 40 41 42 43 44 45 46 47 48
7.2.2 Resampling
You can consider resampling as repeated splitting. The basic idea
is: use part of the data to fit model and then use the rest of data
to calculate model performance. Repeat the process multiple times
and aggregate the results. The differences in resampling techniques
usually center around the ways to choose subsamples. There are
two main reasons that we may need resampling:
removed and 𝑥𝜅𝑖 the predictors for samples in left-out fold. The
process of k-fold cross-validation is as follows:
5: end for
6: Aggregate the results
library(caret)
class <- sim.dat$segment
# creat k-folds
set.seed(1)
cv <- createFolds(class, k = 10, returnTrain = T)
str(cv)
## List of 10
## $ Fold01: int [1:900] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold02: int [1:900] 1 2 3 4 5 6 7 9 10 11 ...
## $ Fold03: int [1:900] 1 2 3 4 5 6 7 8 10 11 ...
## $ Fold04: int [1:900] 1 2 3 4 5 6 7 8 9 11 ...
## $ Fold05: int [1:900] 1 3 4 6 7 8 9 10 11 12 ...
## $ Fold06: int [1:900] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold07: int [1:900] 2 3 4 5 6 7 8 9 10 11 ...
## $ Fold08: int [1:900] 1 2 3 4 5 8 9 10 11 12 ...
## $ Fold09: int [1:900] 1 2 4 5 6 7 8 9 10 11 ...
## $ Fold10: int [1:900] 1 2 3 5 6 7 8 9 10 11 ...
The above code creates ten folds (k=10) according to the customer
segments (we set class to be the categorical variable segment). The
function returns a list of 10 with the index of rows in training set.
146 7 Model Tuning Strategy
Once know how to split the data, the repetition comes naturally.
7.2 Data Splitting and Resampling 147
The apparent error rate is the error rate when the data is used
twice, both to fit the model and to check its accuracy and it is ap-
parently over-optimistic. The modified bootstrap estimate reduces
the bias but can be unstable with small samples size. This esti-
mate can also be unduly optimistic when the model severely over-
148 7 Model Tuning Strategy
fits since the apparent error rate will be close to zero. Efron and
Tibshirani (Efron and Tibshirani, 1997) discuss another technique,
called the “632+ method,” for adjusting the bootstrap estimates.
8
Measuring Performance
if (length(p_to_install) > 0) {
install.packages(p_to_install)
}
149
150 8 Measuring Performance
estimated values and the actual value. The Root Mean Squared
Error (RMSE) is the root square of the MSE.
1 𝑛
𝑀 𝑆𝐸 = ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑛 𝑖=1
1 𝑛
𝑅𝑀 𝑆𝐸 = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑛 𝑖=1
Both are the common measurements for the regression model per-
formance. Let’s use the previous income prediction as an example.
Fit a simple linear model:
##
## Call:
## lm(formula = income ~ store_exp + online_exp + store_trans +
## online_trans, data = sim.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128768 -15804 441 13375 150945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85711.680 3651.599 23.47 < 2e-16 ***
## store_exp 3.198 0.475 6.73 3.3e-11 ***
## online_exp 8.995 0.894 10.06 < 2e-16 ***
## store_trans 4631.751 436.478 10.61 < 2e-16 ***
## online_trans -1451.162 178.835 -8.11 1.8e-15 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8.1 Regression Model Performance 151
##
## Residual standard error: 31500 on 811 degrees of freedom
## (184 observations deleted due to missingness)
## Multiple R-squared: 0.602,Adjusted R-squared: 0.6
## F-statistic: 306 on 4 and 811 DF, p-value: <2e-16
The fitted results fit shows the RMSE is 31530 (at the bottom of
the output after Residual standard error:).
Another common performance measure for the regression model
is R-Squared, often denoted as 𝑅2 . It is the square of the corre-
lation between the fitted value and the observed value. It is often
explained as the percentage of the information in the data that
can be explained by the model. The above model returns a R-
squared�0.6, which indicates the model can explain 60% of the
variance in variable income. While 𝑅2 is easy to explain, it is not
a direct measure of model accuracy but correlation. Here the 𝑅2
value is not low but the RMSE is 0.6 which means the average dif-
ference between model fitting and the observation is 3.153×104 . It
is a big discrepancy from an application point of view. When the
response variable has a large scale and high variance, a high 𝑅2
doesn’t mean the model has enough accuracy. It is also important
to remember that 𝑅2 is dependent on the variation of the outcome
variable. If the data has a response variable with a higher variance,
the model based on it tends to have a higher 𝑅2 .
We used 𝑅2 to show the impact of the error from independent
and response variables in Chapter 7 where we didn’t consider the
impact of the number of parameters (because the number of pa-
rameters is very small compared to the number of observations).
However, 𝑅2 increases as the number of parameters increases. So
people usually use Adjusted R-squared which is designed to miti-
gate the issue. The original 𝑅2 is defined as:
𝑅𝑆𝑆
𝑅2 = 1 −
𝑇 𝑆𝑆
𝑛 𝑛
where 𝑅𝑆𝑆 = ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̂ )2 and 𝑇 𝑆𝑆 = ∑𝑖=1 (𝑦𝑖 − 𝑦)̄ 2 .
Since RSS is always decreasing as the number of parameters in-
152 8 Measuring Performance
𝑅𝑆𝑆/(𝑛 − 𝑝 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 −
𝑇 𝑆𝑆/(𝑛 − 1)
1
𝐶𝑝 = (𝑅𝑆𝑆 + 2𝑝𝜎̂ 2 )
𝑛
R function AIC() and BIC() will calculate the AIC and BIC value
according to the above equations. Many textbooks ignore content
8.2 Classification Model Performance 153
Separate the data to be training and testing sets. Fit model using
training data (xTrain and yTrain) and applied the trained model on
testing data (xTest and yTest) to evaluate model performance. We
use 70% of the sample as training and the rest 30% as testing.
set.seed(100)
# separate the data to be training and testing
trainIndex <- createDataPartition(disease_dat$y, p = 0.8,
list = F, times = 1)
xTrain <- disease_dat[trainIndex, ] %>% dplyr::select(-y)
xTest <- disease_dat[-trainIndex, ] %>% dplyr::select(-y)
# the response variable need to be factor
154 8 Measuring Performance
Apply the trained random forest model to the testing data to get
two types of predictions:
• probability (a value between 0 to 1)
## 0 1
## 47 0.831 0.169
## 101 0.177 0.823
## 196 0.543 0.457
## 258 0.858 0.142
## 274 0.534 0.466
## 369 0.827 0.173
## 389 0.852 0.148
## 416 0.183 0.817
## 440 0.523 0.477
## 642 0.836 0.164
• category prediction (0 or 1)
## 146 232 269 302 500 520 521 575 738 781
## 0 0 1 0 0 0 1 0 0 0
## Levels: 0 1
## yTest
## yhat 1 0
## 1 56 1
## 0 15 88
𝑇𝑃 + 𝑇𝑁
𝑇 𝑜𝑡𝑎𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
You can calculate the total accuracy when there are more than
two categories. This statistic is straightforward but has some dis-
advantages. First, it doesn’t differentiate different error types. In
a real application, different types of error may have different im-
pacts. For example, it is much worse to tag an important email as
spam and miss it than failing to filter out a spam email. Provost
et al. (Provost F, 1998) discussed in detail the problem of using
total accuracy on different classifiers. There are some other met-
rics based on the confusion matrix that measure different types of
error.
Precision is a metric to measure how accurate positive predic-
tions are (i.e. among those emails predicted as spam, how many
percentages of them are spam emails?):
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑁
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
impact, in the spam email case, we want to make sure the model
specificity is high enough.
Second, total accuracy doesn’t reflect the natural frequencies of
each class. For example, the percentage of fraud cases for insurance
may be very low, like 0.1%. A model can achieve nearly perfect
accuracy (99.9%) by predicting all samples to be negative. The
percentage of the largest class in the training set is also called the
no-information rate. In this example, the no-information rate is
99.9%. You need to get a model that at least beats this rate.
𝑃0 − 𝑃𝑒
𝐾𝑎𝑝𝑝𝑎 =
1 − 𝑃𝑒
Kappa Agreement
<0 Less than chance agreement
0.01–0.20 Slight agreement
0.21– 0.40 Fair agreement
0.41–0.60 Moderate agreement
0.61–0.80 Substantial agreement
158 8 Measuring Performance
Kappa Agreement
0.81–0.99 Almost perfect agreement
# install.packages("fmsb")
kt<-fmsb::Kappa.test(table(yhat,yTest))
kt$Result
##
## Estimate Cohen's kappa statistics and test the
## null hypothesis that the extent of agreement is
## same as random (kappa=0)
##
## data: table(yhat, yTest)
## Z = 9.7, p-value <2e-16
## 95 percent confidence interval:
## 0.6972 0.8894
## sample estimates:
## [1] 0.7933
kt$Judgement
8.2.3 ROC
Receiver Operating Characteristic (ROC) curve uses the predicted
class probabilities and determines an effective threshold such that
values above the threshold are indicative of a specific event. We
have shown the definitions of sensitivity and specificity above. The
sensitivity is the true positive rate and specificity is true negative
rate. “1 - specificity” is the false negative rate. ROC is a graph
of pairs of true positive rate (sensitivity) and false positive rate
(1-specificity) values that result as the test’s cutoff value is varied.
The Area Under the Curve(AUC) is a common measure for two-
class problem. There is usually a trade-off between sensitivity and
specificity. If the threshold is set lower, then there are more sam-
ples predicted as positive and hence the sensitivity is higher. Let’s
look at the predicted probability yhatprob in the swine disease ex-
ample. The predicted probability object yhatprob has two columns,
one is the predicted probability that a farm will have an outbreak,
the other is the probability that farm will NOT have an outbreak.
So the two add up to have value 1. We use the probability of out-
break (the 2nd column) for further illustration. You can use roc()
function to get an ROC object (rocCurve) and then apply different
functions on that object to get needed plot or ROC statistics. For
example, the following code produces the ROC curve:
plot(1-rocCurve$specificities,
rocCurve$sensitivities,
type = 'l',
xlab = '1 - Specificities',
ylab = 'Sensitivities')
160 8 Measuring Performance
1.0
0.8
Sensitivities
0.6
0.4
0.2
0.0
1 - Specificities
The first argument of the roc() is, response, the observation vec-
tor. The second argument is predictor is the continuous prediction
(probability or link function value). The x-axis of ROC curve is
“1 - specificity” and the y-axis is “sensitivity”. ROC curve starts
from (0, 0) and ends with (1, 1). A perfect model that correctly
identifies all the samples will have 100% sensitivity and specificity
which corresponds to the curve that also goes through (0, 1). The
area under the perfect curve is 1. A model that is totally useless
corresponds to a curve that is close to the diagonal line and an
area under the curve about 0.5.
You can visually compare different models by putting their ROC
curves on one plot. Or use the AUC to compare them. DeLong et
al. came up a statistic test to compare AUC based on U-statistics
(E.R. DeLong, 1988) which can give a p-value and confidence in-
terval. You can also use bootstrap to get a confidence interval for
AUC (Hall P, 2004).
We can use the following code in R to get an estimate of AUC and
its confidence interval:
table(yTest)
## yTest
## 1 0
## 71 89
100
80
% Samples Found
60
40
20
0 20 40 60 80 100
% Samples Tested
165
166 9 Regression Models
𝑝
𝑓(X) = X𝜷 = 𝛽0 + ∑ x.j 𝛽𝑗
𝑗=1
𝑁 𝑁 𝑝
𝑅𝑆𝑆(𝛽) = ∑(𝑦𝑖 − 𝑓(xi. )) = ∑(𝑦𝑖 − 𝛽0 − ∑ 𝑥𝑖𝑗 𝛽𝑗 )2
2
Before fitting the model, we need to clean the data, such as re-
moving bad data points that are not logical (negative expense).
To fit a linear regression model, let us first check if there are any
missing values or outliers:
168 9 Regression Models
50000
600
Frequency
30000
400
200
10000
0
0 20000 40000
total_exp
y <- modeldat$total_exp
# Find data points with Z-score larger than 3.5
zs <- (y - mean(y))/mad(y)
modeldat <- modeldat[-which(zs > 3.5), ]
Q1
0.8
Q8 0.24
0.6
Q6 0.53 0.68
0.4
Q10 0.72 0.6 0.85
0.2
Q7 -0.61 -0.74 -0.93 -0.88
0
Q5 -0.79 -0.56 -0.86 -0.91 0.9
-0.2
Q9 -0.78 -0.48 -0.8 -0.86 0.83 0.92
-0.4
Q2 0.2 -0.06 0.4 0.27 -0.25 -0.28 -0.3
-0.6
Q3 0.21 -0.64 -0.1 -0.07 0.24 0.01 -0.07 0.49
-0.8
Q4 0.57 -0.34 0.2 0.32 -0.14 -0.42 -0.46 0.44 0.75
-1
(3) if all the variables in the dataset except the response vari-
able are included in the model, we can use . at the right
side of ~
(4) if we want to consider the interaction between two vari-
ables such as Q1 and Q2, we can add an interaction term
Q1*Q2
##
## Call:
## lm(formula = log(total_exp) ~ ., data = modeldat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1749 -0.1372 0.0128 0.1416 0.5623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.09831 0.05429 149.18 < 2e-16 ***
## Q1 -0.14534 0.00882 -16.47 < 2e-16 ***
## Q2 0.10228 0.01949 5.25 2.0e-07 ***
## Q3 0.25445 0.01835 13.87 < 2e-16 ***
## Q6 -0.22768 0.01152 -19.76 < 2e-16 ***
## Q8 -0.09071 0.01650 -5.50 5.2e-08 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.226 on 805 degrees of freedom
## Multiple R-squared: 0.854,Adjusted R-squared: 0.853
9.1 Ordinary Least Square 171
confint(lmfit,level=0.9)
## 5 % 95 %
## (Intercept) 8.00892 8.18771
## Q1 -0.15987 -0.13081
## Q2 0.07018 0.13437
## Q3 0.22424 0.28466
## Q6 -0.24665 -0.20871
## Q8 -0.11787 -0.06354
Q-Q Plot is used to check the normality assumption for the resid-
ual. For normally distributed residuals, the data points should fol-
low a straight line along the Q-Q plot. The more departure from
a straight line, the more departure from a normal distribution for
the residual.
Standardized residuals
Residuals vs Fitted Normal Q-Q
0 2
Residuals
0.0
678
-1.0
155 678
155
-4
960 960
Cook's distance
960 257
254
310
2.0
155 678
0.03
1.0
0.00
0.0
## Q1 Q2 Q3 Q6 Q8 total_exp
## 155 4 2 1 4 4 351.9
## 678 2 1 1 1 2 1087.3
## 960 2 1 1 1 3 658.3
It is not easy to see why those records are outliers from the above
output. It will be clear conditional on the independent variables
(Q1, Q2, Q3, Q6, and Q8). Let us examine the value of total_exp for
samples with the same Q1, Q2, Q3, Q6, and Q8 answers as the
3rd row above.
## [1] 87
summary(datcheck$total_exp)
variables and income as the response variable. First load the data
and preprocessing the data:
library(lattice)
library(caret)
library(dplyr)
library(elasticnet)
library(lars)
# Load Data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
ymad <- mad(na.omit(sim.dat$income))
# Calculate Z values
zs <- (sim.dat$income - mean(na.omit(sim.dat$income)))/ymad
# which(na.omit(zs>3.5)) find outlier
# which(is.na(zs)) find missing values
idex <- c(which(na.omit(zs > 3.5)), which(is.na(zs)))
# Remove rows with outlier and missing values
sim.dat <- sim.dat[-idex, ]
set.seed(100)
ctrl <- trainControl(method = "cv", number = 10)
From the result, we can see that the optimal number of variables is
182 9 Regression Models
Q6
Q2
Q3
Q1
Q4
Q7
Q8
Q9
Q5
Q10
Importance
The above plot shows that Q1, Q2, Q3, and Q6, are more impor-
tant than other variables.Now let’s fit a PCR model with number
of principal components as the hyper-parameter:
PCR PLS
45000
RMSE (Cross-Validation)
40000
35000
30000
25000
2 4 6 8 10
# Components
9.2 Principal Component Regression and Partial Least Square 185
The plot confirms our choice of using a model with three compo-
nents for both PLS and PCR.
10
Regularization Methods
187
188 10 Regularization Methods
library(NetlifyDS)
Σ𝑛𝑖=1 (𝑦𝑖 − 𝛽0 − Σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 )2 + 𝜆Σ𝑝𝑗=1 𝛽𝑗2 = 𝑅𝑆𝑆 + 𝜆Σ𝑝𝑗=1 𝛽𝑗2 (10.1)
function from MASS, function enet() from elasticnet. If you know the
value of 𝜆, you can use either of the function to fit ridge regression.
A more convenient way is to use train() function from caret. Let’s
use the 10 survey questions to predict the total purchase amount
(sum of online and store purchase).
## Ridge Regression
##
## 999 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 899, 899, 899, 899, 899, 900, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000 1744 0.7952 754.0
## 0.005263 1744 0.7954 754.9
## 0.010526 1744 0.7955 755.9
## 0.015789 1744 0.7955 757.3
## 0.021053 1745 0.7956 758.8
## 0.026316 1746 0.7956 760.6
## 0.031579 1747 0.7956 762.4
## 0.036842 1748 0.7956 764.3
## 0.042105 1750 0.7956 766.4
## 0.047368 1751 0.7956 768.5
## 0.052632 1753 0.7956 770.6
## 0.057895 1755 0.7956 772.7
## 0.063158 1757 0.7956 774.9
## 0.068421 1759 0.7956 777.2
## 0.073684 1762 0.7956 779.6
## 0.078947 1764 0.7955 782.1
## 0.084211 1767 0.7955 784.8
## 0.089474 1769 0.7955 787.6
## 0.094737 1772 0.7955 790.4
## 0.100000 1775 0.7954 793.3
##
## RMSE was used to select the optimal model using
## the smallest value.
## The final value used for the model was lambda
## = 0.005263.
The results show that the best value of 𝜆 is 0.005 and the RMSE
10.1 Ridge Regression 191
and 𝑅2 are 1744 and 0.7954 correspondingly. You can see from the
figure 10.1, as the 𝜆 increase, the RMSE first slightly decreases and
then increases.
plot(ridgeRegTune)
1770
RMSE (Cross-Validation)
1760
1750
Weight Decay
FIGURE 10.1: Test mean squared error for the ridge regression
Once you have the tuning parameter value, there are different func-
tions to fit a ridge regression. Let’s look at how to use enet() in
elasticnet package.
Note here ridgefit only assigns the value of the tuning parameter
for ridge regression. Since the elastic net model include both ridge
and lasso penalty, we need to use predict() function to get the
model fit. You can get the fitted results by setting s = 1 and mode
= "fraction". Here s = 1 means we only use the ridge parameter.
We will come back to this when we get to lasso regression.
192 10 Regularization Methods
By setting type = "fit", the above returns a list object. The fit
item has the predictions:
names(ridgePred)
head(ridgePred$fit)
## 1 2 3 4 5 6
## 1290.5 224.2 591.4 1220.6 853.4 908.2
ridgeCoef<-predict(ridgefit,newx = as.matrix(trainx),
s=1, mode="fraction", type="coefficients")
It also returns a list and the estimates are in the coefficients item:
10.2 LASSO
Even though the ridge regression shrinks the parameter estimates
towards 0, it won’t shink any estimates to be exactly 0 which
means it includes all predictors in the final model. So it can’t
select variables. It may not be a problem for prediction but it is
a huge disadvantage if you want to interpret the model especially
when the number of variables is large. A popular alternative to the
ridge penalty is the Least Absolute Shrinkage and Selection
Operator (LASSO) (R, 1996).
Similar to ridge regression, lasso adds a penalty. The lasso coeffi-
cients 𝛽𝜆𝐿̂ minimize the following:
Σ𝑛𝑖=1 (𝑦𝑖 −𝛽0 −Σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 )2 +𝜆Σ𝑝𝑗=1 |𝛽𝑗 | = 𝑅𝑆𝑆 +𝜆Σ𝑝𝑗=1 |𝛽𝑗 | (10.2)
The only difference between lasso and ridge is the penalty. In sta-
tistical parlance, ridge uses 𝐿2 penalty (𝛽𝑗2 ) and lasso uses 𝐿1
penalty (|𝛽𝑗 |). 𝐿1 penalty can shrink the estimates to 0 when 𝜆 is
big enough. So lasso can be used as a feature selection tool. It is
a huge advantage because it leads to a more explainable model.
Similar to other models with tuning parameters, lasso regres-
sion requires cross-validation to tune the parameter. You can use
train() in a similar way as we showed in the ridge regression sec-
tion. To tune parameter, we need to set cross-validation and pa-
rameter range. Also, it is advised to standardize the predictors:
The results show that the best value of the tuning parameter
(fraction from the output) is 0.957 and the RMSE and 𝑅2 are
1742 and 0.7954 correspondingly. The performance is nearly the
same with ridge regression. You can see from the figure 10.2, as
the 𝜆 increase, the RMSE first decreases and then increases.
plot(lassoTune)
1760
RMSE (Cross-Validation)
1755
1750
1745
Fraction
FIGURE 10.2: Test mean squared error for the lasso regression
Once you select a value for tuning parameter, there are different
functions to fit lasso regression, such as lars() in lars, enet() in
elasticnet, glmnet() in glmnet. They all have very similar syntax.
Again by setting type = "fit", the above returns a list object. The
fit item has the predictions:
head(lassoFit$fit)
## 1 2 3 4 5 6
## 1357.3 300.5 690.2 1228.2 838.4 1010.1
It also returns a list and the estimates are in the coefficients item:
etc. This algorithm works well for lasso regression especially when
the dimension is high.
Σ𝑛𝑖=1 (𝑦𝑖 −𝛽0 −Σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 )2 +𝜆Σ𝑝𝑗=1 |𝛽𝑗 | = 𝑅𝑆𝑆 +𝜆Σ𝑝𝑗=1 |𝛽𝑗 | (10.3)
2
𝑚𝑖𝑛 {Σ𝑛𝑖=1 (𝑦𝑖 − 𝛽0 − Σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 ) } , Σ𝑝𝑗=1 |𝛽𝑗 | ≤ 𝑠 (10.4)
𝛽
For any value of tuning parameter 𝜆, there exists a 𝑠 such that the
coefficient estimates optimize equation (10.3) also optimize equa-
tion (10.4). Similarly, for ridge regression, the two representations
are identical:
Σ𝑛𝑖=1 (𝑦𝑖 − 𝛽0 − Σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 )2 + 𝜆Σ𝑝𝑗=1 𝛽𝑗2 = 𝑅𝑆𝑆 + 𝜆Σ𝑝𝑗=1 𝛽𝑗2 (10.5)
2
𝑚𝑖𝑛 {Σ𝑛𝑖=1 (𝑦𝑖 − 𝛽0 − Σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 ) } , Σ𝑝𝑗=1 𝛽𝑗2 ≤ 𝑠 (10.6)
𝛽
̂
When 𝑝 2, lasso estimates (𝛽𝑙𝑎𝑠𝑠𝑜 in figure 10.3) have the smallest
RSS among all points that satisfy |𝛽 ∗ 1| + |𝛽2 | ≤ 𝑠 (i.e. within the
diamond in figure 10.3). Ridge estimates have the smallest RSS
among all points that satisfy 𝛽12 + 𝛽22 ≤ 𝑠 (i.e. within the circle
in figure 10.3). As s increases, the diamond and circle regions ex-
pand and get less restrictive. If s is large enough, the restrictive
region will cover the least squares estimate (𝛽𝐿𝑆𝐸 ̂ ). Then, equa-
tions (10.4) and (10.6) will simply yield the least squares estimate.
198 10 Regularization Methods
Figure 10.3 illustrates the situations for lasso (left) and ridge
(right). The least square estimate is marked as 𝛽.̂ The restric-
tive regions are in grey, the diamond region is for the lasso, the
circle region is for the ridge. The least square estimates lie out-
side the grey region, so they are different from the lasso and ridge
estimates. The ellipses centered around 𝛽 ̂ represent contours of
RSS. All the points on a given ellipse share an RSS. As the el-
lipses expand, the RSS increases. The lasso and ridge estimates
are the first points at which an ellipse contracts the grey region.
Since ridge regression has a circular restrictive region that doesn’t
have a sharp point, the intersecting point can’t drop on the axis.
But it is possible for lasso since it has corners at each of the axes.
When the intersecting point is on an axis, one of the parameter
estimates is 0. If p > 2, the restrictive regions become sphere or
hypersphere. In that case, when the intersecting point drops on an
axis, multiple coefficient estimates can equal 0 simultaneously.
10.4 Elastic Net 199
Elasticnet
999 samples
10 predictor
200 10 Regularization Methods
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.9579 and lambda = 0.
The results show that the best values of the tuning parameters are
fraction = 0.9579 and lambda = 0 It also indicates that the final
model is lasso only (the ridge penalty parameter lambda is 0). The
RMSE and 𝑅2 are 1742.2843 and 0.7954 correspondingly.
1 𝑁
𝑚𝑖𝑛 Σ𝑖=1 𝑤𝑖 𝑙(𝑦𝑖 , 𝛽0 + 𝜷𝐓 𝐱𝐢 ) + 𝜆[(1 − 𝛼) ∥ 𝜷 ∥22 /2 + 𝛼 ∥ 𝜷 ∥1 ]
𝛽0 ,𝜷 𝑁
where
𝑙(𝑦𝑖 , 𝛽0 + 𝜷𝐓 𝐱𝐢 ) = −𝑙𝑜𝑔[ℒ(𝑦𝑖 , 𝛽0 + 𝜷𝐓 𝐱𝐢 )]
plot(glmfit, label = T)
0 3 3 7 9
1000
3
6
2
5
500
Coefficients
4
0
1
10
-500
L1 Norm
Each curve in the plot represents one predictor. The default setting
is 𝛼 = 1 which means there is only lasso penalty. From left to right,
𝐿𝐼 norm is increasing which means 𝜆 is decreasing. The bottom
x-axis is 𝐿1 norm (i.e. ∥ 𝜷 ∥1 ). The upper x-axis is the effective
10.5 Penalized Generalized Linear Model 203
degrees of freedom (df) for the lasso. You can check the detail for
every step by:
print(glmfit)
Df %Dev Lambda
1 0 0.000 3040
2 2 0.104 2770
3 2 0.192 2530
4 2 0.265 2300
5 3 0.326 2100
6 3 0.389 1910
7 3 0.442 1740
8 3 0.485 1590
9 3 0.521 1450
...
The first column Df is the degree of freedom (i.e. the number of non-
zero coefficients), %Dev is the percentage of deviance explained and
Lambda is the value of tuning parameter 𝜆. By default, the function
will try 100 different values of 𝜆. However, if as 𝜆 changes, the %Dev
doesn’t change sufficiently, the algorithm will stop before it goes
through all the values of 𝜆. We didn’t show the full output above.
But it only uses 68 different values of 𝜆. You can also set the value
of 𝜆 using s= :
coef(glmfit, s = 1200)
## Q5 .
## Q6 .
## Q7 .
## Q8 .
## Q9 .
## Q10 .
## 1 2
## [1,] 6004 5968
## [2,] 7101 6674
## [3,] 9158 8411
We can plot the object using plot(). The red dotted line is the
cross-validation curve. Each red point is the cross-validation mean
squared error for a value of 𝜆. The grey bars around the red points
indicate the upper and lower standard deviation. The two gray dot-
ted vertical lines represent the two selected values of 𝜆, one gives
the minimum mean cross-validated error (lambda.min), the other
gives the error that is within one standard error of the minimum
(lambda.1se).
plot(cvfit)
10.5 Penalized Generalized Linear Model 205
9 9 9 9 9 9 9 7 7 6 6 4 3 3 3 3 2
1.5e+07
Mean-Squared Error
1.0e+07
5.0e+06
2 3 4 5 6 7 8
Log(λ)
## [1] 12.57
## [1] 1200
## Q2 653.7
## Q3 624.5
## Q4 .
## Q5 .
## Q6 .
## Q7 .
## Q8 .
## Q9 .
## Q10 .
𝑦𝑖 ∼ 𝐵𝑜𝑢𝑛𝑜𝑢𝑙𝑙𝑖(𝜃𝑖 )
𝐺
𝜃𝑖
𝑙𝑜𝑔 ( ) = 𝜂𝜷 (𝑥𝑖 ) = 𝛽0 + ∑ xi,g 𝑇 𝜷𝐠
1 − 𝜃𝑖 𝑔=1
library(MASS)
dat <- read.csv("http://bit.ly/2KXb1Qi")
fit <- glm(y~., dat, family = "binomial")
levels(as.factor(trainy))
The first column of the above output is the predicted link function
value when 𝜆 = 0.02833. The second column of the output is the
predicted link function when 𝜆 = 0.0311. Similarly, you can change
the setting for type to produce different outputs. You can use the
cv.glmnet() function to tune parameters. The parameter setting
is nearly the same as before, the only difference is the setting
of type.measure. Since the response is categorical, not continuous,
we have different performance measurements. The most common
settings of type.measure for classification are:
• class:error rate
• auc:it is the area under the ROC for the dichotomous problem
So the model is to predict the probability of outcome “1”. Take
a baby example of 3 observations and 2 values of 𝜆 to show the
usage of predict() function:
newdat = as.matrix(trainx[1:3, ])
predict(fit, newdat, type = "link", s = c(2.833e-02, 3.110e-02))
## 1 2
## 1 0.1943 0.1443
## 2 -0.9913 -1.0077
## 3 -0.5841 -0.5496
For example:
0.5
Misclassification Error
0.4
0.3
0.2
-10 -8 -6 -4
Log(λ)
The above uses error rate as performance criteria and use 10-fold
cross-validation. Similarly, you can get the 𝜆 value for the mini-
mum error rate and the error rate that is 1 standard error from
the minimum:
cvfit$lambda.min
## [1] 2.643e-05
cvfit$lambda.1se
## [1] 0.003334
You can use the same way to get the parameter estimates and
make prediction.
way dummy variables are encoded. Thus the group lasso (Yuan
and Lin, 2007) method has been proposed to enable variable se-
lection in linear regression models on groups of variables, instead
of on single variables. For logistic regression models, the group
lasso algorithm was first studied by Kim et al. (Y. Kim and Kim,
2006). They proposed a gradient descent algorithm to solve the
corresponding constrained problem, which does, however, depend
on unknown constants. Meier et al. (L Meier and Buhlmann, 2008)
proposed a new algorithm that could work directly on the penal-
ized problem and its convergence property does not depend on
unknown constants. The algorithm is especially suitable for high-
dimensional problems. It can also be applied to solve the corre-
sponding convex optimization problem in generalized linear mod-
els. The group lasso estimator proposed by Meier et al. (L Meier
and Buhlmann, 2008) for logistic regression has been shown to be
statistically consistent, even with a large number of categorical
predictors.
In this section, we illustrate how to use the logistic group lasso
algorithm to construct risk scoring systems for predicting disease.
Instead of maximizing the log-likelihood in the maximum likeli-
hood method, the logistic group lasso estimates are calculated by
minimizing the convex function:
𝐺
𝑆𝜆 (𝜷) = −𝑙(𝜷) + 𝜆 ∑ 𝑠(𝑑𝑓𝑔 ) ∥ 𝜷𝐠 ∥2
𝑔=1
{0.96𝜆𝑚𝑎𝑥 , 0.962 𝜆𝑚𝑎𝑥 , 0.963 𝜆𝑚𝑎𝑥 , ..., 0.96100 𝜆𝑚𝑎𝑥 }. Here 𝜆𝑚𝑎𝑥 is
defined as
1
𝜆𝑚𝑎𝑥 = 𝑚𝑎𝑥 { ∥ xg 𝑇 (y − y)̄ ∥2 } , (10.8)
𝑔∈{1,...,𝐺} 𝑠(𝑑𝑓𝑔 )
such that when 𝜆 = 𝜆𝑚𝑎𝑥 , only the intercept is in the model. When
𝜆 goes to 0, the model is equivalent to ordinary logistic regression.
Three criteria may be used to select the optimal value of 𝜆. One
is AUC which you should have seem many times in this book
by now. The log-likelihood score used in Meier et al. (L Meier
and Buhlmann, 2008) is taken as the average of log-likelihood of
the validation data overall cross-validation sets. Another one is
the maximum correlation coefficient in Yeo and Burge (Yeo and
Burge, 2004) that is defined as:
devtools::install_github("netlify/NetlifyDS")
library("NetlifyDS")
The package includes the swine disease breakout data and you can
load the data by:
10.5 Penalized Generalized Linear Model 213
data("sim1_da1")
Dummy variables from the same question are in the same group:
index[1:50]
...
$ auc : num [1:100] 0.573 0.567 0.535 ...
$ log_likelihood : num [1:100] -554 -554 -553 ...
$ maxrho : num [1:100] -0.0519 0.00666 ...
$ lambda.max.auc : Named num [1:2] 0.922 0.94
..- attr(*, "names")= chr [1:2] "lambda" "auc"
$ lambda.1se.auc : Named num [1:2] 16.74 0.81
..- attr(*, "names")= chr [1:2] "" "se.auc"
$ lambda.max.loglike: Named num [1:2] 1.77 -248.86
..- attr(*, "names")= chr [1:2] "lambda" "loglike"
$ lambda.1se.loglike: Named num [1:2] 9.45 -360.13
..- attr(*, "names")= chr [1:2] "lambda" "se.loglike"
$ lambda.max.maxco : Named num [1:2] 0.922 0.708
..- attr(*, "names")= chr [1:2] "lambda" "maxco"
$ lambda.1se.maxco : Named num [1:2] 14.216 0.504
..- attr(*, "names")= chr [1:2] "lambda" "se.maxco"
plot(cv_fit)
1.0
0.8
auc
0.6
0.4
0 10 20 30 40 50
Lambda
The x-axis is the value of the tuning parameter, the y-axis is AUC.
The two dash lines are the value of 𝜆 for max AUC and the value
for the one standard deviation to the max AUC. Once you choose
the value of the tuning parameter, you can use fitglasso() to fit
the model. For example, we can fit the model using the parameter
value that gives the max AUC, which is 𝜆 = 0.922:
coef(fitgl)
0.922
Intercept -5.318e+01
Q1.A 1.757e+00
Q1.B 1.719e+00
Q2.A 2.170e+00
Q2.B 6.939e-01
Q3.A 2.102e+00
Q3.B 1.359e+00
...
217
218 11 Tree-Based Methods
CART can refer to the tree model in general, but most of the time,
it represents the algorithm initially proposed by Breiman (Breiman
et al., 1984). After Breiman, there are many new algorithms, such
as ID3, C4.5, and C5.0. C5.0 is an improved version of C4.5, but
since C5.0 is not open source, the C4.5 algorithm is more popular.
C4.5 was a major competitor of CART. But now, all those seem
outdated. The most popular tree models are Random Forest (RF)
and Gradient Boosting Machine (GBM). Despite being out of favor
in application, it is important to understand the mechanism of the
basic tree algorithm. Because the later models are based on the
same foundation.
The original CART algorithm targets binary classification, and the
later algorithms can handle multi-category classification. A single
tree is easy to explain but has poor accuracy. More complicated
tree models, such as RF and GBM, can provide much better pre-
diction at the cost of explainability. As the model becoming more
complicated, it is more like a black-box which makes it very diffi-
cult to explain the relationship among predictors. There is always
a trade-off between explainability and predictability.
The reason why it is called “tree” is of course because the structure
11.1 Tree Basics 219
as:
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = Σ𝑖 𝑝𝑖 (1 − 𝑝𝑖 )
𝑝1 (1 − 𝑝1 ) + 𝑝2 (1 − 𝑝2 )
It is easy to see that when the sample set is pure, one of the
probability is 0 and the Gini score is the smallest. Conversely, when
𝑝1 = 𝑝2 = 0.5, the Gini score is the largest, in which case the purity
of the node is the smallest. Let’s look at an example. Suppose
we want to determine which students are computer science (CS)
majors. Here is the simple hypothetical classification tree result
obtained with the gender variable.
222 11 Tree-Based Methods
The Gini impurity for the node “Gender” is the following weighted
average of the above two scores:
3 5 2 1
× + ×0=
5 18 5 6
purity, B is the second, and A has the lowest purity. We need less
information to describe nodes with higher purity.
So entropy decreases from 1 to 0.39 after the split and the IG for
“Gender” is 0.61.
Information Gain Ratio (IGR)
ID3 uses information gain as the splitting criterion to train the
classification tree. A drawback of information gain is that it is
biased towards choosing attributes with many values, resulting in
overfitting (selecting a feature that is non-optimal for prediction)
(HSSINA et al., 2014).
To understand why let’s look at another hypothetical scenario. As-
sume that the training set has students’ birth month as a feature.
You might say that the birth month should not be considered in
this case because it intuitively doesn’t help tell the student’s ma-
jor. Yes, you’re right. However, practically, we may have a much
more complicated dataset, and we may not have such intuition
for all the features. So, we may not always be able to determine
whether a feature makes sense or not. If we use the birth month
to split the data, the corresponding entropy of the mode “Birth
Month” is 0.24 (the sum of column “Weighted Entropy” in the ta-
ble), and the information gain is 0.76, which is larger than the IG
of “Gender” (0.61). So between the two features, IG would choose
“Birth Month” to split the data.
11.2 Splitting Criteria 225
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛
where split information is:
The split information for the birth month is 3.4, and the gain ratio
is 0.22, which is smaller than that of gender (0.63). The gain ratio
refers to use gender as the splitting feature rather than the birth
month. Gain ratio favors attributes with fewer categories and leads
to better generalization (less overfitting).
226 11 Tree-Based Methods
In equation (11.1), 𝑦1̄ and 𝑦2̄ are the average of the sample in 𝑆1
and 𝑆2 . The way regression tree grows is to automatically decide
on the splitting variables and split points that can maximize SSE
reduction. Since this process is essentially a recursive segmenta-
tion, this approach is also called recursive partitioning.
Take a look at this simple regression tree for the height of 10
students:
SSE for the 10 students in root node is 522.9. After the split, SSE
decreases from 522.9 to 168.
11.3 Tree Pruning 227
In this situation:
ple size at the node helps to prevent the leaf nodes having only
one sample. The sample size can be a tuning parameter. If it
is too large, the model tends to under-fit. If it is too small, the
model tends to over-fit. In the case of severe class imbalance,
the minimum sample size may need to be smaller because the
number of samples in a particular class is small.
• Maximum depth of the tree: If the tree grows too deep, the model
tends to over-fit. It can be a tuning parameter.
• Maximum number of terminal nodes: Limit on the terminal
nodes works the same as the limit on the depth of the tree.
They are proportional.
• The number of variables considered for each split: the algorithm
randomly selects variables used in finding the optimal split point
at each level. In general, the square root of the number of all
variables works best, which is also the default setting for many
functions. However, people often treat it as a tuning parameter.
Remove branches
Another way is to first let the tree grow as much as possible and
then go back to remove insignificant branches. The process reduces
the depth of the tree. The idea is to overfit the training set and then
correct using cross-validation. There are different implementations.
• cost/complexity penalty
The idea is that the pruning minimizes the penalized error 𝑆𝑆𝐸𝜆
with a certain value of tuning parameter 𝜆.
You train a complete tree using the subset (1) and apply the tree
on the subset (2) to calculate the accuracy. Then prune the tree
based on a node and apply that on the subset (2) to calculate
another accuracy. If the accuracy after pruning is higher or equal
to that from the complete tree, then we set the node as a terminal
node. Otherwise, keep the subtree under the node. The advantage
of this method is that it is easy to compute. However, when the
size of the subset (2) is much smaller than that of the subset (1),
there is a risk of over-pruning. Some researchers found that this
method results in more accurate trees than pruning process based
on tree size (F. Espoito and Semeraro, 1997).
• Error-complexity pruning
This method is to search for a trade-off between error and com-
plexity. Assume we have a splitting node 𝑡, and the corresponding
subtree 𝑇 . The error cost of the node is defined as:
230 11 Tree-Based Methods
𝑝(𝑡) is the ratio of the sample of the node to the total sample�
The multiplication 𝑟(𝑡) × 𝑝(𝑡) cancels out the sample size of the
node. If we keep node 𝑡, the error cost of the subtree 𝑇 is:
𝑅(𝑡) − 𝑅(𝑇 )𝑡
𝑎(𝑡) =
𝑛𝑜. 𝑜𝑓 𝑙𝑒𝑎𝑣𝑒𝑠 − 1
point 𝑡, all the samples under 𝑡 will be classified as from one cat-
egory, say category 𝑐. If we prune the subtree, the expected error
rate is:
𝑛𝑡 − 𝑛𝑡,𝑐 + 𝑘 − 1
𝐸(𝑡) =
𝑛𝑡 + 𝑘
where:
𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠
𝑛𝑡 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑢𝑛𝑑𝑒𝑟 𝑛𝑜𝑑𝑒 𝑡
𝑛𝑡,𝑐 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑢𝑛𝑑𝑒𝑟 𝑡 𝑡ℎ𝑎𝑡 𝑏𝑒𝑙𝑜𝑛𝑔 𝑡𝑜 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐
The sample average for region 𝑅1 is 163, for region 𝑅2 is 176. For
a new observation, if it is female, the model predicts the height to
be 163, if it is male, the predicted height is 176. Calculating the
mean is easy. Let’s look at the first step in more detail which is to
divide the space into 𝑅1 , 𝑅2 , … , 𝑅𝐽 .
In theory, the region can be any shape. However, to simplify the
problem, we divide the predictor space into high-dimensional rect-
angles. The goal is to divide the space in a way that minimize RSS.
Practically, it is nearly impossible to consider all possible parti-
tions of the feature space. So we use an approach named recursive
binary splitting, a top-down , greedy algorithm. The process starts
from the top of the tree (root node) and then successively splits
the predictor space. Each split produces two branches (hence bi-
nary). At each step of the process, it chooses the best split at that
particular step, rather than looking ahead and picking a split that
leads to a better tree in general (hence greedy).
Calculate the RSS decrease after the split. For different (𝑗, 𝑠),
search for the combination that minimizes the RSS, that is to
minimize the following:
1800
RMSE (Cross-Validation)
1750
1700
1650
2 4 6 8 10 12
print(rpartTree)
## n= 999
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 999 1.581e+10 3479.0
## 2) Q3< 3.5 799 2.374e+09 1819.0
## 4) Q5< 1.5 250 3.534e+06 705.2 *
## 5) Q5>=1.5 549 1.919e+09 2326.0 *
## 3) Q3>=3.5 200 2.436e+09 10110.0 *
You can see that the final model picks Q3 and Q5 to predict total
expenditure. To visualize the tree, you can convert rpart object to
party object using partykit then use plot() function:
1
Q3
When fitting tree models, people need to choose the way to treat
categorical predictors. If you know some of the categories have
higher predictability, then the first approach may be better. In
the rest of this section, we will build tree models using the above
two approaches and compare them.
Let’s build a classification model to identify the gender of the
customer. The population is relatively balanced (55.4% male and
44.6% female).
the model. We can compare the model results from the two ap-
proaches:
CART
1000 samples
11 predictor
2 classes: 'Female', 'Male'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 901, 899, 900, 900, 901, 900, ...
Resampling results across tuning parameters:
......
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.00835.
plot.roc(rpartRoc,
type = "s",
print.thres = c(.5),
print.thres.pch = 3,
print.thres.pattern = "",
print.thres.cex = 1.2,
col = "red", legacy.axes = TRUE,
print.thres.col = "red")
plot.roc(rpartFactorRoc,
type = "s",
add = TRUE,
print.thres = c(.5),
print.thres.pch = 16, legacy.axes = TRUE,
print.thres.pattern = "",
print.thres.cex = 1.2)
240 11 Tree-Based Methods
legend(.75, .2,
c("Grouped Categories", "Independent Categories"),
lwd = c(1, 1),
col = c("black", "red"),
pch = c(16, 3))
1.0
0.8
0.6
Sensitivity
0.4
0.2
Grouped Categories
Independent Categories
0.0
1. Low accuracy
2. Unstable: little change in the training data leads to very
different trees.
Then fit the model using train function in caret package. Here
we just set the number of trees to be 1000. You can tune that
parameter.
set.seed(100)
bagTune <- caret::train(trainx, trainy,
method = "treebag",
nbagg = 1000,
metric = "ROC",
trControl = trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = TRUE))
bagTune
## Bagged CART
##
## 1000 samples
## 11 predictor
## 2 classes: 'Female', 'Male'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 901, 899, 900, 900, 901, 900, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7093 0.6533 0.6774
rfTune
## Random Forest
##
## 1000 samples
## 11 predictor
## 2 classes: 'Female', 'Male'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 899, 900, 900, 899, 899, 901, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.7169 0.5341 0.8205
## 2 0.7137 0.6334 0.7175
## 3 0.7150 0.6478 0.6995
## 4 0.7114 0.6550 0.6950
## 5 0.7092 0.6514 0.6882
##
## ROC was used to select the optimal model using
## the largest value.
## The final value used for the model was mtry = 1.
Since bagging tree is a special case of random forest, you can fit
the bagging tree by setting 𝑚𝑡𝑟𝑦 = 𝑝. Function importance() can
return the importance of each predictor:
248 11 Tree-Based Methods
importance(rfit)
## MeanDecreaseGini
## Q1 9.056
## Q2 7.582
## Q3 7.611
## Q4 12.308
## Q5 5.628
## Q6 9.740
## Q7 6.638
## Q8 7.829
## Q9 5.955
## Q10 4.781
## segment 11.185
varImpPlot(rfit)
rfit
Q4
segment
Q6
Q1
Q8
Q3
Q2
Q7
Q9
Q5
Q10
0 2 4 6 8 10 12
MeanDecreaseGini
11.7 Gradient Boosted Machine 249
It is easy to see from the plot that segment and Q4 are the top two
variables to classify gender.
1 𝑁
𝑒𝑟𝑟̄ = Σ 𝐼(𝑦 ≠ 𝐺(𝑥𝑖 ))
𝑁 𝑖=1 𝑖
The algorithm produces a series of classifiers 𝐺𝑚 (𝑥), 𝑚 =
1, 2, ..., 𝑀 from different iterations. In each iteration, it finds the
best classifier based on the current weights. The misclassified sam-
ples in the 𝑚𝑡ℎ iteration will have higher weights in the (𝑚+1)𝑡ℎ it-
eration and the correctly classified samples will have lower weights.
As it moves on, the algorithm will put more effort into the “diffi-
cult” samples until it can correctly classify them. So it requires the
algorithm to change focus at each iteration. At each iteration, the
algorithm will calculate a stage weight based on the error rate. The
final prediction is a weighted average of all those weak classifiers
using stage weights from all the iterations:
𝐺(𝑥) = 𝑠𝑖𝑔𝑛(Σ𝑀
𝑚=1 𝛼𝑚 𝐺𝑚 (𝑥))
Algorithm 5 AdaBoost.M1
1: Response variables have two values: +1 and -1
2: Initialize the observation to have the same weights: 𝑤𝑖 =
1
𝑁 , 𝑖 = 1, ..., 𝑁
3: for m = 1 to M do
4: Fit a classifier 𝐺𝑚 (𝑥) to the training data using weights
𝑤𝑖
5: Compute the error rate for the 𝑚𝑡 ℎ classifier: 𝑒𝑟𝑟𝑚 =
Σ𝑁
𝑖=1 𝑤𝑖 𝐼(𝑦𝑖 ≠𝐺𝑚 (𝑥𝑖 ))
Σ𝑁𝑖=1 𝑤𝑖
6: Compute the stage weight for 𝑚𝑡ℎ iteration: 𝛼𝑚 =
𝑙𝑜𝑔 1−𝑒𝑟𝑟
𝑒𝑟𝑟𝑚
𝑚
set is 𝑝. 𝑓(𝑥) is the model prediction in the range of [−∞, +∞] and
1
the predicted event probability is 𝑝̂ = 1+𝑒𝑥𝑝[−𝑓(𝑥)] . The gradient
boosting for this problem is as follows:
When using the tree as the base learner, basic gradient boosting
has two tuning parameters: tree depth and the number of iter-
ations. You can further customize the algorithm by selecting a
different loss function and gradient(Hastie T, 2008). The final line
of the loop includes a regularization strategy. Instead of adding
(𝑗)
𝑓𝑖 to the previous iteration’s 𝑓𝑖 , only a fraction of the value is
added. This fraction is called learning rate which is 𝜆 in the algo-
rithm. It can take values between 0 and 1 which is another tuning
parameter of the model.
The way to calculate variable importance in boosting is similar
to a bagging model. You get variable importance by combining
measures of importance across the ensemble. For example, we can
calculate the Gini index improvement for each variable across all
trees and use the average as the measurement of the importance.
Boosting is a very popular method for classification. It is one of the
methods that can be directly applied to the data without requir-
ing a great deal of time-consuming data preprocessing. Applying
11.7 Gradient Boosted Machine 253
set.seed(100)
gbmTune <- caret::train(x = trainx,
y = trainy,
method = "gbm",
tuneGrid = gbmGrid,
metric = "ROC",
verbose = FALSE,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = TRUE))
1000 samples
11 predictor
2 classes: 'Female', 'Male'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 899, 900, 900, 899, 899, 901, ...
Resampling results across tuning parameters:
254 11 Tree-Based Methods
ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 4,
interaction.depth = 3, shrinkage = 0.01 and n.minobsinnode = 6.
The results show that the tuning parameter settings that lead to
the best ROC are n.trees = 4 (number of trees), interaction.depth =
3 (depth of tree), shrinkage = 0.01 (learning rate) and n.minobsinnode
= 6 (minimum number of observations in each node).
Now, let’s compare the results from the three tree models.
plot.roc(rpartRoc,
type = "s",
print.thres = c(.5), print.thres.pch = 16,
print.thres.pattern = "", print.thres.cex = 1.2,
col = "black", legacy.axes = TRUE,
print.thres.col = "black")
plot.roc(treebagRoc,
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 3,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "red", print.thres.col = "red")
plot.roc(rfRoc,
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 1,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "green", print.thres.col = "green")
plot.roc(gbmRoc,
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 10,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "blue", print.thres.col = "blue")
Single Tree
0.4
Bagged Tree
Random Forest
Boosted Tree
0.2
0.0
Since the data here doesn’t have many variables, we don’t see
a significant difference among the models. But you can still see
those ensemble methods are better than a single tree. In most
of the real applications, ensemble methods perform much better.
Random forest and boosting trees can be a baseline model. Before
exploring different models, you can quickly run a random forest to
see the performance and then try to improve that performance. If
the performance you got from the random forest is not too much
better than guessing, you should consider collecting more data or
reviewing the problem to frame it a different way instead of trying
other models. Because it usually means the current data is not
enough to solve the problem.
𝐺(𝑥) = 𝑠𝑖𝑔𝑛(Σ𝑀
𝑚=1 𝛼𝑚 𝐺𝑚 (𝑥))
𝑓(𝑥) = Σ𝑀
𝑚=1 𝛽𝑚 𝑏(𝑥, 𝛾𝑚 ) (11.2)
𝑚𝑖𝑛 Σ𝑁 𝑀
𝑖=1 𝐿 (𝑦𝑖 , Σ𝑚=1 𝛽𝑚 𝑏(𝑥𝑖 ; 𝛾𝑚 )) (11.3)
{𝛽𝑚 ,𝛾𝑚 }𝑀
𝑖
𝑚𝑖𝑛 = Σ𝑁
𝑖=1 𝐿(𝑦𝑖 , 𝛽𝑏(𝑥𝑖 ; 𝛾))
𝛽,𝛾
(𝛽𝑚 , 𝛾𝑚 ) = 𝑎𝑟𝑔𝑚𝑖𝑛Σ𝑁
𝑖=1 𝐿(𝑦𝑖 , 𝑓𝑚−1 (𝑥𝑖 ) + 𝛽𝑏(𝑥𝑖 ; 𝛾))
𝛽,𝛾
on the previous basis function 𝑓𝑚−1 (𝑥). And then add the new
basis 𝑏(𝑥; 𝛾𝑚 )𝛽𝑚 to the previous basis function to get a new basis
function 𝑓𝑚 (𝑥) without changing any parameters from previous
steps. Assume we use squared-error loss:
Then we have:
𝐿(𝑦𝑖 , 𝑓𝑚−1 (𝑥𝑖 ) + 𝛽𝑏(𝑥𝑖 ; 𝛾)) = (𝑦𝑖 − 𝑓𝑚−1 (𝑥𝑖 ) − 𝛽𝑏(𝑥𝑖 ; 𝛾))2
where 𝑦𝑖 − 𝑓𝑚−1 (𝑥𝑖 ) is the residual for sample i based on the pre-
vious model. That is to say, it is fitting the new basis using the
residual of the previous step.
However, the squared-error loss is generally not a good choice. For
a regression problem, it is susceptible to outliers. It doesn’t fit the
classification problem since the response is categorical. Hence we
often consider other loss functions.
Now let is go back to the AdaBoost.M1 algorithm. It is actually
a special case of the above forward stagewise model when the loss
function is:
eration 𝐺𝑚 (𝑥) ∈ {−1, 1}. If we use the exponential loss, the opti-
mization problem is
(𝛽𝑚 , 𝐺𝑚 ) = 𝑎𝑟𝑔𝑚𝑖𝑛Σ𝑁
𝑖=1 𝑒𝑥𝑝[−𝑦𝑖 (𝑓𝑚−1 (𝑥𝑖 ) + 𝛽𝐺(𝑥𝑖 ))]
𝛽,𝐺
= 𝑎𝑟𝑔𝑚𝑖𝑛Σ𝑁
𝑖=1 𝑒𝑥𝑝[−𝑦𝑖 𝛽𝐺(𝑥𝑖 )] ⋅ 𝑒𝑥𝑝[−𝑦𝑖 𝑓𝑚−1 (𝑥𝑖 )]
𝛽,𝐺
= 𝑎𝑟𝑔𝑚𝑖𝑛Σ𝑁 𝑚
𝑖=1 𝑤𝑖 𝑒𝑥𝑝[−𝑦𝑖 𝛽𝐺(𝑥𝑖 )]
𝛽,𝐺
(11.4)
where 𝑤𝑖𝑚 = 𝑒𝑥𝑝[−𝑦𝑖 𝑓𝑚−1 (𝑥𝑖 )]. It does not depend on 𝛽 and 𝐺(𝑥).
So we can consider it as the weight for each sample. Since the
weight is related to 𝑓𝑚−1 (𝑥𝑖 ), it changes each iteration. We can
further decompose equation (11.4):
𝑎𝑟𝑔𝑚𝑖𝑛Σ𝑁 𝑚
𝑖=1 𝑤𝑖 𝑒𝑥𝑝[−𝑦𝑖 𝛽𝐺(𝑥𝑖 )]
𝛽,𝐺
= 𝑎𝑟𝑔𝑚𝑖𝑛Σ𝑁 𝑚 −𝛽 𝑚 𝛽
𝑖=1 {𝑤𝑖 𝑒 𝐼(𝑦𝑖 = 𝐺(𝑥)) + 𝑤𝑖 𝑒 𝐼(𝑦𝑖 ≠ 𝐺(𝑥))}
𝛽,𝐺
= 𝑎𝑟𝑔𝑚𝑖𝑛Σ𝑁 𝑚 −𝛽 𝑚 𝛽
𝑖=1 {𝑤𝑖 𝑒 [1 − 𝐼(𝑦𝑖 ≠ 𝐺(𝑥))] + 𝑤𝑖 𝑒 𝐼(𝑦𝑖 ≠ 𝐺(𝑥))}
𝛽,𝐺
= 𝑎𝑟𝑔𝑚𝑖𝑛 {(𝑒𝛽 − 𝑒−𝛽 ) ⋅ Σ𝑁 𝑚
𝑖=1 𝑤𝑖 𝐼(𝑦𝑖 ≠ 𝐺(𝑥𝑖 )) + 𝑒
−𝛽
⋅ Σ𝑁 𝑚
𝑖=1 𝑤𝑖 }
𝛽,𝐺
(11.5)
when 𝛽 > 0, the solution for equation (11.5) is:
𝐺𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛Σ𝑁 𝑚
𝑖=1 𝑤𝑖 𝐼(𝑦𝑖 ≠ 𝐺(𝑥))
𝐺
1 1 − 𝑒𝑟𝑟𝑚
𝛽𝑚 = 𝑙𝑛
2 𝑒𝑟𝑟𝑚
where
Σ𝑁 𝑚
𝑖=1 𝑤𝑖 𝐼(𝑦𝑖 ≠ 𝐺𝑚 (𝑥𝑖 ))
𝑒𝑟𝑟𝑚 = 𝑚
Σ𝑁
𝑖=1 𝑤𝑖
260 11 Tree-Based Methods
We can go ahead and get the weight for the next iteration:
where 𝛼𝑚 = 2𝛽𝑚 = 𝑙𝑜𝑔 1−𝑒𝑟𝑟 𝑒𝑟𝑟𝑚 is the same with the 𝛼𝑚 in Ad-
𝑚
261
262 12 Deep Learning
𝑀
𝑓(X) = ∑ 𝑔𝑚 (�T
m X)
𝑚=1
𝑣2
3. �T = (0, 1), 𝑣 = 𝑥2 , 𝑔(𝑣) = 𝑒 5
1
4. �T = (1, 0), 𝑣 = 𝑥1 , 𝑔(𝑣) = (𝑣 + 0.1)𝑠𝑖𝑛( 𝑣 +0.1 )
3
Here is how you can simulate the data and plot it using R:
x1 <- M$x
x2 <- M$y
## setting 1 map X using w to get v
v <- (1/2) * x1 + (sqrt(3)/2) * x2
# apply g() on v
g1 <- 1/(1 + exp(-v))
par(mfrow = c(2, 2), mar = c(0, 0, 1, 0))
surf3D(x1, x2, g1, colvar = g1, border = "black", colkey = FALSE,
box = FALSE, main = "Setting 1")
## setting 2
v <- x1
g2 <- (v + 5) * sin(1/(v/3 + 0.1))
surf3D(x1, x2, g2, colvar = g2, border = "black", colkey = FALSE,
box = FALSE, main = "Setting 2")
## setting 3
v <- x2
g3 <- exp(v^2/5)
surf3D(x1, x2, g3, colvar = g3, border = "black", colkey = FALSE,
box = FALSE, main = "Setting 3")
## setting 4
v <- x1
g4 <- (v + 0.1) * sin(1/(v/3 + 0.1))
surf3D(x1, x2, g4, colvar = g4, border = "black", colkey = FALSE,
box = FALSE, main = "Setting 4")
268 12 Deep Learning
Setting 1 Setting 2
Setting 3 Setting 4
𝑦(𝑖)
̂ = 𝜎(𝑤𝑇 𝑥(𝑖) + 𝑏)
1
where 𝜎(𝑧) = 1+𝑒−𝑧 . The following figure summarizes the process:
There are two types of layers. The last layer connects directly to
the output. All the rest are intermediate layers. Depending on your
definition, we call it “0-layer neural network” where the layer count
only considers intermediate layers. To train the model, you need
a cost function which is defined as equation (12.2).
1 𝑚
𝐽 (𝑤, 𝑏) = Σ 𝐿(𝑦(𝑖)
̂ , 𝑦(𝑖) ) (12.2)
𝑚 𝑖=1
where
𝐿(𝑦(𝑖)
̂ , 𝑦(𝑖) ) = −𝑦(𝑖) 𝑙𝑜𝑔(𝑦(𝑖)
̂ ) − (1 − 𝑦(𝑖) )𝑙𝑜𝑔(1 − 𝑦(𝑖)
̂ )
𝑏[1] is the column vector of the four bias parameters shown above.
𝑧 [1] is a column vector of the four non-active neurons. When you
apply an activation function to a matrix or vector, you apply it
element-wise. 𝑊 [1] is the matrix by stacking the four row-vectors:
12.2 Feedforward Neural Network 275
[1]𝑇
𝑤
⎡ 1[1]𝑇 ⎤
𝑤2
𝑊 [1]
=⎢
⎢ [1]𝑇
⎥
⎥
⎢ 𝑤3 ⎥
[1]𝑇
⎣ 𝑤4 ⎦
So if you have one sample, you can go through the above forward
propagation process to calculate the output 𝑦 ̂ for that sample.
If you have 𝑚 training samples, you need to repeat this process
each of the 𝑚 samples. We use superscript (i) to denote a
quantity associated with 𝑖𝑡ℎ sample. You need to do the same
calculation for all 𝑚 samples.
For i = 1 to m, do:
where
| | |
⎡
𝑋=⎢ 𝑥 (1) (1) (𝑚) ⎤
𝑥 ⋯ 𝑥 ⎥,
⎣ | | | ⎦
| | |
⎡
[𝑙]
𝐴 =⎢ 𝑎 [𝑙](1)
𝑎 [𝑙](1)
⋯ 𝑎 [𝑙](𝑚) ⎤
,
⎥
⎣ | | | ⎦𝑙=1 𝑜𝑟 2
| | |
𝑍 [𝑙] = ⎡
⎢ 𝑧 [𝑙](1)
𝑧 [𝑙](1)
⋯ 𝑧 [𝑙](𝑚) ⎤
⎥
⎣ | | | ⎦
𝑙=1 𝑜𝑟 2
276 12 Deep Learning
You can add layers like this to get a deeper neural network as
shown in the bottom right of figure 12.1.
1
0.8
σ(z) =
1 + e−z
0.6
σ(z)
0.4
0.2
0.0
-5 0 5
When the output has more than 2 categories, people use softmax
function as the output layer activation function.
𝑒𝑧𝑖
𝑓𝑖 (z) = (12.3)
Σ𝐽𝑗=1 𝑒𝑧𝑗
where z is a vector.
12.2 Feedforward Neural Network 277
𝑒𝑧 − 𝑒−𝑧
𝑡𝑎𝑛ℎ(𝑧) = (12.4)
𝑒𝑧 + 𝑒−𝑧
1.0
ez − e−z
0.5
tanh(z) =
ez + e−z
0.0
-0.5
-1.0
-5 0 5
The tanh function crosses point (0, 0) and the value of the function
is between 1 and -1 which makes the mean of the activated neurons
closer to 0. The sigmoid function doesn’t have that property. When
you preprocess the training input data, you sometimes center the
data so that the mean is 0. The tanh function is doing that data
processing in some way which makes learning for the next layer a
little easier. This activation function is used a lot in the recurrent
neural networks where you want to polarize the results.
• Rectified Linear Unit (ReLU) Function
The most popular activation function is the Rectified Linear Unit
(ReLU) function. It is a piecewise function, or a half rectified func-
tion:
278 12 Deep Learning
R(z) = max(0, z)
6
4
2
0
-5 0 5
𝑧 𝑧≥0
𝑅(𝑧)𝐿𝑒𝑎𝑘𝑦 = {
𝑎𝑧 𝑧<0
8
6
4
2
0
-5 0 5
You may notice that all activation functions are non-linear. Since
the composition of two linear functions is still linear, using a lin-
ear activation function doesn’t help to capture more information.
That is why you don’t see people use a linear activation function
in the intermediate layer. One exception is when the output 𝑦 is
continuous, you may use linear activation function at the output
layer. To sum up, for intermediate layers:
• ReLU is usually a good choice. If you don’t know what to choose,
then start with ReLU. Leaky ReLu usually works better than the
ReLU but it is not used as much in practice. Either one works
fine. Also, people usually use a = 0.01 as the slope for leaky
ReLU. You can try different parameters but most of the people
a = 0.01.
• tanh is used sometimes especially in recurrent neural network.
But you nearly never see people use sigmoid function as inter-
mediate layer activation function.
For the output layer:
• When it is binary classification, use sigmoid with binary cross-
entropy as loss function
280 12 Deep Learning
• When there are multiple classes, use softmax function with cat-
egorical cross-entropy as loss function
• When the response is continuous, use identity function (i.e. y =
x)
12.2.5 Optimization
So far, we have introduced the core components of deep learning
architecture, layer, weight, activation function, and loss function.
With the architecture in place, we need to determine how to update
the network based on a loss function (a.k.a. objective function).
In this section, we will look at variants of optimization algorithms
that will improve the training speed.
12.2.5.1 Batch, Mini-batch, Stochastic Gradient Descent
it will start over and take another pass through the training set. It
means one more decision to make, the optimal number of epochs.
It is decided by looking at the trends of performance metrics on a
holdout set of training data. We discussed the data splitting and
sampling in section 7.2. People often use a single holdout set to
tune the model in deep learning. But it is important to use a big
enough holdout set to give high confidence in your model’s overall
performance. Then you can evaluate the chosen model on your test
set that is not used in the training process. MGD is what everyone
in deep learning uses when training on a large data set.
𝑉𝑡 = 𝛽𝑉𝑡−1 + (1 − 𝛽)𝜃𝑡
And we have:
𝑉0 = 0
𝑉1 = 𝛽𝑉1 + (1 − 𝛽)𝜃1
𝑉2 = 𝛽𝑉1 + (1 − 𝛽)𝜃2
⋮
𝑉100 = 𝛽𝑉99 + (1 − 𝛽)𝜃100
𝑉0 = 0
𝑉1 = 0.05𝜃1 ......
𝑉2 = 0.0475𝜃1 + 0.05𝜃2
The black line in the right plot of figure 12.8 is the exponentially
weighted averages of simulated temperature data with 𝛽 = 0.95.
1
𝑉𝑡 is approximately average over the previous 1−𝛽 days. So 𝛽 =
0.95 approximates a 20 days’ average. The red line corresponds to
𝛽 = 0.8, which approximates 5 days’ average. As 𝛽 increases, it
averages over a larger window of the previous values, and hence the
curve gets smoother. A larger 𝛽 also means that it gives the current
value 𝜃𝑡 less weight (1 − 𝛽), and the average adapts more slowly.
It is easy to see from the plot that the averages at the beginning
are more biased. The bias correction can help to achieve a better
estimate:
𝑉𝑡
𝑉𝑡𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 =
1 − 𝛽𝑡
𝑉1
𝑉1𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = = 𝜃1
1 − 0.95
𝑉2
𝑉2𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = = 0.4872𝜃1 + 0.5128𝜃2
1 − 0.952
284 12 Deep Learning
100
80
80
Tempreture
60
60
40
40
20
20
beta1=0.95 beta1=0.95
beta2=0.8 beta2=0.8
beta3=0.5 beta3=0.5
0
0 20 40 60 80 100 0 20 40 60 80 100
Days Days
𝑤 = 𝑤 − 𝛼𝑉𝑑𝑤 ; 𝑏 = 𝑏 − 𝛼𝑉𝑑𝑏
𝑑𝑤 𝑑𝑏
𝑤=𝑤−𝛼 ; 𝑏 =𝑏−𝛼
√𝑆𝑑𝑤 √𝑆𝑑𝑏
1 𝑚
𝑚𝑖𝑛𝐽 (𝑤, 𝑏) = Σ 𝐿(𝑦(𝑖)
̂ , 𝑦(𝑖) ) + 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑤,𝑏 𝑚 𝑖=1
𝜆 𝜆 𝑛𝑥 2
𝐿2 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 = ∥ 𝑤 ∥22 = Σ 𝑤
2𝑚 2𝑚 𝑖=1 𝑖
𝜆 𝑛𝑥
𝐿1 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 = Σ |𝑤|
𝑚 𝑖=1
For neural network,
1 𝑚 𝜆
𝐽 (𝑤[1] , 𝑏[1] , … , 𝑤[𝐿] , 𝑏[𝐿] ) = Σ𝑖=1 𝐿(𝑦(𝑖)
̂ , 𝑦(𝑖) ) + Σ𝐿 ∥ 𝑤[𝑙] ∥2𝐹
𝑚 2 𝑙=1
where
[𝑙] [𝑙−1] [𝑙]
∥ 𝑤[𝑙] ∥2𝐹 = Σ𝑛𝑖=1 Σ𝑛𝑗=1 (𝑤𝑖𝑗 )2
sent the color red, blue and green accordingly. Similarly, You can
process the image as a 3-d array. Or you can vectorize the array
as shown in figure 12.10.
Let’s look at how to use the keras R package for a toy exam-
290 12 Deep Learning
install.packages("keras")
library(keras)
install_keras()
You can run the code in this section in the Databrick community
edition with R as the interface. Refer to section 4.3 for how to
set up an account, create a notebook (R or Python) and start
a cluster. For an audience with a statistical background, using a
well-managed cloud environment has the following benefit:
• Minimum language barrier in coding for most statisticians
• Zero setups to save time using the cloud environment
• Get familiar with the current trend of cloud computing in the
industrial context
You can also run the code on your local machine with R and the re-
quired Python packages (keras uses the Python TensorFlow back-
end engine). Different versions of Python may cause some errors
when running install_keras(). Here are the things you could do
when you encounter the Python backend issue in your local ma-
chine:
• Run reticulate::py_config() to check the current Python config-
uration to see if anything needs to be changed.
• By default, install_keras() uses virtual environment
~/.virtualenvs/r-reticulate. If you don’t know how to set
12.2 Feedforward Neural Network 291
2
https://tensorflow.rstudio.com/reference/keras/install_keras/
292 12 Deep Learning
str(mnist)
List of 2
$ train:List of 2
..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
$ test :List of 2
..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...
Now we prepare the features (x) and the response variable (y)
for both the training and testing dataset, and we can check the
structure of the x_train and y_train using str() function.
str(x_train)
str(y_train)
Now let’s plot a chosen 28x28 matrix as an image using R’s image()
function. In R’s image() function, the way of showing an image
is rotated 90 degree from the matrix representation. So there is
additional steps to rearrange the matrix such that we can use
image() function to show it in the actual orientation.
input_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0 0 0 0 9 80 207
[7,] 0 0 0 0 0 0 39 158 158 158 168 253 253
[8,] 0 0 0 0 0 0 226 253 253 253 253 253 253
[9,] 0 0 0 0 0 0 139 253 253 253 238 113 215
294 12 Deep Learning
[10,] 0 0 0 0 0 0 39 34 34 34 30 0 31
[11,] 0 0 0 0 0 0 91 0 0 0 0 0 0
[12,] 0 0 0 0 0 0 0 0 0 0 0 11 33
[13,] 0 0 0 0 0 0 0 0 0 0 11 167 253
[14,] 0 0 0 0 0 0 0 0 0 0 27 253 253
[15,] 0 0 0 0 0 0 0 0 0 0 18 201 253
[16,] 0 0 0 0 0 0 0 0 0 0 0 36 87
...
into 784 columns (i.e. features), and then rescale the value to be
between 0 and 1 by dividing the original pixel value by 255, as
described in the cell below.
# step 1: reshape
x_train <- array_reshape(x_train,
c(nrow(x_train), 784))
x_test <- array_reshape(x_test,
c(nrow(x_test), 784))
# step 2: rescale
x_train <- x_train / 255
x_test <- x_test / 255
str(x_train)
str(x_test)
The above dnn_model has 4 layers with first layer 256 nodes, 2nd
layer 128 nodes, 3rd layer 64 nodes, and last layer 10 nodes. The
activation function for the first 3 layers is relu and the activation
function for the last layer is softmax which is typical for classifi-
cation problems. The model detail can be obtained through sum-
mary() function. The number of parameter of each layer can be cal-
culated as: (number of input features +1) times (numbe of nodes
in the layer). For example, the first layer has (784+1)x256=200960
parameters; the 2nd layer has (256+1)x128=32896 parameters.
Please note, dropout only randomly drop certain proportion of
parameters for each batch, it will not reduce the number of pa-
rameters in the model. The total number of parameters for the
dnn_model we just defined has 242762 parameters to be estimated.
summary(dnn_model)
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_1 (Dense) (None, 256) 200960
________________________________________________________________________________
dropout_1 (Dropout) (None, 256) 0
________________________________________________________________________________
dense_2 (Dense) (None, 128) 32896
________________________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
________________________________________________________________________________
dense_3 (Dense) (None, 64) 8256
________________________________________________________________________________
dropout_3 (Dropout) (None, 64) 0
________________________________________________________________________________
dense_4 (Dense) (None, 10) 650
================================================================================
Total params: 242,762
Trainable params: 242,762
298 12 Deep Learning
Non-trainable params: 0
________________________________________________________________________________
Now we can feed data (x and y) into the neural network that
we just built to estimate all the parameters in the model. Here
we define three hyperparameters: epochs, batch_size, and valida-
tion_split, for this model. It just takes a couple of minutes to
finish.
str(dnn_history)
12.2 Feedforward Neural Network 299
List of 2
$ params :List of 8
..$ metrics : chr [1:4] "loss" "acc" "val_loss" "val_acc"
..$ epochs : int 15
..$ steps : NULL
..$ do_validation : logi TRUE
..$ samples : int 48000
..$ batch_size : int 128
..$ verbose : int 1
..$ validation_samples: int 12000
$ metrics:List of 4
..$ acc : num [1:15] 0.83 0.929 0.945 0.954 0.959 ...
..$ loss : num [1:15] 0.559 0.254 0.195 0.165 0.148 ...
..$ val_acc : num [1:15] 0.946 0.961 0.967 0.969 0.973 ...
..$ val_loss: num [1:15] 0.182 0.137 0.122 0.113 0.104 ...
- attr(*, "class")= chr "keras_training_history"
plot(dnn_history)
300 12 Deep Learning
12.2.7.3 Prediction
$acc
[1] 0.981
12.2 Feedforward Neural Network 301
[1] 7 2 1 0 4 1
[1] 190
index_image = 34
You start from the top left corner of the image and put the fil-
ter on the top left 3 x3 sub-matrix of the input image and take
the element-wise product. Then you add up the 9 numbers. Move
forward one step each time until it gets to the bottom right. The
detailed process is shown in figure 12.12.
Let’s use edge detection as an example to see how convolution
operation works. Given a picture as the left of figure 12.13, you
want to detect the vertical lines. For example, there are vertical
lines along with the hair and the edges of the bookcase. How do
you do that? There are standard filters for operations like blurring,
12.3 Convolutional Neural Network 305
sharpening, and edge detection. To get the edge, you can use the
following 3 x 3 filter to convolute over the image.
kernel_vertical
image = magick::image_read("http://bit.ly/2Nh5ANX")
kernel_vertical = matrix(c(1, 1, 1, 0, 0, 0, -1, -1, -1),
nrow = 3, ncol = 3)
plot(image)
plot(image_edge_vertical)
plot(image_edge_horizontal)
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
So the output image has a lighter region in the middle that cor-
responds to the vertical edge of the input image. When the input
image is large, such as the image in figure 12.13 is 1020 x 711,
the edge will not seem as thick as it is in this small example. To
detect the horizontal edge, you only need to rotate the filter by 90
degrees. The right image in figure 12.13 shows the horizontal edge
detection result. You can see how convolution operator detects a
specific feature from the image.
308 12 Deep Learning
the value for p and also the pixel value used. Or you can just use
0 to pad and make the output the same size with input.
library(EBImage)
library(dplyr)
# convert to 2D grayscale
gray_eggshell = apply(eggshell, c(1,2), mean)
if (length(dim(image)) == 3) {
# get image dimensions
col <- dim(image[, , 1])[2]
row <- dim(image[, , 1])[1]
# calculate new dimension size
c <- (col - f)/s + 1
r <- (row - f)/s + 1
# create new image object
newImage <- array(0, c(c, r, 3))
# loops in RGB layers
for (rgb in 1:3) {
m <- image[, , rgb]
m3 <- matrix(0, ncol = c, nrow = r)
i <- 1
if (type == "mean")
for (ii in 1:r) {
j <- 1
for (jj in 1:c) {
m3[ii, jj] <- mean(as.numeric(m[i:(i +
(f - 1)), j:(j + (f - 1))]))
j <- j + s
}
i <- i + s
} else for (ii in 1:r) {
j = 1
for (jj in 1:c) {
m3[ii, jj] <- max(as.numeric(m[i:(i +
(f - 1)), j:(j + (f - 1))]))
j <- j + s
}
i <- i + s
}
newImage[, , rgb] <- m3
}
} else if (length(dim(image)) == 2) {
312 12 Deep Learning
Let’s apply both max and mean pooling with filter size 10 (𝑓 = 10)
and stride 10 (𝑠 = 10).
12.3 Convolutional Neural Network 313
You can see the result by plotting the output image (figure 12.18).
The top left is the original color picture. The top right is the 2D
grayscale picture. The bottom left is the result of max pooling.
The bottom right is the result of mean pooling. The max-pooling
gives you the value of the largest pixel and the mean-pooling gives
the average of the patch. You can consider it as a representation
of features, looking at the maximal or average presence of differ-
ent features. In general, max-pooling works better. You can gain
some intuition from the example (figure 12.18). The max-pooling
“picks” more distinct features and average-pooling blurs out fea-
tures evenly.
numbers with the corresponding numbers from the top left region
of the color input image and add them up. Add a bias parameter
and apply an activation function which gives you the first number
of the output image. Then slide it over to calculate the next one.
The final output is 2D 4 × 4. If you want to detect features in the
red channel only, you can use a filter with the second and third
channels to be all 0s. With different choices of the parameters,
you can get different feature detectors. You can use more than one
filter and each filter has multiple channels. For example, you can
use one 3 × 3 × 3 filter to detect the horizontal edge and another
to detect the vertical edge. Figure 12.18 shows an example of one
layer with two filters. Each filter has a dimension of 3 × 3 × 3. The
output dimension is 4×4×2. The output has two channels because
we have two filters on the layer. The total number of parameters
is 58 (each filter has one bias parameter 𝑏).
We use the following notations for layer 𝑙:
12.3 Convolutional Neural Network 315
str(x_train)
summary(cnn_model)
Train the model and save each epochs’s history. Please note, as
we are not using GPU, it takes a few minutes to finish. Please
be patient while waiting for the results. The training time can be
significantly reduced if running on GPU.
320 12 Deep Learning
## loss accuracy
## 0.02153786 0.99290001
12.3.5.3 Prediction
# model prediction
cnn_pred <- cnn_model %>%
predict_classes(x_test)
head(cnn_pred)
[1] 7 2 1 0 4 1
[1] 72
• Music generation
• Sentiment analysis
A trained CNN accepts a fixed-sized vector as input (such as
28 × 28 image) and produces a fixed-sized vector as output (such
as the probabilities of being one the ten digits). RNN has a much
more flexible structure. It can operate over sequences of vectors
and produces sequences of outputs and they can vary in size. To
understand what it means, let’s look at some RNN structure ex-
amples.
324 12 Deep Learning
The information flows from one step to the next with a repetitive
structure until the last time step input 𝑥<𝑇𝑥 > and then it out-
puts 𝑦<𝑇
̂ 𝑦 > . In this example, 𝑇𝑥 = 𝑇𝑦 . The architecture changes
when 𝑇𝑥 and 𝑇𝑦 are not the same. The model shares parameters,
𝑊𝑦𝑎 , 𝑊𝑎𝑎 , 𝑊𝑎𝑥 , 𝑏𝑎 , 𝑏𝑦 , for all time steps of the input.
𝐿<𝑡> (𝑦<𝑡>
̂ ) = −𝑦<𝑡> 𝑙𝑜𝑔(𝑦<𝑡>
̂ ) − (1 − 𝑦<𝑡> )𝑙𝑜𝑔(1 − 𝑦<𝑡>
̂ )
𝑇
𝑦
𝐿(𝑦,̂ 𝑦) = Σ𝑡=1 𝐿<𝑡> (𝑦,̂ 𝑦)
The above defines the forward process. Same as before, the back-
ward propagation computes the gradient descent for the parame-
ters by the chain rule for differentiation.
In this RNN structure, the information only flows from the left to
the right. So at any position, it only uses data from earlier in the
sequence to make a prediction. It does not work when predicting
the current word needs information from later words. For example,
consider the following two sentences:
Given just the first three words is not enough to know if the word
“April” is part of a person’s name. It is a person’s name in 1 but
not 2. The two sentences have the same first three words. In this
case, we need a model that allows the information to flow in both
directions. A bidirectional RNN takes data from both earlier and
later in the sequence. The disadvantage is that it needs the entire
328 12 Deep Learning
The word “male” has a score of -1 for the “gender” feature, “fe-
male” has a score of 1. Both “Apple” and “pumpkin” have a high
score for the “food” feature and much lower scores for the rest. You
can set the number of features to learn, usually more than what
we list in the above figure. If you use 200 features to represent the
words, then the learned embedding for each word is a vector with
a length of 200.
For language-related applications, text embedding is the most crit-
ical step. It converts raw text into a meaningful vector representa-
tion. Once we have a vector representation, it is easy to calculate
typical numerical metrics such as cosine similarity. There are many
pre-trained text embeddings available for us to use. We will briefly
introduce some of these popular embeddings.
The first widely used embedding is word2vec. It was first intro-
duced in 2013 and was trained by a large collection of text in
an unsupervised way. Training the word2vec embedding vector
uses bag-of-words or skip-gram. In the bag-of-words architecture,
the model predicts the current word based on a window of sur-
rounding context words. In skip-gram architecture, the model uses
the current word to predict the surrounding window of context
words. There are pre-trained word2vec embeddings based on a
large amount of text (such as wiki pages, news reports, etc.) for
general applications.
GloVe (Global Vectors) embedding is an extension of word2vec
and performs better. It uses a unique version of the square loss
function. However, words are composite of meaningful components
330 12 Deep Learning
such as radicals.
“eat” and “eaten” are different forms of the same word. Both
word2vec and GloVe use word-level information, and they treat
each word uniquely based on its context.
The fastText embedding is introduced to use the word’s internal
structure to make the process more efficient. It uses morphologi-
cal information to extend the skip-gram model. New words that
are not in the training data can be repressed well. It also sup-
ports 150+ different languages. The above-mentioned embeddings
(word2vec, GloVe, and fastText) do not consider the words’ con-
text (i.e., the same word has the same embedding vector). How-
ever, the same word may have different meanings in a different con-
text. BERT (Bidirectional Encoder Representations from Trans-
formers) is introduced to add context-level information in text-
related applications. As of early 2021, BERT is generally consid-
ered the best language model for common application tasks.
For sentence 1, you need to use “she” in the adjective clause af-
ter “which” because it is a girl. For sentence 2, you need to use
“he” because it is a boy. This is a long-term dependency example
where the information at the beginning can affect what needs to
come much later in the sentence. RNN needs to forward propa-
gate information from left to right and then backpropagate from
right to left. It can be difficult for the error associated with the
12.4 Recurrent Neural Network 331
Machine learning algorithms can not deal with raw text, and we
have to convert text into numbers before feeding it into an algo-
rithm. Tokenization is one way to convert text data into a nu-
merical representation. For example, suppose we have 500 unique
words for all reviews in the training dataset. We can label each
word by the rank (i.e., from 1 to 500) of their frequency in the
334 12 Deep Learning
Now we load the IMDB dataset, and we can check the structure
of the loaded object by using str() command.
336 12 Deep Learning
The x_train and x_test are numerical data frames ready to be used
for recurrent neural network models.
Simple Recurrent Neurel Network
Like DNN and CNN models we trained in the past, RNN mod-
12.4 Recurrent Neural Network 337
els are relatively easy to train using keras after the pre-processing
stage. In the following example, we use layer_embedding() to fit
an embedding layer based on the training dataset, which has
two parameters: input_dim (the number of unique words) and out-
put_dim (the length of dense vectors). Then, we add a simple RNN
layer by calling layer_simple_rnn() and followed by a dense layer
layer_dense() to connect to the response binary variable.
batch_size = 128
epochs = 5
validation_split = 0.2
epochs = epochs,
validation_split = validation_split
)
Epoch 1/5
plot(rnn_history)
12.4 Recurrent Neural Network 339
rnn_model %>%
evaluate(x_test, y_test)
loss accuracy
0.4365373 0.8010000
lstm_model %>%
layer_embedding(input_dim = max_unique_word, output_dim = 128) %>%
layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>%
layer_dense(units = 1, activation = 'sigmoid')
batch_size = 128
epochs = 5
validation_split = 0.2
Epoch 1/5
plot(lstm_history)
342 12 Deep Learning
lstm_model %>%
evaluate(x_test, y_test)
A.1 readr
345
346 A Handling Large Local Data
read_csv("2015,2016,2017
1,2,3
4,5,6")
## # A tibble: 2 x 3
## `2015` `2016` `2017`
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
The major functions of readr is to turn flat files into data frames:
• read_csv():
reads comma delimited files
• read_csv2(): reads semicolon separated files (common in countries
where , is used as the decimal place)
• read_tsv(): reads tab delimited files
• read_delim(): reads in files with any delimiter
• read_fwf(): reads fixed width files. You can specify fields ei-
ther by their widths with fwf_widths() or their position with
fwf_positions()
• read_table():
reads a common variation of fixed width files where
columns are separated by white space
A.1 readr 347
# A tibble: 6 x 19
age gender income house store_exp online_exp store_trans online_trans Q1
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <int>
1 57 Female 1.21e5 Yes 529. 304. 2 2 4
2 63 Female 1.22e5 Yes 478. 110. 4 2 4
3 59 Male 1.14e5 Yes 491. 279. 7 2 5
4 60 Male 1.14e5 Yes 348. 142. 10 2 5
5 51 Male 1.24e5 Yes 380. 112. 4 4 4
6 59 Male 1.08e5 Yes 338. 196. 4 5 4
# ... with 10 more variables: Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>, Q6 <int>,
# Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
The function reads the file to R as a tibble. You can consider tibble
as next iteration of the data frame. They are different with data
frame for the following aspects:
• It never changes an input’s type (i.e., no more stringsAsFactors
= FALSE!)
• It never adjusts the names of variables
• It has a refined print method that shows only the first 10 rows
and all the columns that fit on the screen. You can also control
the default print behavior by setting options.
Refer to http://r4ds.had.co.nz/tibbles.html for more information
about ‘tibble’.
When you run read_csv() it prints out a column specification that
gives the name and type of each column. To better understanding
348 A Handling Large Local Data
how readr works, it is helpful to type in some baby data set and
check the results:
## # A tibble: 2 x 3
## `2015` `2016` `2017`
## <chr> <chr> <chr>
## 1 100 200 300
## 2 canola soybean corn
You can also add comments on the top and tell R to skip those
lines:
## # A tibble: 7 x 3
## Date Food Mood
## <chr> <chr> <chr>
## 1 Monday carrot happy
## 2 Tuesday carrot happy
## 3 Wednesday carrot happy
## 4 Thursday carrot happy
A.1 readr 349
If you don’t have column names, set col_names = FALSE then R will
assign names “X1”,“X2”… to the columns:
## # A tibble: 2 x 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 Saturday carrot extremely happy
## 2 Sunday carrot extremely happy
## # A tibble: 2 x 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 Saturday carrot extremely happy
## 2 Sunday carrot extremely happy
## # A tibble: 1 x 10
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 every man is a poet when he is in
## # ... with 1 more variable: X10 <chr>
## # A tibble: 1 x 5
## X1 X2 X3 X4 X5
## <chr> <chr> <chr> <chr> <chr>
## 1 THE UNBEARABLE RANDOMNESS OF LIFE
Another situation you will often run into is the missing value. In
marketing survey, people like to use “99” to represent missing. You
can tell R to set all observation with value “99” as missing when
you read the data:
## # A tibble: 1 x 3
## Q1 Q2 Q3
## <dbl> <dbl> <lgl>
## 1 5 4 NA
A.2 data.table— enhanced data.frame 351
For writing data back to disk, you can use write_csv() and
write_tsv(). The following two characters of the two functions in-
crease the chances of the output file being read back in correctly:
• Encode strings in UTF-8
• Save dates and date-times in ISO8601 format so they are easily
parsed elsewhere
For example:
write_csv(sim.dat, "sim_dat.csv")
For other data types, you can use the following packages:
• Haven: SPSS, Stata and SAS data
• Readxl and xlsx: excel data(.xls and .xlsx)
• DBI: given data base, such as RMySQL, RSQLite and RPost-
greSQL, read data directly from the database using SQL
Some other useful materials:
• For getting data from the internet, you can refer to the book
“XML and Web Technologies for Data Sciences with R”.
# read data
sim.dat <- readr::read_csv("http://bit.ly/2P5gTw4")
sim.dat %>%
group_by(gender) %>%
summarise(Avg_online_trans = mean(online_trans))
dt <- data.table(sim.dat)
class(dt)
dt[, mean(online_trans)]
## [1] 13.55
sim.dat[,mean(online_trans)]
## gender V1
## 1: Female 15.38
## 2: Male 11.26
You can group by more than one variables. For example, group by
“gender” and “house”:
## gender house V1
## 1: Female Yes 11.312
## 2: Male Yes 8.772
## 3: Female No 19.146
## 4: Male No 16.486
Different from data frame, there are three arguments for data ta-
ble:
356 A Handling Large Local Data
SELECT
gender,
avg(online_trans)
FROM
sim.dat
GROUP BY
gender
R code:
A.2 data.table— enhanced data.frame 357
is equal to SQL:
SELECT
gender,
house,
avg(online_trans) AS avg
FROM
sim.dat
GROUP BY
gender,
house
R code:
is equal to SQL:
SELECT
gender,
house,
avg(online_trans) AS avg
FROM
sim.dat
WHERE
age < 40
GROUP BY
gender,
house
You can see the analogy between data.table and SQL. Now let’s
focus on operations in data table.
• select row
358 A Handling Large Local Data
• select column
Selecting columns in data.table don’t need $:
## [1] 57 63 59 60 51 59
• tabulation
In data table. .N means to count�
360 A Handling Large Local Data
# row count
dt[, .N]
## [1] 1000
# counts by gender
dt[, .N, by= gender]
## gender N
## 1: Female 554
## 2: Male 446
## gender count
## 1: Female 292
## 2: Male 86
Order table:
Since data table keep some characters of data frame, they share
some operations:
A.2 data.table— enhanced data.frame 361
dt[order(-online_exp)][1:5]
You can also order the table by more than one variable. The fol-
lowing code will order the table by gender, then order within gender
by online_exp:
dt[order(gender, -online_exp)][1:5]
363
364 B R code for data simulation
2,NA,5000,NA,10,500,NA,NA),
ncol=length(vars), byrow=TRUE)
Now let’s edit the data we just simulated a little by adding tags
to 0/1 binomial variables:
In the real world, the data always includes some noise such as
missing, wrong imputation. So we will add some noise to the data:
So far we have created part of the data. You can check it using
summary(sim.dat). Next, we will move on to simulate survey data.
nf <- 800
for (j in 1:20) {
set.seed(19870 + j)
x <- c("A", "B", "C")
sim.da1 <- NULL
for (i in 1:nf) {
# sample(x, 120, replace=TRUE)->sam
sim.da1 <- rbind(sim.da1, sample(x, 120, replace = TRUE))
}
# r = 0.5
# s1 <- c(rep(c(1/2, 0, -1/2), 40),
370 B R code for data simulation
# r = 1
# s1 <- c(rep(c(1, 0, -1), 40),
# rep(c(1, 0, 0), 40),
# rep(c(0, 0, 0), 40))
# link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3
# r = 2
# s1 <- c(rep(c(2, 0, -2), 40),
# rep(c(2, 0, 0), 40),
# rep(c(0, 0, 0), 40))
#
# link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3/0.5
for (i in 1:120) {
ind <- c(ind, rep(i, 2))
}
373
374 B Bibliography
OLS, 174
Ordinary Least Square (OLS),
174
P-value, 171
381