Untitled
Untitled
BUSINESS ANALYTICS
Communicating with Numbers
page ii
Johnson
Purchasing and Supply Management
Sixteenth Edition
PROJECT MANAGEMENT
Larson
Project Management: The Managerial Process
Eighth Edition
MANAGEMENT SCIENCE
Schindler
Business Research Methods
Thirteenth Edition
BUSINESS FORECASTING
Sterman
Business Dynamics: Systems Thinking and Modeling for a
Complex World
OPERATIONS MANAGEMENT
Stevenson
Operations Management
Fourteenth Edition
BUSINESS STATISTICS
McGuckian
Connect Master: Business Statistics
page iii
BUSINESS ANALYTICS
Communicating with Numbers
Sanjiv Jaggia
California Polytechnic State University
Alison Kelly
Suffolk University
Kevin Lertwachara
California Polytechnic State University
Leida Chen
California Polytechnic State University
page iv
BUSINESS ANALYTICS
Some ancillaries, including electronic and print components, may not be available to
customers outside the United States.
1 2 3 4 5 6 7 8 9 LWI 24 23 22 21 20
ISBN 978-1-260-57601-6
MHID 1-260-57601-9
The Internet addresses listed in the text were accurate at the time of publication. The
inclusion of a website does not indicate an endorsement by the authors or McGraw-Hill
Education, and McGraw-Hill Education does not guarantee the accuracy of the information
presented at these sites.
mheducation.com/highered
page v
Dedicated to our
families
page vi
Sanjiv Jaggia
page vi
Kevin Lertwachara
Teresa Cameron/Frank Gonzales/California Polytechnic State University
Leida Chen
Courtesy of Leida Chen
page ix
In this text, we have chosen Excel with Analytic Solver (an Exc
add-in) and R as the software packages due to their accessibility, eas
of use, and powerful features for demonstrating analytics concep
and performing analytics on real-world data. In most chapters, th
instructor can select to cover either Excel or R (or both) based on th
course objectives and the technical background of the students. Thes
software packages can be accessible to students through universi
computer labs, educational licenses, and free, open-source license
For this edition, all examples and exercise problems are solved usin
the latest versions of the software as of writing, namely, Microso
Office Professional 2016, Analytic Solver 2019, and R version 3.5.3
We recommend that the same versions of the software be used
order to replicate the results presented in the text. When new version
of the software packages are released in the future, we plan t
incorporate any substantial changes in future editions of the text o
provide updates online if the differences are relatively minor.
Sanjiv Jagg
Alison Kel
Kevin Lertwachar
Leida Che
page x
page xii
ME 11.2530 12.0480
page xiv
Computer Software
The text includes hands-on tutorials and problem-solving examples
featuring Microsoft Excel, Analytic Solver (an Excel add-in software
for data mining analysis), as well as R (a powerful software that
merges the convenience of statistical packages with the power of
coding).
Throughout the text, students learn to use the software to solve
real-world problems and to reinforce the concepts discussed in the
chapters. Students will also learn how to visualize and communicate
with data using charts and infographics featured in the software.
page xv
FOR INSTRUCTORS
page xvii
FOR STUDENTS
No surprises.
The Connect Calendar and Reports tools keep you on track
with the work you need to get done and your assignment
scores. Life gets busy; Connect tools help you keep learning
through it all.
Top: Jenner Images/Getty Images, Left: Hero Images/Getty Images, Right: Hero Images/Getty Images
page xviii
R Package
R is a powerful software that merges the convenience of statistical
packages with the power of coding. It is open source as well as
cross-platform compatible and gives students the flexibility to work
with large data sets using a wide range of analytics techniques. The
software is continuously evolving to include packages that support
new analytical methods. In addition, students can access rich online
resources and tap into the expertise of a worldwide community of R
users. In Appendix C, we introduce some fundamental features of R
and also provide instructions on how to obtain solutions for many
solved examples in the text.
As with other texts that use R, differences between software
versions are likely to result in minor inconsistencies in analytics
outcomes in algorithm-rich Chapters 9, 10, and 11. In these
chapters, the solved examples and exercise problems are based on
R version 3.5.3 on Microsoft Windows. In order to replicate the
results with newer versions of R, we suggest a line of code in these
chapters that sets the random number generator to the one used on
R version 3.5.3.
Analytic Solver
The Excel-based user interface of Analytic Solver reduces the
learning curve for students allowing them to focus on problem
solving rather than trying to learn a new software package. The
solved examples and exercise problems are based on Analytic
Solver 2019. Newer versions of Analytic Solver will likely produce the
same analysis results but may have a slightly different user interface.
For consistency, we recommend that you use Analytic Solver 2019
with this text.
Analytic Solver can be used with Microsoft Excel for Windows (as
an add-in), or “in the cloud” at AnalyticSolver.com using any device
(PC, Mac, tablet) with a web browser. It offers comprehensive
features for prescriptive analytics (optimization, simulation, decision
analysis) and predictive analytics (forecasting, data mining, text
mining). Its optimization features are upward compatible from the
standard Solver in Excel. Note: Analytic Solver software is not free
with the purchase of this text, but low-cost licenses are available. If
interested in having students get low-cost academic access for class
use, instructors should send an email to support@solver.com to get
their course code and receive student pricing and access information
as well as their own access information.
page xix
Student Resources
Students have access to data files, tutorials, and detailed progress
reporting within Connect. Key textbook resources can also be
accessed through the Additional Student Resources page:
www.mhhe.com/JaggiaBA1e.
ACKNOWLEDGMENTS
We would like to thank the following instructors for their feedback
and careful reviews during the development process:
Sung K. Ahn, Washington State University
Triss Ashton, Tarleton State University
Anteneh Ayanso, Brock University
Matthew Bailey, Bucknell University
Palash Beram Saint Louis University
Matthew Brenn, University of Akron
Paul Brooks, Virginia Commonwealth University
Kevin Brown, Asbury University
Kevin Caskey, SUNY New Paltz
Paolo Catasti, Virginia Commonwealth University
Michael Cervetti, University of Memphis
Jimmy Chen, Bucknell University
Jen-Yi Chen, Cleveland State University
Rex Cheung, San Francisco State University
Alan Chow, University of South Alabama,
Matthew Dean, University of Southern Maine
Alisa DiSalvo, St. Mary’s University Twin Cities
Mark Dobeck, Cleveland State University
Joan Donohue, University of South Carolina
Kathy Enget, University at Albany- SUNY
Kathryn Ernstberger, Indiana University Southeast
Robertas Gabrys, University of Southern California
Ridvan Gedik, University of New Haven
Roger Grinde, University of New Hampshire
Thomas Groleau, Carthage College
Babita Gupta, California State University Monterey Bay
Serina Al Haddad, Rollins College
Dan Harpool, University of Arkansas Little Rock
Fady Harfoush, Loyola University Chicago
Paul Holmes, Ashland University
Ping-Hung Hsieh, Oregon State University
Kuang-Chung Hsu, University of Central Oklahoma
Jason Imbrogno, University of Northern Alabama
Marina Johnson, Montclair State University
Jerzy Kamburowski, University of Toledo
Reza Kheirandish, Clayton State University
Esther Klein, St. Francis College
Bharat Kolluri, University of Hartford
Mohammad Merhi, Indiana University
Jacob Miller, Southern Utah University
Sinjini Mitra, California State University Fullerton
Kyle Moninger, Bowling Green State University
Rex Moody, Angelo State University
Ebrahim Mortaz, Pace University
Kristin Pettey, Southwestern College
Daniel Power, University of Northern Iowa
Zbigniew Przasnyski, Loyola Marymount University
Sharma Pillutla, Towson University
Roman Rabinovich, Boston University
Michael Ratajczyk, Saint Mary’s College
R. Christopher L. Riley, Delta State University
Leslie Rush, University of Hawaii West Oahu
Avijit Sarkar, University of Redlands
Dmitriy Shaltayev, Christopher Newport University
Mike Shurden, Lander University
Pearl Steinbuch, Boston University
Alicia Strandberg, Villanova University School of Business
David Taylor, Sacred Heart University
Stan Taylor, California State University Sacramento
Pablo Trejo, Miami Dade College
Nikhil Varaiya, San Diego State University
Timothy Vaughan, University of Wisconsin Eau Claire
Fen Wang, Central Washington University
Kanghyun Yoon, University of Central Oklahoma
John Yu, Saint Joseph’s University
Oliver Yi, San Jose State University
Liang Xu, Slippery Rock University of Pennsylvania
Yong Xu, Ferris State University
Jay Zagorsky, Boston University
page xx
BRIEF CONTENTS
CHAPTER 1 Introduction to Business Analytics 2
CHAPTER 2 Data Management and Wrangling 30
CHAPTER 3 Data Visualization and Summary Measures 80
CHAPTER 4 Probability and Probability Distributions 138
CHAPTER 5 Statistical Inference 176
CHAPTER 6 Regression Analysis 210
CHAPTER 7 Advanced Regression Analysis 258
CHAPTER 8 Introduction to Data Mining 316
CHAPTER 9 Supervised Data Mining: k-Nearest Neighbors an
Naïve Bayes 366
CHAPTER 10 Supervised Data Mining: Decision Trees 410
CHAPTER 11 Unsupervised Data Mining 474
CHAPTER 12 Forecasting with Time Series Data 518
CHAPTER 13 Introduction to Prescriptive Analytics 560
APPENDIXES
Appendix A Big Data Sets: Variable Description and Dat
Dictionary 608
Appendix B Getting Started with Excel and Excel Add-Ins 614
Appendix C Getting Started with R 621
Appendix D Statistical Tables 628
Appendix E Answers to Selected Exercises 632
Index 652
page xxii
CONTENTS
CHAPTER 1
1.2 Types of Data 8
Sample and Population Data 8
Cross-Sectional and Time Series Data 9
Structured and Unstructured Data 10
Big Data 12
CHAPTER 2
DATA MANAGEMENT AND WRANGLING 30
2.1 Data Management 32
Data Modeling: The Entity-Relationship Diagram 33
Data Retrieval in the Database Environment 35
Data Warehouse and Data Mart 36
2.2 Data Inspection 39
2.3 Data Preparation 45
Handling Missing Values 46
Subsetting 50
CHAPTER 3
3.4 Summary Measures 114
Measures of Central Location 114
Measures of Dispersion 119
Measures of Shape 121
Measures of Association 123
3.5 Detecting Outliers 127
A Boxplot 127
z-Scores 129
CHAPTER 4
page xxiv
CHAPTER 5
STATISTICAL INFERENCE 176
5.1 Sampling Distributions 178
The Sampling Distribution of the Sample Mean 178
The Sampling Distribution of the Sample Proportion 182
5.2 Estimation 185
Confidence Interval for the Population Mean μ 186
Using Excel and R to Construct a Confidence Interval for μ 18
Confidence Interval for the Population Proportion p 190
5.3 Hypothesis Testing 194
Hypothesis Test for the Population Mean μ 197
Using Excel and R to Test μ 201
Hypothesis Test for the Population Proportion p 203
CHAPTER 6
REGRESSION ANALYSIS 210
6.1 The Linear Regression Model 212
The Components of the Linear Regression Model 212
Estimating a Linear Regression Model with Excel or R 216
Categorical Variables with Multiple Categories 218
6.2 Model Selection 222
The Standard Error of the Estimate, se 223
The Coefficient of Determination, R2 224
The Adjusted R2 225
One Last Note on Goodness-of-Fit Measures 227
6.3 Tests of Significance 229
Test of Joint Significance 229
Test of Individual Significance 231
A Test for a Nonzero Slope Coefficient 233
Reporting Regression Results 235
CHAPTER 7
7.4 Cross-Validation Methods 300
The Holdout Method 301
Using Analytic Solver and R for the Holdout Method for the
Logistic Regression Model 305
The k-Fold Cross-Validation Method 306
CHAPTER 8
8.2 Similarity Measures 322
Similarity Measures for Numerical Data 323
Similarity Measures for Categorical Data 327
8.3 Performance Evaluation 332
Data Partitioning 332
Oversampling 333
Performance Evaluation in Supervised Data Mining 334
Performance Evaluation for Classification Models 335
Using Excel to Obtain the Confusion Matrix and Performance
Measures 337
Selecting Cut-off Values 338
Performance Charts for Classification 340
Using Excel to Obtain Performance Charts for Classification
344
Performance Evaluation for Prediction 345
Using Excel to Obtain Performance Measures for Prediction
347
CHAPTER 9
page xxv
CHAPTER 10
10.2 Classification Trees 415
Using Analytic Solver and R to Develop a Classification Tree
422
10.3 Regression Trees 440
Using Analytic Solver and R to Develop a Prediction Tree 443
CHAPTER 11
CHAPTER 12
CHAPTER 13
INTRODUCTION TO PRESCRIPTIVE
ANALYTICS 560
13.1 Overview of Prescriptive Analytics 562
APPENDIXES
Appendix A Big Data Sets: Variable Description and Data
Dictionary 608
Appendix B Getting Started with Excel and Excel Add-Ins 614
Appendix C Getting Started with R 621
Appendix D Statistical Tables 628
Appendix E Answers to Selected Exercises 632
Index 652
page xxv
page 1
BUSINESS ANALYTICS
Communicating with Numbers
page 2
1 Introduction to Business
Analytics
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 1.1 Explain the importance of business
analytics.
LO 1.2 Explain the various types of data.
LO 1.3 Describe variables and types of
measurement scales.
LO 1.4 Describe different data sources and file
formats.
page 3
©CampPhoto/iStock/Getty Images
INTRODUCTORY CASE
Vacation in Belize
After graduating from a university in southern California, Emily
Hernandez is excited to go on a vacation in Belize with her
friends. There are different airlines that offer flights from
southern California to Belize City. Emily prefers a direct flight
from Los Angeles but is also worried about staying within her
budget. Other, less expensive, options would mean making one
or even two additional stops en route to Belize City. Once
arriving at Belize City, she plans to take a sea ferry to one of
the resort hotels on Ambergris Caye Island. Purchasing a
vacation package from one of these hotels would be most cost
effective, but Emily wants to make sure that the hotel she
chooses has all the amenities she wants, such as an early
check-in option, a complimentary breakfast, recreational
activities, and sightseeing services. She also wants to make
sure that the hotel is reputable and has good customer
reviews.
Emily starts researching her options for flights and hotels by
searching for deals on the Internet. She has organized and
kept meticulous records of the information she has found so
that she can compare all the options. She would like to use this
information to:
1. Find a flight that is convenient as well as affordable.
2. Choose a reputable hotel that is priced under $200 per night.
A synopsis of this case is provided at the end of Section 1.2.
page 4
BUSINESS ANALYTICS
Business analytics combines qualitative reasoning with
quantitative tools to identify key business problems and
translate data analysis into decisions that improve business
performance.
page 5
These three categories of analytics serve different functions in
the problem-solving process. Descriptive analytics is often referred
to as business intelligence (BI), which provides organizations and
their users with the ability to access and manipulate data
interactively through reports, dashboards, applications, and
visualization tools. BI uses past data integrated from multiple
sources to inform decision making and identify problems and
solutions. Most BI questions can be solved using complex queries of
the enterprise databases. For example, a typical BI question for an
online music streaming company would be, “During the first quarter
of 2020, how many country songs recommended by the music
service were skipped by a U.S.-based female listener within five
seconds of playing?” The answer to this question can be found by
querying and summarizing historical data and is useful for making
decisions related to the effectiveness of the music recommendation
system of the company.
Predictive and prescriptive analytics, on the other hand, are
commonly considered advanced predictions. They focus on building
predictive and prescriptive models that help organizations
understand what might happen in the future. Advanced prediction
problems are solved using statistics and data mining techniques. For
the online music streaming company example, an advanced
prediction question would be, “What are the key factors that
influence a U.S.-based female listener’s music choice?” The answer
to this question cannot be directly found in the enterprise databases
and requires complex learning and analysis of the relevant historical
data.
The three categories of business analytics can also be viewed
according to the level of sophistication and business values they
offer. For many people, using predictive analytics to predict the
future is more valuable than simply summarizing data and describing
what happened in the past. In addition, predictive techniques tend to
require more complex modeling and analysis tools than most
descriptive techniques. Likewise, using prescriptive techniques to
provide actionable recommendations could be more valuable than
predicting a number of possible outcomes in the future. Turning data-
driven recommendations into action also requires thoughtful
consideration and organizational commitment beyond developing
descriptive and predictive analytical models.
Figure 1.1 categorizes business analytics into three stages of
development based on its value and the level of organizational
commitment to data-driven decision making. Most chapters of this
text can also be grouped within each of the three analytics
categories as shown in Figure 1.1.
page 8
1.2 TYPES OF DATA
LO 1.2
Explain the various types of data.
Every day, consumers and businesses use many kinds of data from
various sources to help make decisions. An important first step for
making decisions is to find the right data and prepare it for the
analysis. In general, data are compilations of facts, figures, or other
contents, both numerical and nonnumerical. Data of all types and
formats are generated from multiple sources. Insights from all of
these data can enable businesses to make better decisions, such as
deepening customer engagement, optimizing operations, preventing
threats and fraud, and capitalizing on new sources of revenue. We
often find a large amount of data at our disposal. However, we also
derive insights from relatively small data sets, such as from
consumer focus groups, marketing surveys, or reports from
government agencies.
Data that have been organized, analyzed, and processed in a
meaningful and purposeful way become information. We use a
blend of data, contextual information, experience, and intuition to
derive knowledge that can be applied and put into action in specific
situations.
FIGURE 1.2 Screen shot of airfare and hotel search results from Orbitz.com
Source: Orbitz.com
Before we analyze the information that Emily has gathered, it is
important to understand different types of data. In this section, we
focus on the various data categorizations.
page 10
*The Toronto Raptors won their first NBA title during the 2018-2019 season.
FIGURE 1.3 Homeownership Rate (in %) in the U.S. from 2000 through 2018
Source: Federal Reserve Bank of St. Louis
Big Data
Nowadays, businesses and organizations generate and gather more
and more data at an increasing pace. The term big data is a catch-
phrase, meaning a massive volume of both structured and
unstructured data that are extremely difficult to manage, process,
and analyze using traditional data-processing tools. Despite the
challenges, big data present great opportunities to gain knowledge
and business intelligence with potential game-changing impacts on
company revenues, competitive advantage, and organizational
efficiency. More formally, a widely accepted definition of big data is
“high-volume, high-velocity and/or high-variety information assets
that demand cost-effective, innovative forms of information
processing that enable enhanced insight, decision making, and
process automation” (www.gartner.com). The three characteristics
(the three Vs) of big data are:
Volume: An immense amount of data is compiled from a single
source or a wide range of sources, including business
transactions, household and personal devices, manufacturing
equipment, social media, and other online portals.
Velocity: In addition to volume, data from a variety of sources get
generated at a rapid speed. Managing these data streams can
become a critical issue for many organizations.
Variety: Data also come in all types, forms, and granularity, both
structured and unstructured. These data may include numbers,
text, and figures as well as audio, video, e-mails, and other
multimedia elements.
In addition to the three defining characteristics of big data, we
also need to pay close attention to the veracity of the data and the
business value that they can generate. Veracity refers to the
credibility and quality of data. One must verify the reliability and
accuracy of the data content prior to relying on the data to make
decisions. This becomes increasingly challenging with the rapid
growth of data volume fueled by social media and automatic data
collection. Value derived from big data is perhaps the most important
aspect of any analytics initiative. Having a plethora of data does not
guarantee that useful insights or measurable improvements will be
generated. Organizations must develop a methodical plan for
formulating business questions, curating the right data, and
unlocking the hidden potential in big data.
Big data, however, do not necessarily imply complete
(population) data. Take, for example, the analysis of all Facebook
users. It certainly involves big data, but if we consider all Internet
users in the world, Facebook users are only a very large sample.
There are many Internet users who do not use Facebook, so the
data on Facebook do not represent the population. Even if we define
the population as pertaining to those who use online social media,
Facebook is still one of many social media portals that consumers
use. And because different social media are used for different
purposes, data collected from these sites may very well reflect
different populations of Internet users; this distinction is especially
important from a strategic business standpoint. Therefore, Facebook
data are simply a very large sample.
In addition, we may choose not to use big data in its entirety even
when they are available. Sometimes it is just inconvenient to analyze
a very large data set as it is computationally burdensome, even with
a modern, high-capacity computer system. Other times, the
additional benefits of working with big data may not justify the
associated costs. In sum, we often choose to work with a small data
set, which in a sense is a sample drawn from big data.
page 13
EXAMPLE 1.1
In the introductory case, Emily is looking for roundtrip flight
schedules offered by different airlines from Los Angeles,
California, to Belize City. Once arriving at Belize City, she plans
to purchase a vacation package from one of the resort hotels
on Ambergris Caye Island. She has compiled information on
flight schedules and hotel options from search results on
Orbitz.com. Her initial search yields 1,420 flight schedules and
19 hotels. In its current format, Emily is overwhelmed by the
amount of data and knows that she needs to refine her search.
She would like to focus on flights that are convenient (short
duration) and relatively inexpensive. For hotels, she would like
an affordable hotel (priced under $200 per night) with good
reviews (above-average review on a 5-point scale). Given
Emily’s preferences, summarize the online information in
tabular form.
SOLUTION:
We first search for roundtrip flight schedules where priority is
given to short flight times and low prices. Even though the flight
information is not in a perfect row-column configuration, the
structured nature of the data allows us to summarize the
information in a tabular format. The four most relevant options
are presented in Panel A of Table 1.2. Given these options, it
seems that the American/Delta choice might be the best for
her. Similarly, once we refine the search to include only hotels
that fall within her budget of under $200 per night and have an
average consumer rating above 4 on a 5-point scale, the
number of hotel options declines from 19 to five. These five
options are presented in Panel B of Table 1.2. In this case, her
choice of hotel is not clear-cut. The hotel with the page 14
highest rating is also the most expensive (X’Tan Ha
– The Waterfront), while the least expensive hotel has the
lowest rating (Isla Bonita Yacht Club) and the rating is based
on only 11 reviews. Emily will now base her final hotel choice
on online reviews. These reviews constitute unstructured data
that do not conform to a row-column format.
There are four airline options that she finds suitable. With a price of $495.48, the
Delta/American option is the cheapest, but the flight time is over 20 hours each way.
The United/Delta option offers the shortest flight times; however, it comes with a
price tag of $929.91. Emily decides on the American/Delta option, which seems
most reasonable with a price of $632.91, a flight time of under eight hours to Belize,
and only five hours on the return. In regard to hotels, Bonita Yacht Club offers the
cheapest price of $130, but it comes with a relatively low rating of 4.2.
In addition, Emily is concerned about the credibility of the rating as it is based on
only 11 reviews. Although not the cheapest, Emily decides to go with X’Tan Ha – The
Waterfront. It has the highest rating of 4.6 and the price still falls within her budget of
under $200. In these reviews, Emily consistently finds key phrases such as “great
location,” “clean room,” “comfortable bed,” and “helpful staff.” Finally, the pictures
that guests have posted are consistent with the images published by the resort on its
website.
EXERCISES 1.2
Applications
1. A few years ago, it came as a surprise when Apple’s iPhone 4 was found to have
a problem. Users complained of weak reception, and sometimes even dropped
calls, when they cradled the phone in their hands in a particular way. A survey at a
local store found that 2% of iPhone 4 users experienced this reception problem.
a. Describe the relevant population.
b. Is 2% associated with the population or the sample?
2. Many people regard video games as an obsession for youngsters, but, in fact, the
average age of a video game player is 35 years old. Is the value 35 likely the
actual or the estimated average age of the population? Explain.
3. An accounting professor wants to know the average GPA of the students enrolled
in her class. She looks up information on Blackboard about the students enrolled
in her class and computes the average GPA as 3.29. Describe the relevant
population.
4. Recent college graduates with an engineering degree continue to earn high
salaries. An online search revealed that the average annual salary for an entry-
level position in engineering is $65,000.
a. What is the relevant population?
b. Do you think the average salary of $65,000 is computed from the population?
Explain.
5. Research suggests that depression significantly increases the risk of developing
dementia later in life. Suppose that in a study involving 949 elderly persons, it was
found that 22% of those who had depression went on to develop dementia,
compared to only 17% of those who did not have depression.
a. Describe the relevant population and the sample.
b. Are the numbers 22% and 17% associated with the population or a sample?
6. Go to www.zillow.com and find the sale price of 20 single-family homes sold in
Las Vegas, Nevada, in the last 30 days. Structure the data in a
tabular format and include the sale price, the number of bedrooms,
page 15
the square footage, and the age of the house. Do these data represent cross-
sectional or time series data?
7. Go to www.finance.yahoo.com to get the current stock quote for Home Depot
(ticker symbol = HD). Use the ticker symbol to search for historical prices and
create a table that includes the monthly adjusted close price of Home Depot stock
for the last 12 months. Do these data represent cross-sectional or time series
data?
8. Go to the New York Times website at www.nytimes.com and review the front
page. Would you consider the data on the page to be structured or unstructured?
Explain.
9. Conduct an online search to compare small hybrid vehicles (e.g., Toyota Prius,
Ford Fusion, Chevrolet Volt) on price, fuel economy, and other specifications. Do
you consider the search results structured or unstructured data? Explain.
10. Find Under Armour’s annual revenue from the past 10 years. Are the data
considered structured or unstructured? Explain. Are they cross-sectional or time
series data?
11. Ask 20 of your friends about their online social media usage, specifically whether
or not they use Facebook, Instagram, and Snapchat; how often they use each
social media portal; and their overall satisfaction of each of these portals. Create
a table that presents this information. Are the data considered structured or
unstructured? Are they cross-sectional or time series data?
12. Ask 20 of your friends whether they live in a dormitory, a rental unit, or other form
of accommodation. Also find out their approximate monthly lodging expenses.
Create a table that uses this information. Are the data considered structured or
unstructured? Are they cross-sectional or time series data?
13. Go to the U.S. Census Bureau website at www.census.gov and search for the
most recent median household income for Alabama, Arizona, California, Florida,
Georgia, Indiana, Iowa, Maine, Massachusetts, Minnesota, Mississippi, New
Mexico, North Dakota, and Washington. Do these data represent cross-sectional
or time series data? Comment on the regional differences in income.
page 16
EXAMPLE 1.2
In the introductory case, Emily has conducted an online search
on airfares and hotels for her planned vacation to Ambergris
Caye Island. She has summarized the information in Table 1.2.
What types of variables are included in the airfare and hotel
data?
SOLUTION:
Airlines and hotels are categorical variables because the
observations—the names—are merely labels. On the other
hand, the roundtrip price, the average rating, the number of
reviews, and the price per night are numerical variables
because the observations are all meaningful numbers. Note
that the roundtrip price, the number of reviews, and the price
per night represent discrete variables because they can only
assume a countable number of values. The average rating is
continuous because it is characterized by uncountable values
within the 0 to 5 interval. The date and time variables are
considered numerical; however, if we were to consider the days
of the week (e.g., Monday, Tuesday, etc.) in the flight
schedules, then the variable days would be considered a
categorical variable.
page 17
TABLE 1.3 Companies of the DJIA and Exchange Where Stock is Traded
Category Rating
Excellent 5
Very good 4
Good 3
Fair 2
Poor 1
In Table 1.4, the value attached to excellent (5 stars) is higher than
the value attached to good (3 stars), indicating that the response of
excellent is preferred to good. However, we can easily redefine the
ratings, as we show in Table 1.5.
Category Rating
Excellent 100
Good 70
Fair 50
Poor 40
page 18
In Table 1.5, excellent still receives a higher value than good, but
now the difference between the two categories is 30 points (100 –
70), as compared to a difference of 2 points (5 – 3) when we use the
first classification. In other words, differences between categories
are meaningless with ordinal data. (We also should note that we
could reverse the ordering so that, for instance, excellent equals 40
and poor equals 100; this renumbering would not change the nature
of the data.)
As mentioned earlier, observations of a categorical variable are
typically expressed in words but are coded into numbers for
purposes of data processing. When summarizing the results of a
categorical variable, we typically count the number of observations
that fall into each category or calculate the percentage of
observations that fall into each category. However, with a categorical
variable, we are unable to perform meaningful arithmetic operations,
such as addition and subtraction.
With data that is measured on the interval scale, we are able to
categorize and rank the data as well as find meaningful differences
between observations. The Fahrenheit scale for temperatures is an
example of interval-scaled data. Not only is 60 degrees Fahrenheit
hotter than 50 degrees Fahrenheit, the same difference of 10
degrees also exists between 90 and 80 degrees Fahrenheit.
The main drawback of interval-scaled data is that the value of
zero is arbitrarily chosen; the zero point of interval-scaled data does
not reflect a complete absence of what is being measured. No
specific meaning is attached to 0 degrees Fahrenheit other than to
say it is 10 degrees colder than 10 degrees Fahrenheit. With an
arbitrary zero point, meaningful ratios cannot be constructed. For
instance, it is senseless to say that 80 degrees is twice as hot as 40
degrees; in other words, the ratio 80/40 has no meaning.
The ratio scale represents the strongest level of measurement.
The ratio scale has all the characteristics of the interval scale as well
as a true zero point, which allows us to interpret the ratios between
observations. The ratio scale is used in many business applications.
Variables such as sales, profits, and inventory levels are expressed
on the ratio scale. A meaningful zero point allows us to state, for
example, that profits for firm A are double those of firm B. Variables
such as weight, time, and distance are also measured on a ratio
scale because zero is meaningful.
Unlike nominal- and ordinal-scaled variables (categorical
variables), arithmetic operations are valid on interval- and ratio-
scaled variables (numerical variables). In later chapters, we will
calculate summary measures, such as the mean, the median, and
the variance, for numerical variables; we cannot calculate these
measures if the variable is categorical in nature.
MEASUREMENT SCALES
The observations for any variable can be classified into one of
four major measurement scales: nominal, ordinal, interval, or
ratio.
Nominal: Observations differ merely by name or label.
Ordinal: Observations can be categorized and ranked;
however, differences between the ranked observations are
meaningless.
Interval: Observations can be categorized and ranked, and
differences between observations are meaningful. The main
drawback of the interval scale is that the value of zero is
arbitrarily chosen.
Ratio: Observations have all the characteristics of interval-
scaled data as well as a true zero point; thus, meaningful
ratios can be calculated.
Nominal and ordinal scales are used for categorical variables,
whereas interval and ratio scales are used for numerical
variables.
page 19
EXAMPLE 1.3
FILE
Tween_Survey
SOLUTION:
page 20
EXERCISES 1.3
Applications
14. Which of the following variables are categorical and which are numerical? If the
variable is numerical, then specify whether the variable is discrete or continuous.
a. Points scored in a football game.
b. Racial composition of a high school classroom.
c. Heights of 15-year-olds.
15. Which of the following variables are categorical and which are numerical? If the
variable is numerical, then specify whether the variable is discrete or continuous.
a. Colors of cars in a mall parking lot.
b. Time it takes each student to complete a final exam.
c. The number of patrons who frequent a restaurant.
16. In each of the following scenarios, define the type of measurement scale.
a. A kindergarten teacher marks whether each student is a boy or a girl.
b. A ski resort records the daily temperature during the month of January.
c. A restaurant surveys its customers about the quality of its waiting staff on a
scale of 1 to 4, where 1 is poor and 4 is excellent.
17. In each of the following scenarios, define the type of measurement scale.
a. An investor collects data on the weekly closing price of gold throughout the
year.
b. An analyst assigns a sample of bond issues to one of the following credit
ratings, given in descending order of credit quality (increasing probability of
default): AAA, AA, BBB, BB, CC, D.
c. The dean of the business school at a local university categorizes students by
major (i.e., accounting, finance, marketing, etc.) to help in determining class
offerings in the future.
18. In each of the following scenarios, define the type of measurement scale.
a. A meteorologist records the amount of monthly rainfall over the past year.
b. A sociologist notes the birth year of 50 individuals.
c. An investor monitors the daily stock price of BP following the 2010 oil disaster in
the Gulf of Mexico.
19. FILE Major. A professor records the majors of her 30 students. A portion of the
data is shown in the accompanying table.
Student Major
1 Accounting
2 Management
⋮ ⋮
30 Economics
⋮ ⋮ ⋮ ⋮
Fixed-Width Format
In a data file with a fixed-width format (or fixed-length format), each
column starts and ends at the same place in every row. The actual
data are stored as plain text characters in a digital file. Consider the
information in Table 1.7. It shows the first name, telephone number,
and annual salary for three individuals.
Delimited Format
Another widely used file format to store tabular data is a delimited
format. In Figure 1.6, we show the information in Table 1.7 in a
delimited format, where each piece of data is separated by a
comma.
page 25
EXERCISES 1.4
Applications
21. A used car salesperson recently sold two Mercedes, three Toyota, six Ford, and
four Hyundai sedans. He wants to record the sales data.
a. Organize the data into a fixed-width format (eight characters for brand name
and four characters for the number of cars sold).
b. Organize the data in a delimited format.
c. Code the data in XML format.
d. Code the data in HTML table format.
e. Code the data in JSON format.
22. Last year, Oracle hired three finance majors from the local university. Robert
Schneider started with a salary of $56,000, Chun Zhang with $52,000, Sunil
Banerjee with $58,000, and Linda Jones with $60,000. Oracle wants to record the
hiring data.
a. Organize the data into a fixed-width format (10 characters for first name, 10
characters for last name, and six characters for salary).
b. Organize the data in a delimited format.
c. Code the data in XML format.
d. Code the data in HTML table format.
e. Code the data in JSON format.
23. The following table lists the population, in millions, in India and China, the two
most populous countries in the world, for the years 2013 through 2017.
a. Organize the data into a fixed-width format (four characters for Year, eight
characters for India, and eight characters for China).
b. Organize the data in a delimited format.
24. The following table lists the top five countries in the world in terms of their
happiness index on a 10-point scale, and their corresponding GDP per capita as
reported by the United Nations in 2017.
a. Organize the data into a fixed-width format (11 characters for Country, 10
characters for Happiness, and six characters for GDP).
b. Organize the data in a delimited format.
25. The following three students were honored at a local high school for securing
admissions to prestigious universities.
Name University
Bridget Yale
Minori Stanford
Matthew Harvard
a. Organize the data into a fixed-width format (10 characters for Name and 10
characters for University).
b. Organize the data in a delimited format.
c. Code the data in XML format.
d. Code the data in HTML table format.
e. Code the data in JSON format.
26. According to Forbes, Michael Trout of the Los Angeles Angels, with earnings of
$39 million, was the highest-paid player in baseball in 2019. Bryce Harper of the
Philadelphia Phillies ranked second at $36.5 million. The Boston Red Sox pitcher
David Price was baseball’s third-highest-paid player at $32 million.
a. Code the data on the player name, baseball team, and salary in the XML
format.
b. Repeat part a using the HTML table format.
c. Repeat part a using the JSON format.
27. According to U.S. News and World Report, a statistician was the best business
profession in 2019 with 12,600 projected jobs and a median salary of $84,060. A
mathematician was the second-best profession with 900 projected jobs and a
median salary of $103,010. Interestingly, both of these career paths are related to
data analytics.
a. Code the information on the profession, projected jobs, and median salary in
XML format.
b. Repeat part a using the HTML table format.
c. Repeat part a using the JSON format.
page 26
Case Study
Since 1940, the Billboard magazine has published a variety of weekly music
popularity charts. Today, the magazine uses a combination of sales volume, airplay
on radio, digital downloads, and online streams in order to determine the chart
rankings. One of the most highly watched Billboard charts is the Hot 100, which
provides a weekly ranking of the top 100 music singles. Each entry on the Hot 100
chart lists the current rank, last week’s rank, highest position, and the number of
weeks the song has been on the chart. Other Billboard charts rank music by genre
such as Pop, Rock, R&B, and Latin or rank the popularity of music albums and
artists.
Maya Alexander is a reporter for her university’s newspaper, The Campus
Gazette. She wants to launch a new Film & Music column that will include
commentary, summarized data, and statistics on music popularity. In an effort to
convince her editor to give her this assignment, Maya researches and evaluates
what The Campus Gazette might be able to use from the Billboard website
(http://www.billboard.com).
Source: Billboard.com
page 27
Our readers may also be interested in following the top music singles by
genre. For example, Table 1.8 is an example of a summary table listing the
top five singles in Country, Pop, and Rock, which is easily compiled from
three different Billboard charts. This summary table can be complemented
with written commentaries based on information provided by the popularity
charts such as the highest chart position and the number of weeks on the
chart for each song.
According to the most recent sales and market performance, the Billboard
website also organizes songs and albums into categories such as ‘Gains in
Performance,’ ‘Biggest Gain in Streams,’ and ‘Biggest Gain in Digital Sales.’
These songs and albums have not yet reached the top 5 positions, but their
rankings are quickly moving up on the popularity charts. The Campus
Gazette can establish itself as a music trendsetter on campus by introducing
our readers to these up-and-coming songs and albums in a new Film & Music
column. Figure 1.11 shows an example of an up-and-coming single on the
Hot 100 chart. Similar to the popularity charts, these music categories are
formatted using the HTML standard on the Billboard website and can be
readily imported into a text document. We can create commentaries on
selected songs and albums from these lists to introduce up-and-coming
music to our readers.
Report 1.1. Finland is the happiest country in the world, according to the 2018
Happiness Index Report by the United Nations (http://www.worldhappiness.report). In
fact, several Scandinavian countries have consistently held the top spots among the
156 countries included in the annual Happiness Index Report in the past several
years. Visit the Happiness Index website, explore, and write a report based on the
current data provided on the website.
page 29
Report 1.2. Millions of tourists visit Yosemite National Park in California each year.
Stunning waterfalls, giant redwood trees, and spectacular granite rock formations are
among the main attractions at the iconic park. However, the winding roads leading to
the Yosemite Valley may be closed occasionally due to severe weather conditions.
Visit a weather forecast website such as http://www.weather.com and explore the
weather data around Yosemite Park. Write a report to advise a tourist planning a
visit.
page 28
page 30
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 2.1 Describe the key concepts related to data
management.
LO 2.2 Inspect and explore data.
LO 2.3 Apply data preparation techniques to
handle missing values and to subset data.
LO 2.4 Transform numerical variables.
LO 2.5 Transform categorical variables.
page 31
©CustomersYuliia Mazurkevych/Shutterstoick
INTRODUCTORY CASE
Gaining Insights into Retail Customer Data
Organic Food Superstore is an online grocery store that
specializes in providing organic food products to health-
conscious consumers. The company offers a membership-
based service that ships fresh ingredients for a wide range of
chef-designed meals to its members’ homes. Catherine Hill is a
marketing manager at Organic Food Superstore. She has been
assigned to market the company’s new line of Asian-inspired
meals. Research has shown that the most likely customers for
healthy ethnic cuisines are college-educated millennials (born
between 1982 and 2000).
In order to spend the company’s marketing dollars
efficiently, Catherine wants to focus on this target demographic
when designing the marketing campaign. With the help of the
information technology (IT) group, Catherine has acquired a
representative sample that includes each customer’s
identification number (CustID), sex (Sex), race (Race),
birthdate (BirthDate), whether the customer has a college
degree (College), household size (HouseholdSize), zip code
(ZipCode), annual income (Income), total spending in 2017
(Spending2017), total spending in 2018 (Spending2018), total
number of orders during the past 24 months (NumOfOrders),
number of days since the last order (DaysSinceLast), the
customer’s rating on the last purchase (Satisfaction), and the
channel through which the customer was originally acquired
(Channel). Table 2.1 shows a portion of the data set.
DATA WRANGLING
Data wrangling is the process of retrieving, cleansing,
integrating, transforming, and enriching data to support
subsequent data analysis.
DATA MANAGEMENT
Data management is the process that an organization uses to
acquire, organize, store, manipulate, and distribute data.
The most common type of database (a collection of data) is
the relational database. A relational database consists of one or
more logically related data files, where each data file is a two-
dimensional grid that consists of rows and columns.
page 33
page 34
a) CUSTOMER table
b) ORDER table
c) ORDER_DETAIL table
1484001 4378 1
Order_ID Product_ID Quantity
1482141 4305 1
⋮ ⋮ ⋮
1482141 4330 2
d) PRODUCT table
While simple queries like the previous one are useful, we often
need to compile data from multiple tables and apply more than one
selection criteria. For example, we may want to retrieve the customer
names, order IDs, and order quantities of organic sweet potato
purchases on October 15, 2018. The following SQL query returns
this information.
page 37
FIGURE 2.3 Star schema of a data mart for Organic Food Superstore
DATA WAREHOUSE AND DATA MART
A data warehouse is a central repository of data from multiple
departments within an organization to support managerial
decision making. Analytics professionals tend to acquire data
from data marts, which are small-scale data warehouses that
only contain data that are relevant to certain subjects or
decision areas. Data in a data mart are organized using a
multidimensional data model called a star schema, which
includes dimension and fact tables.
Each of the dimension tables has a 1:M relationship with the fact
table. Hence, the primary keys of the dimension tables are also the
foreign keys in the fact table. At the same time, the combination of
the primary keys of the dimension tables forms the composite
primary key of the fact table. The fact table is usually depicted at the
center surrounded by multiple dimension tables forming the shape of
a star. In reality, it is not uncommon to see multiple fact tables
sharing relationships with a group of dimension tables in a data mart.
One of the key advantages of the star schema is its ability to
“slice and dice” data based on different dimensions. For example, in
the data model shown in Figure 2.3, sales data can be retrieved
based on who the customer is; which product or product category is
involved; the year, quarter, or month of the order; where the
customer lives; or through which channel the order was submitted
simply by matching the primary keys of the dimension tables with the
foreign keys of the fact table.
In recent years, new forms of databases to support big data have
emerged. The most notable is the NoSQL or “Not Only SQL”
database. The NoSQL database is a non-relational database that
supports the storage of a wide range of data types including
structured, semi-structured, and unstructured data. It also offers the
flexibility, performance, and scalability needed to handle extremely
high volumes of data. Analytics professionals will likely see NoSQL
databases implemented alongside relational databases to support
organizations’ data needs in today’s environment.
page 38
EXERCISES 2.1
Applications
1. Which of the following statements correctly describe the data wrangling process?
Select all that apply. Explain if incorrect.
a. Data wrangling is the process of retrieving, cleansing, integrating, transforming,
and enriching data.
b. Data wrangling is the process of defining and modeling the structure of a
database to represent real-world events.
c. The objectives of data wrangling include improving data quality and reducing
the time and effort required to perform analytics.
d. Data wrangling focuses on transforming the raw data into a format that is more
appropriate and easier to analyze.
2. Which of the following statements about entity-relationship diagrams (ERDs) are
correct? Select all that apply. Explain if incorrect.
a. An entity usually represents persons, places, things, or events about which we
want to store data.
b. A foreign key is an attribute that uniquely identifies each instance of the entity.
c. A composite key is a key that consists of more than one attribute.
d. A relationship between entities represents certain business facts or rules.
3. Which of the following statements correctly identify and describe the key elements
of a relational database? Select all that apply. Explain if incorrect.
a. A table in a relational database is a two-dimensional grid that page 39
contains actual data.
b. A field or a column represents a characteristic of a physical object, an event, or
a person.
c. A relational database includes software tools for advanced data visualization.
d. A tuple or a record in a table represents a physical object, an event, or a
person.
4. Which of the following statements best describes what a foreign key is? Select all
that apply. Explain if incorrect.
a. It is an attribute that uniquely identifies each instance of the entity.
b. It is a primary key that consists of multiple attributes.
c. It is the primary key of a related database table.
d. It is a single occurrence of an entity.
5. Which type of relationship―one-to-one (1:1), one-to-many (1:M), or many-to-
many (M:N)―do the following business rules describe?
a. One manager can supervise multiple employees, and one employee may report
to multiple managers.
b. A business department has multiple employees, but each employee can be
assigned to only one department.
c. A company can have only one CEO, and each CEO can work for only one
company.
d. An academic adviser can work with multiple students, while each student is
assigned to only one adviser.
e. A golf course offers a membership to many members, and a golfer can
potentially sign up for a membership at multiple golf courses.
f. A soccer team consists of multiple players, while an individual player can play
for only one team at a time.
6. Which of the following statements correctly describe benefits of Structured Query
Language (SQL)? Select all that apply. Explain if incorrect.
a. SQL can be used to manipulate structured, semi-structured, and unstructured
data.
b. SQL commands allow users to select data based on multiple selection criteria.
c. SQL can be used to compile data from multiple tables.
d. SQL commands are relatively simple and intuitive.
7. Which of the following statements about data warehouses and data marts are
correct? Select all that apply. Explain if incorrect.
a. A data warehouse is a subset of the enterprise database that focuses on one
particular subject or decision area.
b. The dimension table describes business dimensions of interest, such as
customer, product, location, and time, while the fact table contains facts about
the business operation.
c. A star schema represents a multidimensional data model.
d. A data warehouse is the central repository of data from multiple departments
within a business enterprise to support managerial decision making.
Once the raw data are extracted from the database, data
warehouse, or data mart, we usually review and inspect the data set
to assess data quality and relevant information for subsequent
analysis. In addition to visually reviewing data, counting and sorting
are among the very first tasks most data analysts perform to gain a
better understanding and insights into the data. Counting and sorting
data help us verify that the data set is complete or that it may have
missing values, especially for important variables. Sorting data also
allows us to review the range of values for each variable. We can
sort data based on a single variable or multiple variables.
In Example 2.1, we demonstrate how to use counting and sorting
features in Excel and R to inspect and gain insights into the data.
While these features also allow us to detect missing values, we
discuss the treatment of missing values in Section 2.3.
FILE
Gig
EXAMPLE 2.1
BalanceGig is a company that matches independent workers
for short-term engagements with businesses in the
construction, automotive, and high-tech industries. The ‘gig’
employees work only for a short period of time, often on a
particular project or a specific task. A manager at BalanceGig
extracts the employee data from their most recent work
engagement, including the hourly wage page 40
(HourlyWage), the client’s industry (Industry), and
the employee’s job classification (Job). A portion of the Gig
data set is shown in Table 2.3.
2 46 Automotive Engineer
⋮ ⋮ ⋮ ⋮
EmployeeID HourlyWage Industry Job
a. Open the Gig data file. Note that the employee data are
currently sorted by their employee ID in column A. Scroll to the
end of the data set and note that the last record is in row 605.
With the column heading in row 1, the data set has a total of
604 records.
b. We use two Excel functions, COUNT and COUNTA, to inspect
the number of values in each column. The COUNT function
counts the number of cells that contain numeric values and,
therefore, can only apply to the EmployeeID and HourlyWage
variables. The COUNTA function counts the number of cells
that are not empty and is applicable to all four variables.
Because HourlyWage is a numerical variable, we can enter
either =COUNT(B2:B605) or =COUNTA(B2:B605) in an empty
cell to count the number of values for HourlyWage. We get
604 values, implying that there are no missing values.
Similarly, we enter =COUNTA(C2:C605) and
=COUNTA(D2:D605) in empty cells to count the number of
values for the Industry (column C) and Job (column D)
variables. Because these two variables are non-numerical, we
use COUNTA instead of COUNT. Verify that the number of
records for Industry and Job are 594 and 588, respectively,
indicating that there are 10 and 16 blank or missing values,
respectively, in these two variables.
c. To count the number of employees in each industry, we use
the COUNTIF function. Entering
=COUNTIF(C2:C605,“=Automotive”) in an empty cell will show
that 190 of the 604 employees worked in the automotive
industry. Similarly, entering =COUNTIF(B2:B605,“>30”) in an
empty cell will show that 536 employees earned more than
$30 per hour. Note that the first parameter in the COUNTIF
function is the range of cells to be counted, and the second
parameter specifies the selection criterion. Other logical
operators such as >=, <, <=, and <> (not equal to) can also be
used in the COUNTIF function. page 41
d. To count the number of employees with multiple
selection criteria, we use the COUNTIFS function. For
example, entering =COUNTIFS(C2:C605, “=Automotive”,
B2:B605,“>30”) in an empty cell will show that 181 employees
worked in the automotive industry and earned more than $30
per hour. Additional data ranges and selection criteria can be
added in corresponding pairs. The >=, <, <=, and <> operators
can also be used in the COUNTIFS function.
e. To sort all employees by their hourly wage, highlight cells A1
through D605. From the menu, click Data > Sort (in the Sort &
Filter group). Make sure that the My data has headers
checkbox is checked. Select HourlyWage for the Sort by
option and choose the Smallest to Largest (or ascending)
order. Click OK.
At the top of the sorted list, verify that there are three
employees with the lowest hourly wage of $24.28. To sort data
in descending order, repeat step e but choose the Largest to
Smallest (or descending) order. Verify that the highest hourly
wage is $51.00.
f. To sort the data based on multiple variables, again highlight
cells A1:D605 and go to Data > Sort. Choose Industry in the
Sort by option and the A to Z (or ascending) order. Click the
Add Level button and choose Job in the Then by option and
the A to Z order. Click the Add Level button again and choose
HourlyWage in the second Then by option and the Smallest to
Largest order. Click OK. We see that the lowest- and the
highest-paid accountants who worked in the automotive
industry made $28.74 and $49.32 per hour, respectively.
Similarly, sorting the data by industry in descending order
(Z to A) and then by job classification and hourly wage in
ascending order reveals that the lowest- and the highest-paid
accountants in the Tech industry made $36.13 and $49.49 per
hour, respectively.
g. To resort the data set to its original order, again highlight cells
A1:D605 and go to Data > Sort. Select each of the Then by
rows and click the Delete Level button. Choose EmployeeID in
the Sort by option and the Smallest to Largest order.
Using R
Before following all R instructions, make sure that you have
read Appendix C (“Getting Started with R”). We assume that
you have downloaded R and RStudio and that you know how
to import an Excel file. Throughout the text, our goal is to
provide the simplest way to obtain the relevant output. We
denote all function names in boldface and all options within a
function in italics.
a. Import the Gig data file into a data frame (table) and label it
myData. Keep in mind that the R language is case sensitive.
b. We use the dim function in R to count the number of
observations and variables. Verify that the R output shows 604
observations and four variables. Enter:
> dim(myData)
c. Two common functions to display a portion of data are head
and View. The head function displays the first few
observations in the data set, and the View function (case
sensitive) displays a spreadsheet-style data viewer where the
user can scroll through rows and columns. Verify that the first
employee in the data set is an analyst who worked in the
construction industry and made $32.81 per hour. Enter:
> head(myData)
> View(myData)
d. R stores missing values as NA, and we use the is.na function
to identify the observations with missing values. R labels
observations with missing values as “True” and page 42
observations without missing values as “False.” In
order to inspect the Industry variable for missing values, enter:
> is.na(myData$Industry)
e. For a large data set, having to look through all observations is
inconvenient. Alternately, we can use the which function
together with the is.na function to identify “which”
observations contain missing values. The following command
identifies 10 observations by row number as having a missing
value in the Industry variable. Verify that the first observation
with a missing Industry value is in row 24. Enter:
> which (is.na(myData$Industry))
f. To inspect the 24th observation, we specify row 24 in the
myData data frame. Enter:
> myData[24,]
Note that there are two elements within the square bracket,
separated by a comma. The first element identifies a row
number (also called row index), and the second element after
the comma identifies a column number (also called column
index). Leaving the second element blank will display all
columns. To inspect an observation in row 24 and column 3,
we enter myData[24, 3]. In a small data set, we can also
review the missing values by scrolling to the specific rows and
columns in the data viewer produced by the View function. As
mentioned earlier, the treatment of missing values is
discussed in Section 2.3.
g. To identify and count the number of employees with multiple
selection criteria, we use the which and length functions. In
the following command, we identify which employees worked
in the automotive industry with the which function and count
the number of these employees using the length function. The
double equal sign (==), also called equality operator, is used
to check whether the industry is automotive. In R, text
characters such as ‘Automotive’ are enclosed in quotation
marks. Enter:
> length(which(myData$Industry==‘Automotive’))
We can also use the >, >=, <, <=, and != (not equal to)
operators in the selection criteria. For example, using the
following command, we can determine the number of
employees who earn more than $30 per hour. Enter:
> length(which(myData$HourlyWage > 30))
Note that there are 190 employees in the automotive industry
and there are 536 employees who earn more than $30 per
hour.
h. To count how many employees worked in a particular industry
and earned more than a particular wage, we use the and
operator (&). The following command shows that 181
employees worked in the automotive industry and earned
more than $30 per hour. Enter:
> length(which(myData$Industry==‘Automotive’ &
myData$HourlyWage > 30))
i. We use the order function to sort the observations of a
variable. In order to sort the HourlyWage variable and store
the ordered data set in a new data frame called sortedData1,
enter:
> sortedData1 <- myData[order(myData$HourlyWage),]
> View(sortedData1)
page 43
The View function shows that the lowest and
highest hourly wages are $24.28 and $51.00, respectively. By
default, the sorting is performed in ascending order. To sort in
descending order, enter:
> sortedData1 <- myData[order(myData$HourlyWage,
decreasing = TRUE),]
j. To sort data by multiple variables, we specify the variables in
the order function. The following command sorts the data by
industry, job classification, and hourly wage, all in ascending
order, and stores the ordered data in a data frame called
sortedData2. Enter:
> sortedData2 <- myData[order(myData$Industry,
myData$Job, myData$HourlyWage),]
> View(sortedData2)
The View function shows that the lowest-paid accountant who
worked in the automotive industry made $28.74 per hour.
k. To sort the data by industry and job classification in ascending
order and then by hourly wage in descending order, we insert
a minus sign in front of the hourly wage variable. Verify that
the highest-paid accountant in the automotive industry made
$49.32 per hour. Enter:
> sortedData3 <- myData[order(myData$Industry,
myData$Job, -myData$HourlyWage),]
> View(sortedData3)
l. The industry and job classification variables are non-
numerical. As a result, to sort the data by industry in
descending order and then by job classification and hourly
wage in ascending order, we use the xtfrm function with the
minus sign in front of the Industry variable. Enter:
> sortedData4 <- myData[order(-xtfrm(myData$Industry),
myData$Job, myData$HourlyWage),]
> View(sortedData4)
The View function reveals that the lowest- and the highest-
paid accountants in the technology industry made $36.13 and
$49.49 per hour, respectively.
m. To sort the data by industry, job, and hourly wage, all in
descending order, we use the decreasing option in the order
function. Verify that the highest-paid sales representative in
the technology industry made $48.87. Enter:
> sortedData5 <- myData[order(myData$Industry,
myData$Job, myData$HourlyWage, decreasing = TRUE),]
> View(sortedData5)
n. To export the sorted data from step m as a comma-separated
value file, we use the write.csv function. Verify that the
exported file is in the default folder (e.g., My Document on
Microsoft Windows). Other data frames in R can be exported
using a similar statement. Enter:
> write.csv(sortedData5,“sortedData5.csv”)
Summary
There are a total of 604 records in the data set. There are no
missing values in the HourlyWage variable. The Industry and
Job variables have 10 and 16 missing values, respectively.
190 employees worked in the automotive industry, 536
employees earned more than $30 per hour, and 181
employees worked in the automotive industry and earned
more than $30 per hour. page 44
The lowest and the highest hourly wages in the
data set are $24.28 and $51.00, respectively. The three
employees who had the lowest hourly wage of $24.28 all
worked in the construction industry and were hired as
Engineer, Sales Rep, and Accountant, respectively.
Interestingly, the employee with the highest hourly wage of
$51.00 also worked in the construction industry in a job type
classified as Other.
The lowest- and the highest-paid accountants who worked in
the automotive industry made $28.74 and $49.32 per hour,
respectively. In the technology industry, the lowest- and the
highest-paid accountants made $36.13 and $49.49 per hour,
respectively. Note that the lowest hourly wage for an
accountant is considerably higher in the technology industry
compared to the automotive industry ($36.13 > $28.74).
There are many ways to count and sort data to obtain useful
insights. To gain further insights, students are encouraged to
experiment with the Gig data using different combinations of
counting and sorting options than the ones used in Example 2.1.
EXERCISES 2.2
Mechanics
8. FILE Exercise_2.8. The accompanying data set contains two numerical
variables, x1 and x2.
a. For x2, how many of the observations are equal to 2?
b. Sort x1 and then x2, both in ascending order. After the variables have been
sorted, what is the first observation for x1 and x2?
c. Sort x1 and then x2, both in descending order. After the variables have been
sorted, what is the first observation for x1 and x2?
d. Sort x1 in ascending order and x2 in descending order. After the variables have
been sorted, what is the first observation for x1 and x2?
e. How many missing values are there in x1 and x2?
9. FILE Exercise_2.9. The accompanying data set contains three numerical
variables, x1, x2, and x3.
a. For x1, how many of the observations are greater than 30?
b. Sort x1, x2, and then x3 all in ascending order. After the variables have been
sorted, what is the first observation for x1, x2, and x3?
c. Sort x1 and x2 in descending order and x3 in ascending order. After the
variables have been sorted, what is the first observation for x1, x2, and x3?
d. How many missing values are there in x1, x2, and x3?
10. FILE Exercise_2.10. The accompanying data set contains three numerical
variables, x1, x2, and x3, and one categorical variable, x4.
a. For x4, how many of the observations are less than three?
b. Sort x1, x2, x3, and then x4 all in ascending order. After the variables have
been sorted, what is the first observation for x1, x2, x3, and x4?
c. Sort x1, x2, x3, and then x4 all in descending order. After the variables have
been sorted, what is the first observation for x1, x2, x3, and x4?
d. How many missing values are there in x1, x2, x3, and x4?
e. How many observations are there in each category in x4?
Applications
11. FILE SAT. The following table lists a portion of the average writing and math SAT
scores for the 50 states as well as the District of Columbia, Puerto Rico, and the
U.S. Virgin Islands for the year 2017 as reported by the College Board.
⋮ ⋮ ⋮
a. Sort the data by writing scores in descending order. Which state has the highest
average writing score? What is the average math score of that state?
b. Sort the data by math scores in ascending order. Which state has the lowest
average math score? What is the average writing score of that state?
c. How many states reported an average math score higher than 600?
d. How many states reported an average writing score lower than 550?
12. FILE Fitness. A social science study conducts a survey of 418 individuals about
how often they exercise, marital status, and annual income. A portion of the
Fitness data is shown in the accompanying table.
page 45
ID Exercise Married Income
⋮ ⋮ ⋮ ⋮
a. Sort the data by annual income. Of the 10 highest income earners, how many
of them are married and always exercise?
b. Sort the data by marital status and exercise both in descending order. How
many of the individuals who are married and exercise sometimes earn more
than $110,000 per year?
c. How many missing values are there in each variable?
d. How many individuals are married and unmarried?
e. How many married individuals always exercise? How many unmarried
individuals never exercise?
13. FILE Spend. A company conducts a consumer survey with questions about
home ownership (OwnHome: Yes/No), car ownership (OwnCar: Yes/No), annual
household spending on food (Food), and annual household spending on travel
(Travel). A portion of the data is shown in the accompanying table.
a. Sort the data by home ownership, car ownership, and the travel spending all in
descending order. How much did the first customer on the ordered list spend on
food?
b. Sort the data only by the travel spending amount in descending order. Of the 10
customers who spend the most on traveling, how many of them are
homeowners? How many of them are both homeowners and car owners?
c. How many missing values are there in each variable?
d. How many customers are homeowners?
e. How many customers are homeowners but do not own a car?
14. FILE Demographics. The accompanying table shows a portion of data that
shows an individual’s income (Income in $1,000s), age, sex (F = female, M =
male), and marital status (Married; Y = yes, N = no).
Once we have inspected and explored data, we can start the data
preparation process. In this section, we examine two important data
preparation techniques: handling missing values and subsetting
data. As mentioned in Section 2.2, there may be missing values in
the key variables that are crucial for subsequent analysis. Moreover,
most data analysis projects focus only on a portion (subset) of the
data, rather than the entire data set; or sometimes the objective of
the analysis is to compare two subgroups of the data.
page 46
page 47
HANDLING MISSING VALUES
There are two common strategies for dealing with missing
values.
The omission strategy recommends that observations with
missing values be excluded from subsequent analysis.
The imputation strategy recommends that the missing values
be replaced with some reasonable imputed values. For
numerical variables, it is common to use mean imputation. For
categorical variables, it is common to impute the most
predominant category.
EXAMPLE 2.2
Sarah Johnson, the manager of a local restaurant, has
conducted a survey to gauge customers’ perception about the
eatery. Each customer rated the restaurant on its ambience,
cleanliness, service, and food using a scale of 1 (lowest) to 7
(highest). Table 2.4 displays a portion of the survey data.
Using R
Subsetting
The process of extracting portions of a data set that are relevant to
the analysis is called subsetting. It is commonly used to pre-
process the data prior to analysis. For example, a multinational
company has sales data for its global operations, and it creates a
subset of sales data by country and performs analysis accordingly.
For time series data, which are data indexed in time order, we may
choose to create subsets of recent observations and observations
from the distant past in order to analyze them separately. Subsetting
can also be used to eliminate unwanted data such as observations
that contain missing values, low-quality data, or outliers. Sometimes,
subsetting involves excluding variables instead of observations. For
example, we might remove variables that are irrelevant to the
problem, variables that contain redundant information (e.g., property
value and property tax or employee’s age and experience), or
variables with excessive amounts of missing values.
SUBSETTING
Subsetting is the process of extracting parts of a data set that is
of interest to the analytics professional.
Successful Unsuccessful
Summary Measures
Treatments Treatments
page 51
Note that the sex, education level (1: lowest; 5: highest), and
income of the patients differ considerably between the two subsets.
Not surprisingly, male patients, especially those with higher
education and income levels, have better success with tuberculosis
treatment than female patients with lower education and income
levels. This simple analysis highlights the importance of contributing
factors in tuberculosis control efforts.
In Example 2.3, we demonstrate how to use subsetting functions
in Excel and R to select or exclude variables and/or observations
from the original data set.
FILE
Customers
EXAMPLE 2.3
In the introductory case, Catherine Hill wants to gain a better
understanding of Organic Food Superstore’s customers who
are college-educated millennials, born between 1982 and
2000. She feels that sex, household size, income, total
spending in 2018, total number of orders in the past 24 months,
and channel through which the customer was acquired are
useful for her to create a profile of these customers. Use Excel
and R to first identify college-educated millenial customers in
the Customers data file. Then, create subsets of female and
male college-educated millenial customers. The synopsis that
follows this example provides a summary of the results.
SOLUTION:
Using Excel
e. Select the entire filtered data that are left in the worksheet.
Copy and paste the filtered data to a new worksheet. Verify
that the new worksheet contains 59 observations of college-
educated millennials. Rename the new worksheet as College-
Educated Millennials.
f. We now exclude the variables that are not relevant page 52
to the current analysis. In the College-Educated
Millennials worksheet, select cell A1 (CustID). From the menu
choose Home > Delete > Delete Sheet Columns to remove
the CustID column. Repeat this step for the Race, BirthDate,
College, ZipCode, Spending2017, DaysSinceLast, and
Satisfaction columns from the data set.
g. To subset the college-educated millennials data by sex, select
column A. From the menu choose Home > Sort & Filter >
Sort A to Z. If prompted, select Expand the selection in the
Sort Warning dialog box and click Sort. The observations are
now sorted by sex in alphabetic order. The female customer
records are followed by male customer records.
h. Create two new worksheets and assign the worksheet names
Female and Male. Copy and paste the female and male
customer records, including the column headings, to the new
Female and Male worksheets, respectively. Table 2.6 shows a
portion of the results.
Using R
a. Import the Customers data into a data frame (table) and label
it myData.
b. To select college-educated millennials, we first select all
customers with a college degree. Recall that the double equal
sign (==) is used to check whether the College value is “Yes.”
Enter:
> college <- myData[myData$College==‘Yes’, ]
c. We now use the Birthdate variable to select the millennials
who were born between 1982 and 2000. R usually imports the
date values as text characters, and, therefore, we first need to
convert the BirthDate variable into the date data type using the
as.Date function. The option format = “%m/%d/%Y” indicates
that the BirthDate variable is in the mm/dd/yyyy format. For
example, in order for R to read dates such as 01/13/1990,
enter:
> college$BirthDate <- as.Date(college$BirthDate, format =
“%m/%d/%Y”)
Other common date formats include “%Y-%m-%d”, “%b %d,
%Y”, and “%B %d, %Y” that will read dates specified as 1990-
01-13, Jan 13, 1990, and January 13, 1990, respectively.
d. We also use the as.Date function to specify the page 53
cutoff dates, January 1, 1982, and December 31,
1999, before using them as selection criteria for selecting the
millennials in our data. Enter:
> cutoffdate1 <- as.Date(“01/01/1982”, format = “%m/%d/%Y”)
> cutoffdate2 <- as.Date(“12/31/1999”, format = “%m/%d/%Y”)
> millenials <- college[college$BirthDate >= cutoffdate1 &
college$BirthDate <= cutoffdate2, ]
Verify that the millennials data frame contains 59 college-
educated millennials.
e. To include only the Sex, HouseholdSize, Income,
Spending2018, NumOfOrders, and Channel variables in the
millenials data frame, we specify the column indices of these
variables using the c function. Enter:
> subset1 <- millenials[ , c(2,6,8,10,11,14)]
Alternately, we can create a new data frame by specifying the
names of the variables to include. Enter:
> subset2 <- millenials[ , c(“Sex”, “HouseholdSize”, “Income”,
“Spending2018”, “NumOfOrders”, “Channel”)]
Note that subset1 and subset2 data are identical.
f. R imports non-numerical variables such as Sex and Channel
as text characters. Before further subsetting and examining
the data, we convert Sex and Channel into categorical
variables (called factors in R) by using the as.factor function.
Enter:
> subset1$Sex <- as.factor(subset1$Sex)
> subset1$Channel <- as.factor(subset1$Channel)
To verify that the Channel variable has been converted into a
factor or a categorical variable, enter:
> is.factor(subset1$Channel)
This command returns TRUE if the variable is a factor, and
FALSE otherwise.
g. To create two subsets of data based on Sex, we use the split
function. Enter:
> sex <- split(subset1, subset1$Sex)
The sex data frame contains two subsets: Female and Male.
We can now access and view the Female and Male subsets.
Enter:
> sex$Female
> sex$Male
> View(sex$Female)
> View(sex$Male)
Verify that there are 21 female college-educated millennials
and 38 male college-educated millennials. Your results should
be similar to Table 2.6.
h. In some situations, we might simply want to subset data based
on data ranges. For example, we use the following statement
to subset data to include observations 1 to 50 and
observations 101 to 200. Enter:
> dataRanges <- myData[c(1:50, 101:200),]
page 54
hbpictures/Shutterstock
The differences between the female and male customers have given Catherine
some ideas for the marketing campaign that she is about to design. For example,
the data show that an overwhelming portion of the male customers were acquired
through social media ads, while female customers tend to be enticed by web ads or
referrals. She plans to design and run a series of social media ads about the new
product line with content that targets male customers. For female customers,
Catherine wants to focus her marketing efforts on web banner ads and the
company’s referral program.
Furthermore, as the male customers seem to place more frequent but smaller
orders than female customers do, Catherine plans to work with her marketing team
to develop some cross-sell and upsell strategies that target male customers. Given
the fact that the company’s male college-educated millennial customers tend to be
high-income earners, Catherine is confident that with the right message and product
offerings, her marketing team will be able to develop strategies for increasing the
total spending of these customers.
EXERCISES 2.3
Mechanics
16. The following table contains three variables and five observations with some
missing values.
x1 x2 x3
248 3.5
124 3.8 55
150 74
196 4.5 32
6.2 63
a. Handle the missing values using the omission strategy. How many observations
remain in the data set and have complete cases?
b. Handle the missing values using the simple mean imputation strategy. How
many missing values are replaced? What are the means of x1, x2, and x3?
17. FILE Exercise_2.17. The accompanying data set contains four variables, x1, x2,
x3, and x4.
a. Subset the data set to include only observations that have a date on or after
May 1, 1975, for x3. How many observations are in the subset data?
b. Split the data set based on the binary values for x4. What are the average
values for x1 for the two subsets?
18. FILE Exercise_2.18. The accompanying data set contains five variables, x1, x2,
x3, x4, and x5.
a. Subset the data set to include only x2, x3, and x4. How many missing values
are there in the three remaining variables?
b. Remove all observations that have “Own” as the value for x2 and have values
lower than 150 for x3. How many observations remain in the data set? What are
the average values for x3 and x4?
19. FILE Exercise_2.19. The accompanying data set contains five variables, x1, x2,
x3, x4, and x5. There are missing values in the data set.
a. Which variables have missing values?
b. Which observations have missing values?
c. How many missing values are in the data set?
d. Handle the missing values using the omission strategy. How many observations
remain in the data set and have complete cases?
20. FILE Exercise_2.20. The accompanying data set contains five variables, x1, x2,
x3, x4, and x5. There are missing values in the data set. Handle the missing
values using the simple mean imputation strategy for numerical variables and the
predominant category strategy for categorical variables.
a. How many missing values are there for each variable?
b. What are the values for imputing missing values in x1, x2, x3, page 55
x4, and x5?
21. FILE Exercise_2.21. The accompanying data set contains five variables, x1, x2,
x3, x4, and x5.
a. Are there missing values for x1? If so, impute the missing values using the
mean value of x1. After imputation, what is the mean of x1?
b. Are there missing values for x2? If so, impute the missing values using the
mean value of x2. After imputation, what is the mean of x2?
c. If there are missing values in x4, impute the missing values using the median
value of x4. (Hint: Use the MEDIAN(data range) function in Excel or the
median function in R.) After imputation, what is the median of x4?
22. FILE Exercise_2.22. The accompanying data set contains five variables, x1, x2,
x3, x4, and x5. There are missing values in the data set.
a. Which variables have missing values?
b. Which observations have missing values?
c. How many missing values are in the data set?
d. Handle the missing values using the omission strategy. How many observations
remain in the data set and have complete cases?
23. FILE Exercise_2.23. The accompanying data set contains four variables, x1, x2,
x3, and x4. There are missing values in the data set.
a. Subset the data set to include only x1, x2, and x3.
b. Which variables have missing values?
c. Which observations have missing values?
d. How many missing values are in the data set?
e. Handle the missing values using the omission strategy. How many observations
remain in the data set and have complete cases?
f. Split the data set based on the categories of x2. How many observations are in
each subset?
24. FILE Exercise_2.24. The accompanying data set contains seven variables, x1,
x2, x3, x4, x5, x6, and x7. There are missing values in the data set.
a. Remove variables x2, x6, and x7 from the data set. Which of the remaining
variables have missing values?
b. Which observations have missing values?
c. How many missing values are in the data set?
d. If there are missing values in x1, replace the missing values with “Unknown.”
How many missing values were replaced?
e. Handle the missing values for numerical variables using the imputation strategy.
If there are missing values in x3, impute the missing values using the mean
value of x3. If there are missing values in x4, impute the missing values using
the median value of x4. If there are missing values in x5, impute the missing
values using the mean value of x5. What are the average values of x3, x4, and
x5 after imputation?
f. Remove observations that have the value “F” for x1 and values lower than
1,020 for x4. How many observations remain in the data set?
Applications
25. FILE Population. The US Census Bureau records the population for the 50
states each year. The accompanying table shows a portion of these data for the
years 2010 to 2018.
a. Create two subsets of the state population data: one with 2018 population great
than or equal to 5 million and one with 2018 population less than 5 million. How
many observations are in each subset?
b. In the subset of states with 5 million or more people, remove the states with
over 10 million people. How many states were removed?
26. FILE Travel_Plan. Jerry Stevenson is the manager of a travel agency. He wants
to build a model that can predict whether or not a customer will travel within the
next year. He has compiled a data set that contains the following variables:
whether the individual has a college degree (College), whether the individual has
a credit card (CreditCard), annual household spending on food (FoodSpend in $),
annual income (Income in $), and whether the customer has plans to travel within
the next year (TravelPlan, 1 = have travel plans; 0 = do not have travel plans). A
portion of the Travel_Plan data is shown in the accompanying table.
a. Are there any missing values in the data set? If there are, which variables have
missing values? How many missing values are there in the data set?
b. Use the omission strategy to handle missing values. How many observations
are removed due to missing values?
27. FILE Travel_Plan. Refer to the previous exercise for a description of the problem
and data set.
a. Based on his past experience, Jerry knows that whether the individual has a
credit card or not has nothing to do with his or her travel plans and would like to
remove this variable. Remove the variable CreditCard from the data set.
b. In order to better understand his customers with high incomes, page 56
Jerry wants to create a subset of the data that only includes
customers with annual incomes higher than $75,000 and who plan to travel
within the next year. Subset the data to build the list of customers who meet
these criteria. How many observations are in this subset?
c. Return to the original data set. Use the imputation strategy to handle missing
values. If there are missing values for the FoodSpend variable, impute the
missing values using the mean of the variable. If there are missing values for
the Income variable, impute the missing values using the median of the
variable. What are the average values of FoodSpend and Income after
imputation?
28. FILE Football_Players. Denise Lau is an avid football fan and religiously follows
every game of the National Football League. During the 2017 season, she
meticulously keeps a record of how each quarterback has played throughout the
season. Denise is making a presentation at the local NFL fan club about these
quarterbacks. The accompanying table shows a portion of the data that Denise
has recorded, with the following variables: the player’s name (Player), team’s
name (Team), completed passes (Comp), attempted passes (Att), completion
percentage (Pct), total yards thrown (Yds), average yards per attempt (Avg),
yards thrown per game (Yds/G), number of touch downs (TD), and number of
interceptions (Int).
a. Are there any missing values in the data set? If there are, which variables have
missing values? Which observations have missing values? How many missing
values are there in the data set?
b. Use the omission strategy to handle missing values. How many observations
are removed due to missing values?
29. FILE Football_Players. Refer to the previous exercise for a description of the
data set. Denise feels that, for her presentation, it would remove some biases if
the player names and team names are suppressed. Remove these variables from
the data set.
a. Denise also wants to remove outlier cases where the players have less than
five touchdowns or more than 20 interceptions. Remove these observations
from the data set. How many observations were removed from the data?
b. Return to the original data set. Use the imputation strategy to handle missing
values. If there are missing values for Comp, Att, Pct, Yds, Avg, or Yds/G,
impute the missing values using the mean of the variable. If there are missing
values for TD or Int, impute the missing values using the median of the variable.
What are the average values of Comp, Att, Pct, Yds, Avg, and Yds/G after
imputation?
30. FILE Salaries. Ian Stevens is a human resource analyst working for the city of
Seattle. He is performing a compensation analysis of city employees. The
accompanying data set contains three variables: Department, Job Title, and
Hourly Rate (in $). A few hourly rates are missing in the data.
⋮ ⋮ ⋮
a. Split the data set into a number of subsets based on Department. How many
subsets are created?
b. Which subset contains missing values? How many missing values are in that
data set?
c. Use the imputation strategy to replace the missing values with the mean of the
variable. What is the average hourly rate for each subset?
31. FILE Stocks. Investors usually consider a variety of information to make
investment decisions. The accompanying table displays a sample of large publicly
traded corporations and their financial information. Relevant information includes
stock price (Price), dividend as a percentage of share price (Dividend), price to
earnings ratio (PE), earnings per share (EPS), book value, lowest and highest
share prices within the past 52 weeks (52 wk low and 52 wk high), market value
of the company’s shares (Market cap), and earnings before interest, taxes,
depreciation, and amortization (EBITDA in $ billions).
DATA TRANSFORMATION
Data transformation is the data conversion process from one
format or structure to another.
page 58
Binning
Binning is the process of transforming numerical variables into
categorical variables by grouping the numerical values into a small
number of groups or bins. It is important that the bins are
consecutive and nonoverlapping so that each numerical value falls
into one, and only one, bin. For example, we might want to transform
income values into three groups: below $50,000, between $50,000
and $100,000, and above $100,000. The three income groups can
be labeled as “low,” “medium,” and “high” or “1,” “2,” and “3.” Binning
can be an effective way to reduce noise in the data if we believe that
all observations in the same bin tend to behave the same way. For
example, the transformation of the income values into three groups
makes sense when we are more interested in a person’s earning
power (low, medium, or high) rather than the actual income value.
BINNING
Binning is a common data transformation technique that
converts numerical variables into categorical variables by
grouping the numerical values into a small number of bins.
As noted above, binning reduces the noise in the data often due
to minor observation errors. For example, with binning, outliers in the
data (e.g., individuals with extremely high income, perhaps recorded
incorrectly) will be part of the last bin and, therefore, will not distort
subsequent data analysis. Binning is also useful in categorizing
observations and meeting the categorical data requirements of some
data mining analytics techniques such as naïve Bayes (discussed in
Chapter 9).
In addition to binning numerical values according to user-defined
boundaries, bins are also often created to have equal intervals. For
example, we can create bins that represent an interval of 10 degrees
in temperature or 10 years in age. We can also create bins of equal
counts, where individual bins have the same number of
observations. For example, by binning a class of 200 students into
10 equal-size groups based on their grades, we can find out the
relative academic standing of the students. Students in the bin with
the highest grades represent the top 10% of the class.
In Example 2.4, we demonstrate how to use Analytic Solver and
R to create bins with equal counts, equal intervals, and user-defined
intervals.
FILE
Customers
EXAMPLE 2.4
In order to better understand her customers, Catherine Hill
would like to perform the RFM analysis, a popular marketing
technique used to identify high-value customers. RFM stands
for recency, frequency, and monetary. The RFM ratings can be
created from the DaysSinceLast (recency), NumOfOrders
(frequency), and Spending2018 (monetary) variables.
Following the 80/20 business rule (i.e., 80% of your
business comes from 20% of your best customers), for each of
the three RFM variables, Catherine would like to bin customers
into five equal-size groups, with 20% of the customers included
in each group. Each group is also assigned a score from 1 to 5,
with 5 being the highest. Customers with the RFM rating of 555
are considered the most valuable customers to the company.
In addition to the RFM binning, Catherine would like to bin
the Income variable into five equal intervals. Finally, she would
like to start a tiered membership status where different services
and rewards are offered to customers depending on how much
they spent in 2018. She would like to assign the bronze
membership status to customers who spent less than $250,
silver membership status to those who spent $250 or more but
less than $1,000, and the gold membership status to those who
spent $1,000 or more.
page 59
Column S Column T
Spending Membership
0 Bronze
250 Silver
1000 Gold
Mathematical Transformations
As discussed earlier, data transformation is an important step in
bringing out the information in the data set, which can then be used
for further data analysis. In addition to binning, another common
approach is to create new variables through mathematical
transformations of existing variables. For example, to page 63
analyze diabetes risk, doctors and dieticians often focus
on body mass index (BMI), which is calculated as weight in
kilograms divided by squared height in meters, rather than focusing
on either weight or height alone. Similarly, in order to analyze trend,
we often transform raw data values into percentages.
Sometimes data on variables such as income, firm size, and
house prices are highly skewed; skewness is discussed in Section
3.4 of Chapter 3. For example, according to a Federal Reserve
report, the richest 1% of families in the U.S. controlled 38.6% of the
country’s wealth in 2016 (CNN, September 27, 2017). The extremely
high (or low) values of skewed variables significantly inflate (or
deflate) the average for the entire data set, making it difficult to
detect meaningful relationships with skewed variables. A popular
mathematical transformation that reduces skewness in data is the
natural logarithm transformation. Another transformation to reduce
data skewness is the square root transformation.
Another common data transformation involves calendar dates.
Statistical software usually stores date values as numbers. For
example, in R, date objects are stored as the number of days since
January 1, 1970, using negative numbers for earlier dates. For
example, January 31, 1970, has a value of 30, and December 15,
1969, has a value of −17. Excel implements a similar approach to
store date values but uses a reference value of 1 for January 1,
1900. Transformation of date values is often performed to help bring
useful information out of the data. A retail company might convert
customers’ birthdates into ages in order to examine the differences
in purchase behaviors across age groups. Similarly, by subtracting
the airplane ticket booking date from the actual travel date, an airline
carrier can identify last-minute travelers, who may behave very
differently from early planners.
Sometimes transforming date values into seasons helps enrich
the data set by creating relevant variables to support subsequent
analyses. For example, by extracting and focusing on the months in
which gym members first joined the health club, we may find that
members who joined during the summer months are more interested
in the aquatic exercise programs, whereas those who joined during
the winter months are more interested in the strength-training
programs. This insight can help fitness clubs adjust their marketing
strategies based on seasonality.
Example 2.5 demonstrates how to use Excel and R to perform
the following mathematical transformations: (1) compute the
percentage difference between two values, (2) perform a logarithm
transformation, and (3) extract information from date values.
FILE
Customers
EXAMPLE 2.5
After a closer review of her customers, Catherine Hill feels that
the difference and the percentage difference between a
customer’s 2017 and 2018 spending may be more useful to
understanding the customer’s spending patterns than the
yearly spending values. Therefore, Catherine wants to
generate two new variables that capture the year-to-year
difference and the percentage difference in spending. She also
notices that the income variable is highly skewed, with most
customers’ incomes falling between $40,000 and $100,000,
with only a few very-high-income earners. She has been
advised to transform the income variable into natural
logarithms, which will reduce the skewness of the data.
Catherine would also like to convert customer birthdates
into ages as of January 1, 2019, for exploring differences in
purchase behaviors of customers across age groups. Finally,
she would like to create a new variable that captures the birth
month of the customers so that seasonal products can be
marketed to these customers during their birth month.
Use Excel and R to transform variables according to
Catherine’s specifications.
page 64
SOLUTION:
Using Excel
Using R
a. Import the Customer data into a data frame (table) and label it myData.
b. We find the spending difference, and then use the head function to view the first
few observations. Enter:
> myData$SpendingDiff <- myData$Spending2018 - myData$Spending2017
> head(myData$SpendingDiff)
Verify that the first observation of the SpendingDiff variable is −46.
c. We create the percentage spending difference and round it to two page 65
decimal places using the round function. We then place the “%”
sign using the paste function and use the head function to view the first few
observations. Enter:
> myData$PctSpendingDiff <- round((myData$SpendingDiff /
myData$Spending2017)*100, digits = 2)
> myData$PctSpendingDiff <- paste(myData$PctSpendingDiff, “%”)
> head(myData$PctSpendingDiff)
Verify that the first observation of the PctSpendingDiff variable is −16.03%.
d. We use the log function for the natural logarithm transformation, and then use the
head function to view the first few observations. Enter:
> myData$IncomeLn <- log(myData$Income)
> head(myData$IncomeLn)
Verify that the first observation of the IncomeLn variable is 10.87805. The
IncomeLn values are slightly different from those in Table 2.9 because Table 2.9
is formatted to show only four decimal places. For the base 10 logarithm
transformation, use the log10 function in place of the log function.
e. To calculate a customer’s age as of January 1, 2019, we first need to convert the
Birthdate variable into the data values and create a new variable for the January
1, 2019, date. Enter:
> myData$BirthDate <- as.Date(myData$BirthDate, format = "%m/%d/%Y")
> endDate <- as.Date("01/01/2019", format = "%m/%d/%Y")
f. We use the difftime function to find out the number of days between the
customer’s birthdate and January 1, 2019. By dividing the difference in days by
365.25, we account for the leap years (by using 365.25 instead of 365) and obtain
the difference in years. We use the as.numeric function to ensure that the Age
variable has a numerical type. Finally, we use the floor function to remove the
decimal places so that the age of a customer is an integer and the head function
to view the first few observations. Enter:
> myData$Age <- difftime(endDate, myData$BirthDate)/365.25
> myData$Age <- as.numeric(myData$Age)
> myData$Age <- floor(myData$Age)
> head(myData$Age)
Verify that the first customer’s age as of January 1, 2019, is 32 years.
g. We use the months function to extract the month name from the Birthdate
variable, the match function to convert month names (January to December) to
numbers (1 to 12), and the head function to view the first few observations. Enter:
> myData$BirthMonth <- months(myData$BirthDate)
> myData$BirthMonth <- match(myData$BirthMonth, month.name)
> head(myData$BirthMonth)
Verify that the first customer’s birthday is in month 12 (December).
h. We use the View function to display a spreadsheet-style data. The output should
be consistent with Table 2.9. Enter:
> View(myData)
page 66
EXERCISES 2.4
Mechanics
35. FILE Exercise_2.35. The accompanying data contains three variables and six
observations.
x1 x2 x3
248 3.5 78
124 3.8 55
210 1.6 66
150 4.8 74
196 4.5 32
234 6.2 63
a. Bin the values of x1 into two equal-size groups. Label the groups with numbers
1 (lower values) and 2 (higher values). What is the average value of x1 for
group 1? (Hint: Sort the data by group number before calculating the average.)
b. Bin the values of x2 into three equal interval groups. Label the groups with
numbers 1 (lowest values) to 3 (highest values). How many observations are
assigned to group 1?
c. Bin the values of x3 into the following two groups: ≤ 50 and > 50. Label the
groups with numbers 1 (lower values) and 2 (higher values). How many
observations are assigned to group 2?
36. FILE Exercise_2.36. The accompanying data set contains three variables, x1,
x2, and x3.
a. Bin the values of x1 into three equal-size groups. Label the groups with
numbers 1 (lowest values) to 3 (highest values). How many observations are
assigned to group 1?
b. Bin the values of x2 into three equal-interval groups. Label the groups with
numbers 1 (lowest values) to 3 (highest values). How many observations are
assigned to group 2?
c. Bin the values of x3 into the following three groups: < 50,000, between 50,000
and 100,000, and > 100,000. Label the groups with numbers 1 (lowest values)
to 3 (highest values). How many observations are assigned to group 1?
37. FILE Exercise_2.37. The accompanying data set contains three variables, x1,
x2, and x3.
a. Bin the values of x1 into three equal-size groups. Label the groups with “low”
(lowest values), “medium,” and “high” (highest values). How many observations
are assigned to group medium?
b. Bin the values of x2 into three equal-interval groups. Label the groups with “low”
(lowest values), “medium,” and “high” (highest values). How many observations
are assigned to group high?
c. Bin the values of x3 into the following three groups: < 20, between 20 and 30,
and > 30. Label the groups with “low” (lowest values), “medium,” and “high”
(highest values). How many observations are assigned to group low?
38. FILE Exercise_2.38. The accompanying data set contains three variables, x1,
x2, and x3.
a. Bin the values of x1, x2, and x3 into five equal-size groups. Label the groups
with numbers 1 (lowest) to 5 (highest).
b. Combine the group labels of x1, x2, and x3 to create a score like the RFM score
described in Example 2.4. How many observations have the score “431”? How
many observations have the score “222”?
39. The following table contains two variables and five observations.
x1 x2
248 350
124 148
150 130
x1 x2
196 145
240 180
a. Create a new variable called “Sum” that contains the sum of the values of x1
and x2 for each observation. What is the average value of Sum?
b. Create a new variable called “Difference” that contains the absolute difference
between the values of x1 and x2 for each observation. What is the average
value of Difference?
page 67
40. FILE Exercise_2.40. The accompanying data set contains three
variables, x1, x2, and x3.
a. Create a new variable called “Difference” that contains the difference between
the values of x1 and x2 for each observation (i.e., x2 − x1). What is the average
difference?
b. Create a new variable called “PercentDifference” that contains the percent
difference between the values of x1 and x2 for each observation (i.e., (x2 −
x1)/x1). What is the average percent difference?
c. Create a new variable called “Log” that contains the natural logarithms for x3.
What is the average logarithm value?
41. FILE Exercise_2.41. The accompanying data set contains three variables, x1,
x2, and x3.
a. Create a new variable called “Difference” that contains the difference between
the values of x1 and x2 for each observation (i.e., x2 − x1). What is the average
difference?
b. Create a new variable called “PercentDifference” that contains the percent
difference between the values of x1 and x2 for each observation (i.e., (x2 −
x1)/x1). What is the average percent difference?
c. Create a new variable called “Log” that contains the natural logarithms for x3.
Bin the logarithm values into five equal-interval groups. Label the groups using
numbers 1 (lowest values) to 5 (highest values). How many observations are in
group 2?
42. FILE Exercise_2.42. The accompanying data set contains two variables, Date1
and Date2.
a. Create a new variable called “DifferenceInYear” that contains the difference
between Date1 and Date2 in year for each observation. What is the average
difference in year? (Hint: Use the YEARFRAC function if you are using Excel to
complete this problem.)
b. Create a new variable called “Month” that contains the month values extracted
from Date1. What is the average month value?
c. Bin the month values in four equal-interval groups. Label the groups using
numbers 1 (lowest values) to four (highest values). Which group has the highest
number of observations?
Application
43. FILE Population. The U.S. Census Bureau records the population for the 50
states each year. The accompanying table shows a portion of these data for the
years 2010 to 2018.
a. Bin the 2017 population values into four equal-size groups. Label the groups
using numbers 1 (lowest values) to 4 (highest values). How many states are
assigned to group 4?
b. Bin the 2018 population values into four equal-interval groups. Label the groups
using numbers 1 (lowest values) to 4 (highest values). How many states are
assigned to group 2? Compare the groups in parts a and b. Which states are in
higher groups for the 2018 population than for the 2017 population?
c. Bin the 2018 population values into the following three groups: < 1,000,000,
between 1,000,000 and 5,000,000, and > 5,000,000. Label the groups using
numbers 1 (lowest values) to 3 (highest values). How many observations are
assigned to group 2?
44. FILE Population. Refer to the previous exercise for a description of the data set.
a. Create a new variable called “Difference” that contains the difference between
the 2018 population and the 2017 population for each state (i.e., 2018
population - 2017 population). What is the average difference?
b. Create a new variable called “PercentDifference” that contains the percent
difference between the 2017 and 2018 population values for the states (i.e.,
(2018 population - 2017 population)/2017 population). What is the average
percent difference?
c. Create a new variable called “Log” that contains the natural logarithms for the
2018 population values for the states. Bin the logarithm values into five equal-
interval groups. Label the groups using numbers 1 (lowest values) to 5 (highest
values). How many observations are in group 2?
d. Create a new variable called “SquareRoot” that contains the square root of the
2018 population values for the states. Bin the square root values into five equal-
interval groups. Label the groups using numbers 1 (lowest values) to 5 (highest
values). How many observations are in group 2?
e. Compare the groups in parts c and d. Are the groupings the same or different?
45. FILE Credit_Cards. Greg Metcalf works for a national credit card company, and
he is performing a customer value analysis on a subset of credit card customers.
In order to perform the RFM analysis on the customers, Greg has compiled a data
set that contains the dates of the last transaction (LastTransactionDate), total
number of transactions in the past two years (Frequency), and total spending
during the past two years (Spending). A portion of the data set is shown in the
accompanying table.
⋮ ⋮ ⋮
8/14/2017 49 27918
page 68
a. Greg wants to calculate the number of days between January 1,
2019, and the last transaction date. Create a new variable “DaysSinceLast” that
contains the number of days since the last transaction. (Hint: Use the DATEDIF
function if you are using Excel to complete this problem.) What is the average
number of days since the last purchase for all the customers?
b. Create the RFM scores for each customer. How many customers have an RFM
score of 555? What is their average spending?
c. Create a new variable called “LogSpending” that contains the natural logarithms
for the total spending during the past two years. Bin the logarithm values into
five equal-interval groups. Label the groups using numbers 1 (lowest values) to
5 (highest values). How many observations are in group 2?
d. Create a new variable called “AverageOrderSize” that contains the average
spending per order. This is calculated by dividing total spending (Spending) by
total number of transactions (Frequency) in the past two years. Bin the values
of AverageOrderSize into five equal-interval groups. Label the groups using
numbers 1 (lowest values) to 5 (highest values). How many observations are in
group 2?
e. Compare the groups in parts c and d. Are the groupings the same or different?
46. FILE Game_Players. TurboX is a online video game company that makes three
types of video games: action, role play, and sports. It is interested in
understanding its millennial customers. By combining the data from its customer
database and a customer survey, TurboX compiled a data set that has the
following variables: the player’s satisfaction with the online game purchase
experience (Satisfaction), the enjoyment level of the game played (Enjoyment),
whether the player will recommend the game to others (Recommend), which type
of game the player played (Type), total spending on games last year
(SpendingLastYear), total spending on games this year (SpendingThisYear), and
the date of birth of the player (BirthDate).
a. Bin the total spending on games last year into four equal-size groups. Label the
groups using numbers 1 (lowest values) to 4 (highest values). How many
customers are assigned to group 4?
b. Bin the total spending on games this year into four equal-interval groups. Label
the groups using numbers 1 (lowest values) to 4 (highest values). How many
customers are assigned to group 3?
c. Bin the total spending on games this year into the following three groups: < 250,
between 250 and 500, and > 500. Label the groups using numbers 1 (lowest
values) to 3 (highest values). How many observations are assigned to group 2?
d. Create a new variable called “Difference” that contains the difference between
this year’s and last year’s spending on games for the players (i.e.,
SpendingThisYear - SpendingLastYear). What is the average difference?
e. Create a new variable called “PercentDifference” that contains the percent
difference between this year’s and last year’s spending on games for the
players (i.e., (SpendingThisYear - SpendingLastYear)/SpendingLastYear). What
is the average percent difference?
f. Create a new variable “Age” that contains the players’ ages as of January 1,
2019. What is the average age of the players?
g. Create a new variable “BirthMonth” that contains the players’ birth month
extracted from their dates of birth. Which month is the most frequent birth
month?
47. FILE Engineers. Erin Thomas, an HR manager of an engineering firm, wants to
perform an analysis on the data about the company’s engineers. The variables
included in the data are date of birth (BirthDate), personality type according to the
Myers-Briggs Personality assessment (Personality), annual salary (Salary), level
of the position (Level), and number of professional certificates achieved
(Certificates). The accompanying table shows a portion of the data set.
a. Create a new variable “Age” that contains the engineers’ ages as of January 1,
2019. What is the average age of the engineers?
b. Bin the age values into three equal-size groups. Label the groups using
numbers 1 (lowest age values) to 3 (highest age values). How many
observations are in group 3?
c. Bin the annual salary values into four equal interval groups. Label the groups
using numbers 1 (lowest salary values) to 4 (highest salary values). How many
engineers are assigned to group 4?
d. Bin the number of professional certificates achieved into the following three
groups: < 2, between 2 and 4, and over 4. Label the groups “Low,” “Medium,”
and “High.” How many engineers are in the “High” group?
page 69
48. FILE Patients. Jerry Stevenson is the manager of a medical
clinic in Scottsdale, AZ. He wants to analyze patient data to identify high-risk
patients for cardiovascular diseases. From medical literature, he learned that the
risk of cardiovascular diseases is influenced by a patient’s age, body mass index
(BMI), amount of exercise, race, and education level. Jerry has compiled a data
set with the following variables for his clinic’s patients: race (Race), education
level (Education), body weight in kilograms (Weight), height in meters (Height),
date of birth (BirthDate), and number of minutes of exercise per week (Exercise).
The accompanying table shows a portion of the data set.
a. Create a new variable called “BMI” that contains the body mass index of the
patients. BMI is calculated as weight in kilograms/(height in meters)2. What is
the average BMI of the patients?
b. Create a new variable “Age” that contains the patients’ ages as of January 1,
2019. What is the average age of the patients?
c. Bin the patients’ ages into five equal-size groups. Label the groups using
numbers 1 (youngest) to 5 (oldest). How many patients are in group 4?
d. Bin the patients’ total minutes of exercise per week into five equal-size groups.
Label the groups using number 1 (highest values) to 5 (lowest values). How
many patients are in group 5?
e. Bin the patients’ BMI into five equal-size groups. Label the groups using
numbers 1 (lowest values) to 5 (highest values). How many patients are in
group 1?
f. Create a risk score for each patient by concatenating the group numbers
obtained in parts c, d, and e. How many patients are in the risk group of 555?
Category Reduction
Sometimes nominal or ordinal variables come with too many
categories. This presents a number of potential problems. First,
variables with too many categories pull down model performance
because, unlike a single parameter of a numerical variable, several
parameters associated with the categories of a categorical variable
must be analyzed. Second, if a variable has some categories that
rarely occur, it is difficult to capture the impact of these categories
accurately. In addition, a relatively small sample may not contain any
observations in certain categories, creating errors when the
analytical model is later applied to a larger data set with observations
in all categories. Third, if one category clearly dominates page 70
in terms of occurrence, the categorical variable will fail to
make a positive impact because modeling success is dependent on
being able to differentiate among the observations.
An effective strategy for dealing with these issues is category
reduction, where we collapse some of the categories to create fewer
nonoverlapping categories. Determining the appropriate number of
categories often depends on the data, context, and disciplinary
norms, but there are a few general guidelines.
The first guideline states that categories with very few
observations may be combined to create the “Other” category. For
example, in a data set that contains the demographic data about
potential customers, if many zip code categories only have a few
observations, it is recommended that an “Other” category be created
for these observations. The rationale behind this approach is that a
critical mass can be created for this “Other” category to help reveal
patterns and relationships in data.
Another guideline states that categories with a similar impact may
be combined. For example, when studying public transportation
ridership patterns, one tends to find that the ridership levels remain
relatively stable during the weekdays and then change drastically for
the weekends. Therefore, we may combine data from Monday
through Friday into the “Weekdays” category and Saturday and
Sunday into the “Weekends” category to simplify data from seven to
only two categories.
Example 2.6 demonstrates how to use Excel with Analytic Solver
and R for category reduction.
FILE
Customers
EXAMPLE 2.6
After gaining some insights from the Customers data set,
Catherine would like to analyze race. However, in its current
form, the data set would limit her ability to do a meaningful
analysis given the large number of categories of the race
variable; plus some categories have very few observations. As
a result, she needs to perform a series of data transformations
to prepare the data for subsequent analysis. Use Excel with
Analytic Solver and R to create a new category called Other
that represents the two least-frequent categories.
SOLUTION:
Using Excel with Analytic Solver
Using R
a. Import the Customers data into a data frame (table) and label
it myData.
b. First, we inspect the frequency of each Race category to
identify the two least-frequent categories. Enter:
> table(myData$Race)
The table shows that American Indians and Pacific Islanders
are the two least-frequent categories with only five and three
observations, respectively.
c. We use the ifelse function to create a new variable page 72
called NewRace that uses the Other category to
represent American Indians and Pacific Islanders. Enter:
> myData$NewRace <- ifelse(myData$Race %in%
c(“American Indian”, “Pacific Islander”), “Other”,
myData$Race)
Note that the ifelse function evaluates the values in the Race
variable, and if the value is either American Indian or Pacific
Islander, it replaces it with Other; the original race value is
retained otherwise.
d. We use the table function again to verify that the Other
category has eight observations. Enter:
> table(myData$NewRace)
e. We use the View function to display spreadsheet-style data.
Enter:
> View(myData)
Verify that the 19th customer is the first in the Other category.
Dummy Variables
In many analytical models, such as regression models discussed in
later chapters, categorical variables must first be converted into
numerical variables. For other models, dealing with numerical data is
often easier than categorical data because it avoids the complexities
of the semantics pertaining to each category of the variable. A
dummy variable, also referred to as an indicator or a binary
variable, is commonly used to describe two categories of a variable.
It assumes a value of 1 for one of the categories and 0 for the other
category, referred to as the reference or the benchmark category.
For example, we can define a dummy variable to categorize a
person’s sex using 1 for male and 0 for female, using females as the
reference category. Dummy variables do not suggest any ranking of
the categories and, therefore, without any loss of generality, we can
define 1 for female and 0 for male, using males as the reference
category. All interpretation of the results is made in relation to the
reference category.
A DUMMY VARIABLE
A dummy variable, also referred to as an indicator or a binary
variable, takes on values of 1 or 0 to describe two categories of
a categorical variable.
page 73
EXAMPLE 2.7
For the new Asian-inspired meal kits, Catherine feels that
understanding the channels through which customers were
acquired is important to predict customers’ future behaviors. In
order to include the Channel variable in her predictive model,
Catherine needs to convert the Channel categories into dummy
variables. Because web banner ads are probably the most
common marketing tools used by Organic Food Superstore,
she plans to use the Web channel as the reference category
and assess the effects of other channels in relation to the Web
channel. Use Excel with Analytic Solver and R to create the
relevant dummy variables for the Channel variable.
SOLUTION:
Using Excel with Analytic Solver
page 74
Category Scores
Finally, another common transformation of categorical variables is to
create category scores. This approach is most appropriate if the data
are ordinal and have natural, ordered categories. For example, in
customer satisfaction surveys, we often use ordinal scales, such as
very dissatisfied, somewhat dissatisfied, neutral, somewhat satisfied,
and very satisfied, to indicate the level of satisfaction. While the
satisfaction variable is categorical, the categories are ordered. In
such cases, we can recode the categories numerically using
numbers 1 through 5, with 1 being very dissatisfied and 5 being very
satisfied. This transformation allows the categorical variable to be
treated as a numerical variable in certain analytical models. With this
transformation, we need not convert a categorical variable into
several dummy variables or to reduce its categories. For an effective
transformation, however, we assume equal increments between the
category scores, which may not be appropriate in certain situations.
Example 2.8 shows how to convert a categorical variable into
category scores using Excel and R.
FILE
Customers
EXAMPLE 2.8
For the new Asian-inspired meal kits, Catherine wants to pay
attention to customer satisfaction. As the customer satisfaction
ratings represent ordinal data, she wants to convert them to
category scores ranging from 1 (Very Dissatisfied) to 5 (Very
Satisfied) to make the variable more readily usable in predictive
models. Use Excel and R to create category scores for the
Satisfaction variable.
SOLUTION:
Using Excel
a. Import the Customers data into a data frame (table) and label
it myData.
b. We use the ifelse function, in a nested format, to create
category scores for the Satisfaction variable. Enter:
> myData$Satisfaction_Score <- ifelse(myData$Satisfaction
== “Very Dissatisfied”, 1, ifelse(myData$Satisfaction ==
“Somewhat Dissatisfied”, 2, ifelse(myData$Satisfaction ==
“Neutral”, 3, ifelse(myData$Satisfaction == “Somewhat
Satisfied”, 4, 5))))
Note that the ifelse function evaluates the values in the
Satisfaction variable, and if the value is Very Dissatisfied, the
function assigns a 1 to the new Satisfaction_Score page 75
variable. Because it is a nested format, if the value
is not Very Dissatisfied but is Somewhat Dissatisfied, the
function assigns a 2, and so on. If the values in the
Satisfaction variable are none of the first four scores, the
function assigns 5 to the Satisfaction_Score variable.
c. We use the View function to display spreadsheet-style data.
Enter:
> View(myData)
Verify that the first four satisfaction scores are 1, 3, 5, and 1,
respectively.
EXERCISES 2.5
Mechanics
49. The following table has three variables and six observations.
a. Convert Sex into dummy variables. Use the most frequent category as the
reference category. Which category is the reference category?
b. Transform Decision into dummy variables. Use the most frequent category as
the reference category. Which category is the reference category?
c. Transform the Decision values into category scores where Approve = 1, Reject
= 2, and Need More Information = 3. How many observations have a category
score of 2?
50. FILE Exercise_2.50. The accompanying data set contains three variables, x1,
x2, and x3.
a. The variable x1 contains 6 categories ranging from “A” to “F.” Reduce the
number of categories to five by combining the two least-frequent categories.
Name the new category “Other.” How many observations are in the “Other”
category?
b. The variable x2 contains six categories ranging from “A” to “F.” This variable is
ordinal, meaning that the categories are ordered. “A” represents the lowest
level, whereas “F” represents the highest level. Replace the category names
with category scores ranging from 1 (lowest) to 6 (highest). What is the average
category score for x2?
c. The variable x3 contains four unordered categories. To facilitate subsequent
analyses, we need to convert x3 into dummy variables. How many dummy
variables should be created? Create the dummy variables using Category1 as
the reference category.
51. FILE Exercise_2.51. The accompanying data set contains two variables,
Birthdate and LoanDecision.
a. LoanDecision contains three unordered categories. To facilitate subsequent
analyses, we need to convert LoanDecision into dummy variables. How many
dummy variables should be created? Create the dummy variables using “Need
more information” as the reference category.
b. Create a new variable based on LoanDecision. The new variable should have
only two categories: “Approve” and “Not approve.” The “Not approve” category
combines the “Reject” and “Need more information” categories. How many
observations are in the “Not approve” category?
52. FILE Exercise_2.52. The accompanying data set contains two variables, x1 and
x2.
a. The variable x1 contains three categories ranging from “Low” to “High.” Convert
the category names into category scores (i.e., 1 = “Low”, 2 = “Medium”, and 3 =
“High”). How many observations have a category score of 3?
b. Reduce the number of categories in x2 by combining the three least-frequent
categories. Name the new category “Other”. How many observations are in the
“Other” category?
c. Convert the new x2 into dummy variables. How many dummy variables should
be created? Create the dummy variables using the “Other” category as the
reference category.
53. FILE Exercise_2.53. The accompanying data set contains three variables, x1,
x2, and x3.
a. The variable x1 contains 3 categories: S, M, and L. Convert the category names
into category scores (i.e., S = 1, M = 2, and L = 3). How many observations
have a category score of 3?
b. Transform x2 into an appropriate number of dummy variables. How many
dummy variables should be created?
c. The variable x3 contains four categories: A, B, C, and D. Reduce the number of
categories by combining the two least-frequent categories into a new category
E. How many observations are in the E category?
page 76
Applications
54. FILE Home_Loan. Consider the following portion of data that includes
information about home loan applications. Variables on each application include
the application number (Application), whether the application is conventional or
subsidized by the federal housing administration (LoanType), whether the
property is a single-family or multifamily home (PropertyType), and whether the
application is for a first-time purchase or refinancing (Purpose).
⋮ ⋮ ⋮ ⋮
1 Delivered Standard XL
Package Status Delivery Size
2 Damaged Express S
⋮ ⋮ ⋮ ⋮
a. Transform the Delivery variable into dummy variables. Use the most frequent
category as the reference category. How many dummy variables should be
created? Which category of Delivery is the reference category?
b. Combine the two least frequent categories in the Size variable into a new
category called Other. How many observations are there in the new category?
c. Replace the category names in the Status variable with scores 1 (Lost), 2
(Damaged), or 3 (Delivered). What is the average status score of the 75
packages?
56. FILE Technician. After each thunderstorm, a technician is assigned to do a
check on cellular towers in his or her service area. For each visit, the technician
records in a database the tower number (Tower), the unit model (Model: A or B),
and the extent of the damage to the unit (Damage: None, Minor, Partial, Severe,
and Total for a total loss). A portion of the data is shown in the accompanying
table.
1 A Minor
2 A None
⋮ ⋮ ⋮
98 B Partial
a. Transform the Model variable into dummy variables. How many dummy
variables should be created?
b. Transform the Damage variable into categorical scores ranging from 4 (Total) to
3 (Severe), 2 (Partial), 1 (Minor), and 0 (None). What is the average damage
score of the cell towers?
57. FILE Game_Players. Refer to Exercise 2.46 for a description of the problem and
data set.
a. The variable Satisfaction contains five ordered categories: Very Dissatisfied,
Dissatisfied, Neutral, Satisfied, and Very Satisfied. Replace the category names
with scores ranging from 1 (Very Dissatisfied) to 5 (Very Satisfied). What is the
average satisfaction score of the players?
b. The variable Enjoyment contains five ordered categories: Very Low, Low,
Neutral, High, and Very High. Replace the category names with scores ranging
from 1 (Very Low) to 5 (Very High). What is the average enjoyment score of the
players?
c. The variable Recommend contains five ordered categories: Definitely Will Not,
Will Not, Neutral, Will, and Definitely Will. Replace the category names with
scores ranging from 1 (Definitely Will Not) to 5 (Definitely Will). What is the
average recommendation score of the players?
d. The variable Type contains three unordered game categories: Action, Role
Play, and Sports. To facilitate subsequent analyses, transform Type into dummy
variables. Use the least frequent category as the reference category. Which
category is the reference category? How many dummy variables should be
created?
58. FILE Engineers. Refer to Exercise 2.47 for a description of the problem and
data set.
a. The variable Personality contains four unordered personality types: Analyst,
Diplomat, Explorer, and Sentinel. To facilitate subsequent analyses, Erin needs
to convert this variable into dummy variables. How many dummy variables
should be created? Create the dummy variables using Analyst
type as the reference category. How many observations are in
page 77
the reference category?
b. The variable Level contains three ordered position levels: Engineer I (lowest),
Engineer II, and Senior Engineer (highest). Replace the level names with
scores ranging from 1 (lowest) to 3 (highest). What is the average level score of
the engineers?
59. FILE Patients. Refer to Exercise 2.48 for a description of the problem and data
set.
a. The variable Race contains five unordered categories: American Indian,
Asian/Pacific Islander, Hispanic, Non-Hispanic Black, and Non-Hispanic White.
Reduce the number of categories to four by combining the two least frequent
categories. Name the new category “Other.” How many observations are in the
“Other” category?
b. Transform the Race variable with the new categories into dummy variables. Use
the most frequent race category in the data as the reference category. Which
category is the reference category? How many dummy variables should be
created?
c. The variable Education contains four ordered categories: HS (lowest
educational attainment level), Some College, College, and Graduate (highest
educational attainment level). Replace the category names with category
scores ranging from 1 (lowest) to 4 (highest). What is the average category
score for Education?
2.6 WRITING WITH BIG DATA
FILE
TechSales_Reps
Case Study
Cassius Weatherby is a human resources manager at a major technology firm that
produces software and hardware products. He would like to analyze the net promoter
score (NPS) of sales professionals at the company. The NPS measures customer
satisfaction and loyalty by asking customers how likely they are to recommend the
company to others on a scale of 0 (unlikely) to 10 (very likely). This measure is an
especially important indicator for the company’s software business as a large
percentage of the sales leads come from customer referrals. Cassius wants to
identify relevant factors that are linked with the NPS that a sales professional
receives. These insights can help the company make better hiring decisions and
develop a more effective training program.
With the help of the company’s IT group, a data set with over 20,000 records of
sales professionals is extracted from the enterprise data warehouse. The relevant
variables include the product line to which the sales professional is assigned, age,
sex, the number of years with the company, whether the sales professional has a
college degree, personality type based on the Myers-Briggs Personality assessment,
the number of professional certificates acquired, the average score from the 360-
degree annual evaluation, base salary, and the average NPS received. Cassius is
tasked with inspecting and reviewing the data and preparing a report for the
company’s top management team.
page 78
From a total of about 20,000 records of sales professionals, we select only
the sales professionals in the software product group and divide them into
two categories: those with an average NPS below nine and those with an
average NPS of nine or ten. When a customer gives a sales professional an
NPS of nine or ten, the customer is considered “enthusiastically loyal,”
meaning that they are very likely to continue purchasing from us and refer
their colleagues to our company. Based on the NPS categorization, we then
divide the sales professionals into two categories: those with zero to three
professional certificates and those with four or more professional certificates.
Table 2.10 shows the results. Of the 12,130 sales professionals in the
software product group, we find that 65.57% have earned less than 4
professional certificates, whereas 34.43% have earned four or more.
However, there appears to be a link between those with 4 or more
professional certificates and NPS values. For those who received an NPS of
nine or ten, we find that 62.60% have earned at least four professional
certificates. Similarly, for those who received an NPS of below nine, we find
that 73.00% earned less than four professional certificates.
page 79
Report 2.1 FILE CA_Crash. Subset the data set based on the location, day of the
week, type of collision, and lighting condition. Compare these subsets of data to find
interesting patterns. Can you identify any links between crash fatality and the
aforementioned variables? Are there any missing values? Which strategy should you
use to handle the missing values? Because many of the variables are categorical,
you should consider transforming them into dummy variables prior to the analysis.
Report 2.2 FILE House_Price. Subset the data based on variables, such as
number of bedrooms, number of bathrooms, home square footage, lot square
footage, and age of the house. Which variables can be removed when predicting
house prices? Are there any variables that display a skewed distribution? If there are,
perform logarithm transformations for these variables. Would it make sense to
transform some of the numeric values into categorical values using the binning
strategy? Is equal size or equal interval binning strategy more appropriate in these
situations?
Report 2.3 FILE Longitudinal_Survey. Subset the data based on age, sex, or race.
Are there any missing values in the data? Which strategy should you use to handle
the missing values? Consider if any new variables can be created using the existing
variables. Explore the opportunities of transforming numeric variables through
binning and transforming categorical variables by creating dummy variables.
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 3.1 Visualize categorical and numerical
variables.
LO 3.2 Construct and interpret a contingency table
and a stacked bar chart.
LO 3.3 Construct and interpret a scatterplot.
LO 3.4 Construct and interpret a scatterplot with a
categorical variable, a bubble plot, a line
chart, and a heat map.
LO 3.5 Calculate and interpret summary measures.
LO 3.6 Use boxplots and z-scores to identify
outliers.
page 81
INTRODUCTORY CASE
Investment Decision
Dorothy Brennan works as a financial advisor at a large
investment firm. She meets with an inexperienced investor who
has some questions regarding two approaches to mutual fund
investing: growth investing versus value investing. The investor
has heard that growth funds invest in companies whose stock
prices are expected to grow at a faster rate, relative to the
overall stock market. Value funds, on the other hand, invest in
companies whose stock prices are below their true worth. The
investor has also heard that the main component of investment
return is through capital appreciation in growth funds and
through dividend income in value funds.
The investor shows Dorothy the annual return data for
Fidelity’s Growth Index mutual fund (Growth) and Fidelity’s
Value Index mutual fund (Value). Table 3.1 shows a portion of
the annual returns (in %) for these two mutual funds from 1984
to 2018.
FILE
Growth_Value
⋮ ⋮ ⋮
page 82
1 Diplomat
2 Diplomat
Employee Myers-Briggs Assessment
⋮ ⋮
1000 Explorer
As shown in Table 3.3, the categories of the variable form the first
column of a frequency distribution. We then record the number of
employees that fall into each category. We can readily see from Table
3.3 that the Explorer personality type occurs with the most frequency,
while the Analyst personality type occurs with the least frequency. In
some applications, especially when comparing data sets of differing
sizes, our needs may be better served by focusing on the relative
frequency for each category rather than its frequency. The relative
frequency for each category is calculated by dividing the frequency by
the sample size. We can easily convert relative frequencies into
percentages by multiplying by 100. Table 3.3 shows that 40.4% of the
employees fall into the Explorer personality type.
page 83
A Bar Chart
Next, we show a graphical representation of a frequency distribution.
We first construct a vertical bar chart, sometimes referred to as a
column chart. The height of each bar is equal to the frequency or the
relative frequency of the corresponding category. Figure 3.1 shows
the bar chart for the Myers-Briggs variable.
FILE
Transit_Survey
EXAMPLE 3.1
Recently, an urban university conducted a transportation survey
as part of its commitment to reduce its carbon footprint and
comply with the federal Clean Air Act. The survey was
distributed to students, faculty, and staff members in order to
learn the patterns of their daily commute. One of the questions
asked: During a typical school week, how do you commute from
home to school. Possible responses included Drive_Alone,
Public_Transit, Bicycle, Walk, and Other. Six hundred people
responded to the survey. Table 3.4 shows a portion of the survey
results.
1 Bicycle
2 Public_Transit
⋮ ⋮
600 Walk
page 84
Drive_Alone 57
Public_Transit 273
Mode of Transportation Number of Respondents
Bicycle 111
Walk 141
Other 18
c. Select cells D2:E6. Choose Insert > Insert Bar Chart. Select
the option on the top left side. (If you are having trouble finding
this option after selecting Insert, look for the horizontal bars
above Charts.) Figure 3.2 shows the bar chart. Note that in this
instance we have constructed a horizontal bar chart. If you wish
to construct a vertical bar chart, then you would choose Insert
> Column Chart.
page 85
FIGURE 3.2 Bar chart for Transit_Survey
page 86
25 < x ≤ 50 9 0.2571
50 < x ≤ 75 0 0
A Histogram
Next, we show a graphical representation of a frequency distribution.
For numerical data, a histogram is essentially the counterpart to the
vertical bar chart that we use for categorical data.
When constructing a histogram, we typically mark off the interval
limits along the horizontal axis. The height of each bar represents
either the frequency or the relative frequency for each interval. No
gaps appear between the interval limits.
page 88
EXAMPLE 3.2
The Value variable from the introductory case shows annual
returns (in %) for Fidelity’s Value mutual fund from 1984 through
2018. Construct a frequency distribution and a histogram using
Excel and R, and then summarize the results.
SOLUTION:
Before using Excel or R, we need to make some decisions
about the number of intervals, as well as the width of each
interval. For a variable with 35 observations, it would be
reasonable to use five intervals. We then find that the minimum
and the maximum observations for the Value variable are
−46.52 and 44.08, respectively. Using the formula to
approximate the interval width, we calculate (44.08 −(−46.52))/5
= 18.12. It would be perfectly acceptable to construct, for
instance, a frequency distribution with five intervals where each
interval has a width of 20, and the lower limit of the first interval
is −50. However, because one of our objectives is to compare
the Growth returns with the Value returns, we use the same
number of intervals, same width, and same lower limit as we did
when we constructed the frequency distribution for page 89
Growth; that is, we use six intervals, each with a
width of 25, and the first interval has a lower limit of −50.
Using Excel
25 < x ≤ 50 6 0.1714
50 < x ≤ 75 0 0.0000
page 90
The simplest graph should be used for a given set of data. Strive for
clarity and avoid unnecessary adornments.
Axes should be clearly marked with the numbers of their respective
scales; each axis should be labeled.
When creating a bar chart or a histogram, each bar/rectangle
should be of the same width. Differing widths create distortions.
The vertical axis should not be given a very high value as an upper
limit. In these instances, the data may appear compressed so that
an increase (or decrease) of the data is not as apparent as it
perhaps should be. For example, Figure 3.7(a) plots the daily price
for a barrel of crude oil for the first quarter of the year. Due to Middle
East unrest, the price of crude oil rose from a low of $83.13 per
barrel to a high of $106.19 per barrel, or approximately 28%
. However, because Figure 3.7(a) uses a high value as
an upper limit on the vertical axis ($325), the rise in price appears
dampened. Figure 3.7(b) shows a vertical axis with an upper limit of
$110; this value better reflects the upper limit observed during this
time period.
page 92
FIGURE 3.7 Misleading vertical axis: unreasonably high upper limit
FILE
Stockprice
The vertical axis should not be stretched so that an increase (or
decrease) of the data appears more pronounced than warranted.
For example, Figure 3.8(a) charts the daily closing stock price of a
large retailer for the week of April 4. It is true that the stock price
declined over the week from a high of $60.15 to a low of $59.46;
this amounts to a $0.69 decrease, or an approximate 1% decline.
However, because the vertical axis is stretched, the drop in stock
price appears more dramatic. Figure 3.8(b) shows a vertical axis
that has not been stretched.
Rating Frequency
1 4
2 10
3 14
4 18
5 4
a. How many of the rookies received a rating of 4 or better? How many of the
rookies received a rating of 2 or worse?
b. Construct the relative frequency distribution. What proportion received a rating of
5?
c. Construct a bar chart. Comment on the findings.
2. The following frequency distribution shows the counts of sales of men’s shirts at an
online retailer over the weekend.
Size Frequency
Small 80
Medium 175
Large 210
X-Large 115
page 93
a. Construct the relative frequency distribution. What proportion of
sales were for a medium-sized shirt?
b. Construct a bar chart. Comment on the findings.
3. The following frequency distribution summarizes the counts of purchases by day of
the week for a major domestic retailer.
Day Frequency
Mon 2,504
Tue 2,880
Wed 3,402
Thur 3,566
Fri 4,576
Sat 5,550
Sun 5,022
Northeast 6,373
Midwest 7,647
South 16,609
West 9,069
a. Construct the relative frequency distribution. What proportion of people who live
below the poverty level live in the Midwest?
b. Construct a bar chart. Comment on the findings.
5. A recent poll of 3,057 individuals asked: “What’s the longest vacation you plan to
take this summer?” The following relative frequency distribution summarizes the
results.
a. Construct the frequency distribution. How many people are going to take a one-
week vacation this summer?
b. Construct a bar chart. Comment on the findings.
6. FILE Dining. A local restaurant is committed to providing its patrons with the best
dining experience possible. On a recent survey, the restaurant asked patrons to
rate the quality of their entrées. The responses ranged from 1 to 5, where 1
indicated a disappointing entrée and 5 indicated an exceptional entrée. A portion of
the 200 responses is as follows:
Response Rating
1 3
2 5
Response Rating
⋮ ⋮
200 4
a. Construct the frequency distribution that summarizes the results from the survey.
Which rating appeared with the most frequency?
b. Construct a bar chart. Are the patrons generally satisfied with the quality of their
entrées? Explain.
7. FILE Health. Patients at North Shore Family Practice are required to fill out a
questionnaire that gives the doctor an overall idea of each patient’s health. The first
question is: “In general, what is the quality of your health?” The patient chooses
Excellent, Good, Fair, or Poor. A portion of the 150 responses is as follows:
Response Quality
1 Fair
2 Good
⋮ ⋮
150 Good
a. Construct the frequency distribution that summarizes the results from the
questionnaire. What is the most common response to the questionnaire?
b. Construct a bar chart. How would you characterize the health of patients at this
medical practice? Explain.
8. FILE Millennials. A 2014 Religious Landscape Study by the Pew Research
Center found that 35% of Millennials (Americans born between 1981 and 1996)
identified themselves as not religious. A researcher wonders if this finding is
consistent today. She surveys 600 Millennials and asks them to rate their faith.
Possible responses were Strongly Religious, Somewhat Religious, Slightly
Religious, and Not Religious. A portion of the 600 responses is as follows:
Response Faith
1 Slightly Religious
2 Slightly Religious
Response Faith
⋮ ⋮
a. Construct the frequency distribution that summarizes the results from the survey.
What is the most common response to the survey?
b. Construct a bar chart. Do the researcher’s results appear consistent with those
found by the Pew Research Center? Explain.
9. A researcher conducts a mileage economy test involving 80 cars. The frequency
distribution describing average miles per gallon (mpg) appears in the following
table.
15 ≤ x < 20 15
20 ≤ x < 25 30
25 ≤ x < 30 15
30 ≤ x < 35 10
35 ≤ x < 40 7
40 ≤ x < 45 3
a. Construct the relative frequency distribution. What proportion of the cars got at
least 20 mpg but less than 25 mpg? What proportion of the cars got less than 35
mpg? What proportion of the cars got 35 mpg or more?
b. Construct a histogram. Comment on the shape of the distribution.
page 94
10. Consider the following relative frequency distribution that
summarizes the returns (in %) for 500 small cap stocks.
0 ≤ x < 10 0.42
10 ≤ x < 20 0.25
20 ≤ x < 30 0.04
a. Construct the frequency distribution. How many of the stocks had a return of at
least 10% but less than 20%?
b. Construct a histogram. Comment on the shape of the distribution.
11. The manager at a water park constructed the following frequency distribution to
summarize attendance in July and August.
Attendance Frequency
a. Construct the relative frequency distribution. What proportion of the time was
attendance at least 1,750 but less than 2,000? What proportion of the time was
attendance less than 1,750? What proportion of the time was attendance 1,750
or more?
b. Construct a histogram. Comment on the shape of the distribution.
12. Fifty cities provided information on vacancy rates (in %) in local apartments in the
following frequency distribution.
0≤x<3 0.10
3≤x<6 0.20
Vacancy Rate (%) Relative Frequency
6≤x<9 0.40
9 ≤ x < 12 0.20
12 ≤ x < 15 0.10
a. Construct the frequency distribution. How many of the cities had a vacancy rate
of at least 6% but less than 9%? How many of the cities had a vacancy rate of at
least 9%?
b. Construct a histogram. Comment on the shape of the distribution.
13. The following relative frequency histogram summarizes the median household
income for the 50 states as reported by the U.S. Census Bureau in 2010.
page 95
a. Is the distribution symmetric? If not, is it positively or negatively
skewed?
b. What percentage of the states had median household income between $45,000
and $55,000?
c. What percentage of the states had median household income between $35,000
and $55,000?
14. The following histogram summarizes Apple Inc.’s monthly stock price for the years
2014 through 2018.
a. Is the distribution symmetric? If not, is it positively or negatively skewed?
b. Over this five-year period, approximate the minimum monthly stock price and the
maximum monthly stock price.
c. Over this five-year period, which interval had the highest relative frequency?
15. The following histogram summarizes the salaries (in $100,000s) for the 30 highest-
paid portfolio managers at a large investment firm over the past year.
1 1272
2 1089
⋮ ⋮
100 1389
a. Construct the frequency distribution for Expenditures. Use six intervals with
widths of 400 < x ≤ 700; 700 < x ≤ 1,000; etc. How many customers spent
between $701 and $1,000?
b. How many customers spent $1,300 or less? How many customers spent more
than $1,300?
17. FILE Census. The following table lists a portion of median house values (in $) for
the 50 states as reported by the U.S. Census Bureau in 2010.
Alabama 117600
Alaska 229100
⋮ ⋮
Wyoming 174000
a. Construct the frequency distribution and the histogram for the median house
values. Use six intervals with widths of 0 < x ≤ 100,000; 100,000 < x ≤ 200,000;
etc. Which interval had the highest frequency? How many of the states had
median house values of $300,000 or less?
b. Is the distribution symmetric? If not, is it positively or negatively skewed?
18. FILE DJIA_2019. The accompanying table shows a portion of the daily price index
for the Dow Jones Industrial Average (DJIA) for the first half of 2019.
Day DJIA
⋮ ⋮
a. Construct the frequency distribution and the histogram for the DJIA. Use five
intervals with widths of 22,000 < x ≤ 23,000; 23,000 < x ≤ 24,000; etc. On how
many days during the first half of 2019 was the DJIA more than 26,000?
b. Is the distribution symmetric? If not, is it positively or negatively skewed?
19. FILE Gas_2019. The following table lists a portion of the average price (in $) for a
gallon of gas for the 50 states and the District of Columbia as reported by AAA Gas
Prices on January 2, 2019.
State Price
Alabama 1.94
Alaska 3.06
⋮ ⋮
Wyoming 2.59
a. Construct the frequency distribution and the histogram for the average price of
gas. Use six intervals with widths of 1.70 < x ≤ 2.00; 2.00 < x ≤ 2.30; etc. Which
interval had the highest frequency? How many of the states had average gas
prices greater than $2.60?
b. Is the distribution symmetric? If not, is it positively or negatively skewed?
20. The accompanying figure plots the monthly stock price of a large construction
company from July 2017 through March 2019. The stock has experienced
tremendous growth over this time period, almost tripling in price. Does the figure
reflect this growth? If not, why not?
21. Annual sales at a small pharmaceutical firm have been rather stagnant over the
most recent five-year period, exhibiting only 1.2% growth over this time frame. A
research analyst prepares the accompanying graph for inclusion in a sales report.
page 96
Does this graph accurately reflect what has happened to sales over the last five
years? If not, why not?
A Contingency Table
Recall the Myers_Briggs data set discussed in Section 3.1. If we
expand the data set to include another categorical variable, say one’s
sex, it would allow us to examine the relationship between personality
type and one’s sex. Perhaps we are interested in whether certain
personality types are more prevalent among males versus females.
Table 3.8 shows a portion of the expanded data set, Myers_Briggs2.
FILE
Myers_Briggs2
1 Diplomat Female
2 Diplomat Female
⋮ ⋮ ⋮
page 97
FIGURE 3.9 A stacked column chart for personality type and sex
1 West yes
2 Northeast yes
⋮ ⋮ ⋮
600 South no
page 98
Purchase
Purchase
Using R
a. Import the Promotion data into a data frame (table) and label it
myData.
b. In order to create a contingency table, labeled as myTable, we
use the table(row, column) function and specify the row and
column variables. If you retype myTable, you will see a
contingency table that resembles Table 3.11. If we use the
prop.table function, then R returns cell proportions that, when
converted to percentages, are the same as those that appear in
Table 3.12. Enter:
> myTable <- table(myData$Location, myData$Purchase)
> myTable
> prop.table(myTable)
c. To create a stacked column chart similar to Figure 3.12, we
need to first create a contingency table with the Purchase
variable in rows and the Location variable in columns. Enter:
> myNewTable <- table(myData$Purchase, myData$Location)
d. We use the barplot function to construct a column chart. As we
saw when constructing a bar chart and a histogram, R offers a
number of options for formatting. Here we use main to add a
title; col to define colors for the segments of the columns;
legend to create a legend; xlab and ylab to provide labels for
the x-axis and y-axis, respectively; and ylim to extend the
vertical axis units from 0 to 200. Enter:
> barplot(myNewTable, main=“Location and Purchase”,
col=c('blue','red'), legend=rownames(myNewTable),
xlab='Location', ylab='Count', ylim = c(0,200))
The resulting stacked column chart should look similar to
Figure 3.12.
Summary
Compared to Table 3.10 with just raw data, Table 3.11, Table
3.12, and Figure 3.12 present the results of the location and
purchase example in a much more informative format. We can
readily see that of the 600 e-mail recipients, 410 of them made a
purchase using the promotional discount. With a 68.33%
positive response rate, this marketing strategy seemed
successful. However, there do appear to be some differences
depending on location. Recipients residing in the South and
West were a lot more likely to make a purchase (130 out of 154
and 101 out of 119, respectively) compared to those residing in
the Midwest (77 out of 184). It would be wise for the retailer to
examine if there are other traits that the customers in the South
and West share (age, gender, etc.). That way, in the next
marketing campaign, the e-mails can be even more targeted.
page 101
EXAMPLE 3.4
Recall the Growth_Value data set that contains annual returns
for Fidelity’s Growth and Value mutual funds from 1984 to 2018.
Construct a scatterplot of Value against Growth using Excel and
R, and then summarize the results.
page 102
SOLUTION:
Using Excel
a. Open the Growth_Value data file.
b. When constructing a scatterplot, Excel places the variable that
appears in the first column on the x-axis and the variable that
appears in the second column on the y-axis. Because we want
Growth to be on the x-axis and Value on the y-axis, we do not
need to rearrange the order of the columns. We simultaneously
select the observations for the Growth and Value variables and
choose Insert > Insert Scatter or Bubble Chart > Scatter. (If
you are having trouble finding this option, look for the graph
with data points above Charts.) The resulting scatterplot should
be similar to Figure 3.14.
page 103
EXERCISES 3.2
Applications
22. FILE Bar. At a local bar in a small Midwestern town, beer and wine are the only
two alcoholic options. The manager conducts a survey on the bar’s customers over
the past weekend. Customers are asked to identify their sex (defined as male or
female) and their drink choice (beer, wine, or soft drink). A portion of the responses
is shown in the accompanying table.
1 male beer
2 male beer
⋮ ⋮ ⋮
1 female yes
2 female yes
⋮ ⋮ ⋮
186 male no
a. Construct a contingency table that cross-classifies the data by Sex and Feasible.
How many of the students were female? How many of the students felt that it
was feasible for men and women to be just friends?
b. What is the likelihood that a male student feels that men and women can be just
friends? What is the likelihood that a female student feels that men and women
can be just friends?
c. Construct a stacked column chart. Do male and female students feel the same or
differently about this topic? Explain.
24. FILE Shift. Metalworks, a supplier of fabricated industrial parts, wonders if there is
any connection between when a component is constructed (Shift is equal to 1, 2, or
3) and whether or not it is defective (Defective is equal to Yes if the component is
defective, No otherwise). The supplier collects data on the construction of 300
components. A portion of the data is shown in the accompanying table.
1 1 No
2 1 Yes
Component Shift Defective
⋮ ⋮ ⋮
300 3 No
a. Construct a contingency table that cross-classifies the data by shift and whether
or not the component is defective. How many components constructed during
Shift 1 were defective? How many components constructed during Shift 2 were
not defective?
b. Given that the component was defective, what is the likelihood that it was
constructed during Shift 2? Given that the component was defective, what is the
likelihood that it was constructed during Shift 3? Does there seem to be any
connection between when a component is constructed and whether or not it is
defective? Explain.
c. Construct a stacked column chart. Are the defect rates consistent across all
shifts? Explain.
25. FILE Athletic. A researcher at a marketing firm examines whether the age of a
consumer matters when buying athletic clothing. Her initial feeling is that Brand A
attracts a younger customer, whereas the more established companies (Brands B
and C) draw an older clientele. For 600 recent purchases of athletic clothing, she
collects data on a customer’s age (Age equals 1 if the customer is under 35, 0
otherwise) and the brand name of the athletic clothing (A, B, or C). A portion of the
data is shown in the accompanying table.
1 1 A
2 1 A
⋮ ⋮ ⋮
600 0 C
a. Construct a contingency table that cross-classifies the data by Age and Brand.
How many of the purchases were for Brand A? How many of the purchases were
from customers under 35 years old?
b. Given that the purchase was made by a customer under 35 years old, what is the
likelihood that the customer purchased Brand A? Brand B? Brand C? Do the data
seem to support the researcher’s belief? Explain.
page 104
c. Construct a stacked column chart. Does there appear to be a
relationship between the age of the customer and the brand purchased?
26. FILE Study. A report suggests that business majors spend the least amount of
time on course work compared to all other college students (The Washington Post,
January 28, 2017). A provost of a university conducts a survey on 270 students.
Students are asked their major (business or nonbusiness) and if they study hard
(yes or no), where study hard is defined as spending at least 20 hours per week on
course work. A portion of the responses is shown in the accompanying table.
1 business yes
2 business yes
⋮ ⋮ ⋮
270 nonbusiness no
a. Construct a contingency table that cross-classifies the data by Major and Study
Hard. How many of the students are business majors? How many of the students
study hard?
b. Given that the student is a business major, what is the likelihood that the student
studies hard? Given that the student is a nonbusiness major, what is the
likelihood that the student studies hard? Do the data seem to support the findings
in the report? Explain.
c. Construct a stacked column chart. Comment on the findings.
27. FILE Test_Scores. The accompanying table shows a portion of midterm and final
grades for 32 students. Construct a scatterplot of Final against Midterm. Describe
the relationship.
Final Midterm
86 78
94 97
⋮ ⋮
91 47
28. FILE Life_Obesity. The accompanying table shows a portion of life expectancies
(in years) and obesity rates (in %) for the 50 states and the District of Columbia.
Construct a scatterplot of Life Expectancy against Obesity. Describe the
relationship.
State Life Expectancy Obesity
⋮ ⋮ ⋮
29. FILE Consumption. The accompanying table shows a portion of quarterly data for
average U.S. annual consumption (Consumption in $) and disposable income
(Income in $) for the years 2000–2016. Construct a scatterplot of Consumption
against Income. Describe the relationship.
⋮ ⋮ ⋮
30. FILE Return. In order to diversify risk, investors are often encouraged to invest in
assets whose returns have either a negative relationship or no relationship. The
accompanying table shows a portion of the annual return data (in %) on two assets.
Construct a scatterplot of Return B against Return A. In order to diversify risk,
would the investor be wise to include both of these assets in her portfolio? Explain.
Return A Return B
−20 2
−5 0
⋮ ⋮
10 2
31. FILE Healthy_Living. Healthy living has always been an important goal for any
society. Most would agree that a diet that is rich in fruits and vegetables (FV) and
regular exercise have a positive effect on health, while smoking has a negative
effect on health. The accompanying table shows a portion of the percentage of
these variables observed in various states in the United States.
13590 6 61485
13775 6 54344
⋮ ⋮ ⋮
11988 8 42408
page 105
EXAMPLE 3.5
The Birth_Life data file contains information on the following
variables for 10 countries in 2010: country name (Country
Name), life expectancy (Life Exp in years), birth rate (Birth Rate
in percent), GNI per capita (GNI in $), and level of development
(Development). A portion of the Birth_Life data set is shown in
Table 3.13.
TABLE 3.13 A Portion of the Birth_Life Data Set
Using R
a. Import the Birth_Life data into a data frame (table) and label it
myData.
b. To create a scatterplot that incorporates the Development
variable, we use the plot function. Here we use the options
main to add a title; col to define colors for the segments of the
columns; xlab and ylab to provide labels for the x-axis and y-
axis, respectively; pch to define the shape of the marker; and
col to define the color of the marker. The shapes and colors of
the markers are based on the categories of the Development
variable. Enter:
> plot(myData$'Birth Rate'~myData$'Life Exp',
main=“Scatterplot of Birth Rate against Life Expectancy”, xlab =
“Life Expectancy (in years)”, ylab = “Birth Rate (in %)”, pch=16,
col=ifelse(myData$Development == “Developing”, 20, 26))
c. We add a legend on the right side of the scatterplot using the
legend function. Enter:
> legend(“right”, legend=c(“Developing”, “Developed”), pch=16,
col=c(20, 26))
The resulting scatterplot should be similar to Figure 3.15.
Summary
From Figure 3.15, we see a negative linear relationship between
birth rate and life expectancy. That is, countries with lower birth
rates tend to have higher life expectancies. This relationship
holds true for both developing and developed countries. We also
see that, in general, developed countries have lower birth rates
and higher life expectancies as compared to developing
countries.
page 107
A Bubble Plot
A bubble plot shows the relationship between three numerical
variables. In a bubble plot, the third numerical variable is represented
by the size of the bubble. For instance, a bubble plot may plot a
college student’s study time against screen time and use the size of
the bubble to represent the student’s GPA. This bubble plot would
help us understand the relationships between study time, screen time,
and academic performance.
We illustrate the use of a bubble plot in Example 3.6.
FILE
Birth_Life
EXAMPLE 3.6
Revisit the Birth_Life data from Example 3.5. Use Excel and R
to construct a bubble plot of birth rate against life expectancy
that uses the GNI variable for the size of the bubbles.
Summarize the results.
SOLUTION:
Using Excel
a. Import the Birth_Life data into a data frame (table) and label it
myData.
b. We first create an empty plot by using the plot function (with
the appropriate x and y variables) and the option type = “n”,
which means no plotting. Enter:
> plot(myData$‘Birth Rate’~myData$‘Life Exp’, type=“n”)
page 108
c. We then use the symbols function to plot the
bubbles representing the observations. We use the options
circles, inches, and bg to specify the radii, sizes, and color of
the bubbles, respectively. The bubbles are sized based on the
values of the GNI of the countries. As in Example 3.5, we also
use the options main, xlab, and ylab. Enter:
> symbols(myData$‘Birth Rate’~myData$‘Life Exp’,
circles=myData$GNI, inches = 0.5, bg = ‘blue’, main=“A bubble
plot of birth rate, life expectancy, and GNI”, xlab = “Life
Expectancy (in years)”, ylab = “Birth Rate (in %)”)
The resulting bubble plot should be similar to Figure 3.16.
Summary
From Figure 3.16 we see that a country’s birth rate and its
average life expectancy display a negative linear relationship.
We also see that countries with low birth rates and high life
expectancies have higher GNI per capita, which is indicative of
developed countries.
A Line Chart
A line chart displays a numerical variable as a series of data points
connected by a line. A line chart is especially useful for tracking
changes or trends over time. For example, using a line chart that plots
the sales of Samsung’s smartphones over time, we can easily tell
whether the sales follow an upward, a downward, or a steady trend. It
is also easy for us to identify any major changes that happened in the
past on a line chart. For example, due to a major product recall,
Samsung experienced a significant drop in sales in late 2016. We can
easily identify the event from the line chart because the data point for
that year would dip dramatically.
When multiple lines are plotted in the same chart, we can compare
these observations on one or more dimensions. For example, if we
simultaneously plot the historical sales of Apple’s iPhones alongside
those of Samsung’s smartphones, we would be able to compare the
trends and the rates of change of the two companies. We may even
detect interesting patterns such as whether a drop in the sales of
Samsung’s smartphones coincides with a surge in the sales of
iPhones. In order to illustrate the use of line charts, consider Example
3.7.
FILE
Growth_Value
EXAMPLE 3.7
Recall the introductory case where data are provided on the
annual returns (in %) for Fidelity’s Growth and Value mutual
funds from 1984 to 2018. Use Excel and R to construct line
charts for Growth and Value. Summarize the results.
SOLUTION:
Using Excel
A Heat Map
A heat map is an important visualization tool that uses color or color
intensity to display relationships between variables. Heat maps are
especially useful to identify combinations of the categorical variables
that have economic significance. There are a number of ways to
display a heat map, but they all share one thing in common—they use
color to communicate the relationships between the variables that
would be harder to understand by simply inspecting the raw data. For
example, we can use a heat map to show which products page 110
are the best-or worst-selling products at various stores or
show the most-or least-frequently downloaded music genres across
various music streaming platforms. Example 3.8 illustrates the use of
a heat map.
FILE
Bookstores
EXAMPLE 3.8
A national bookstore chain is trying to understand customer
preferences at various store locations. The marketing
department has acquired a list of 500 of the most recent
transactions from four of its stores. The data set includes the
record number (Record), which one of its four stores sold the
book (BookStore), and the type of book sold (BookType). The
marketing department wants to visualize the data using a heat
map to help it understand customer preferences at different
stores. A portion of the Bookstores data set is shown in Table
3.14.
1 Store2 Biography
⋮ ⋮ ⋮
page 111
Using R
a. Import the Bookstores data into a data frame (table) and label
it myData.
b. We first use the table function to create a contingency table,
labeled myTable, that summarizes the number of each type of
book sold at each store. We then use the rowSums function to
retrieve the total number of books sold at each store and divide
the values in the contingency table by those numbers to get a
modified contingency table that summarizes each type of book
sold as a percentage of all books sold at each store. Enter:
> myTable <-table(myData$BookStore, myData$BookType)
> myTable <- myTable/rowSums(myTable)
c. In order to construct a heat map in R, the data must be
converted into a data matrix, which is a two-dimensional data
structure whose columns must have the same data type and
length. We use the as.matrix function to make this conversion
and label the data matrix as myData.matrix. Enter:
> myData.matrix <- as.matrix(myTable)
d. We use the heatmap function to construct the heat map. We
choose the heat.color pallet with 256 colors using col =
heat.colors(256). By default, the color pallet uses warm colors
(e.g., orange and red) to show smaller values and cool colors
(e.g., yellow and white) to show larger values. We use the scale
= “none” option to ensure that the data are not standardized
when used to plot the heat map. Finally, we use Rowv = NA
and Colv = NA to suppress a tree-like diagram called a
dendrogram. A dendrogram will be discussed later in Chapter
11. Enter:
> heatmap(myData.matrix, col = heat.colors(256), scale =
“none”, Rowv= NA, Colv = NA)
The resulting heat map is shown in Figure 3.19(a).
Summary
The heat maps in Figures 3.18 and 3.19 reveal that customers’
book preferences do differ across different store locations. For
example, romance fictions are the most popular books sold at
Store2 but the least popular at Store3. Self-help books are the
least popular books at Store1. The management can use this
information to make decisions about how many copies of each
type of book to stock at each store.
page 113
EXERCISES 3.3
Applications
33. FILE InternetStocks. A financial analyst wants to compare the performance of the
stocks of two Internet companies, Amazon (AMZN) and Google (GOOG). She
records the average closing prices of the two stocks for the years 2010 through
2016. A portion of the data is shown in the accompanying table. Construct a line
chart that shows the movements of the two stocks over time using two lines each
with a unique color. Describe the overall trend of price movement for the two
stocks. Which stock shows the greater trajectory of price appreciation?
⋮ ⋮ ⋮
34. FILE India_China. It is believed that India will overtake China to become the
world’s most populous nation much sooner than previously thought (CNN, June 19,
2019). The accompanying data file, compiled by the World Bank, contains the
population data, in millions, for India and China from 1960 to 2017. Construct a line
chart that shows the changes in the two countries’ populations over time using two
lines each with a unique color. Describe the overall trend of population growth in
the two countries. Which country shows the faster population growth during the
past 40 years?
⋮ ⋮ ⋮
35. FILE HighSchool_SAT. The accompanying table shows a portion of the average
SAT math score (Math), the average SAT writing score (Writing), the number of test
takers (Test Taker), and whether the school is a private or public school (Type) for
25 high schools in a major metropolitan area.
a. Construct a bubble plot that shows the math score on the x-axis, the writing score
on the y-axis, and the number of test takers as the size of the bubble. Do math
score and writing score show a linear, nonlinear, or no relationship? If the
relationship is a linear relationship, is it a positive or negative relationship? Do
math score and the size of the school (using the number of test takers as a
proxy) show a linear, nonlinear, or no relationship?
b. Construct a scatterplot that shows the math score on the x-axis and the writing
score on the y-axis. Use different colors or symbols to show whether the high
school is a private or public school. Describe the relationships between math
score, writing score, and school type. Does the relationship between math score
and writing score hold true for both private and public schools?
36. FILE Car_Price. The accompanying table shows a portion of data consisting of
the selling price, the age, and the mileage for 20 used sedans.
13590 6 61485
13775 6 54344
⋮ ⋮ ⋮
11988 8 42408
page 114
a. Construct a bubble plot that shows price on the x-axis, age on
the y-axis, and mileage as the sizes of the bubbles. Describe the relationships
between price, age, and mileage of these used sedans.
b. Convert Mileage into a categorical variable, Mileage_Category, by assigning all
cars with less than 50,000 miles to the “Low_Mileage” category and the rest to
the “High_Mileage” category. How many cars are in the “High_Mileage”
category?
c. Construct a scatterplot using Price, Age, and Mileage_Category. Use different
colors or symbols to show cars that belong to the different mileage categories.
Describe the relationships between price, age, and mileage of these used
sedans. Does the relationship between price and age hold true for both mileage
categories?
37. FILE TShirts. A company that sells unisex t-shirts is interested in finding out the
color and size of its best-selling t-shirt. The accompanying data file contains the
size, color, and quantity of t-shirts that were ordered during the last 1,000
transactions. A portion of the data is shown in the accompanying table.
1 1 XL Purple
2 3 M Blue
⋮ ⋮ ⋮ ⋮
1000 1 S Red
a. Construct a contingency table that shows the total quantity sold for each color
and size combination. How many size M red t-shirts were sold? How many size
XL purple t-shirts were sold?
b. Construct a heat map that displays colors or color intensity based on the total
quantity sold. Which two color and size combinations are the most popular ones?
Which two are the least popular ones?
38. FILE Crime_Analysis. The local police department is performing a crime analysis
to find out which crimes are most likely to occur at which locations. The
accompanying data file contains the types of crimes (CrimeType) that occurred at
various locations (Location) in the city over the past five years. A portion of the data
is shown in the accompanying table.
1 Narcotics Street
2 Assault Residence
⋮ ⋮ ⋮
a. Construct a contingency table that shows the frequencies for CrimeType and
Location combinations. How many of the crimes were for burglary and happened
in a residence?
b. Construct a heat map that displays colors or color intensity based on the
frequencies. Which three crime type and location combinations are the most
frequent ones?
page 115
The Mean
The arithmetic mean is the primary measure of central location.
Generally, we refer to the arithmetic mean as simply the mean or the
average. In order to calculate the mean of a variable, we simply add
up all the observations and divide by the number of observations. The
only thing that differs between a population mean and a sample mean
is the notation. The population mean is referred to as μ, where μ is the
Greek letter mu (pronounced as “mew”). For observations x1, x2, . . .,
The Median
The mean is used extensively in data analysis. However, it can give a
misleading description of the center of the distribution in the presence
of extremely small or large observations, also referred to as outliers.
Because the mean can be affected by outliers, we often also calculate
the median as a measure of central location. The median is the
middle value of a data set; that is, an equal number of observations lie
above and below the median. After arranging the data in ascending
order (smallest to largest), we calculate the median as (1) the middle
value if the number of observations is odd or (2) the average of the
two middle values if the number of observations is even.
Many government publications and other data sources publish
both the mean and the median in order to accurately portray a
variable’s typical value. If the mean and the median differ significantly,
then it is likely that the variable contains outliers. For instance, in 2017
the U.S. Census Bureau determined that the median income for
American households was $61,372; however, the mean income was
$86,220. It is well documented that a small number of households in
the United States have income that is considerably higher than the
typical American household income. As a result, these top-earning
households influence the mean by pushing its value significantly
above the value of the median.
The Mode
The mode of a variable is the observation that occurs most frequently.
A variable can have more than one mode, or even no mode. If a
variable has one mode, then we say it is unimodal. If it has two
modes, then it is common to call it bimodal. If two or more modes
exist, then the variable is multimodal. Generally, the mode’s
usefulness as a measure of central location tends to diminish for a
variable with more than three modes. If we want to summarize a
categorical variable, then the mode is the only meaningful measure of
central location.
page 116
FILE
Growth_Value
EXAMPLE 3.9
Using Excel and R, calculate the mean and the median for the
Growth and the Value variables from the introductory case.
Summarize the results.
SOLUTION:
Using Excel
I. Excel’s Formula Option Excel provides built-in formulas for
virtually every summary measure that we may need. To
illustrate, we follow these steps to calculate the mean and the
median for the Growth variable.
a. Open the Growth_Value data file.
b. Enter =AVERAGE(B2:B36). Verify that the output is 15.1074.
c. To calculate the median, enter =MEDIAN(B2:B36). Verify that
the output is 14.44.
If we want to calculate the mean return for the Value
variable, and because the data occupy cells C2 through C36 on
the spreadsheet, we enter =AVERAGE(C2:C36). Because the
Growth and Value variables each contains 35 unique
observations and no duplicates, it is not practical to compute the
mode. For other variables, we enter =MODE(array) to calculate
the mode, where the notation array specifies the range of cells
to be included in the calculation. When introducing new
functions later in this chapter and other chapters, we will follow
this format. The first and second columns of Table 3.15 show
various descriptive measures and page 117
corresponding function names in Excel. We will
refer back to Table 3.15 on a few occasions in this chapter.
Descriptive
Excel R
Measure
Location
Multiple NA summary(df)
measures
Dispersion
Descriptive
Excel R
Measure
Shape
Skewness =SKEW(array) NA
Kurtosis =KURT(array) NA
Association
a The notation df refers to the data frame or file and the notation var refers to the
variable name. The variable name should be specified in single quotations if it
consists of more than one word or if it is a number.
b NA denotes that a simple function is not readily available.
d The range function in R returns the minimum and maximum values, so the range
can be calculated by taking the difference between the two values.
e The mad function calculates the median absolute deviation, rather than the mean
absolute deviation. Alternatively, we can install the ‘lsr’ package in R and use the aad
function, which computes the mean absolute deviation.
Growth Value
Count 35 Count 35
Using R
Like Excel, R has many built-in formulas or functions. In R, we
denote all function names in boldface and all options within a
function in italics. The first and third columns of Table 3.15 show
various descriptive measures and corresponding function names
in R.
a. Import the Growth_Value data into a data frame (table) and
label it myData.
b. The mean function will return the mean for a specified variable
in a data frame. In order to find the mean for the Growth
variable, enter:
> mean(myData$Growth)
And R returns: 15.10743.
page 118
c. The summary function will return the minimum,
first quartile, median, mean, third quartile, and maximum values
for each variable in a data frame. Enter:
> summary(myData)
Table 3.17 shows the R output using the summary function.
TABLE 3.17 R Output Using the summary Function
A Percentile
Recall that the median is the middle observation of a variable; that is,
half of the observations fall below this observation and half fall above
it. The median is also called the 50th percentile. In many instances,
we are interested in a percentile other than the 50th percentile. In
general, the pth percentile divides a variable into two parts:
Approximately p percent of the observations are less than the pth
percentile.
Approximately (100 − p) percent of the observations are greater
than the pth percentile.
page 119
A PERCENTILE
In general, the pth percentile divides a variable into two parts:
Approximately p percent of the observations are less than the
pth percentile.
Approximately (100 − p) percent of the observations are
greater than the pth percentile.
EXAMPLE 3.10
Using Table 3.17, interpret the first and the third quartiles for the
Growth variable from the introductory case.
SOLUTION:
The first quartile for the Growth variable is 2.13%. Approximately
25% of the returns are less than 2.13%, and approximately 75%
of the returns are greater than 2.13%. The third quartile for the
Growth variable is 32.00%. Approximately 75% of the returns
are less than 32.00%, and approximately 25% of the returns are
greater than 32.00%.
Measures of Dispersion
While measures of central location reflect the typical or central value
of a variable, they fail to describe the underlying dispersion of the
variable. We now discuss several measures of dispersion that gauge
the variability of a data set. Each measure is a numerical value that
equals zero if all observations are identical and increases as the
observations become more diverse.
The Range
The range is the simplest measure of dispersion; it is the difference
between the maximum and the minimum observations of a variable.
The range is not considered a good measure of dispersion because it
focuses solely on the extreme observations and ignores every other
observation of a variable.
page 120
substitute the population mean μ for the sample mean and the
population size N for the sample size n.
whatever the units of the variable, the variance has squared units. In
order to return to the original units of measurement, we take the
positive square root of s2 or σ2, which gives us either the sample
standard deviation s or the population standard deviation σ.
Table 3.15 shows the function names for various measures of
dispersion in Excel and R. Recall too that Excel’s Descriptive
Statistics option using the Data Analysis Toolpak provides many
summary measures using a single command. For measures of
dispersion, Excel’s Descriptive Statistics option treats the data as a
sample and calculates the sample variance and the sample standard
deviation. These values for the Growth and the Value mutual funds
are shown in Table 3.16.
MEASURES OF DISPERSION
Measures of dispersion can be summarized as follows.
The range is the difference between the maximum and the
minimum observations. The main weakness of the range is
that is ignores all observations except the extremes.
The interquartile range (IQR) is the difference between the
third quartile and the first quartile. The measure does not rely
on the extreme observations; however, it does not incorporate
all observations.
The mean absolute deviation (MAD) is an average of the
absolute differences between the observations and the mean.
The variance is an average of the squared differences
between the observations and the mean. The standard
deviation is the positive square root of the variance.
page 121
EXAMPLE 3.11
Use the information in Table 3.18 to
a. Compare the risk of investing in Growth versus Value using the
standard deviation.
b. Calculate and interpret the Sharpe ratios for Growth and Value.
Assume that the return on a 1-year T-bill is 2%.
SOLUTION:
We had earlier shown that the Growth mutual fund had a higher
return, which is good, along with a higher standard deviation,
which is bad. We can use the Sharpe ratio to make a valid
comparison between the funds. The Growth mutual fund
provides a higher Sharpe ratio than the Value mutual fund
(0.5502 > 0.5270); therefore, the Growth mutual fund offered
more reward per unit of risk.
Measures of Shape
In this section, we examine measures of shape, namely, the
skewness coefficient and the kurtosis coefficient.
page 122
MEASURES OF SHAPE
Measures of shape can be summarized as follows.
The skewness coefficient measures the degree to which a
distribution is not symmetric about its mean. A symmetric
distribution has a skewness coefficient of 0. A positively
(negatively) skewed distribution has a positive (negative)
skewness coefficient.
The kurtosis coefficient measures whether the tails of a
distribution are more or less extreme than the normal
distribution. Because the normal distribution has a kurtosis
coefficient of 3, it is common to calculate the excess kurtosis of
a distribution as the kurtosis coefficient minus 3.
EXAMPLE 3.12
Interpret the skewness and the kurtosis coefficients in Table
3.16 for the Growth and Value variables from the introductory
case.
page 123
Measures of Association
In Section 3.2 we used a scatterplot to visually assess whether two
numerical variables had some type of systematic relationship. Here,
we present two numerical measures of association that quantify the
direction and strength of the linear relationship between two variables,
x and y. It is important to point out that these measures are not
appropriate when the underlying relationship between the variables is
nonlinear.
The Covariance
An objective numerical measure that reveals the direction of the linear
relationship between two variables is called the covariance. Like
variance, the formula for the covariance depends on whether we have
a sample or a population. The sample covariance, denoted as sxy is
calculated as .
MEASURES OF ASSOCIATION
Measures of association can be summarized as follows.
The covariance between two variables x and y indicates
whether they have a negative linear relationship, a positive
linear relationship, or no linear relationship.
The correlation coefficient between two variables x and y
indicates the direction and the strength of the linear
relationship.
page 124
FILE
Growth_Value
EXAMPLE 3.13
Using Excel and R, calculate the correlation coefficient between
the Growth and the Value variables from the introductory case.
Then summarize the results.
SOLUTION:
Using Excel
An analysis of annual return data for Fidelity’s Growth and Value mutual funds for
the years 1984 through 2018 provides important information for an investor trying to
determine whether to invest in a growth mutual fund, a value mutual fund, or both
types of mutual funds. Over this period, the mean return for the Growth fund of
15.11% is greater than the mean return for the Value fund of 11.44%. While the mean
return typically represents the reward of investing, it does not incorporate the risk of
investing. Standard deviation tends to be the most common measure of risk with
financial data. Because the standard deviation for the Growth fund (23.82%) is greater
than the standard deviation for the Value fund (17.92%), the Growth fund is likelier to
have returns farther above and below its mean. Finally, given a risk-free rate of 2%,
the Sharpe ratio for the Growth fund is 0.5502, compared to that for the Value fund of
0.5270, indicating that the Growth fund provides more reward per unit of risk.
Assuming that the behavior of these returns will continue, the investor will favor
investing in Growth over Value. A commonly used disclaimer, however, states that
past performance is no guarantee of future results. Because the two styles often
complement each other, it might be advisable for the investor to add diversity to his
portfolio by using them together.
EXERCISES 3.4
Applications
39. FILE Corporations. Monthly stock prices (in $) for Corporation A and Corporation
B are collected for five years. A portion of the data is shown in the accompanying
table.
Date A B
Date A B
⋮ ⋮ ⋮
a. Calculate the mean and the standard deviation for each corporation’s stock price.
b. Which corporation had the higher average stock price over the time period?
Which corporation’s stock price had greater dispersion as measured by the
standard deviation?
40. FILE HD_Lowe’s. Annual revenues (in $ millions) for Home Depot and Lowe’s
Corporation are collected for 13 years. A portion of the data is shown in the
accompanying table.
Year HD Lowe’s
⋮ ⋮ ⋮
a. For each company, calculate the mean and the median revenues for this time
period. Which company had higher average revenues?
b. For each company, calculate the variance and the standard deviation for this time
period. Which company’s revenues had more dispersion as measured by the
standard deviation?
41. FILE Prime. The accompanying table shows a portion of the annual expenditures
(in $) for 100 Amazon Prime customers.
Customer Expenditures
1 1272
2 1089
Customer Expenditures
⋮ ⋮
100 1389
State Price
Alaska 3.06
Alabama 1.94
⋮ ⋮
Wyoming 2.59
page 126
43. FILE Rent. The following table shows a portion of the monthly
rent and square footage for 40 rentals in a large college town.
1 645 500
2 675 648
⋮ ⋮ ⋮
40 2400 2700
a. Calculate the mean and the standard deviation for monthly rent.
b. Calculate the mean and the standard deviation for square footage.
44. Refer to the previous exercise for a description of the data.
a. The skewness and (excess) kurtosis coefficients for monthly rent are 1.0198 and
0.4790, respectively. Interpret these values.
b. The skewness and (excess) kurtosis coefficients for square footage are 3.0573
and 12.3484, respectively. Interpret these values.
45. FILE Highway. Many environmental groups and politicians are suggesting a return
to the federal 55-mile-per-hour (mph) speed limit on America’s highways. They
argue that not only will a lower national speed limit reduce greenhouse emissions, it
will also increase traffic safety. A researcher believes that a lower speed limit will
not increase traffic safety because he feels that traffic safety is based on the
variability of the speeds with which people are driving, rather than the average
speed. The researcher gathers the speeds of 40 cars from a highway with a speed
limit of 55 mph (Highway 1) and the speeds of 40 cars from a highway with a speed
limit of 65 mph (Highway 2). A portion of the data is shown in the accompanying
table.
1 60 70
2 55 65
⋮ ⋮ ⋮
40 52 65
Date A B
⋮ ⋮ ⋮
⋮ ⋮ ⋮
a. Which fund had the higher reward over this time period? Explain.
b. Which fund was riskier over this time period? Explain.
c. Given a risk-free rate of 2%, which fund has the higher Sharpe ratio? What does
this ratio imply?
48. Refer to the previous exercise for a description of the data.
a. The skewness and (excess) kurtosis coefficients for the Latin America fund are
0.3215 and −0.7026, respectively. Interpret these values.
b. The skewness and (excess) kurtosis coefficients for the Canada fund are
−0.2531 and 0.0118, respectively. Interpret these values.
49. FILE Tech_Energy. The accompanying table shows a portion of the annual
returns (in %) for a technology mutual fund and an energy mutual fund from 1982
through 2018.
⋮ ⋮ ⋮
page 127
52. FILE Happiness_Age. Many attempts have been made to relate
happiness with various factors. A 2018 study in the Journal of Happiness relates
happiness with age and finds that holding everything else constant, people are
least happy when they are in their mid-40s. The accompanying data file shows a
respondent’s age and his/her perception of well-being on a scale from 0 to 100.
a. Calculate and interpret the correlation coefficient between age and happiness.
b. Construct a scatterplot to point out a flaw with the correlation analysis in part a.
A Boxplot
A common way to quickly summarize a variable is to use a five-
number summary. A five-number summary shows the minimum value,
the quartiles (Q1, Q2, and Q3), and the maximum value of the
variable. A boxplot, also referred to as a box-and-whisker plot, is a
convenient way to graphically display the five-number summary of a
variable. In general, a boxplot is constructed as follows:
Plot the five-number summary values in ascending order on the
horizontal axis.
Draw a box encompassing the first and third quartiles.
Draw a dashed vertical line in the box at the median.
Calculate the interquartile range (IQR). Recall that IQR = Q3 − Q1.
Draw a line (“whisker”) that extends from Q1 to the minimum value
that is not farther than 1.5 × IQR from Q1. Similarly, draw a line that
extends from Q3 to the maximum value that is not farther than 1.5 ×
IQR from Q3.
Use an asterisk (or other symbol) to indicate observations that are
farther than 1.5 × IQR from the box. These observations are
considered outliers.
Consider the boxplot in Figure 3.21. The left whisker extends from Q1
to the minimum value (Min) because Min is not farther than 1.5 × IQR
from Q1. The right whisker, on the other hand, does not extend from
Q3 to the maximum value because there is an observation that is
farther than 1.5 × IQR from Q3. The asterisk indicates that this
observation is considered an outlier.
FIGURE 3.21 An example of a boxplot
EXAMPLE 3.14
Use R to construct a boxplot for the Growth and Value variables
from the introductory case. Interpret the results.
SOLUTION:
Using R
a. Import the Growth_Value data into a data frame (table) and
label it myData.
b. We use the boxplot function. For options within the function,
we use main to provide a title, xlab to label the x-axis, names to
label each variable, horizontal to construct a horizontal boxplot
(as opposed to a vertical boxplot), and col to give color to the
IQR portion. Enter:
> boxplot(myData$Growth, myData$Value, main= “Boxplots for
Growth and Value”, xlab=“Annual Returns, 1984-2018 (in
percent)”, names =c(“Growth”,“Value”), horizontal = TRUE,
col=“gold”)
Figure 3.22 shows the output that R returns.
page 129
c. To treat outliers, we use the out parameter in the
boxplot function to extract and store their values in a new data
frame (table). Enter:
> outliersGrowth <- boxplot(myData$Growth)$out
> outliersValue <- boxplot(myData$Value)$out
Verify that there is one outlier (79.48) for the Growth variable
and one outlier (−46.52) for the Value variable.
d. One approach to treat outliers in R is to replace them with NAs,
which represent missing values. The %in% operator is for
value matching, and we use it here to find the outliers in the
Growth and Value variables. We then use the ifelse function to
replace the outliers with NAs and store updated data in two
new variables, newGrowth and newValue. Enter:
> myData$newGrowth <- ifelse(myData$Growth %in%
outliersGrowth, NA, myData$Growth)
> myData$newValue <- ifelse(myData$Value %in%
outliersValue, NA, myData$Value)
Verify that the newGrowth observation for year 1999 and the
newValue observation for year 2008 are now NAs. Note: With
outliers replaced with NAs, we have the option to implement the
omission and the imputation strategies for treatment of missing
values described in Section 2.4.
e. Once the outliers are replaced with NAs, we can recalculate
summary measures. We use the summary function to compare
the means of the original variables with outliers and the new
variables without outliers. Enter:
> summary(myData)
Verify that the means of the newGrowth and newValue
variables are 13.210 and 13.149, respectively, and that they are
different from the means of the original Growth and Value
variables.
Summary
The median returns for the two mutual funds are indicated by
the bold and wider vertical lines in Figure 3.22. As we already
found, the median returns for the two funds are similar (15.09%
for Value versus 14.44% for Growth). However, Value has an
outlier on the left-hand side, as indicated by the circle, while
Growth has an outlier on the right-hand side.
For Value, the outlier on the left-hand side coupled with a
median that falls to the right of center in the IQR box suggests
that this distribution is negatively skewed. This is consistent with
the negative skewness coefficient that was calculated for this
variable in Section 3.4.
On the other hand, Growth has an outlier on the right-hand
side with a median that falls to the left of center in the IQR box.
The distribution of Growth is positively skewed. Again, this is
consistent with the positive skewness coefficient that was
calculated for this variable in Section 3.4.
z-Scores
The mean and the standard deviation are the most extensively used
measures of central location and dispersion, respectively. Unlike the
mean, it is not easy to interpret the standard deviation intuitively. All
we can say is that a low value for the standard deviation indicates that
the observations are close to the mean, while a high value for the
standard deviation indicates that the observations are more dispersed
from the mean.
We will first use the empirical rule to make precise statements
regarding the percentage of observations that fall within a specified
number of standard deviations from the mean. We page 130
then compute a z-score that measures the relative
location of an observation and indicates whether it is an outlier.
EXAMPLE 3.15
A large lecture class has 280 students. The professor has
announced that the mean score on an exam is 74 with a
standard deviation of 8. The distribution of scores is bell-shaped.
a. Approximately how many students scored within 58 and 90?
b. Approximately how many students scored more than 90?
SOLUTION:
page 131
Calculating z-Scores
It is often instructive to use the mean and the standard deviation to
find the relative location of an observation. Suppose a student gets a
score of 90 on her accounting exam and a score of 90 on her
marketing exam. While the student’s scores are identical in both
classes, her relative position in these classes may be quite different.
What if the mean score was different in the classes? Even with the
same mean scores, what if the standard deviation was different in the
classes? Both the mean and the standard deviation are needed to
find the relative position of this student in both classes.
We use the z-score to find the relative position of an observation
by dividing the difference of the observation from the mean by the
standard deviation, or, equivalently, . A z-score is a unitless
EXAMPLE 3.16
The mean and the standard deviation of scores on an
accounting exam are 74 and eight, respectively. The mean and
the standard deviation of scores on a marketing exam are 78
and 10, respectively. Find the z-scores for a student who scores
90 in both classes.
EXAMPLE 3.17
Table 3.18 shows the minimum and maximum observations as
well as the means and standard deviations for the Growth and
Value variables from the introductory case. Calculate the z-
scores for the minimum and the maximum observations for each
variable. Are the results consistent with the boxplots constructed
in Figure 3.22? Explain.
TABLE 3.18 Summary Statistics for the Growth and Value Variables (in %)
page 132
IDENTIFYING OUTLIERS
A boxplot is a convenient way to graphically display the five-
number summary of a variable. If outliers are present, then
they are indicated as asterisks (or another symbol) that are
farther than 1.5 × IQR from the box.
The z-score measures the relative position of an observation
within a distribution and is calculated as . If the
EXERCISES 3.5
Applications
53. Consider the following boxplot.
page 133
a. Interpret Q1 and Q3.
b. Calculate the interquartile range. Determine whether any outliers exist.
c. Is the distribution symmetric? If not, comment on its skewness.
56. Consider the following five-point summary for a variable that was obtained using
500 observations.
page 134
Case Study
An investor currently owns real estate in the college town of Blacksburg, Virginia—
home to the Virginia Tech Hokies. He would like to expand his holdings by purchasing
similar rental property in either Athens, Georgia, or Chapel Hill, North Carolina. As a
preliminary step, he would like information on house prices in these two areas. He is
interested in properties that have at least two bedrooms and that are listed for less
than $1,000,000. The following report will summarize previous sales that have
satisfied these criteria.
kali9/Getty Images
TABLE 3.19 Summary Measures for House Prices (in $) in Athens and
Chapel Hill
Table 3.20 shows the relative frequency distribution for the house prices in
both cities. The relative frequency distribution reinforces the findings from the
summary measures, that is, house prices in Athens are more affordable than
houses in Chapel Hill. In Athens, 51% of the prices fell in the range $100,000
to $200,000, but only 9% of the prices in Chapel Hill fell in this range.
Moreover, 91% of houses in Athens versus only 51% of houses in Chapel Hill
sold for less than $400,000.
Finally, Figure 3.24 shows the boxplots of house prices for each city. The
boxplots reveal two more major points with respect to house prices in these
two cities:
page 136
Report 3.1 FILE House_Price. Perform a similar analysis to the one conducted in this
section, but choose two other college towns.
Report 3.2 FILE College_Admissions. Use tabular and graphical methods as well as
summary measures to examine the SAT scores of those students who were admitted
to the School of Business & Economics versus those students who were admitted to
the School of Arts & Letters.
Report 3.3 FILE Longitudinal_Survey. Use tabular and graphical methods as well as
summary measures to examine the weight of a respondent depending on whether or
not the respondent is an outgoing or a shy individual.
Report 3.4 FILE TechSales_Reps. Use tabular and graphical methods as well as
summary measures to examine the salaries of sales representatives depending on
their personality types and gender in the software and hardware groups.
page 137
page 138
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 4.1 Describe probability concepts and the
rules of probability.
LO 4.2 Apply the total probability rule and Bayes’
theorem.
LO 4.3 Describe a discrete random variable and
its probability distribution.
LO 4.4 Calculate probabilities for binomial and
Poisson distributions.
LO 4.5 Describe the normal distribution and
calculate its associated probabilities.
page 139
©Seastock/Shutterstock
INTRODUCTORY CASE
Linking Support for Legalizing Marijuana
with Age Group
Support for marijuana legalization in the United States has
grown remarkably over the past few decades. In 1969, when
the question was first presented, only 12% of Americans were
in favor of its legalization. This support increased to over 25%
by the late 1970s. While support was stagnant from 1981 to
1997, the turn of the century brought a renewed interest in its
legalization, with the percentage of Americans in favor
exceeding 30% by 2000 and 40% by 2009.
Alexis Lewis works for a drug policy institute that focuses on
science, health, and human rights. She is analyzing the
demographic breakdown of marijuana supporters. Using results
from a Pew Research Center survey conducted from August
23-September 2, 2016, she has found that support for
marijuana legalization varies considerably depending on a
person’s age group. Alexis compiles information on support
based on age group, as shown in Table 4.1.
page 140
Events
An event is a subset of the sample space. A simple event consists of
just one of the possible outcomes of an experiment. Getting an A in
a course is an example of a simple event. An event may also contain
several outcomes of an experiment. For example, we can define an
event as getting a passing grade in a course; this event is formed by
the subset of outcomes A, B, C, and D.
Events are considered exhaustive if they include all outcomes in
the sample space. In the grade-distribution example, the events of
getting grades A and B are not exhaustive events because they do
not include many feasible grades in the sample space. However, the
events P and F, defined as “pass” and “fail,” respectively, are
exhaustive.
Another important probability concept concerns mutually
exclusive events. For two mutually exclusive events, the occurrence
of one event precludes the occurrence of the other. Going back to
the grade-distribution example, while the events of getting grades A
and B are not exhaustive, they are mutually exclusive because you
cannot possibly get an A as well as a B in the same course. Grades
P and F, on the other hand, are both mutually exclusive and
exhaustive.
COMBINING EVENTS
The union of two events, denoted A ∪ B, is the event
consisting of all outcomes in A or B.
The intersection of two events, denoted A ∩ B, is the event
consisting of all outcomes in A and B.
The complement of event A, denoted Ac, is the event
consisting of all outcomes in the sample space S that are not
in A.
Assigning Probabilities
Now that we have described a valid sample space and the various
ways in which we can define events from that sample space, we are
ready to assign probabilities. When we arrive at a probability, we
generally are able to categorize the probability as a subjective
probability, an empirical probability, or a classical probability.
Regardless of the method used, there are two defining properties of
probability.
page 142
CATEGORIZING PROBABILITIES
A subjective probability is calculated by drawing on personal
and subjective judgment.
An empirical probability is calculated as a relative frequency
of occurrence.
A classical probability is based on logical analysis rather than
on observation or personal judgment.
Since empirical and classical probabilities generally do not vary
from person to person, they are often grouped as objective
probabilities.
Rules of Probability
We will now present various rules used to combine probabilities of
events.
The complement rule follows from one of the defining properties
of probability: The sum of probabilities assigned to simple events in a
sample space must equal one. Therefore, for an event A and its
complement Ac, we get P(A) + P(Ac) = 1. Rearranging this equation,
we obtain the complement rule.
page 143
EXAMPLE 4.1
A manager at Moksha Yoga Center believes that 37% of
female and 30% of male open house attendees she contacts
will purchase a membership.
a. What is the probability that a randomly selected female
contacted by the manager will not purchase a membership?
b. What is the probability that a randomly selected male
contacted by the manager will not purchase a membership?
SOLUTION:
EXAMPLE 4.2
Anthony feels that he has a 75% chance of getting an A in
Statistics and a 55% chance of getting an A in Managerial
Economics. He also believes he has a 40% chance of getting
an A in both classes.
a. What is the probability that he gets an A in at least one of
these courses?
b. What is the probability that he does not get an A in either of
these courses?
SOLUTION:
page 144
b. The probability that he does not receive an A in
either of these two courses is actually the complement of the
union of the two events; that is, P((AS ∪ AM)c). We calculated
the union in part a, so using the complement rule we have
Note that for mutually exclusive events A and B, the joint probability
is zero; that is, P(A ∩ B) = 0. We need not concern ourselves with
double-counting, and, therefore, the probability of the union is simply
the sum of the two probabilities.
In business applications, the probability of interest is often a
conditional probability. Examples include the probability that a
customer will make an online purchase conditional on receiving an e-
mail with a discount offer; the probability of making a six-figure salary
conditional on getting an MBA; and the probability that sales will
improve conditional on the firm launching a new innovative product.
Let’s use an example to illustrate the concept of conditional
probability. Suppose the probability that a recent business college
graduate finds a suitable job is 0.80. The probability of finding a
suitable job is 0.90 if the recent business college graduate has prior
work experience. Here, the probability of an event is conditional on
the occurrence of another event. If A represents “finding a job” and B
represents “prior work experience,” then P(A) = 0.80 and the
conditional probability is denoted as P(A ∣ B) = 0.90. In this example,
the probability of finding a suitable job increases from 0.80 to 0.90
when conditioned on prior work experience. In general, the
conditional probability, P(A ∣ B), is greater than the unconditional
probability, P(A), if B exerts a positive influence on A. Similarly, P(A
∣ B) is less than P(A) when B exerts a negative influence on A.
Finally, if B exerts no influence on A, then P(A ∣ B) equals P(A). It is
common to refer to “unconditional probability” simply as “probability.”
We rely on the Venn diagram in Figure 4.2 to explain the
conditional probability. Because P(A ∣ B) represents the probability of
A conditional on B (B has occurred), the original sample space S
reduces to B. The conditional probability P(A ∣ B) is based on the
portion of A that is included in B. It is derived as the ratio of the
probability of the intersection of A and B to the probability of B.
CONDITIONAL PROBABILITY
The probability that A occurs given that B has occurred is
derived as
EXAMPLE 4.3
Economic globalization is defined as the integration of national
economies into the international economy through trade,
foreign direct investment, capital flows, migration, and the
spread of technology. Although globalization is generally
viewed favorably, it also increases the vulnerability of a country
to economic conditions of other countries. An economist
predicts a 60% chance that country A will perform poorly and a
25% chance that country B will perform poorly. There is also a
16% chance that both countries will perform poorly.
a. What is the probability that country A performs poorly given
that country B performs poorly?
page 145
b. What is the probability that country B performs
poorly given that country A performs poorly?
c. Interpret your findings.
SOLUTION:
We first write down the available information in probability
terms. Defining A as “country A performing poorly” and B as
“country B performing poorly,” we have the following
information: P(A) = 0.60, P(B) = 0.25, and P(A ∩ B) = 0.16.
a.
b.
EXAMPLE 4.4
A manager believes that 14% of consumers will respond
positively to the firm’s social media campaign. Also, 24% of
those who respond positively will become loyal customers. Find
the probability that the next recipient of their social media
campaign will react positively and will become a loyal
customer.
SOLUTION:
Let the event R represent a consumer who responds positively
to a social media campaign and the event L represent a loyal
customer. Therefore, P(R) = 0.14 and P(L ∣ R) = 0.24. We
calculate the probability that the next recipient of a social media
campaign will react positively and become a loyal customer as
P(R ∩ L) = P(L ∣ R)P(R) = 0.24 × 0.14 = 0.0336.
Of particular interest to researchers is whether or not two events
influence one another. Two events are independent if the
occurrence of one event does not affect the probability of the
occurrence of the other event. Similarly, events are considered
dependent if the occurrence of one is related to the probability of the
occurrence of the other. We generally test for the independence of
two events by comparing the conditional probability of one event, for
instance P(A ∣ B), to the probability, P(A). If these two probabilities
are the same, we say that the two events, A and B, are independent;
if the probabilities differ, the two events are dependent.
page 146
EXAMPLE 4.5
Suppose that for a given year there is a 2% chance that your
desktop computer will crash and a 6% chance that your laptop
computer will crash. Moreover, there is a 0.12% chance that
both computers will crash. Is the reliability of the two computers
independent of each other?
SOLUTION:
Let event D represent the outcome that your desktop crashes
and event L represent the outcome that your laptop crashes.
Therefore, P(D) = 0.02, P(L) = 0.06, and P(D ∩ L) = 0.0012.
The reliability of the two computers is independent because
In other words, if your laptop crashes, it does not alter the
probability that your desktop also crashes. Equivalently, we
show that the events are independent because P(D ∩ L) =
P(D)P(L) = 0.0012.
EXERCISES 4.1
Applications
1. Consider the following scenarios to determine if the mentioned combination of
attributes represents a union or an intersection.
a. A marketing firm is looking for a candidate with a business degree and at least
five years of work experience.
b. A family has decided to purchase Toyota or Honda.
2. You apply for a position at two firms. Let event A represent the outcome of getting
an offer from the first firm and event B represent the outcome of getting an offer
from the second firm.
a. Explain why events A and B are not exhaustive.
b. Explain why events A and B are not mutually exclusive.
3. The probabilities that stock A will rise in price is 0.40 and that stock B will rise in
price is 0.60. Further, if stock B rises in price, the probability that stock A will also
rise in price is 0.50.
a. What is the probability that at least one of the stocks will rise in price?
b. Are events A and B mutually exclusive? Explain.
c. Are events A and B independent? Explain.
4. Fraud detection has become an indispensable tool for banks and credit card
companies to combat fraudulent credit card transactions. A fraud detection firm
raises an alarm on 5% of all transactions and on 80% of fraudulent transactions.
What is the probability that the transaction is fraudulent if the firm does not raise
an alarm? Assume that 1% of all transactions are fraudulent.
5. Dr. Miriam Johnson has been teaching accounting for over 20 years. From her
experience, she knows that 60% of her students do homework regularly.
Moreover, 95% of the students who do their homework regularly pass the course.
She also knows that 85% of her students pass the course.
a. What is the probability that a student will do homework regularly and also pass
the course?
b. What is the probability that a student will neither do homework regularly nor
pass the course?
c. Are the events “pass the course” and “do homework regularly” mutually
exclusive? Explain.
d. Are the events “pass the course” and “do homework regularly” independent?
Explain.
6. Mike Danes has been delayed in going to the annual sales event at one of his
favorite apparel stores. His friend has just texted him that there are only 20 shirts
left, of which eight are in size M, 10 in size L, and two in size XL. Also nine of the
shirts are white, five are blue, and the remaining are of mixed
colors. Mike is interested in getting a white or a blue shirt in size
page 147
L. Define the events A = Getting a white or a blue shirt and B = Getting a shirt in
size L.
a. Find P(A), P(Ac), and P(B).
b. Are the events A and B mutually exclusive? Explain.
c. Would you describe Mike’s preference by the events A ∪ B or A ∩ B?
7. An analyst estimates that the probability of default on a seven-year AA-rated bond
is 0.06, while that on a seven-year A-rated bond is 0.13. The probability that they
will both default is 0.04.
a. What is the probability that at least one of the bonds defaults?
b. What is the probability that neither the seven-year AA-rated bond nor the
seven-year A-rated bond defaults?
c. Given that the seven-year AA-rated bond defaults, what is the probability that
the seven-year A-rated bond also defaults?
8. A manufacturing firm just received a shipment of 20 assembly parts, of slightly
varied sizes, from a vendor. The manager knows that there are only 15 parts in
the shipment that would be suitable. He examines these parts one at a time.
a. Find the probability that the first part is suitable.
b. If the first part is suitable, find the probability that the second part is also
suitable.
c. If the first part is suitable, find the probability that the second part is not suitable.
9. Apple products have become a household name in America, with the average
household owning 2.6 Apple products (CNBC, October 10, 2017). Suppose that in
the Midwest, the likelihood of owning an Apple product is 61% for households with
kids and 48% for households without kids. Suppose there are 1,200 households
in a representative community, of which 820 are with kids and the rest are without
kids.
a. Are the events “household with kids” and “household without kids” mutually
exclusive and exhaustive? Explain.
b. What is the probability that a household is without kids?
c. What is the probability that a household is with kids and owns an Apple
product?
d. What is the probability that a household is without kids and does not own an
Apple product?
10. As part of the 2010 financial overhaul, bank regulators are renewing efforts to
require Wall Street executives to cut back on bonuses (The Wall Street Journal,
March 5, 2019). It is known that 10 out of 15 members of the board of directors of
a company are in favor of the bonus. Suppose two members were randomly
selected by the media.
a. What is the probability that both of them were in favor of the bonus?
b. What is the probability that neither of them was in favor of the bonus?
11. Christine Wong has asked Dave and Mike to help her move into a new apartment
on Sunday morning. She has asked them both, in case one of them does not
show up. From past experience, Christine knows that there is a 40% chance that
Dave will not show up and a 30% chance that Mike will not show up. Dave and
Mike do not know each other and their decisions can be assumed to be
independent.
a. What is the probability that both Dave and Mike will show up?
b. What is the probability that at least one of them will show up?
c. What is the probability that neither Dave nor Mike will show up?
12. It is reported that 85% of Asian, 78% of white, 70% of Hispanic, and 38% of black
children have two parents at home. Suppose there are 500 students in a
representative school, of which 280 are white, 50 are Asian, 100 are Hispanic,
and 70 are black.
a. Are the events “Asian” and “black” mutually exclusive and exhaustive? Explain.
b. What is the probability that a child is not white?
c. What is the probability that a child is white and has both parents at home?
d. What is the probability that a child is Asian and does not have both parents at
home?
13. Surgery for a painful, common back condition often results in significantly reduced
back pain and better physical function than treatment with drugs and physical
therapy. A researcher followed 803 patients, of whom 398 ended up getting
surgery. After two years, of those who had surgery, 63% said they had a major
improvement in their condition, compared with 29% among those who received
nonsurgical treatment.
a. What is the probability that a patient had surgery? What is the probability that a
patient did not have surgery?
b. What is the probability that a patient had surgery and experienced a major
improvement in his or her condition?
c. What is the probability that a patient received nonsurgical treatment and
experienced a major improvement in his or her condition?
14. Despite the availability of several modes of transportation, including metro and
ride-booking services, most people in the Washington, D.C., area continue to
drive their own cars to get around (The Washington Post, June 6, 2019).
According to a survey, 62% of area adults use their own cars daily. Suppose only
38% of the area’s adults under 35 use their own cars daily. It is known that 43% of
the area’s adults are under 35.
a. What is the probability that a Washington, D.C., area adult is under 35 and uses
his/her own car daily?
b. If a Washington, D.C., area adult uses his/her own car daily, what is the
probability that he/she is under 35?
page 148
FIGURE 4.3 The total probability rule: P(A) = P(A ∩ B) + P(A ∩ Bc)
page 149
EXAMPLE 4.6
In a lie-detector test, an individual is asked to answer a series
of questions while connected to a polygraph (lie detector). This
instrument measures and records several physiological
responses of the individual on the basis that false answers will
produce distinctive measurements. Assume that 99% of the
individuals who go in for a polygraph test tell the truth. These
tests are considered to be 95% reliable. In other words, there is
a 95% chance that the test will detect a lie if an individual
actually lies. Let there also be a 0.5% chance that the test
erroneously detects a lie even when the individual is telling the
truth. An individual has just taken a polygraph test and the test
has detected a lie. What is the probability that the individual
was actually telling the truth?
SOLUTION:
First we define some events and their associated probabilities.
Let D and T correspond to the events that the polygraph
detects a lie and that an individual is telling the truth,
respectively. We are given that P(T ) = 0.99, implying that P(T
c) = 1 − 0.99 = 0.01. In addition, we formulate P(D ∣ T c) = 0.95
page 150
The first column presents prior probabilities and the second
column shows related conditional probabilities. We first
compute the denominator of Bayes’ theorem by using the total
probability rule, P(D) = P(D ∩ T ) + P(D ∩ T c). Joint
probabilities are calculated as products of conditional
probabilities with their corresponding prior probabilities. For
instance, in Table 4.2, in order to obtain P(D ∩ T ), we multiply
P(D ∣ T ) with P(T ), which yields P(D ∩ T ) = 0.005 × 0.99 =
0.00495. Similarly, we find P(D ∩ T c) = 0.95 × 0.01 = 0.00950.
Thus, according to the total probability rule, P(D) = 0.00495 +
EXAMPLE 4.7
Scott Myers is a security analyst for a telecommunications firm
called Webtalk. Although he is optimistic about the firm’s future,
he is concerned that its stock price will be considerably
affected by the condition of credit flow in the economy. He
believes that the probability is 0.20 that credit flow will improve
significantly, 0.50 that it will improve only marginally, and 0.30
that it will not improve at all. He also estimates that the
probability that the stock price of Webtalk will go up is 0.90 with
significant improvement in credit flow in the economy, 0.40 with
marginal improvement in credit flow in the economy, and 0.10
with no improvement in credit flow in the economy.
a. Based on Scott’s estimates, what is the probability that the
stock price of Webtalk goes up?
b. If we know that the stock price of Webtalk has gone up, what
is the probability that credit flow in the economy has improved
significantly?
SOLUTION:
As always, we first define the relevant events and their
associated probabilities. Let S, M, and N denote significant,
marginal, and no improvement in credit flow, respectively. Then
P(S) = 0.20, P(M) = 0.50, and P(N) = 0.30. In addition, if we
allow G to denote an increase in stock price, we formulate P(G
∣ S) = 0.90, P(G ∣ M) = 0.40, and P(G ∣ N) = 0.10. We need to
calculate P(G) in part a and P(S ∣ G) in part b. Table 4.3 aids in
assigning probabilities.
page 151
EXAMPLE 4.8
We can now answer the questions posed by Alexis Lewis in the
introductory case of this chapter. Use the information gathered
by Alexis to
a. Formulate relevant conditional, unconditional, and joint
probabilities.
b. Calculate the probability of all Americans who support the
legalization of marijuana.
SOLUTION:
page 152
EXERCISES 4 2
EXERCISES 4.2
Applications
15. Christine has always been weak in mathematics. Based on her performance prior
to the final exam in Calculus, there is a 40% chance that she will fail the course if
she does not have a tutor. With a tutor, her probability of failing decreases to 10%.
There is only a 50% chance that she will find a tutor at such short notice.
a. What is the probability that Christine fails the course?
b. Christine ends up failing the course. What is the probability that she had found
a tutor?
16. An analyst expects that 20% of all publicly traded companies will experience a
decline in earnings next year. The analyst has developed a ratio to help forecast
this decline. If the company is headed for a decline, there is a 70% chance that
this ratio will be negative. If the company is not headed for a decline, there is a
15% chance that the ratio will be negative. The analyst randomly selects a
company and its ratio is negative. What is the posterior probability that the
company will experience a decline?
17. The State Police are trying to crack down on speeding on a particular portion of
the Massachusetts Turnpike. To aid in this pursuit, they have purchased a new
radar gun that promises greater consistency and reliability. Specifically, the gun
advertises ± one-mile-per-hour accuracy 98% of the time; that is, there is a 0.98
probability that the gun will detect a speeder, if the driver is actually speeding.
Assume there is a 1% chance that the gun erroneously detects a speeder even
when the driver is below the speed limit. Suppose that 95% of the drivers drive
below the speed limit on this stretch of the Massachusetts Turnpike.
a. What is the probability that the gun detects speeding and the driver was
speeding?
b. What is the probability that the gun detects speeding and the driver was not
speeding?
c. Suppose the police stop a driver because the gun detects speeding. What is the
probability that the driver was actually driving below the speed limit?
18. According to data from the National Health and Nutrition Examination Survey,
33% of white, 49.6% of black, 43% of Hispanic, and 8.9% of Asian women are
obese. In a representative town, 48% of women are white, 19% are black, 26%
are Hispanic, and the remaining 7% are Asian.
a. Find the probability that a randomly selected woman in this town is obese.
b. Given that a woman is obese, what is the probability that she is white?
c. Given that a woman is obese, what is the probability that she is black?
d. Given that a woman is obese, what is the probability that she is Asian?
19. An analyst thinks that next year there is a 20% chance that the world economy
will be good, a 50% chance that it will be neutral, and a 30% chance that it will be
poor. She also predicts probabilities that the performance of a start-up firm,
Creative Ideas, will be good, neutral, or poor for each of the economic states of
the world economy. The following table presents probabilities for three states of
the world economy and the corresponding conditional probabilities for Creative
Ideas.
page 153
Probability
State of Performance Conditional
of
the World of Creative Probability of
Economic
Economy Ideas Creative Ideas
State
a. What is the probability that the performance of the world economy will be
neutral and that of Creative Ideas will be poor?
b. What is the probability that the performance of Creative Ideas will be poor?
c. The performance of Creative Ideas was poor. What is the probability that the
performance of the world economy had also been poor?
20. A crucial game of the Los Angeles Lakers basketball team depends on the health
of its key player. According to his doctor’s report, there is a 40% chance that he
will be fully fit to play, a 30% chance that he will be somewhat fit to play, and a
30% chance that he will not be able to play at all. The coach has estimated the
chances of winning at 80% if the player is fully fit, 60% if he is somewhat fit, and
40% if he is unable to play.
a. What is the probability that the Lakers will win the game?
b. You have just heard that the Lakers won the game. What is the probability that
the key player had been fully fit to play in the game?
21. A 2015 national survey by the Washington Post–Kaiser Family Foundation finds
that there is a big sex divide between Americans when identifying as feminist or
strong feminist. The results of the survey are shown in the following table. In
addition, per the 2010 U.S. Census Current Population Survey, 50.8% of the
American population is female and 49.2% is male.
Sex Feminist or Strong Feminist
Female 66%
Male 41%
Republican 41%
Democrat 66%
Independent 63%
page 154
page 155
b. The last column of Table 4.4 shows the calculation for the
variance. We first calculate each xi’s squared difference from
the mean (xi − μ)2, weigh each value by the appropriate
probability, (xi − μ)2P(X = xi), and then sum these weighted
squared differences. Thus, as shown at the bottom of the last
column, Var(X) = σ2 = Σ(xi − μ)2P(X = xi) = 9.97, or 9.97 (in
($1,000s)2). The standard deviation is the positive square root
of the variance,
page 156
EXERCISES 4.3
Applications
23. Fifty percent of the customers who go to Auto Center for tires buy four tires and
30% buy two tires. Moreover, 18% buy fewer than two tires, with 5% buying none.
a. What is the probability that a customer buys three tires?
b. Construct a cumulative probability distribution for the number of tires bought.
24. Jane Wormley is a professor of management at a university. She expects to be
able to use her grant money to fund up to two students for research assistance.
While she realizes that there is a 5% chance that she may not be able to fund any
student, there is an 80% chance that she will be able to fund two students.
a. What is the probability that Jane will fund one student?
b. Construct a cumulative probability distribution of the random variable defined as
the number of students that Jane will be able to fund.
25. A marketing firm is considering making up to three new hires. Given its specific
needs, the management feels that there is a 60% chance of hiring at least two
candidates. There is only a 5% chance that it will not make any hires and a 10%
chance that it will make all three hires.
a. What is the probability that the firm will make at least one hire?
b. Find the expected value and the standard deviation of the number of hires.
26. An appliance store sells additional warranties on its refrigerators. Twenty percent
of the buyers buy the limited warranty for $100 and 5% buy the extended warranty
for $200. What is the expected revenue for the store from the warranty if it sells
120 refrigerators?
27. Organizers of an outdoor summer concert in Toronto are concerned about the
weather conditions on the day of the concert. They will make a profit of $25,000
on a clear day and $10,000 on a cloudy day. They will take a loss of $5,000 if it
rains. The weather channel has predicted a 60% chance of rain on the day of the
concert. Calculate the expected profit from the concert if the likelihood is 10% that
it will be sunny and 30% that it will be cloudy.
28. The manager of a publishing company plans to give a $20,000 bonus to the top
15%, $10,000 to the next 30%, and $5,000 to the next 10% of sales
representatives. If the publishing company has a total of 200 sales
representatives, what is the expected bonus that the company will pay?
29. You are considering buying insurance for your new laptop computer, which you
have recently bought for $1,500. The insurance premium for three years is $80.
Over the three-year period there is an 8% chance that your laptop computer will
require work worth $400, a 3% chance that it will require work worth $800, and a
2% chance that it will completely break down with a scrap value of $100. Should
you buy the insurance?
30. An investor considers investing $10,000 in the stock market. He believes that the
probability is 0.30 that the economy will improve, 0.40 that it will stay the same,
and 0.30 that it will deteriorate. Further, if the economy improves, he expects his
investment to grow to $15,000, but it can also go down to $8,000 if the economy
deteriorates. If the economy stays the same, his investment will stay at $10,000.
What is the expected value of his investment?
page 157
A BERNOULLI PROCESS
A Bernoulli process consists of a series of n independent and
identical trials of an experiment such that on each trial:
There are only two possible outcomes, conventionally labeled
success and failure; and
The probabilities of success and failure remain the same from
trial to trial.
for x = 0, 1, 2, . . . , n. By definition, 0! = 1.
The formula consists of two parts, which we explain with a scenario
where historically 85% of the customers in a store make a purchase.
Suppose we want to compute the probability that exactly one of the
three customers in the store will make a purchase.
exactly 1 success.
The second part of the equation, px(1 − p)n − x, represents the
probability of any particular sequence with x successes and n − x
failures. For example, we can obtain the probability of one success
in three trials as 0.85 × 0.15 × 0.15 = (0.85)1 × (0.15)2 = 0.0191. In
other words, each sequence consisting of 1 success in 3 trials has
a 1.91% chance of occurring.
We obtain the overall probability of getting 1 success in 3 trials as
P(X = 1) = 3 × 0.0191 = 0.0573.
page 158
EXAMPLE 4.10
In the United States, about 30% of adults have four-year
college degrees (US Census, July 31, 2018). Suppose five
adults are randomly selected.
a. What is the probability that none of the adults has a college
degree?
b. What is the probability that no more than two of the adults
have a college degree?
c. What is the probability that at least two of the adults have a
college degree?
d. Calculate the expected number of adults with a college
degree.
SOLUTION: First, this problem satisfies the conditions for a
Bernoulli process with a random selection of five adults, n = 5.
Here, an adult either has a college degree, with probability p =
0.30, or does not have a college degree, with probability 1 − p
= 1 − 0.30 = 0.70.
a. In order to find the probability that none of the adults has a
college degree, we let x = 0 and find
page 159
From a random sample of five adults, there is a
47.17% likelihood that at least two adults will have a college
degree.
d. We calculate the expected number of adults with a college
degree as
A POISSON PROCESS
An experiment satisfies a Poisson process if
The number of successes within a specified time or space
interval equals any integer between zero and infinity.
The number of successes counted in nonoverlapping intervals
are independent.
The probability of success in any interval is the same for all
intervals of equal size and is proportional to the size of the
interval.
page 160
EXAMPLE 4.11
Anne is concerned about staffing needs at the Starbucks that
she manages. She believes that the typical Starbucks
customer averages 18 visits to the store over a 30-day month.
a. How many visits should Anne expect in a 5-day period from a
typical Starbucks customer?
b. What is the probability that a customer visits the chain five
times in a 5-day period?
c. What is the probability that a customer visits the chain no
more than two times in a 5-day period?
d. What is the probability that a customer visits the chain at least
three times in a 5-day period?
SOLUTION: In applications of the Poisson distribution, we first
determine the mean number of successes in the relevant time
or space interval. We use the Poisson process condition that
the probability that success occurs in any interval is the same
for all intervals of equal size and is proportional to the size of
the interval. Here, the relevant mean will be based on the rate
of 18 visits over a 30-day month.
a. Given the rate of 18 visits over a 30-day month, we can write
the mean for the 30-day period as μ30 = 18. For this problem,
we compute the proportional mean for a five-day period as μ5
page 161
EXAMPLE 4.12
People turn to social media to stay in touch with friends and
family members, connect with old friends, catch the news, look
for employment, and be entertained. According to a recent
survey, 68% of all U.S. adults are Facebook users. Consider a
sample of 100 randomly selected American adults.
a. What is the probability that exactly 70 American adults are
Facebook users?
b. What is the probability that no more than 70 American adults
are Facebook users?
c. What is the probability that at least 70 American adults are
Facebook users?
SOLUTION:
We let X denote the number of American adults who are
Facebook users. We also know that p = 0.68 and n = 100.
Using Excel
We use Excel’s BINOM.DIST function to calculate binomial
probabilities. In order to find P(X = x), we enter
=BINOM.DIST(x, n, p, 0) where x is the number of successes,
n is the number of trials, and p is the probability of success. If
we enter a “1” for the last argument in the function, then Excel
returns P(X ≤ x).
a. In order to find the probability that exactly 70 American adults
are Facebook users, P(X = 70), we enter =BINOM.DIST(70,
100, 0.68, 0) and Excel returns 0.0791.
b. In order to find the probability that no more than 70 American
adults are Facebook users, P(X ≤ 70), we enter
=BINOM.DIST(70, 100, 0.68, 1) and Excel returns 0.7007.
c. In order to find the probability that at least 70 American adults
are Facebook users, P(X ≥ 70) = 1 − P(X ≤ 69), we enter =1-
BINOM.DIST(69, 100, 0.68, 1) and Excel returns 0.3784.
Using R
We use R’s dbinom and pbinom functions to calculate
binomial probabilities. In order to calculate P(X = x), we enter
dbinom(x, n, p) where x is the number of successes, n is the
number of trials, and p is the probability of success. In order to
calculate P(X ≤ x), we enter pbinom(x, n, p).
a. In order to find P(X = 70), we enter:
> dbinom(70, 100, 0.68)
And R returns: 0.07907911.
b. In order to find P(X ≤ 70), we enter:
> pbinom(70, 100, 0.68)
And R returns: 0.7006736.
c. In order to find P(X ≥ 70) = 1 − P(X ≤ 69), we enter:
> 1 − pbinom(69, 100, 0.68)
And R returns: 0.3784055.
page 162
EXAMPLE 4.13
The sales volume of craft beer continues to grow, amounting to
24% of the total beer market in the U.S. (USA Today, April 2,
2019). It has been estimated that 1.5 craft breweries open
every day. Assume this number represents an average that
remains constant over time.
a. What is the probability that no more than 10 craft breweries
open every week?
b. What is the probability that exactly 10 craft breweries open
every week?
SOLUTION:We let X denote the number of craft breweries that
open every week and compute the weekly mean, μ = 1.5 × 7 =
10.5.
Using Excel
We use Excel’s POISSON.DIST function to calculate Poisson
probabilities. In order to find P(X = x), we enter
=POISSON.DIST(x, μ, 0) where x is the number of successes
over some interval and μ is the mean over this interval. If we
enter a “1” for the last argument in the function, then Excel
returns P(X ≤ x).
a. In order to find the probability that no more than 10 craft
breweries open every week, P(X ≤ 10), we enter
=POISSON.DIST(10, 10.5, 1) and Excel returns 0.5207.
b. In order to find the probability that exactly 10 craft breweries
open every week, P(X = 10), we enter =POISSON.DIST(10,
10.5, 0) and Excel returns 0.1236.
Using R
We use R’s dpois and ppois functions to calculate Poisson
probabilities. In order to calculate P(X = x), we enter dpois(x, μ)
where x is the number of successes over some interval and μ
is the mean over this interval. In order to calculate P(X ≤ x), we
enter ppois(x, μ).
a. In order to find P(X ≤ 10), we enter:
> ppois(10, 10.5)
And R returns: 0.5207381.
b. In order to find P(X = 10), we enter:
> dpois(10, 10.5)
And R returns: 0.1236055.
EXERCISES 4.4
Applications
31. At a local community college, 40% of students who enter the college as freshmen
go on to graduate. Ten freshmen are randomly selected.
a. What is the probability that none of them graduates from the local community
college?
b. What is the probability that at most nine will graduate from the local community
college?
c. What is the expected number that will graduate?
32. As of 2018, 30% of Americans have confidence in U.S. banks, which is still below
the pre-recession level of 41% reported in June 2007 (www.gallup.com, June 28,
2018).
a. What is the probability that fewer than half of four Americans in 2018 have
confidence in U.S. banks?
b. What would have been the corresponding probability in 2007?
33. Approximately 45% of Baby Boomers—those born between 1946 and 1964—are
still in the workforce (www.pewresearch.org, May 11, 2015). Six Baby Boomers
are selected at random.
a. What is the probability that exactly one of the Baby Boomers is still in the
workforce?
page 163
b. What is the probability that at least five of the Baby Boomers
are still in the workforce?
c. What is the probability that less than two of the Baby Boomers are still in the
workforce?
d. What is the probability that more than the expected number of the Baby
Boomers are still in the workforce?
34. According to the Census Bureau projections, Hispanics will comprise 28.6% of the
total population in 2060 (CNN, March 6, 2019). For comparison, they comprised
just 18.1% of the total population in 2018.
a. What is the expected value and the standard deviation of Hispanics in a random
sample of 5,000 people in 2018?
b. What is the corresponding expected value and the standard deviation projected
for 2060?
35. The arduous task of combing and tying up long hair and a desire to assimilate has
led to approximately 25% of Sikh youths giving up on wearing turbans.
a. What is the probability that exactly two in a random sample of five Sikh youths
wear a turban?
b. What is the probability that two or more in a random sample of five Sikh youths
wear a turban?
c. What is the probability that more than the expected number of Sikh youths wear
a turban in a random sample of five Sikh youths?
d. What is the probability that more than the expected number of Sikh youths wear
a turban in a random sample of 10 Sikh youths?
36. Researchers from leading universities have shown that divorces are contagious
(Chicago Tribune, August 16, 2018). A split-up between immediate friends
increases a person’s own chances of getting divorced from 36% to 63%, an
increase of 75%.
a. Compute the probability that more than half of four randomly selected
marriages will end in divorce if the couple’s immediate friends have split up.
b. Redo part a if it is known that none of the couple’s immediate friends has split
up.
37. Sixty percent of a firm’s employees are men. Suppose four of the firm’s
employees are randomly selected.
a. What is more likely, finding three men and one woman or two men and two
women?
b. Do you obtain the same answer as in part a if 70% of the firm’s employees had
been men?
38. The principal of an architecture firm tells her client that there is at least a 50%
chance of having an acceptable design by the end of the week. She knows that
there is only a 25% chance that any one designer would be able to do so by the
end of the week.
a. Would she be correct in her statement to the client if she asks two of her
designers to work on the design, independently?
b. If not, what if she asks three of her designers to work on the design,
independently?
39. Suppose 40% of recent college graduates plan on pursuing a graduate degree.
Fifteen recent college graduates are randomly selected.
a. What is the probability that no more than four of the college graduates plan to
pursue a graduate degree?
b. What is the probability that exactly seven of the college graduates plan to
pursue a graduate degree?
c. What is the probability that at least six but no more than nine of the college
graduates plan to pursue a graduate degree?
40. A manager at 24/7 Fitness Center is strategic about contacting open house
attendees. With her strategy, she believes that 40% of the attendees she contacts
will purchase a club membership. Suppose she contacts 20 open house
attendees.
a. What is the probability that exactly 10 of the attendees will purchase a club
membership?
b. What is the probability that no more than 10 of the attendees will purchase a
club membership?
c. What is the probability that at least 15 of the attendees will purchase a club
membership?
41. Fraud detection has become an indispensable tool for banks and credit card
companies to combat fraudulent credit card transactions. A fraud detection firm
has detected some form of fraudulent activities in 1.31%, and serious fraudulent
activities in 0.87%, of transactions. Assume that fraudulent transactions remain
stable.
a. What is the probability that fewer than 2 out of 100 transactions are fraudulent?
b. What is the probability that fewer than 2 out of 100 transactions are seriously
fraudulent?
42. New Age Solar installs solar panels for residential homes. Because of the
company’s personalized approach, it averages three home installations daily.
a. What is the probability that New Age Solar installs solar panels in at most four
homes in a day?
b. What is the probability that New Age Solar installs solar panels in at least three
homes in a day?
43. On average, there are 12 potholes per mile on a particular stretch of the state
highway. Suppose the potholes are distributed evenly on the highway.
a. Find the probability of finding fewer than two potholes in a quarter-mile stretch
of the highway.
b. Find the probability of finding more than one pothole in a quarter-mile stretch of
the highway.
44. A tollbooth operator has observed that cars arrive randomly at an average rate of
360 cars per hour.
a. Find the probability that two cars arrive during a specified one-minute period.
page 164
b. Find the probability that at least two cars arrive during a
specified one-minute period.
c. Find the probability that 40 cars arrive between 10:00 am and 10:10 am.
45. A textile manufacturing process finds that on average, two flaws occur per every
50 yards of material produced.
a. What is the probability of exactly two flaws in a 50-yard piece of material?
b. What is the probability of no more than two flaws in a 50-yard piece of material?
c. What is the probability of no flaws in a 25-yard piece of material?
46. Motorists arrive at a Gulf gas station at the rate of two per minute during morning
hours.
a. What is the probability that more than two motorists will arrive at the Gulf gas
station during a one-minute interval in the morning?
b. What is the probability that exactly six motorists will arrive at the Gulf gas
station during a five-minute interval in the morning?
c. How many motorists can an employee expect in her three-hour morning shift?
47. Airline travelers should be ready to be more flexible as airlines once again cancel
thousands of flights this summer. The Coalition for Airline Passengers Rights,
Health, and Safety averages 400 calls a day to help stranded travelers deal with
airlines. Suppose the hotline is staffed for 16 hours a day.
a. Calculate the average number of calls in a one-hour interval, 30-minute interval,
and 15-minute interval.
b. What is the probability of exactly six calls in a 15-minute interval?
c. What is the probability of no calls in a 15-minute interval?
d. What is the probability of at least two calls in a 15-minute interval?
48. According to the Centers for Disease Control and Prevention (CDC), the aging of
the U.S. population is translating into many more visits to doctors’ offices and
hospitals. It is estimated that an average person makes four visits a year to
doctors’ offices and hospitals.
a. What are the mean and the standard deviation of an average person’s number
of monthly visits to doctors’ offices and hospitals?
b. What is the probability that an average person does not make any monthly
visits to doctors’ offices and hospitals?
c. What is the probability that an average person makes at least one monthly visit
to doctors’ offices and hospitals?
49. Last year, there were 24,584 age-discrimination claims filed with the Equal
Employment Opportunity Commission. Assume there were 260 working days in
the fiscal year for which a worker could file a claim.
a. Calculate the average number of claims filed on a working day.
b. What is the probability that exactly 100 claims were filed on a working day?
c. What is the probability that no more than 100 claims were filed on a working
day?
50. American adults are watching significantly less television than they did in previous
decades. In 2016, Nielsen reported that American adults are watching an average
of five hours and four minutes, or 304 minutes, of television per day.
a. Find the probability that an average American adult watches more than 320
minutes of television per day.
b. Find the probability that an average American adult watches more than 2,200
minutes of television per week.
page 165
page 166
⋮ ⋮ ⋮ ⋮
1.5 → → 0.9357
The first column of the table, denoted as the z column, shows values
of z up to the tenth decimal point, while the first row of the table,
denoted as the z row, shows hundredths values. Thus, for z = 1.52,
we match 1.5 on the z column with 0.02 on the z row to find a
corresponding probability of 0.9357. The arrows in Table 4.5 indicate
that P(Z ≤ 1.52) = 0.9357. Note that the area to the right of 1.52 can
be computed as P(Z > 1.52) = 1 − P(Z ≤ 1.52) = 1 − 0.9357 =
0.0643.
Suppose we want to find P(Z ≤ −1.96). Because z is a negative
value, we can look up this probability from the left-hand page of the z
table in Appendix D to find P(Z ≤ −1.96) = 0.0250. As before, the
area to the right of −1.96 can be computed as P(Z > −1.96) = 1 −
P(Z ≤ −1.96) = 1 − 0.0250 = 0.9750.
So far we have computed probabilities for given z values. Now
we will evaluate z values for given cumulative probabilities;
noncumulative probabilities can be evaluated using symmetry.
Suppose we need to find z given P(Z ≤ z) = 0.6808.
Because the probability is already in a cumulative format—that is,
P(Z ≤ z) = 0.6808—we simply look up 0.6808 from the body of the
table (right-hand side) to find the corresponding z value from the
row/column of z. Table 4.6 shows the relevant portion of the z table.
Therefore, z = 0.47.
page 167
this probability using the z table as P(Z ≤ 1.5) − P(Z < −0.5) =
0.9332 − 0.3085 = 0.6247.
page 169
EXAMPLE 4.15
Scores on a management aptitude examination are normally
distributed with a mean of 72 and a standard deviation of 8.
a. What is the lowest score that will place a manager in the top
10% (90th percentile) of the distribution?
b. What is the highest score that will place a manager in the
bottom 25% (25th percentile) of the distribution?
SOLUTION: Let X represent scores on a management aptitude
examination with μ = 72 and σ = 8. We will use the inverse
transformation x = μ + zσ to solve these problems.
a. The 90th percentile is a numerical value x such that P(X < x) =
0.90. We look up 0.90 (or the closest value to 0.90) in the z
table (right-hand side) to get z = 1.28 and use the inverse
transformation to find x = 72 + 1.28(8) = 82.24. Therefore, a
score of 82.24 or higher will place a manager in the top 10%
of the distribution (see Figure 4.7).
EXAMPLE 4.16
The Vanguard Balanced Index Fund seeks to maintain an
allocation of 60% to stocks and 40% to bonds. With low fees
and a consistent investment approach, this fund ranks fourth
out of 792 funds that allocate 50% to 70% to stocks (US News,
March 2017). Based on historical data, the expected return and
standard deviation of this fund are estimated as 7.49% and
6.41%, respectively. Assume that the fund returns are stable
and are normally distributed.
a. What is the probability that the fund will generate a return
between 5% and 10%?
b. What is the lowest return of the fund that will place it in the top
10% (90th percentile) of the distribution?
SOLUTION: We let X denote the return on the Vanguard
Balanced fund. We know that X is normally distributed with μ =
7.49 and σ = 6.41.
As mentioned in Chapters 2 and 3, due to different fonts
and type settings, copying and pasting Excel or R functions
from this text directly into Excel or R may cause errors. When
such errors occur, you may need to replace special characters
such as quotation marks and parentheses or delete extra
spaces in the functions.
Using Excel
We use Excel’s NORM.DIST and NORM.INV functions to solve
problems pertaining to the normal distribution. In order to find
P(X ≤ x), we enter =NORM.DIST(x, μ, σ, 1) where x is the
value for which we want to evaluate the cumulative probability,
μ is the mean of the distribution, and σ is the standard
deviation of the distribution. (If we enter “0” for the last
argument in the function, then Excel returns the height of the
normal distribution at the point x. This feature is useful if we
want to plot the normal distribution.) If we want to find a
particular x value for a given cumulative probability
(cumulprob), then we enter =NORM.INV(cumulprob, μ, σ).
a. In order to find the probability of a return between 5% and
10%, P(5 ≤ X ≤ 10), we enter =NORM.DIST(10, 7.49, 6.41, 1)
− NORM.DIST(5, 7.49, 6.41, 1). Excel returns 0.3035.
b. In order to find the lowest return that will place it in the top
10% (90th percentile) of the distribution, P(X > x) = 0.10, we
enter =NORM.INV(0.90, 7.49, 6.41). Excel returns 15.7047.
Using R
We use R’s pnorm and qnorm functions to solve problems
associated with the normal distribution. In order to find P(X ≤
x), we enter pnorm(x, μ, σ, lower.tail = TRUE) where x is the
value for which we want to evaluate the cumulative probability,
μ is the mean of the distribution, and σ is the standard
deviation of the distribution. If we enter “lower.tail=FALSE” for
the last argument in the function, then R returns P(X > x). If we
want to find a particular x value for a given cumulative
probability (cumulprob), then we enter qnorm(cumulprob, μ, σ).
a. In order to find P(5 ≤ X ≤ 10), we enter:
> pnorm(10, 7.49, 6.41, lower.tail=TRUE) − pnorm(5, 7.49,
6.41, lower.tail=TRUE)
And R returns: 0.3034746.
page 171
b. In order to solve for x to satisfy P(X > x) = 0.10,
we enter:
> qnorm(0.90, 7.49, 6.41)
And R returns: 15.70475.
EXERCISES 4.5
Applications
51. The historical returns on a balanced portfolio have had an average return of 8%
and a standard deviation of 12%. Assume that returns on this portfolio follow a
normal distribution.
a. What percentage of returns were greater than 20%?
b. What percentage of returns were below −16%?
52. A professional basketball team averages 105 points per game with a standard
deviation of 10 points. Assume points per game follow the normal distribution.
a. What is the probability that a game’s score is between 85 and 125 points?
b. What is the probability that a game’s score is more than 125 points? If there are
82 games in a regular season, in how many games will the team score more
than 125 points?
53. The average rent in a city is $1,500 per month with a standard deviation of $250.
Assume rent follows the normal distribution.
a. What percentage of rents are between $1,250 and $1,750?
b. What percentage of rents are less than $1,250?
c. What percentage of rents are greater than $2,000?
54. Suppose that the miles-per-gallon (mpg) rating of passenger cars is a normally
distributed random variable with a mean and a standard deviation of 33.8 mpg
and 3.5 mpg, respectively.
a. What is the probability that a randomly selected passenger car gets at least 40
mpg?
b. What is the probability that a randomly selected passenger car gets between 30
and 35 mpg?
c. An automobile manufacturer wants to build a new passenger car with an mpg
rating that improves upon 99% of existing cars. What is the minimum mpg that
would achieve this goal?
55. Business interns from a small private college earn an average annual salary of
$43,000. Let the salary be normally distributed with a standard deviation of
$18,000.
a. What percentage of business interns make between $40,000 and $50,000?
b. What percentage of business interns make more than $80,000?
56. A financial advisor informs a client that the expected return on a portfolio is 8%
with a standard deviation of 12%. There is a 15% chance that the return would be
above 16%. If the advisor is right about her assessment, is it reasonable to
assume that the underlying return distribution is normal?
57. According to a company’s website, the top 25% of the candidates who take the
entrance test will be called for an interview. You have just been called for an
interview. The reported mean and standard deviation of the test scores are 68 and
8, respectively. What is the possible range for your test score if you assume that
the scores are normally distributed?
58. The time required to assemble an electronic component is normally distributed
with a mean and a standard deviation of 16 minutes and 4 minutes, respectively.
a. Find the probability that a randomly picked assembly takes between 10 and 20
minutes.
b. It is unusual for the assembly time to be above 24 minutes or below 6 minutes.
What proportion of assembly times falls in these unusual categories?
59. According to the Mortgage Banking Association, loans that are 90 days or more
past due are considered seriously delinquent (housingwire.com, May 14, 2019). It
has been reported that the rate of seriously delinquent loans has an average of
9.1%. Let the rate of seriously delinquent loans follow a normal distribution with a
standard deviation of 0.80%.
a. What is the probability that the rate of seriously delinquent loans is above 8%?
b. What is the probability that the rate of seriously delinquent loans is between
9.5% and 10.5%?
60. The manager of a night club in Boston stated that 95% of the customers are
between the ages of 22 and 28 years. If the age of customers is normally
distributed with a mean of 25 years, calculate its standard deviation.
61. An estimated 1.8 million students take on student loans to pay ever-rising tuition
and room and board. It is also known that the average cumulative debt of recent
college graduates is about $22,500. Let the cumulative debt among recent college
graduates be normally distributed with a standard deviation of $7,000.
Approximately how many recent college graduates have accumulated student
loans of more than $30,000?
62. On average, an American professional football game lasts about three hours,
even though the ball is actually in play only 11 minutes (SB Nation, April 1, 2019).
Assume that game times are normally distributed with a standard deviation of 0.4
hour.
a. Find the probability that a game lasts less than 2.5 hours.
b. Find the probability that a game lasts either less than 2.5 hours or more than
3.5 hours.
page 172
c. Find the maximum value for the game time that will place it in
the bottom 1% of the distribution.
63. A young investment manager tells his client that the probability of making a
positive return with his suggested portfolio is 90%. If it is known that returns are
normally distributed with a mean of 5.6%, what is the risk, measured by standard
deviation, that this investment manager assumes in his calculation?
64. You are considering the risk-return profile of two mutual funds for investment. The
relatively risky fund promises an expected return of 8% with a standard deviation
of 14%. The relatively less risky fund promises an expected return and standard
deviation of 4% and 5%, respectively. Assume that the returns are approximately
normally distributed.
a. Which mutual fund will you pick if your objective is to minimize the probability of
earning a negative return?
b. Which mutual fund will you pick if your objective is to maximize the probability of
earning a return above 8%?
65. A new car battery is sold with a two-year warranty whereby the owner gets the
battery replaced free of cost if it breaks down during the warranty period. Suppose
an auto store makes a net profit of $20 on batteries that stay trouble-free during
the warranty period; it makes a net loss of $10 on batteries that break down. The
life of batteries is known to be normally distributed with a mean and a standard
deviation of 40 and 16 months, respectively.
a. What is the probability that a battery will break down during the warranty
period?
b. What is the expected profit of the auto store on a battery?
c. What is the expected monthly profit on batteries if the auto store sells an
average of 500 batteries a month?
Case Study
Professor Lang is a professor of economics at Salem State University. She has been
teaching a course in Principles of Economics for over 25 years. Professor Lang has
never graded on a curve because she believes that relative grading may unduly
penalize or benefit a student in an unusually strong or weak class. She always uses
an absolute scale for making grades, as shown in the two left columns of Table 4.7.
TABLE 4.7 Grading Scales with Absolute Grading versus Relative Grading
B 78 up to 92 B 0.35
C 64 up to 78 C 0.40
D 58 up to 64 D 0.10
F Below 58 F 0.05
page 173
A 0.14 0.10
B 0.38 0.35
C 0.36 0.40
D 0.07 0.10
F 0.05 0.05
The second column of Table 4.8 shows that 14% of students are expected
to receive A’s, 38% B’s, and so on. Although these numbers are generally
consistent with the relative scale restated in the third column of Table 4.8, it
appears that the relative scale makes it harder for students to get higher
grades. For instance, 14% get A’s with the absolute scale compared to only
10% with the relative scale.
Alternatively, we can compare the two grading methods on the basis of the
range of scores for various grades. The second column of Table 4.9 restates
the range of scores based on absolute grading. In order to obtain the range of
scores based on relative grading, it is once again necessary to apply
concepts from the normal distribution. For instance, the minimum score
required to earn an A with relative grading is derived by solving for x in P(X ≥
x) = 0.10. Because P(X ≥ x) = 0.10 is equivalent to P(Z ≥ z) = 0.10, it follows
that z = 1.28. Inserting the proper values of the mean, the standard deviation,
and z into x = μ + zσ yields a value of x equal to 94.47. Ranges for other
grades, derived similarly, are presented in the third column of Table 4.9.
B 78 up to 92 80.21 up to
94.47
C 64 up to 78 65.70 up to
80.21
D 58 up to 64 58.20 up to
65.70
Once again comparing the results in Table 4.9, the use of the relative scale
makes it harder for students to get higher grades in Professor Lang’s courses.
For instance, in order to receive an A with relative grading, a student must
have a score of at least 94.47 versus a score of at least 92 with absolute
grading. Both absolute and relative grading methods have their merits and
teachers often make the decision on the basis of their teaching philosophy.
However, if Professor Lang wants to keep the grades consistent with her
earlier absolute scale, she should base her relative scale on the probabilities
computed in the second column of Table 4.8.
page 174
0 0.35 0.20
Report 4.3. Akiko Hamaguchi is the manager of a small sushi restaurant called Little
Ginza in Phoenix, Arizona. As part of her job, Akiko has to purchase salmon every
day for the restaurant. For the sake of freshness, it is important that she buys the
right amount of salmon daily. Buying too much may result in
wastage and buying too little may disappoint some customers on
page 175
high-demand days.
Akiko has estimated that the daily consumption of salmon is normally distributed
with a mean of 12 pounds and a standard deviation of 3.2 pounds. She has always
bought 20 pounds of salmon every day. Lately, she has been criticized by the owners
because this amount of salmon was too often resulting in wastage. As part of cost
cutting, Akiko is considering a new strategy. She will buy salmon that is sufficient to
meet the daily demand of customers on 90% of the days.
In a report, help Akiko use the above information to
Calculate the probability that the demand for salmon at Little Ginza is above 20
pounds.
Calculate the probability that the demand for salmon at Little Ginza is below 15
pounds.
Determine the amount of salmon that should be bought daily so that the restaurant
meets demand on 90% of the days.
page 176
5 Statistical Inference
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 5.1 Describe the sampling distribution of the
sample mean.
LO 5.2 Describe the sampling distribution of the
sample proportion.
LO 5.3 Construct a confidence interval for the
population mean.
LO 5.4 Construct a confidence interval for the
population proportion.
LO 5.5 Conduct a hypothesis test for the
population mean.
LO 5.6 Conduct a hypothesis test for the
population proportion.
page 177
©Ken Seet/Corbis Images/SuperStock
INTRODUCTORY CASE
Undergraduate Study Habits
Are today’s college students studying hard or hardly studying?
A study asserts that, over the past six decades, the number of
hours that the average college student studies each week has
been steadily dropping (The Wall Street Journal, April 10,
2019). In 1961, students invested 24 hours per week in their
academic pursuits, whereas today’s students study an average
of 14 hours per week.
Susan Knight is a dean at a large university in California.
She wonders if the study trend is reflective of the students at
her university. She randomly selects 35 students and asks their
average study time per week (in hours). A portion of the
responses is shown in Table 5.1.
FILE
Study_Hours
25
19
⋮
16
page 178
page 179
LO 5.1
Describe the sampling distribution of the sample mean.
as
page 180
c. The expected value of the sample mean for both
sample sizes is identical to the expected value of the
individual pizza. However, the standard error of the sample
mean with n = 4 is lower than the one with n = 2. For both
sample sizes, the standard error of the sample mean is lower
than the standard deviation of the individual pizza. This result
confirms that averaging reduces variability.
EXAMPLE 5.2
Use the information in Example 5.1 to answer the following
questions:
a. What is the probability that a randomly selected pizza is less
than 15.5 inches?
b. What is the probability that the average of two randomly
selected pizzas is less than 15.5 inches?
c. What is the probability that the average of four randomly
selected pizzas is less than 15.5 inches?
d. Comment on the computed probabilities.
In a random sample
that of X.
EXAMPLE 5.3
For the month of May, a coffee chain advertised a Happy Hour
between the hours of 3 pm and 5 pm when customers could
enjoy a half-price iced coffee. The manager of one of the
chains wants to determine if the Happy Hour has had a
lingering effect on the amount of money customers now spend
on iced coffee.
Before the marketing campaign, customers spent an
average of $4.18 on iced coffee with a standard deviation of
$0.84. Based on 50 customers sampled after the marketing
campaign, the average amount spent is $4.26. If the coffee
chain chose not to pursue the marketing campaign, how likely
is it that customers will spend an average of $4.26 or more on
iced coffee?
It is quite
Our discussion thus far has focused on the population mean, but in
many business applications, we are concerned with the population
proportion. For instance, a banker is interested in the default
probability of mortgage holders. Or an online retailer cares about the
proportion of customers who make a purchase after receiving a
promotional e-mail. In these examples, the parameter of interest is
the population proportion p.
As in the case of the population mean, we almost always make
inferences about the population proportion on the basis of sample
data. Here, the relevant statistic (estimator) is the sample proportion,
a particular value (estimate) is denoted by Because is a
EXAMPLE 5.4
A study found that 55% of British firms experienced a cyber-
attack in the past year (BBC, April 23, 2019).
a. What are the expected value and the standard error of the
sample proportion derived from a random sample of 100
firms?
b. In a random sample of 100 firms, what is the probability that
the sample proportion is greater than 0.57?
SOLUTION:
a. Given that p = 0.55 and n = 100, the expected value and the
standard error of
find that
FIGURE 5.3 Finding
EXERCISES 5.1
Applications
1. According to a survey, high school girls average 100 text messages daily. Assume
the population standard deviation is 20 text messages. Suppose a random
sample of 50 high school girls is taken.
a. What is the probability that the sample mean is more than 105?
b. What is the probability that the sample mean is less than 95?
c. What is the probability that the sample mean is between 95 and 105?
2. Beer bottles are filled so that they contain an average of 330 ml of beer in each
bottle. Suppose that the amount of beer in a bottle is normally distributed with a
standard deviation of four ml.
a. What is the probability that a randomly selected bottle will have less than 325
ml of beer?
page 184
b. What is the probability that a randomly selected six-pack of beer will have a
mean amount less than 325 ml?
c. What is the probability that a randomly selected 12-pack of beer will have a
mean amount less than 325 ml?
d. Comment on the sample size and the corresponding probabilities.
3. The weight of people in a small town in Missouri is known to be normally
distributed with a mean of 180 pounds and a standard deviation of 28 pounds. On
a raft that takes people across the river, a sign states, “Maximum capacity 3,200
pounds or 16 persons.” What is the probability that a random sample of 16
persons will exceed the weight limit of 3,200 pounds?
4. Despite its nutritional value, seafood is only a tiny part of the American diet, with
the average American eating just 16 pounds of seafood per year. Janice and Nina
both work in the seafood industry and they decide to create their own random
samples and document the average seafood diet in their sample. Let the standard
deviation of the American seafood diet be seven pounds.
a. Janice samples 42 Americans and finds an average seafood consumption of 18
pounds. How likely is it to get an average of 18 pounds or more if she had a
representative sample?
b. Nina samples 90 Americans and finds an average seafood consumption of 17.5
pounds. How likely is it to get an average of 17.5 pounds or more if she had a
representative sample?
c. Which of the two women is likely to have used a more representative sample?
Explain.
5. A small hair salon in Denver, Colorado, averages about 30 customers on
weekdays with a standard deviation of six. It is safe to assume that the underlying
distribution is normal. In an attempt to increase the number of weekday
customers, the manager offers a $2 discount on five consecutive weekdays. She
reports that her strategy has worked because the sample mean of customers
during this five-weekday period jumps to 35.
a. How unusual would it be to get a sample average of 35 or more customers if
the manager had not offered the discount?
b. Do you feel confident that the manager’s discount strategy has worked?
Explain.
6. Suppose that the typical college student graduates with $28,650 in debt. Let debt
among recent college graduates be normally distributed with a standard deviation
of $7,000.
a. What is the probability that the average debt of four recent college graduates is
more than $25,000?
b. What is the probability that the average debt of four recent college graduates is
more than $30,000?
7. Forty families gathered for a fund-raising event. Suppose the individual
contribution for each family is normally distributed with a mean and a standard
deviation of $115 and $35, respectively. The organizers would call this event a
success if the total contributions exceed $5,000. What is the probability that this
fund-raising event is a success?
8. A doctor is getting sued for malpractice by four of her former patients. It is
believed that the amount that each patient will sue her for is normally distributed
with a mean of $800,000 and a standard deviation of $250,000.
a. What is the probability that a given patient sues the doctor for more than
$1,000,000?
b. If the four patients sue the doctor independently, what is the probability that the
total amount they sue for is over $4,000,000?
9. Suppose that the miles-per-gallon (mpg) rating of passenger cars is normally
distributed with a mean and a standard deviation of 33.8 and 3.5 mpg,
respectively.
a. What is the probability that a randomly selected passenger car gets more than
35 mpg?
b. What is the probability that the average mpg of four randomly selected
passenger cars is more than 35 mpg?
c. If four passenger cars are randomly selected, what is the probability that all of
the passenger cars get more than 35 mpg?
10. Suppose that IQ scores are normally distributed with a mean of 100 and a
standard deviation of 16.
a. What is the probability that a randomly selected person will have an IQ score of
less than 90?
b. What is the probability that the average IQ score of four randomly selected
people is less than 90?
c. If four people are randomly selected, what is the probability that all of them
have an IQ score of less than 90?
11. A 2019 Pew Research study finds that the number of undocumented immigrants
living in the United States has dropped to the level it was in 2004. While its share
has declined, California still accounts for approximately 23% of the nation’s
estimated 10.5 million undocumented immigrants.
a. In a sample of 50 undocumented immigrants, what is the probability that more
than 20% live in California?
b. In a sample of 200 undocumented immigrants, what is the probability that more
than 20% live in California?
c. Comment on the reason for the difference between the computed probabilities
in parts a and b.
12. Suppose that a study finds that 33% of teenagers text while driving. The study
was based on 100 teen drivers.
a. Discuss the sampling distribution of the sample proportion.
b. What is the probability that the sample proportion is less than 0.30?
c. What is the probability that the sample proportion is within ±0.02 of the
population proportion?
page 185
13. A car manufacturer is concerned about poor customer
satisfaction at one of its dealerships. The management decides to evaluate the
satisfaction surveys of its next 40 customers. The dealer will be fined if the
number of customers who report favorably is between 22 and 26. The dealership
will be dissolved if fewer than 22 customers report favorably. It is known that 70%
of the dealer’s customers report favorably on satisfaction surveys.
a. What is the probability that the dealer will be fined?
b. What is the probability that the dealership will be dissolved?
14. At a new exhibit in the Museum of Science, people are asked to choose between
50 or 100 random draws from a machine. The machine is known to have 60 green
balls and 40 red balls. After each draw, the color of the ball is noted and the ball is
put back for the next draw. You win a prize if more than 70% of the draws result in
a green ball. Would you choose 50 or 100 draws for the game? Explain.
15. Suppose that one in six smartphone users have fallen prey to cyber-attack.
a. Discuss the sampling distribution of the sample proportion based on a sample
of 200 smartphone users. Is it appropriate to use the normal distribution
approximation for the sample proportion?
b. What is the probability that more than 20% of smartphone users in the sample
have fallen prey to cyber-attack?
5.2 ESTIMATION
Given sample data, we use the sample statistics to make inferences
about the unknown population parameters, such as the population
mean and the population proportion. Two basic methodologies
emerge from the inferential branch of statistics: estimation and
hypothesis testing. Although the sample statistics are based on a
portion of the population, they contain useful information to estimate
the population parameters and to conduct tests regarding the
population parameters. In this section, we focus on estimation.
As mentioned earlier, when a statistic is used to estimate a
parameter, it is referred to as a point estimator, or simply an
estimator. A particular value of the estimator is called a point
estimate or an estimate. Recall that the sample mean is the
estimator of the population mean μ, and the sample proportion is
the estimator of the population proportion p.
Suppose in a sample of 25 ultra-green cars, we find that the
mean miles per gallon (mpg) of the cars is mpg; similarly,
suppose we calculate the proportion of these cars that get an mpg
greater than 100 as Thus, the estimate for the mean mpg of
all ultra-green cars is 96.52 mpg, and the estimate for the proportion
of all ultra-green cars with mpg greater than 100 is 0.28. It is
important to note that these estimates are based on a sample of 25
cars and, therefore, are likely to vary between samples. Often it is
more informative to provide a range of values—an interval—rather
than a single point estimate for the unknown population parameter.
This range of values is called a confidence interval, also referred to
as an interval estimate, for the population parameter.
CONFIDENCE INTERVAL
A confidence interval, or interval estimate, provides a range of
values that, with a certain level of confidence, contains the
population parameter of interest.
also by
Another standardized statistic, which uses the estimator S in
We use the notation tα,df to denote a value such that the area in
the upper tail equals α for a given df. In other words, for a random
variable Tdf, the notation tα,df represents a value such that P(Tdf ≥
tα,df ) = α. Similarly, tα/2,df represents a value such that P(Tdf ≥ tα/2,df)
= α/2. Figure 5.4 illustrates the notation.
page 188
EXAMPLE 5.5
In a sample of 25 ultra-green cars, we find that the mean miles
per gallon (mpg) is with a standard deviation of s =
page 189
SOLUTION: The condition that follows a normal distribution is
satisfied since we assumed that mpg is normally distributed.
We construct the confidence interval as For the 90%
The 90% confidence interval for the average mpg of all ultra-
green cars is between 92.86 mpg and 100.18 mpg.
EXAMPLE 5.6
Amazon Prime is a $119-per-year service that gives the
company’s customers free two-day shipping and discounted
rates on overnight delivery. Prime customers also get other
perks, such as free e-books. Table 5.3 shows a portion of the
annual expenditures (in $) for 100 Prime customers. Use Excel
and R to construct the 95% confidence interval for the average
annual expenditures of all Prime customers. Summarize the
results.
FILE
Prime
Customer Expenditures
1 1272
2 1089
⋮ ⋮
100 1389
Using Excel
a. Open the Prime data file. Note that the observations for the
Expenditures variable are in cells B2 through B101.
b. For a 95% confidence interval, with n = 100, we find t0.025,99
using = T.INV(0.975,99). Thus, in order to obtain the lower
limit, we enter =AVERAGE(B2:B101) − T.INV(0.975,99)
*STDEV.S(B2:101)/SQRT(100). For the upper limit, we enter
=AVERAGE(B2:B101) + T.INV(0.975,99)
*STDEV.S(B2:101)/SQRT(100).
Note: For a one-step approach to constructing a confidence
interval in Excel, we can use the Descriptive Statistics option
in its Analysis Toolpak which we discussed in Chapter 3. In the
Descriptive Statistics dialog box, we select Summary statistics
and Confidence Interval for Mean. (By default, the confidence
level is set at 95%, but you easily enter another level.) In the
table that Excel returns, we find the mean and the margin of
error which is labeled Confidence Level(95.0%).
page 190
Using R
a. Import the Prime data into a data frame (table) and label it
myData.
b. For a 95% confidence interval, with n = 100, we find t0.025,99
using the qt function. Thus, in order to obtain the lower and
upper limits, we enter:
distribution.
page 191
EXAMPLE 5.7
In a sample of 25 ultra-green cars, seven of the cars obtained
over 100 miles per gallon (mpg). Construct 90% and 99%
confidence intervals for the population proportion of all ultra-
green cars that obtain over 100 mpg.
values into
EXERCISES 5.2
Applications
16. A popular weight loss program claims that with its recommended healthy diet
regimen, users lose significant weight within a month. In order to estimate the
mean weight loss of all customers, a nutritionist takes a sample of 18 dieters and
records their weight loss one month after joining the program. He computes the
mean and the standard deviation of weight loss as 12.5 pounds and 9.2 pounds,
respectively. He believes that weight loss is likely to be normally distributed.
a. Calculate the margin of error with 95% confidence.
b. Calculate the 95% confidence interval for the population mean.
c. How can the margin of error reported in part a be reduced?
17. FILE Customers. The manager of The Cheesecake Factory in Memphis reports
that on six randomly selected weekdays, the number of customers served was
120, 130, 100, 205, 185, and 220. She believes that the number of customers
served on weekdays follows a normal distribution.
a. Calculate the margin of error with 90% confidence.
b. Construct the 90% confidence interval for the average number of customers
served on weekdays.
c. How can the margin of error reported in part a be reduced?
18. According to a recent survey, high school girls average 100 text messages daily.
Assume that the survey was based on a random sample of 36 high school girls.
The sample standard deviation is computed as 10 text messages daily.
a. Calculate the margin of error with 99% confidence.
b. What is the 99% confidence interval for the population mean texts that all high
school girls send daily?
19. The Chartered Financial Analyst (CFA) designation is fast becoming a
requirement for serious investment professionals. Although it
requires a successful completion of three levels of grueling
page 192
exams, the designation often results in a promising career with a lucrative salary.
A student of finance is curious about the average salary of a CFA charterholder.
He takes a random sample of 36 recent charterholders and computes a mean
salary of $158,000 with a standard deviation of $36,000. Use this sample
information to determine the 95% confidence interval for the average salary of a
CFA charterholder.
20. The sudoku puzzle has become very popular all over the world. It is based on a 9
× 9 grid and the challenge is to fill in the grid so that every row, every column, and
every 3 × 3 box contains the digits 1 through 9. A researcher is interested in
estimating the average time taken by a college student to solve the puzzle. He
takes a random sample of eight college students and records their solving times
(in minutes) as 14, 7, 17, 20, 18, 15, 19, 28.
a. Construct the 99% confidence interval for the average time taken by a college
student to solve a sudoku puzzle.
b. What assumption is necessary to make this inference?
21. FILE Stock_Price. The monthly closing stock price for a large technology firm
for the first six months of the year are reported in the following table.
Month Price
January 71
February 73
March 76
April 78
May 81
June 75
City Debt
Seattle 1135
⋮ ⋮
Pittsburgh 763
Micro Macro
85 48
78 79
⋮ ⋮
75 74
a. Construct 95% confidence intervals for the mean score in microeconomics and
the mean score in macroeconomics.
b. Explain why the widths of the two intervals are different.
27. FILE Math_Scores. For decades, people have believed that boys are innately
more capable than girls in math. In other words, due to the intrinsic differences in
brains, boys are believed to be better suited for doing math than girls. Recent
research challenges this stereotype, arguing that gender differences in math
performance have more to do with culture than innate aptitude. Others argue,
however, that while the average may be the same, there is more variability in
math ability for boys than girls, resulting in some boys with soaring math skills. A
portion of sample data on math scores of boys and girls is shown
in the accompanying table.
page 193
Boys Girls
74 83
89 76
⋮ ⋮
66 74
a. Construct 95% confidence intervals for the mean scores of boys and the mean
scores of girls. Explain your assumptions.
b. Explain why the widths of the two intervals are different.
28. FILE Startups. Many of today’s leading companies, including Google, Microsoft,
and Facebook, are based on technologies developed within universities. Lisa
Fisher is a business school professor who believes that a university’s research
expenditure (Research in $ millions) and the age of its technology transfer office
(Duration in years) are major factors that enhance innovation. She wants to know
what the average values are for the Research and the Duration variables. She
collects data from 143 universities on these variables, a portion of which is shown
in the accompanying table.
Research Duration
145.52 23
237.52 23
⋮ ⋮
154.38 9
a. Construct and interpret the 95% confidence interval for the mean research
expenditure of all universities.
b. Construct and interpret the 95% confidence interval for the mean duration of all
universities.
29. A survey of 1,026 people were asked about what they would do with an
unexpected cash gift. Forty-seven percent responded that they would pay off
debts.
a. With 95% confidence, what is the margin of error?
b. Construct the 95% confidence interval for the population proportion of people
who would pay off debts with an unexpected cash gift.
30. A sample of 5,324 Americans were asked about what matters most to them in a
place to live. Thirty-seven percent of the respondents felt job opportunities matter
most.
a. Construct the 90% confidence interval for the proportion of Americans who feel
that good job opportunities matter most in a place to live.
b. Construct the 99% confidence interval for the proportion of Americans who feel
that good job opportunities matter most in a place to live.
c. Which of the above two intervals has a higher margin of error? Explain why.
31. An economist reports that 560 out of a sample of 1,200 middle-income American
households actively participate in the stock market.
a. Construct the 90% confidence interval for the proportion of middle-income
Americans who actively participate in the stock market.
b. Can we conclude that the percentage of middle-income Americans who actively
participate in the stock market is not 50%?
32. In a survey of 1,116 adults, 47% approved of the job that President Trump was
doing in handling the economy (Opinion Today, July 1, 2019).
a. Compute the 90% confidence interval for the proportion of Americans who
approved of President Trump’s handling of the economy.
b. What is the resulting margin of error?
c. Compute the margin of error associated with the 99% confidence level.
33. In a recent poll of 760 homeowners in the United States, one in five homeowners
reports having a home equity loan that he or she is currently paying off. Using a
confidence coefficient of 0.90, construct the confidence interval for the proportion
of all homeowners in the United States that hold a home equity loan.
34. Obesity is generally defined as 30 or more pounds over a healthy weight. A recent
study of obesity reports 27.5% of a random sample of 400 adults in the United
States to be obese.
a. Use this sample information to compute the 90% confidence interval for the
adult obesity rate in the United States.
b. Is it reasonable to conclude with 90% confidence that the adult obesity rate in
the United States differs from 30%?
35. An accounting professor is notorious for being stingy in giving out good letter
grades. In a large section of 140 students in the fall semester, she gave out only
5% A’s, 23% B’s, 42% C’s, and 30% D’s and F’s. Assuming that this was a
representative class, compute the 95% confidence interval of the probability of
getting at least a B from this professor.
page 194
page 195
EXAMPLE 5.8
A trade group predicts that back-to-school spending will
average $606.40 per family this year. A different economic
model is needed if the prediction is wrong. Specify the null and
the alternative hypotheses to determine if a different economic
model is needed.
EXAMPLE 5.9
A television research analyst wishes to test a claim that more
than 50% of the households will tune in for a TV episode.
Specify the null and the alternative hypotheses to test the
claim.
The claim that more than 50% of the households will tune in for
a TV episode is valid only if the null hypothesis is rejected.
EXAMPLE 5.10
An online retailer is deciding whether or not to build a brick-
and-mortar store in a new marketplace. A market analysis
determines that the venture will be profitable if average
pedestrian traffic exceeds 500 people per day. The competing
hypotheses are specified as follows.
Discuss the consequences of a Type I error and a Type II error.
page 197
LO 5.5
Conduct a hypothesis test for the population mean.
page 198
Note that the value of the test statistic tdf is evaluated at μ = μ0,
which explains why we need some form of the equality sign in the
null hypothesis. Given that the population is normally distributed with
n = 25 (so df = 24), and s = 9, we compute the value of the test
of the t table. Referencing Table 5.5 for df = 24, we find that the
exact probability P(T24 ≥ 2.22) cannot be determined. Because 2.22
lies between 2.064 and 2.492, this implies that the p-value is
between 0.01 and 0.025. Using Excel or R (instructions shown
shortly), we find that the exact p-value is 0.018. Figure 5.6 shows the
computed p-value.
FIGURE 5.6 The p-value for a right-tailed test with t24 = 2.22
Note that when the null hypothesis is true, there is only a 1.8%
chance that the sample mean will be 71 or more. This seems like a
very small chance, but is it small enough to allow us to reject the null
hypothesis in favor of the alternative hypothesis? Let’s see how we
define “small enough.”
page 199
Alternative
p-value
Hypothesis
page 200
tailed test with the hypotheses specified as H0: μ ≤ 67 versus HA: μ >
Step 3. Calculate the value of the test statistic and the p-value.
When testing the population mean μ, the value of the test
statistic is where μ0 is the hypothesized value of the
page 201
EXAMPLE 5.11
FILE
Study_Hours
EXAMPLE 5.12
FILE
Study_Hours
As the introductory case to this chapter mentions, research
finds that today’s undergraduates study an average of 14 hours
per week. Using the sample data from Table 5.1, the dean at a
large university in California would like to test if the mean study
time of students at her university differs from 14 hours per
week. At the 5% significance level, use Excel and R to find the
value of the test statistic and the p-value. Summarize the
results.
page 202
Using Excel
(AVERAGE(A2:A36)−14)/(STDEV.S(A2:A36)/SQRT(36)). We
obtain t34 = 1.9444.
c. Even though Excel offers a number of functions that generate
p-values, we use the T.DIST.RT function. Here, for a two-tailed
test and a positive value for the test statistic, we enter
=2*T.DIST.RT(1.9444, 34) and Excel returns 0.0602.
Using R
Summary
Because the p-value of 0.0602 is not less than α = 0.05, we do
not reject the null hypothesis. At the 5% significance level, we
cannot conclude that the mean study time of students at this
large university in California differs from 14 hours per week.
page 203
Two hypothesis tests are conducted. The first test examines whether the mean
study time of students at this university is below the 1961 national average of 24
hours per week. At the 5% significance level, the sample data suggest that the mean
is less than 24 hours per week. The second test investigates whether the mean
study time of students at this university differs from today’s national average of 14
hours per week. At the 5% significance level, the results do not suggest that the
mean study time is different from 14 hours per week.
Thus, the sample results support the overall findings of the report:
undergraduates study, on average, 14 hours per week, far below the 1961 average
of 24 hours per week. The present analysis, however, does not explain why that
might be the case. For instance, it cannot be determined whether students have just
become lazier or if, with the advent of the computer, they can access information in
less time.
EXAMPLE 5.13
Driven by growing public support, the legalization of marijuana
in America has been moving at a very rapid rate. Today, 57% of
adults say the use of marijuana should be made legal
(www.pewresearch.org, October 12, 2016). A health
practitioner in Ohio collects data from 200 adults and finds that
102 of them favor marijuana legalization.
a. The health practitioner believes that the proportion of adults
who favor marijuana legalization in Ohio is not representative
of the national proportion. Specify the competing hypotheses
to test her claim.
b. Calculate the value of the test statistic and the p-value.
c. At the 10% significance level, do the sample data support the
health practitioner’s belief?
SOLUTION:
EXERCISES 5.3
Applications
36. Construct the null and the alternative hypotheses for the following tests:
a. Test if the mean weight of cereal in a cereal box differs from 18 ounces.
b. Test if the stock price increases on more than 60% of the trading days.
c. Test if Americans get an average of less than seven hours of sleep.
d. Define the consequences of Type I and Type II errors for part a.
37. Construct the null and the alternative hypotheses for the following claims:
a. “I am going to get the majority of the votes to win this election.”
b. “I suspect that your 10-inch pizzas are, on average, less than 10 inches in size.”
page 205
c. “I will have to fine the company because its tablets do not
contain an average of 250 mg of ibuprofen as advertised.”
d. Discuss the consequences of Type I and Type II errors for part a.
38. The manager of a large manufacturing firm is considering switching to new and
expensive software that promises to reduce its assembly costs. Before
purchasing the software, the manager wants to conduct a hypothesis test to
determine if the new software does reduce its assembly costs.
a. Would the manager of the manufacturing firm be more concerned about a Type
I error or a Type II error? Explain.
b. Would the software company be more concerned about a Type I error or a Type
II error? Explain.
39. A polygraph (lie detector) is an instrument used to determine if an individual is
telling the truth. These tests are considered to be 95% reliable. In other words, if
an individual lies, there is a 0.95 probability that the test will detect a lie. Let there
also be a 0.005 probability that the test erroneously detects a lie even when the
individual is actually telling the truth. Consider the null hypothesis, “the individual
is telling the truth,” to answer the following questions.
a. What is the probability of a Type I error?
b. What is the probability of a Type II error?
c. What are the consequences of Type I and Type II errors?
d. What is wrong with the statement, “I can prove that the individual is telling the
truth on the basis of the polygraph result”?
40. FILE Wait_Time. The manager of a small convenience store does not want her
customers standing in line for too long prior to a purchase. In particular, she is
willing to hire an employee for another cash register if the average wait time of the
customers is more than five minutes. She randomly observes the wait time (in
minutes) of customers during the day as:
a. Set up the null and the alternative hypotheses to determine if the manager
needs to hire another employee.
b. Calculate the value of the test statistic and the p-value. What assumption
regarding the population is necessary to implement this step?
c. Decide whether the manager needs to hire another employee at α = 0.10.
41. A machine that is programmed to package 1.20 pounds of cereal in each cereal
box is being tested for its accuracy. In a sample of 36 cereal boxes, the mean and
the standard deviation are calculated as 1.22 pounds and 0.06 pound,
respectively.
a. Set up the null and the alternative hypotheses to determine if the machine is
working improperly—that is, it is either underfilling or overfilling the cereal
boxes.
b. Calculate the value of the test statistic and the p-value.
c. At the 5% level of significance, can you conclude that the machine is working
improperly? Explain.
42. This past year, home prices in the Midwest increased by an average of 6.6%. A
Realtor collects data on 36 recent home sales in the West. He finds an average
increase in home prices of 7.5% with a standard deviation of 2%. Can he
conclude that the average increase in home prices in the West is greater than the
increase in the Midwest? Use a 5% significance level for the analysis.
43. Based on the average predictions of 45 economists, the U.S. gross domestic
product (GDP) will expand by 2.8% this year. Suppose the sample standard
deviation of their predictions was 1%. At the 5% significance level, test if the
mean forecast GDP of all economists is less than 1%.
44. FILE MPG. The data accompanying this exercise show miles per gallon (MPG)
for a sample of 50 hybrid cars.
a. State the null and the alternative hypotheses in order to test whether the
average MPG differs from 50.
b. Calculate the value of the test statistic and the p-value.
c. At α = 0.05, can you conclude that the average MPG differs from 50?
45. A car manufacturer is trying to develop a new sports car. Engineers are hoping
that the average amount of time that the car takes to go from 0 to 60 miles per
hour is below six seconds. The manufacturer tested 12 of the cars and clocked
their performance times. Three of the cars clocked in at 5.8 seconds, five cars at
5.9 seconds, three cars at 6.0 seconds, and 1 car at 6.1 seconds. At the 5% level
of significance, test if the new sports car is meeting its goal to go from 0 to 60
miles per hour in less than six seconds. Assume a normal distribution for the
analysis.
46. FILE Highway_Speeds. A police officer is concerned about speeds on a certain
section of Interstate 90. The data accompanying this exercise show the speeds of
40 cars on a Saturday afternoon.
a. The speed limit on this portion of Interstate 90 is 65 mph. Specify the competing
hypotheses in order to determine if the average speed of cars on Interstate 90
is greater than the speed limit.
b. Calculate the value of the test statistic and the p-value.
c. At α = 0.01, are the officer’s concerns warranted? Explain.
47. FILE Debt_Payments. The data accompanying this exercise show the average
monthly debt payments (Debt, in $) for residents of 26 metropolitan areas.
a. State the null and the alternative hypotheses in order to test whether average
monthly debt payments are greater than $900.
b. What assumption regarding the population is necessary to implement this step?
page 206
c. Calculate the value of the test statistic and the p-value.
d. At α = 0.05, are average monthly debt payments greater than $900? Explain.
48. Some first-time home buyers—especially millennials—are raiding their retirement
accounts to cover the down payment on a home. An economist is concerned that
the percentage of millennials who will dip into their retirement accounts to fund a
home now exceeds 20%. He randomly surveys 190 millenials with retirement
accounts and finds that 50 are borrowing against them.
a. Set up the null and the alternative hypotheses to test the economist’s concern.
b. Calculate the value of the test statistic and the p-value.
c. Determine if the economist’s concern is justifiable at α = 0.05.
49. FILE Lottery. A 2012 survey by Business Week found that Massachusetts
residents spent an average of $860.70 on the lottery, more than three times the
U.S. average. A researcher at a Boston think tank believes that Massachusetts
residents spend less than this amount. He surveys 100 Massachusetts residents
and asks them about their annual expenditures on the lottery.
a. Specify the competing hypotheses to test the researcher’s claim.
b. Calculate the value of the test statistic and the p-value.
c. At the 10% significance level, do the data support the researcher’s claim?
Explain.
50. The margarita is one of the most common tequila-based cocktails, made with
tequila mixed with triple sec and lime or lemon juice, often served with salt on the
glass rim. A common ratio for a margarita is 2:1:1, which includes 50% tequila,
25% triple sec, and 25% fresh lime or lemon juice. A manager at a local bar is
concerned that the bartender uses incorrect proportions in more than 50% of
margaritas. He secretly observes the bartender and finds that he used the correct
proportions in only 10 out of 30 margaritas. Test if the manager’s suspicion is
justified at α = 0.05.
51. A politician claims that he is supported by a clear majority of voters. In a recent
survey, 24 out of 40 randomly selected voters indicated that they would vote for
the politician. Is the politician’s claim justified at the 5% level of significance?
52. With increasing out-of-pocket healthcare costs, it is claimed that more than 60%
of senior citizens are likely to make serious adjustments to their lifestyle. Test this
claim at the 1% level of significance if in a survey of 140 senior citizens, 90
reported that they have made serious adjustments to their lifestyle.
53. A movie production company is releasing a movie with the hopes that many
viewers will return to see the movie in the theater for a second time. Their target is
to have 30 million viewers, and they want more than 30% of the viewers to want
to see the movie again. They show the movie to a test audience of 200 people,
and after the movie they asked them if they would see the movie in theaters
again. Of the test audience, 68 people said they would see the movie again.
a. At the 5% level of significance, test if more than 30% of the viewers will return
to see the movie again.
b. Repeat the analysis at the 10% level of significance.
54. FILE Silicon_Valley. According to a 2018 report by CNBC on workforce
diversity, about 60% of the employees in high-tech firms in Silicon Valley are white
and about 20% are Asian. Women, along with African Americans and Latinxs, are
highly underrepresented. Just about 30% of all employees are women, with
African Americans and Latinx accounting for only about 15% of the workforce.
Tara Jones is a recent college graduate working for a large high-tech firm in
Silicon Valley. She wants to determine if her firm faces the same diversity as in
the report. She collects sex and ethnicity information on 50 employees in her firm.
A portion of the data is shown in the accompanying table.
Sex Ethnicity
Woman White
Man White
⋮ ⋮
Man Nonwhite
page 207
gauge the variability in . In both cases, the variability depends on the size of the
sample on which the value of the estimator is based. If the sample size is sufficiently
large, then the variability virtually disappears, or, equivalently, and
approach zero. Thus, with big data, it is not very meaningful to construct confidence
intervals for the population mean or the population proportion because the margin of
error also approaches zero; under these circumstances, when estimating μ or p, it is
sufficient to use the estimate of the relevant point estimator.
Recall too that when testing the population mean, the value of the test statistic is
calculated as and when testing the population proportion, the
value of the respective test statistic increases, leading to a small p-value, and thus
rejection of the null hypothesis in virtually any scenario.
Thus, if the sample size is sufficiently large, statistical inference is not very useful.
In this Writing with Data section, we focus on a case study where the sample size is
relatively small.
Case Study
According to a 2018 paper released by the Economic Policy Institute, a nonprofit,
nonpartisan think tank in Washington, D.C., income inequality continues to grow in
the United States. Over the years, the rich have become richer while working-class
wages have stagnated. A local Latino politician has been vocal regarding his concern
about the welfare of Latinx. In various speeches, he has stated that the mean salary
of Latinx households in his county has fallen below the 2017 mean of approximately
$50,000. He has also stated that the proportion of Latinx households making less
than $30,000 has risen above the 2017 level of 20%. Both of his statements are
based on income data for 36 Latinx households in the county. A portion of the data is
shown in Table 5.6.
FILE
Latinx_Income
Income
23
63
⋮
47
page 208
TABLE 5.7 Test Statistic Values and p-Values for Hypothesis Tests
0.204
0.369
When testing whether the mean income of Latinx households has fallen
below the 2017 level of $50,000, a test statistic value of −0.837 is obtained.
Given a p-value of 0.204, the null hypothesis regarding the population mean,
specified in Table 5.7, cannot be rejected at any reasonable level of
significance. Similarly, given a p-value of 0.369, the null hypothesis regarding
the population proportion cannot be rejected. Therefore, sample evidence
does not support the claims that the mean income of Latinx households has
fallen below $50,000 or that the proportion of Latinx households making less
than $30,000 has risen above 20%. Perhaps the politician’s remarks were
based on a cursory look at the sample statistics and not on a thorough
statistical analysis.
Report 5.2 FILE Field_Choice. A 2018 Pew Research Center survey finds that
more than half of Americans (52%) believe that young people do not pursue a
degree in science, technology, engineering, or math (STEM)
because they feel that these subjects are too hard. Sixty college-
page 209
bound students are asked about the field they would like to pursue in college. The
choices offered in the questionnaire are STEM, Business, and Other; information on
a teenager’s sex (Male or Female) is also collected. The accompanying data file
contains the responses. In a report, use the sample information to
Compare the 95% confidence interval for the proportion of students who would like
to pursue STEM with the proportion who would like to pursue Business. Do the
results appear to support the survey’s finding? Explain.
Construct and interpret the 95% confidence interval for the proportion of female
students who are college bound.
6 Regression Analysis
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 6.1 Estimate and interpret a linear regression
model.
LO 6.2 Interpret goodness-of-fit measures.
LO 6.3 Conduct tests of significance.
LO 6.4 Address common violations of the OLS
assumptions.
page 211
©Rawpixel.com/Shutterstock
INTRODUCTORY CASE
College Scorecard
With college costs and student debt on the rise, the choices
that families make when searching for and selecting a college
have never been more important. Yet, students and parents
struggle to find clear, reliable data on critical questions of
college affordability and value. For these reasons, the
Department of Education (DOE) published a redesigned
College Scorecard that reports the most reliable national data
on college costs and students’ outcomes at specific colleges.
Fiona Schmidt, a college counselor, believes that the
information from the College Scorecard can help her as she
advises families. Fiona wonders what college factors influence
post-college earnings and wants answers to the following
questions: If a college costs more or has a higher graduation
rate, should a student expect to earn more after graduation? If
a greater percentage of the students are paying down debt
after college, does this somehow influence post-college
earnings? And finally, does the location of a college affect post-
college earnings?
To address these questions, Fiona gathers information from
116 colleges on annual post-college earnings (Earnings in $),
the average annual cost (Cost in $), the graduation rate (Grad
in %), the percentage of students paying down debt (Debt in
%), and whether or not a college is located in a city (City
equals 1 if a city location, 0 otherwise). Table 6.1 shows a
portion of the data.
page 212
page 213
page 214
Let b0, b1, b2, . . . , bk represent the estimates of β0, β1, β2, . . . ,
e.
A common approach to obtaining estimates for β0, β1, β2, . . . , βk
is to use the ordinary least squares (OLS) method. OLS
estimators have many desirable properties if certain assumptions
hold. (These assumptions are discussed in Section 6.4.) The OLS
method chooses the sample regression equation whereby the error
sum of squares, SSE, is minimized, where SSE = Σ(y – )2 = Σe2.
FILE
College
EXAMPLE 6.1
Using the data from Table 6.1, estimate the linear regression
model, Earnings = β0 + β1Cost + β2Grad + β3Debt + β4City + ɛ,
where Earnings is annual post-college earnings (in $), Cost is
the average annual cost (in $), Grad is the graduation rate (in
%), Debt is the percentage of students paying down debt (in
%), and City assumes a value of 1 if the college is located in a
city, 0 otherwise.
a. What is the sample regression equation?
b. Interpret the slope coefficients.
c. Predict annual post-college earnings if a college’s average
annual cost is $25,000, its graduation rate is 60%, its
percentage of students paying down debt is 80%, and it is
located in a city.
SOLUTION: Table 6.2 shows the Excel output from estimating this
model. We will provide Excel and R instructions for obtaining
regression output at the end of this section.
page 216
b. All coefficients are positive, suggesting a positive
influence of each predictor variable on the response variable.
Specifically:
Holding all other predictor variables constant, if average
annual costs increase by $1, then, on average, predicted
earnings are expected to increase by b1; that is, by about
$0.4349. This result suggests that spending more on college
pays off.
If the graduation rate increases by 1%, then, on average,
predicted earnings are expected to increase by $178.10,
holding the other predictor variables constant. Policymakers
often debate whether graduation rate can be used as a
proxy for a college’s academic quality. Perhaps it is not the
academic quality but the case that colleges with a higher
graduation rate have more motivated students, which
translates into higher earnings for these students.
A one percentage point increase in the percentage of
students paying down debt is expected to increase predicted
earnings by approximately $141.48, holding the other three
predictor variables constant. As it turns out, Debt is not
statistically significant at any reasonable level; we will
discuss such tests of significance in Section 6.3.
All else constant, predicted earnings are $2,526.79 higher
for graduates of colleges located in a city. This difference is
not surprising seeing that students who attend college in a
city likely have more internship opportunities, which then
often translate into higher-paying jobs after graduation.
c. If a college’s average annual cost is $25,000, its graduation
rate is 60%, its percentage of students paying down debt is
80%, and it is located in a city, then average post-college
earnings for its students are
Estimating a Linear Regression Model with
Excel or R
Using Excel
In order to obtain the regression output in Table 6.2 using Excel, we
follow these steps.
FILE
College
Using R
In order to obtain the regression output in Table 6.2 using R, we
follow these steps.
A. Import the College data into a data frame (table) and label it myData.
B. By default, R will report the regression output using scientific notation. We opt to turn this
option off using the following command:
> options(scipen=999)
In order to turn scientific notation back on, we would enter options(scipen=0) at the
prompt.
page 217
C. Use the lm function to create a linear model, which we label Multiple.
Note that we use the ‘+’ sign to add predictor variables, even if we believe that a
negative relationship may exist between the response variable and the predictor
variables. You will not see output after you implement this step.
> Multiple <- lm(Earnings ~ Cost + Grad + Debt + City, data = myData)
D. Use the summary function to view the summary regression output. Figure 6.4 shows the
R regression output. We have put the intercept and the slope coefficients in boldface. As
expected, these values are identical to the ones obtained using Excel.
> summary(Multiple)
page 218
⋮ ⋮ ⋮ ⋮
page 219
SOLUTION:
page 220
a. Interpret the slope coefficients.
b. Predict sales if a city has 1.5 million women over the age of 60 and their
average income is $44,000.
4. An executive researcher wants to better understand the factors that explain
differences in salaries for marketing majors. He decides to estimate two models: y
= β0 + β1d1 + ɛ (Model 1) and y = β0 + β1d1 + β2d2 + ɛ (Model 2). Here y
represents salary, d1 is a dummy variable that equals 1 for male employees, and
d2 is a dummy variable that equals 1 for employees with an MBA.
a. What is the reference group in Model 1?
b. What is the reference group in Model 2?
c. In these two models, would it matter if d1 equaled 1 for female employees?
5. House price y is estimated as a function of the square footage of a house x and a
dummy variable d that equals 1 if the house has ocean views. The estimated
a. Compute the predicted price of a house with ocean views and square footage of
2,000 and 3,000, respectively.
b. Compute the predicted price of a house without ocean views and square
footage of 2,000 and 3,000, respectively.
c. Discuss the impact of ocean views on the house price.
6. FILE GPA. The director of graduate admissions at a large university is analyzing
the relationship between scores on the math portion of the Graduate Record
Examination (GRE) and subsequent performance in graduate school, as
measured by a student’s grade point average (GPA). She uses a sample of 24
students who graduated within the past five years. A portion of the data is as
follows:
GPA GRE
3.0 700
3.5 720
⋮ ⋮
3.5 780
Salary Education
40 3
53 4
⋮ ⋮
38 0
a. Find the sample regression equation for the model: Salary = β0 + β1Education
+ ɛ.
b. Interpret the coefficient for Education.
c. What is the predicted salary for an individual who completed seven years of
higher education?
8. FILE Consumption. The consumption function, first developed by John Maynard
Keynes, captures one of the key relationships in economics. It expresses
consumption as a function of disposable income, where disposable income is
income after taxes. The accompanying table shows a portion of quarterly data for
average U.S. annual consumption (Consumption in $) and disposable income
(Income in $) for the years 2000–2016.
⋮ ⋮ ⋮
13590 6 61485
13775 6 54344
⋮ ⋮ ⋮
11988 8 42408
a. Estimate the sample regression equation that enables us to predict the price of
a sedan on the basis of its age and mileage.
b. Interpret the slope coefficient of Age.
c. Predict the price of a five-year-old sedan with 65,000 miles.
10. FILE Engine. The maintenance manager at a trucking company wants to build a
regression model to forecast the time until the first engine overhaul (Time in
years) based on four predictor variables: (1) annual miles driven (Miles in 1,000s),
(2) average load weight (Load in tons), (3) average driving speed (Speed in mph),
and (4) oil change interval (Oil in 1,000s of miles). Based on driver logs and
onboard computers, data have been obtained for a sample of 25 trucks. A portion
of the data is shown in the accompanying table.
page 221
a. For each predictor variable, discuss whether it is likely to have
a positive or negative influence on time until the first engine overhaul.
b. Find the sample regression equation for the regression model (use all four
predictor variables).
c. Based on part a, are the signs of the regression coefficients logical?
d. Predict the time before the first engine overhaul for a particular truck driven
60,000 miles per year with an average load of 22 tons, an average driving
speed of 57 mph, and 18,000 miles between oil changes.
11. FILE MCAS. Education reform is one of the most hotly debated subjects on both
state and national policymakers’ list of socioeconomic topics. Consider a linear
regression model that relates school expenditures and family background to
student performance in Massachusetts using 224 school districts. The response
variable is the mean score on the MCAS (Massachusetts Comprehensive
Assessment System) exam given to 10th graders. Four predictor variables are
used: (1) STR is the student-to-teacher ratio in %, (2) TSAL is the average
teacher’s salary in $1,000s, (3) INC is the median household income in $1,000s,
and (4) SGL is the percentage of single-parent households. A portion of the data
is shown in the accompanying table.
24100 26 24 80
23700 32 21 73
⋮ ⋮ ⋮ ⋮
26000 39 22 69
⋮ ⋮ ⋮ ⋮
a. Estimate a linear regression model with Writing as the response variable and
GPA and Female as the predictor variables. Compute the predicted writing
score for a male student with a GPA of 3.5. Repeat the computation for a
female student.
b. Estimate a linear regression model with Math as the response variable and
GPA and Female as the predictor variables. Compute the predicted math score
for a male student with a GPA of 3.5. Repeat the computation for a female
student.
14. FILE Franchise. A president of a large chain of fast-food restaurants collects
data on 100 franchises.
a. Estimate the model: y = β0 + β1x1 + β2x2 + ɛ where y is net profit, x1 is counter
sales, and x2 is drive-through sales. All variables are measured in millions of
dollars.
b. Interpret the coefficient attached to drive-through sales.
c. Predict the net profit of a franchise that had counter sales of $6 million and
drive-up sales of $4 million.
page 222
15. FILE Arlington. A Realtor in Arlington, Massachusetts, is
analyzing the relationship between the sale price of a home (Price in $), its square
footage (Sqft), the number of bedrooms (Beds), the number of bathrooms (Baths),
and a Colonial dummy variable (Colonial equals 1 if a colonial-style home, 0
otherwise). She collects data on 36 sales in Arlington for the analysis. A portion of
the data is shown in the accompanying table.
1 237.52 16 23
⋮ ⋮ ⋮ ⋮
1 154.38 3 9
line, then each residual would be zero; in other words, there would
be no dispersion between the observed and the predicted values.
Because in practice we rarely, if ever, obtain this result, we evaluate
models on the basis of the relative magnitude of the residuals. The
sample regression equation provides a good fit when the dispersion
of the residuals is relatively small.
This value, called the total sum of squares, SST, can be broken
down into two components: explained variation and unexplained
variation. Figure 6.6 illustrates the decomposition of the total
variation of y into its two components for a simple linear regression
model.
variation in y.
Now, we focus on the distance between point B and point C
which is referred to as the explained difference, . In other words,
variation in y.
page 225
where , and .
The value of R2 falls between zero and one; the closer the
value is to one, the better the fit.
The Adjusted R2
Because R2 never decreases as we add more predictor variables to
the linear regression model, it is possible to increase its value
unintentionally by including a group of predictor variables that may
have no economic or intuitive foundation in the linear regression
model. This is true especially when the number of predictor variables
k is large relative to the sample size n. In order to avoid page 226
the possibility of R2 creating a false impression, virtually
all software packages include adjusted R2. Unlike R2, adjusted R2
explicitly accounts for the sample size n and the number of predictor
variables k. It is common to use adjusted R2 for model selection
because it imposes a penalty for any additional predictor variable
that is included in the analysis. When comparing models with the
same response variable, we prefer the model with the higher
adjusted R2, implying that the model is able to explain more of the
sample variation in y.
ADJUSTED R2
The adjusted coefficient of determination is calculated as
EXAMPLE 6.3
Table 6.5 provides goodness-of-fit measures from estimating
the following models:
EXERCISES 6.2
Applications
18. An analyst estimates the sales of a firm as a function of its advertising
expenditures using the model Sales = β0 + β1Advertising + ɛ. Using 20
observations, he finds that SSR = 199.93 and SST = 240.92.
a. What proportion of the sample variation in sales is explained by advertising
expenditures?
b. What proportion of the sample variation in sales is unexplained by advertising
expenditures?
19. FILE Test_Scores. The accompanying data file shows the midterm and final
scores for 32 students in a statistics course.
a. Estimate a student’s final score as a function of his/her midterm score.
b. Find the standard error of the estimate.
c. Find and interpret the coefficient of determination.
20. The director of college admissions at a local university is trying to determine
whether a student’s high school GPA or SAT score is a better predictor of the
student’s subsequent college GPA. She formulates two models:
She estimates these models and obtains the following goodness-of-fit measures.
Model 1 Model 2
R2 0.5595 0.5322
se 40.3684 41.6007
Which model provides a better fit for y? Justify your response with two goodness-
of-fit measures.
21. FILE Property_Taxes. The accompanying data file shows the square footage
and associated property taxes for 20 homes in an affluent suburb 30 miles outside
ofNew York City.
a. Estimate a home’s property taxes as a linear function of the size of the home
(measured by its square footage).
b. What proportion of the sample variation in property taxes is explained by the
home’s size?
c. What proportion of the sample variation in property taxes is unexplained by the
home’s size?
22. FILE Car_Prices. The accompanying data file shows the selling price of a used
sedan, its age, and its mileage. Estimate two models:
Which model provides a better fit for y? Justify your response with two goodness-
of-fit measures.
23. For a sample of 41 New England cities, a sociologist studies the crime rate in
each city (crimes per 100,000 residents) as a function of its poverty rate (in %)
and its median income (in $1,000s). He finds that SSE = 4,182,663 and SST =
7,732,451.
a. Calculate the standard error of the estimate.
b. What proportion of the sample variation in crime rate is explained by the
variability in the predictor variables? What proportion is unexplained?
page 228
24. For a large restaurant chain, a financial analyst uses the
following model to estimate a franchise’s net profit: y = β0 + β1x1 + β2x2 + ɛi,
where y is net profit, x1 is counter sales, and x2 is drive-through sales. For a
sample of 100 franchises, she finds that SSE = 4.4600 and SST = 18.3784.
a. Calculate the standard error of the estimate.
b. Calculate and interpret the coefficient of determination.
c. Calculate the adjusted R2.
25. FILE Football. Is it defense or offense that wins football games? Consider the
following portion of data, which includes a team’s winning record (Win in %), the
average number of yards gained, and the average number of yards allowed
during a recent NFL season.
⋮ ⋮ ⋮ ⋮
a. Compare two simple linear regression models where Model 1 predicts Win as a
function of Yards Gained and Model 2 predicts Win as a function of Yards
Allowed.
b. Estimate a linear regression model, Model 3, that applies both Yards Gained
and Yards Allowed to forecast Win. Is this model an improvement over the other
two models? Explain.
26. FILE Ownership. In order to determine if the homeownership rate in the U.S. is
linked with income, state-level data on the homeownership rate (Ownership in %)
and median household income (Income in $) were collected. A portion of the data
is shown in the accompanying table.
⋮ ⋮ ⋮
⋮ ⋮ ⋮ ⋮
⋮ ⋮ ⋮
page 229
page 230
FILE
College
EXAMPLE 6.4
Let’s revisit Model 3, from Section 6.2.
Model 3: Earnings = β0 + β1Cost + β2Grad + β3Debt + β4City
+ɛ
page 231
If, for example, the slope coefficient β1 equals zero, then the
predictor variable x1 basically drops out of the equation, implying that
x1 does not influence y. In other words, if β1 equals zero, then there
is no linear relationship between x1 and y. Conversely, if β1 does not
equal zero, then x1 influences y.
In general, when we want to test whether the population
coefficient βj is different from, greater than, or less than βj0, where βj0
is the hypothesized value of βj, then the competing hypotheses take
one of the following forms:
H0: βj = βj0
HA: βj ≠ βj0
page 232
The value of the test statistic is calculated as
EXAMPLE 6.5
Let’s again revisit Model 3.
Model 3: Earnings = β 0 + β 1Cost + β 2Grad + β 3Debt + β
4City + ε
H0 : β1 = 0
HA : β1 ≠ 0
We would also like to point out that for a one-tailed test with βj0 = 0,
there are rare instances when the computer-generated p-value is
invalid. This occurs when the sign of bj (and the value of the
accompanying test statistic) is not inconsistent with the null
hypothesis. For example, for a right-tailed test, H0: βj ≤ 0 and HA: βj
> 0, the null hypothesis cannot be rejected if the estimate bj (and the
value of the accompanying test statistic tdf) is negative. Similarly, no
further testing is necessary if bj > 0 (and thus tdf > 0) for a left-tailed
test. In these rare instances, the reported p-value is invalid.
page 234
FILE
J&J
EXAMPLE 6.6
Johnson & Johnson (J&J) was founded more than 120 years
ago on the premise that doctors and nurses should use sterile
products to treat people’s wounds. Since that time, J&J
products have become staples in most people’s homes.
Consider the CAPM where the J&J risk-adjusted stock return R
− Rf is used as the response variable and the risk-adjusted
market return RM − Rf is used as the predictor variable. A
portion of 60 months of data is shown in Table 6.9.
Month Year R - Rf RM - Rf
⋮ ⋮ ⋮ ⋮
a. The estimate for the beta coefficient is 0.7503 and its standard
error is 0.1391. In order to determine whether the beta
coefficient is significantly less than one, we formulate the
hypotheses as
H0 : β ≥ 1
HA:β<1
Given 60 data points, df = n − k − 1 = 60 − 1 − 1 = 58. We
cannot use the test statistic value or the p-value reported in
Table 6.10 because the hypothesized value of β is not zero.
Using unrounded calculations, we find the value of the test
page 235
City NA NA 2,526.7888*
(0.024)
NOTES: Parameter estimates are in the top half of the table with the p-values in
parentheses; * represents significance at the 5% level. NA denotes not applicable. The
lower part of the table contains goodness-of-fit measures.
Africa Studio/Shutterstock
. This regression equation implies that if a college’s average annual cost is $25,000,
its graduation rate is 60%, its percentage of students paying down debt is 80%, and
it is located in a city, then average post-college earnings for its students are
$45,409.
Further testing of this preferred model revealed that the four predictor variables
were jointly significant. Individual tests of significance showed that Cost, Grad, and
City were significant at the 5% level; Debt was not significant in explaining Earnings.
The coefficient of determination, or R2, revealed that approximately 43% of the
sample variability in annual post-college earnings is explained by the model. Thus,
57% of the sample variability in annual post-college earnings remains unexplained.
This is not entirely surprising because factors not included in the model, such as
field of emphasis, grade point average, and natural ability, influence annual post-
college earnings.
EXERCISES 6.3
Applications
29. In order to examine the relationship between the selling price of a used car and its
age, an analyst uses data from 20 recent transactions and estimates Price = β0 +
β1Age + ɛ. A portion of the regression results is shown in the accompanying table.
page 237
32. A researcher estimates the following model relating the return on
a firm’s stock as a function of its price-to-earnings ratio and its price-to-sales ratio:
Return = β0 + β1P/E + β2P/S + ɛ. A portion of the regression results is shown in
the accompanying table.
Month Year R - Rf RM - Rf
⋮ ⋮ ⋮ ⋮
a. Estimate the CAPM model for Caterpillar, Inc. Show the regression results in a
well-formatted table.
b. At the 5% significance level, determine if investment in Caterpillar is riskier than
the market (beta significantly greater than 1).
c. At the 5% significance level, is there evidence of abnormal returns?
38. FILE Arlington. A Realtor examines the factors that influence the price of a
house in Arlington, Massachusetts. He collects data on 36 house sales (Price in
$) and notes each house’s square footage (Sqft), the number of
bedrooms (Beds), the number of bathrooms (Baths), and a
page 238
Colonial dummy variable (Colonial equals 1 if house is colonial, 0 otherwise). A
portion of the data is shown in the accompanying table.
94 92 5
74 90 3
⋮ ⋮ ⋮
63 64 2
15.90 24 12 7
23.70 52 17 14
⋮ ⋮ ⋮ ⋮
26.70 43 2 2
⋮ ⋮ ⋮ ⋮
a. Estimate and interpret a sample regression equation using the 10-year yield as
the response variable and the three-month yield as the predictor variable.
b. Interpret the coefficient of determination.
c. At the 5% significance level, is the three-month-yield significant in explaining
the 10-year yield?
d. Many wonder whether a change in the three-month yield implies the same
change in the 10-year yield. At the 5% significance level, is this belief supported
by the data?
42. FILE BMI. According to the World Health Organization, obesity has reached
epidemic proportions globally. While obesity has generally been linked with
chronic disease and disability, researchers argue that it may also affect salaries.
In other words, the body mass index (BMI) of an employee is a predictor for
salary. (A person is considered overweight if his/ her BMI is at least 25 and obese
if BMI exceeds 30.) The accompanying table shows a portion of
salary data (in $1,000s) for 30 college-educated men with their
page 239
respective BMI and a dummy variable that equals 1 for a white man and 0
otherwise.
34 33 1
43 26 1
Salary BMI White
⋮ ⋮ ⋮
45 21 1
a. Estimate a model for Salary using BMI and White as the predictor variables.
Determine if BMI influences salary at the 5% level of significance.
b. What is the estimated salary of a white college-educated man with a BMI of 30?
Compute the corresponding salary of a nonwhite man.
43. FILE Wage. A researcher wonders whether males get paid more, on average,
than females at a large firm. She interviews 50 employees and collects data on
each employee’s hourly wage (Wage in $), years of higher education (Educ),
years of experience (Exper), age (Age), and a Male dummy variable that equals 1
if male, 0 otherwise. A portion of the data is shown in the accompanying table.
9.1 275 14
Production Time Machine Parts Manual Parts
10.8 446 12
⋮ ⋮ ⋮
15.5 618 16
376 75 0
433 78 0
⋮ ⋮ ⋮
401 68 0
a. Estimate the regression equation relating vehicles serviced to the four predictor
variables.
b. Interpret each of the slope coefficients.
c. At the 5% significance level, are the predictor variables jointly significant? Are
they individually significant? What about at the 10% significance level?
d. What proportion of the variability in vehicles served is explained by the four
predictor variables?
e. Predict vehicles serviced in a nonwinter month for a particular location with five
garage bays, a population of 40,000, and convenient interstate access.
47. FILE Industry. Consider a regression model that links a CEO’s compensation (in
$ millions) with the total assets of the firm (in $ millions) and the firm’s industry.
Dummy variables are used to represent four industries: Manufacturing Technology
d1, Manufacturing Other d2, Financial Services d3, and Nonfinancial Services d4.
A portion of the data for the 455 CEOs is shown in the accompanying table.
a. Estimate the model: y = β0 + β1x + β2d1 + β3d2 + β4d3 + ɛ,
where y and x denote compensation and assets,
respectively. Here the reference category is the nonfinancial
services industry.
b. Interpret the estimated coefficients.
c. Use a 5% level of significance to determine which industries,
relative to the nonfinancial services industry, have different
executive compensation.
d. Reformulate the model to determine, at the 5% significance
level, if compensation is higher in Manufacturing Other than
in Manufacturing Technology. Your model must account for
total assets and all industry types.
48. FILE Retail. A government researcher is analyzing the relationship between
retail sales (in $ millions) and the gross national product (GNP in $ billions). He
also wonders whether there are significant differences in retail sales related to the
quarters of the year. He collects 10 years of quarterly data. A portion is shown in
the accompanying table.
⋮ ⋮ ⋮ ⋮
19.00 64 0 1
19.30 43 1 3
⋮ ⋮ ⋮ ⋮
20.24 36 1 0
a. Use the data to model life expectancy at 65 on the basis of Income, Female,
and Drinks.
b. Conduct a one-tailed test at a = 0.01 to determine if females live longer than
males.
c. Predict the life expectancy at 65 of a male with an income of $40,000 and an
alcohol consumption of two drinks per day; repeat the prediction for a female.
page 241
50. FILE SAT_3. A researcher from the Center for Equal
Opportunity wants to determine if SAT scores of admitted students at a large state
university differed by ethnicity. She collects data on SAT scores and ethnic
background for 200 admitted students. A portion of the data is shown in the
accompanying table.
SAT Ethnicity
1515 White
1530 Latinx
⋮ ⋮
1614 White
a. Estimate the model y = β0 + β1d1 + β2d2 + β3d3 + ɛ, where y represents a
student’s SAT score; d1 equals 1 if the student is white, 0 otherwise; d2 equals
1 if the student is black, 0 otherwise; and d3 equals 1 if the student is Asian, 0
otherwise. Note that the reference category is Latinx. What is the predicted SAT
score for an Asian student? For a Latinx student?
b. At the 5% significance level, determine if the SAT scores of Asian students
differ from those of Latinx students.
c. Reformulate the model to determine if the SAT scores of white students are
lower than the SAT scores of Asian students at the 5% significance level. Your
model must account for all ethnic categories.
page 242
RESIDUAL PLOTS
For the regression model, y = β0 + β1x1 + β2x2 + . . . + βkxk + ɛ,
Such plots are useful for detecting departures from linearity as well
as constant variability. If the regression is based on time series data,
we can plot the residuals sequentially to detect if the observations
are correlated.
Residual plots can also be used to detect outliers. Recall from
Chapter 3 that outliers are observations that stand out from the rest
of the data. For an outlier observation, the resulting residual will
appear distinct in a plot; it will stand out from the rest. While outliers
can greatly impact the estimates, it is not always clear what to do
with them. Outliers may indicate bad data due to incorrectly recorded
(or included) observations in the data set. In such cases, the relevant
observation should be corrected or simply deleted. Alternatively,
outliers may just be due to random variations, in which case the
relevant observations should remain. In any event, residual plots
help us identify potential outliers so that we can take corrective
actions, if needed.
In Figure 6.7, we present a hypothetical residual plot when none
of the assumptions has been violated. Note that all the points are
randomly dispersed around the 0 value of the residuals. Also, there
is no evidence of outliers because no residual stands out from the
rest. Any discernible pattern of the residuals indicates that one or
more assumptions have been violated.
page 243
Detection
We can use residual plots to identify nonlinear patterns. Linearity is
justified if the residuals are randomly dispersed across the values of
a predictor variable. A discernible trend in the residuals is indicative
of nonlinear patterns.
FILE
Happiness_Age
EXAMPLE 6.7
A sociologist wishes to study the relationship between age and
happiness. He interviews 24 individuals and collects data on
age and happiness, measured on a scale from 0 to 100. A
portion of the data is shown in Table 6.12. Examine the linearity
assumption in the regression model, Happiness = β0 + β1Age +
ɛ.
TABLE 6.12 Happiness and Age
Happiness Age
62 49
66 51
⋮ ⋮
72 69
Remedy
Linear regression models are often used as a first pass for most
empirical work. In many instances, they provide a very good
approximation for the actual relationship. However, if residual plots
exhibit strong nonlinear patterns, the inferences made by a linear
regression model can be quite misleading. In such instances, we
should employ nonlinear regression methods based on simple
transformations of the response and the predictor variables; these
methods are discussed in the next chapter.
Detection
The detection methods for multicollinearity are mostly informal. If we
find a high R2 and a significant F statistic coupled with individually
insignificant predictor variables, then multicollinearity may be an
issue. We can also examine the correlations between the predictor
variables to detect severe multicollinearity. One page 245
guideline suggests that multicollinearity is severe if the
sample correlation coefficient between any two predictor variables is
more than 0.80 or less than −0.80. The variance inflation factor (VIF)
is another measure that can detect a high correlation between three
or more predictor variables even if no pair of predictor variables has
a particularly high correlation. The smallest possible value of VIF is
one (absence of multicollinearity). Multicollinearity may be a problem
if the VIF exceeds five or 10. A more detailed discussion of the VIF is
beyond the scope of this text.
FILE
Home_Values
EXAMPLE 6.8
Examine the multicollinearity issue in a linear regression model
that uses median home values (in $) as the response variable
and median household incomes (in $), per capita incomes (in
$), and the percentage of owner-occupied homes as the
predictor variables. A portion of 2010 data for all states in the
United States is shown in Table 6.13.
Remedy
Inexperienced researchers tend to include too many predictor
variables in their quest not to omit anything important and, in doing
so, may include redundant variables that essentially measure the
same thing. When confronted with multicollinearity, a good remedy is
to drop one of the collinear variables. The difficult part is to decide
which of the collinear variables is redundant and, therefore, can
safely be removed. Another option is to obtain more data because
the sample correlation may get weaker as we include more
observations. Sometimes it helps to express the predictor variables
differently so that they are not collinear. At times, the best approach
may be to do nothing when there is a justification to include all
predictor variables. This is especially so if the estimated model yields
a high R2, which implies that the estimated model is good for making
predictions.
Detection
We can use residual plots to gauge changing variability. The
residuals are generally plotted against each predictor variable xj; for
a multiple regression model, we can also plot them against the
predicted value . There is no violation if the residuals are randomly
Sales Sqft
140 1810
160 2500
⋮ ⋮
110 1470
page 247
Remedy
As mentioned earlier, in the presence of changing variability, the
OLS estimators are unbiased, but their estimated standard errors are
inappropriate. Therefore, OLS still provides reasonable coefficient
estimates, but the t and the F tests are no longer valid. This has
prompted some researchers to use the OLS estimates along with a
correction for the standard errors, often referred to as robust
standard errors. Unfortunately, the current version of Excel does not
include a correction for the standard errors. However, it is available
on many statistical computer packages, including R. At the end of
this section, we use R to make the necessary correction in Example
6.9. With robust standard errors, we can then perform legitimate t-
tests.
Detection
We can plot the residuals sequentially over time to look for
correlated observations. If there is no violation, then the residuals
should show no pattern around the horizontal axis. A violation is
indicated when a positive residual in one period is followed by
positive residuals in the next few periods, followed by page 248
negative residuals for a few periods, then positive
residuals, and so on. Although not as common, a violation is also
indicated when a positive residual is followed by a negative residual,
then a positive residual, and so on.
FILE
Sushi_Restaurant
EXAMPLE 6.10
Consider y = β0 + β1x1 + β2x2 + ɛ, where y represents sales (in
$1,000s) at a sushi restaurant and x1 and x2 represent
advertising costs (in $) and the unemployment rate (in %),
respectively. A portion of monthly data from January 2018 to
May 2019 is given in Table 6.16. Inspect the behavior of the
residuals to comment on serial correlation.
Remedy
It is important that we include all relevant predictor variables in the
regression model. An important first step before running a regression
model is to compile a comprehensive list of potential predictor
variables. We can then build down to perhaps a smaller list of
predictor variables using the adjusted R2 criterion. Sometimes, due
to data limitations, we are unable to include all relevant variables.
For example, innate ability may be an important predictor variable for
a model that explains salary, but we are unable to include it because
innate ability is not observable. In such instances, we use a
technique called the instrumental variable technique, which is
outside the scope of this text.
Summary
It takes practice to become an effective user of the regression
methodology. We should think of regression modeling as an iterative
process. We start with a clear understanding of what the regression
model is supposed to do. We define the relevant response variable
and compile a comprehensive list of potential predictor variables.
The emphasis should be to pick a model that makes economic and
intuitive sense and avoid predictor variables that more or less
measure the same thing, thus causing multicollinearity. We then
apply this model to data and refine and improve its fit. Specifically,
from the comprehensive list, we build down to perhaps a smaller list
of predictor variables using significance tests and goodness-of-fit
measures such as the standard error of the estimate and the
adjusted R2. It is important that we explore residual plots to look for
signs of changing variability and correlated observations in cross-
sectional and time series studies, respectively. If we identify any of
these two violations, we can still trust the point estimates of the
regression coefficients. However, we cannot place much faith in the
standard t or F tests of significance unless we employ the necessary
correction.
page 250
FILE
Convenience_Stores
(Intercept) Sqft
(Intercept) 91.45127988 -6.095285e-02
Sqft -0.06095285 4.149344e-05
The above output represents the variance-covariance matrix, where the diagonal
elements contain the variances and the off-diagonal elements contain the covariances of
the OLS estimators. Because we are interested in the standard errors, we simply take
the square roots of the diagonal values of the matrix (see values in boldface). In order to
find the standard errors, labeled as Simple_SE, enter:
> Simple_SE <- diag(vcovHC(Simple,type=“HC1”))^0.5 > Simple_SE
(Intercept) Sqft
9.56301625 0.00644154
The corrected standard errors for the intercept and Sqft are 9.5630 and 0.0064,
respectively. Recall from Example 6.9 that the OLS-generated standard errors for the
intercept and Sqft were 10.4764 and 0.0057, respectively. So now we can easily
compute the t-test of significance using the OLS estimates with the corrected standard
errors.
page 251
The output represents the variance-covariance matrix, where the diagonal elements
contain the variances and the off-diagonal elements contain the covariances of the OLS
estimators. Because we are interested in the standard errors, we simply take the square
roots of the diagonal values of the matrix (see values in boldface). In order to find the
standard errors, labeled as Multiple_SE, enter:
> Multiple_SE <- diag(NeweyWest(Multiple))^0.5
> Multiple_SE
(Intercept) AdsCost Unemp
4.773375748 0.006961572 0.354646071
The corrected standard errors for the intercept, AdsCost, and Unemp are 4.7734,
0.0070, and 0.3546, respectively. Recall from Example 6.10 that the OLS-generated
standard errors for the intercept, AdsCost, and Unemp were 3.9817, 0.0068, and 0.2997,
respectively. The corrected standard errors are all higher than the OLS estimates, which
is typically what we expect when correlated observations are an issue in a regression
model. We can now easily compute the t-test of significance using the OLS estimates
with the corrected standard errors.
page 252
EXERCISES 6.4
Mechanics
51. Using 20 observations, the multiple regression model y = β0 + β1x1 + β2x2 + ɛ
was estimated. A portion of the regression results is as follows:
a. At the 5% significance level, are the predictor variables jointly significant?
b. At the 5% significance level, is each predictor variable individually significant?
c. What is the likely problem with this model?
52. A simple linear regression, y = β0 + β1x + ɛ, is estimated with cross-sectional
data. The resulting residuals e along with the values of the predictor variable x are
shown in the accompanying table.
a. Graph the residuals against time and look for any discernible pattern.
b. Which assumption is being violated? Discuss its consequences and suggest a
possible remedy.
Applications
54. FILE Television. Like books and stories, television not only entertains, it also
exposes a child to new information about the world. While watching too much
television is harmful, a little bit may actually help. Researcher Matt Castle gathers
information on the grade point average (GPA) of 28 middle-school children and
the number of hours of television they watched per week. Examine the linearity
assumption in the regression model GPA = β0 + β1Hours + ɛ.
55. FILE Delivery. Quick2U, a delivery company, would like to standardize its
delivery charge model for shipments (Charge in $) such that customers will better
understand their delivery costs. Three predictor variables are used: (1) distance
(in miles), (2) shipment weight (in lbs), and (3) number of boxes. A sample of 30
recent deliveries is collected; a portion of the data is shown in the accompanying
table.
92.50 29 183 1
157.60 96 135 3
⋮ ⋮ ⋮ ⋮
143.00 47 117 7
59. FILE Healthy_Living. Healthy living has always been an important goal for any
society. Consider a regression model that conjectures that fruits and vegetables
(FV) and regular exercising have a positive effect on health and smoking has a
negative effect on health. The sample consists of the percentage
of these variables observed in various states in the United
page 253
States. A portion of the data is shown in the accompanying table.
⋮ ⋮ ⋮
17235 33 15 5.0
19854 42 25 5.0
⋮ ⋮ ⋮ ⋮
22571 44 21 5.0
page 254
Therefore, if the sample size is sufficiently large, statistical significance does not
necessarily imply that a relationship is economically meaningful. It is for this reason
that when confronted with big data and assessing various models, we tend to rely on
economic intuition and model validation rather than tests of significance. In Chapters
7 through 10, we will rely on cross-validation techniques in order to assess various
models.
FILE
House_Price
Case Study
Develop a predictive model for the price of a house in the college town of Ames,
Iowa. Before evaluating various models, you first have to filter out the House_Price
data to get the appropriate subset of observations for selected variables. After you
have obtained the preferred model, summarize your findings as well as predict the
price of a house in Ames, Iowa, given typical values of the predictor variables.
Shutterstock/Dmytro Zinkevych
TABLE 6.17 The Mean (Median) of Variables for New, Old, and All
Houses
The average sale price for the newer houses is substantially more than
that for the older houses. For all houses, given that the mean is higher than
the median, the house price distribution is positively skewed, indicating that a
few expensive houses have pulled up the mean above the median. The
square footage and the lot size are also positively skewed. Finally, relatively
newer houses have more bedrooms, bathrooms, and square footage but a
smaller lot size. This is consistent with a 2017 article in Building magazine
that found that newer houses have become 24% bigger over the past 15
years, while lot sizes have shrunk 16%.
page 255
In order to analyze the factors that may influence the price of a house, the
following linear regression model with all the referenced predictor variables is
considered:
where New is a dummy variable that equals 1 if the house was built in 2000
or after, 0 otherwise. It is expected that Beds, Bath, Sqft, and Lsize will have
a positive relationship with Price; that is, a house with more bedrooms and
bathrooms is expected to obtain a higher price than one with fewer bedrooms
and bathrooms. Similarly, a bigger house, or one on a bigger lot, is expected
to obtain a higher price. A newer house, one with all the latest updates, is
expected to obtain a higher price as compared to an older house in need of
work. Column 2 of Table 6.18 shows the regression results from estimating
this complete model.
Beds 3,124.63 NA
(0.623)
se 64,612.08 64,492.0062
R2 0.6689 0.6685
Notes: Parameter estimates are in the top half of the table with the p-values
in parentheses; * represents significance at the 5% level. NA denotes not
applicable. The lower part of the table contains goodness-of-fit measures.
All predictor variables with the exception of Beds are correctly signed and
statistically significant. Perhaps the lack of significance of Beds is due to
multicollinearity because the number of bedrooms is likely to be correlated
with the number of bathrooms as well as square footage. An alternative
explanation might be that additional bedrooms add value only in houses with
large square footage. For comparison, a restricted model is estimated that
omits Beds from the list of predictor variables; see Column 3 of Table 6.18 for
the results.
The following observations are made:
page 256
Report 6.1 FILE House_Price. Choose two comparable college towns. Find the
model that best predicts the sale price of a house. Make sure to include a dummy
variable that accounts for the possibility of differences in the sale price due to the
location.
Report 6.2 FILE College_Admissions. Choose a college of interest and use the
sample of enrolled students to best predict a student’s college grade point average.
Use goodness-of-fit measures to find the best predictive model. In order to estimate
these models, you have to first filter the data to include only the enrolled students.
Report 6.3 FILE NBA. Develop and compare two models for predicting a player’s
salary based on offensive performance measures and defensive performance
measures, respectively. In order to estimate these models, you have to first filter the
data to include only career statistics based on regular seasons. Exclude players with
no information on salary.
Report 6.5 FILE TechSales_Reps. Develop a linear regression model for predicting
the salary of a sales representative for both software and hardware industries.
Create dummy variables to capture personality type and discuss their role, along with
the other relevant predictor variables, in predicting salary.
page 257
page 258
7 Advanced Regression
Analysis
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 7.1 Estimate and interpret regression models
with interaction variables.
LO 7.2 Estimate and interpret nonlinear
regression models.
LO 7.3 Estimate and interpret linear probability
and logistic regression models.
LO 7.4 Use cross-validation techniques to
evaluate regression models.
page 259
©JrCasas/Shutterstock
INTRODUCTORY CASE
Gender Gap in Manager Salaries
The salary difference between men and women has shrunk
over the years, particularly among younger workers, but it still
persists. According to a Pew Research Center analysis,
women earned 82% of what men earned in 2017
(http://www.pewresearch.org, April 9, 2018). Ara Lily is
completing her MBA degree from Bentley University, located
just outside Boston. She is upset that the gender gap in
salaries continues to exist in the American workplace. For a
class project, she decides to analyze the gender gap in salaries
of project managers. Ara gains access to the salary (in
$1,000s) for 200 project managers in small- to middle-sized
firms in the Boston area. In addition, she has data on the firm
size, the manager’s experience (in years), whether the
manager is a female (Female equals 1 if female, 0 otherwise),
and whether the manager has a graduate degree (Grad equals
1 if graduate degree, 0 otherwise). Table 7.1 shows a portion of
the data.
FILE
Gender_Gap
page 260
page 261
effect of d1 on given by b1 + b3d2, equals b1 if d2 = 0
b3d1, which depends on the value of d1. Models with the interaction
variables are easy to estimate. In addition, tests of significance can
be conducted on all variables, including the interaction variable.
Example 7.1 illustrates the interaction between two dummy
variables.
EXAMPLE 7.1
According to the National Association of Colleges and
Employers, in 2016 the top-paid business graduates at the
bachelor’s degree level were expected to be those with a
concentration in management information systems (MIS). This
was due, in large part, to the concentration’s linkage with the
exploding field of data analytics. At a University of California
campus, data were collected on the starting salary of business
graduates (Salary in $1,000s) along with their cumulative GPA,
whether they have an MIS concentration (MIS = 1 if yes, 0
otherwise), and whether they have a statistics minor (Statistics
= 1 if yes, 0 otherwise). A portion of the data is shown in Table
7.2.
FILE
Salary_MIS
72 3.53 1 0
66 2.86 1 0
⋮ ⋮ ⋮ ⋮
66 3.65 0 0
page 262
EXAMPLE 7.2
Important risk factors for high blood pressure reported by the
National Institute of Health include weight and ethnicity. High
blood pressure is common in adults who are overweight and
are African American. According to the American Heart
Association, the systolic pressure (top number) should be
below 120. In a recent study, a public policy page 263
researcher in Atlanta surveyed 110 adult men
about 5′10″ in height and in the 55–60 age group. Data were
collected on their systolic pressure, weight (in pounds), and
race (Black = 1 for African American, 0 otherwise); a portion of
the data is shown in Table 7.4.
FILE
BP_Race
TABLE 7.4 Systolic Pressure of Adult Men (n = 110)
196 254 1
151 148 0
⋮ ⋮ ⋮
170 228 0
estimated equation as
page 264
The interaction variable is negative and
statistically significant at the 5% level. The negative coefficient
is interesting as it implies that black men carry their weight
better in terms of the systolic pressure than their nonblack
counterparts. Model 2 is more suitable for prediction because
it has a higher value of the adjusted R2 than Model 1 (0.7175
> 0.7072). For a weight of 180 pounds, the predicted systolic
pressure of a black man is 70.8312 + 0.4362 × 180 + 30.2482
× 1 − 0.1118(180 × 1) = 159. The corresponding systolic
pressure for a nonblack man is 149.
EXAMPLE 7.3
The Master’s in Business Administration (MBA), once a
flagship program in business schools, has lost its appeal in
recent years (The Wall Street Journal, June 5, 2019). While
elite schools like Harvard, Wharton, and Chicago are still
attracting applicants, other schools are finding it much harder
to entice students. As a result, business schools are focusing
on specialized master’s programs to give graduates the extra
skills necessary to be career ready and successful in more
technically challenging fields.
An educational researcher is trying to analyze the
determinants of the applicant pool for the specialized Master of
Science in Accounting (MSA) program at medium-sized
universities in the United States. Two important determinants
are the marketing expense of the business school and the
percentage of the MSA alumni who were employed within three
months after graduation. Consider the data collected on the
number of applications received (Applicants), marketing
expense (Marketing, in $1,000s), and the percentage employed
within three months (Employed); a portion of the data is shown
in Table 7.6.
FILE
Marketing_MSA
60 173 61
71 116 83
⋮ ⋮ ⋮
69 70 92
Both predictor
Interestingly, the
page 266
EXAMPLE 7.4
The objective outlined in the introductory case is to analyze a
possible gender gap in the salaries of project managers. Use
the data in Table 7.1 for the following analysis.
a. Evaluate the determinants of a project manager’s salary.
b. Estimate and interpret a regression model with relevant
interaction variables.
c. Determine whether there is evidence of a gender gap in
salaries.
SOLUTION:
R2 0.7437
page 268
EXERCISES 7.1
Mechanics
1. Consider a linear regression model where y represents the response variable and
x, d1, and d2 are the predictor variables. Both d1 and d2 are dummy variables,
each assuming values 1 or 0. A regression model with x, d1, d2, and d1d2, where
2. Consider a linear regression model where y represents the response variable and
x and d are the predictor variables; d is a dummy variable assuming values 1 or 0.
A model with x, d, and the interaction variable xd is estimated as
3. Consider a linear regression model where y represents the response variable and
x1 and x2 are the predictor variables. A regression model with x1, x2, and x1x2,
and 1.
5. FILE Exercise_7.5. The accompanying data file contains 20 observations on the
response variable y along with the predictor variables x and d.
a. Estimate a regression model with the predictor variables x and d, and then
extend it to also include the interaction variable xd.
and 20.
Applications
7. FILE Overweight. According to the U.S. Department of Health and Human
Services, African American women have the highest rates of being overweight
compared to other groups in the United States. Individuals are
considered overweight if their body mass index (BMI) is 25 or
page 269
greater. Data are collected from 120 individuals. The following table shows a
portion of data on each individual’s BMI; a Female dummy variable that equals 1 if
the individual is female, 0 otherwise; and a Black dummy variable that equals 1 if
the individual is African American, 0 otherwise.
28.70 0 1
28.31 0 0
⋮ ⋮ ⋮
24.90 0 1
100 0 1
80 1 0
⋮ ⋮ ⋮
80 0 0
a. Estimate and interpret the effect of Female, Minority Male, and the interaction
between Female and Minority Male.
b. Predict the percentage of White and Asian men if the founding members
included a woman but no Black or Hispanic men. Repeat the analysis if the
founding members included a woman and a Hispanic man.
9. FILE IceCream. In a recent survey, ice cream truck drivers in Cincinnati, Ohio,
reported that they make about $280 in income on a typical summer day. The
income was generally higher on days with longer work hours, particularly hot
days, and on holidays. Irma follows an ice cream truck driver for five weeks. She
collects the information on the driver’s daily income (Income), number of hours on
the road (Hours), whether it was a particularly hot day (Hot = 1 if the high
temperature was above 85°F, 0 otherwise), and whether it was a Holiday (Holiday
= 1, 0 otherwise). A portion of the data is shown in the accompanying table.
196 5 1 0
282 8 0 0
⋮ ⋮ ⋮ ⋮
374 6 1 1
a. Estimate and interpret the effect of Hours, Hot, and Holidays on Income. Predict
the income of a driver working 6 hours on a hot holiday. What if it was not a
holiday?
b. Extend the above model to include the interaction between Hot and Holiday.
Predict the income of a driver working 6 hours on a hot holiday. What if it was
not a holiday?
10. FILE Mobile_Devices. Americans are spending an average of two hours and 37
minutes daily on smartphones and other mobile devices to connect to the world of
digital information (Business Insider, May 25, 2017). The usage of mobile devices
is especially high for affluent, college-educated urban/suburban dwellers. A
survey was conducted on customers in the 50-mile radius of Chicago. Participants
were asked the average daily time they spent on mobile devices (Usage, in
minutes), their household income (Income, in $1,000s), if they lived in a rural area
(Rural =1 if rural, 0 otherwise), and if they had a college degree (College = 1 if
college graduate, 0 otherwise). A portion of the data is shown in the
accompanying table.
172 146 0 0
165 198 0 1
⋮ ⋮ ⋮ ⋮
110 31 0 1
a. Estimate and interpret a regression model for the mobile device usage based
on Income, Rural, College, and the interaction between Rural and College.
Explain the rationale for using the interaction variable.
b. Predict the mobile device usage for a college-educated person with a
household income of $120,000 and living in a rural area. What would be the
corresponding usage for someone living in a urban/suburban area?
c. Discuss the impact of a college degree on salary.
11. FILE Urban. A sociologist is looking at the relationship between consumption
expenditures of families in the United States (Consumption in $), family income
(Income in $), and whether or not the family lives in an urban or rural community
(Urban = 1 if urban, 0 otherwise). She collects data on 50
families across the United States, a portion of which is shown in
page 270
the accompanying table.
62336 87534 0
60076 94796 1
⋮ ⋮ ⋮
59055 100908 1
a. Estimate Consumption = β0 + β1Income + ɛ. Compute the predicted
consumption expenditures of a family with income of $75,000.
b. Include a dummy variable Urban to predict consumption for a family with
income of $75,000 in urban and rural communities.
c. Include a dummy variable Urban and an interaction variable (Income × Urban)
to predict consumption for a family with income of $75,000 in urban and rural
communities.
d. Which of the preceding models is most suitable for the data? Explain.
12. FILE Pick_Errors. The distribution center for an online retailer has been
experiencing quite a few “pick errors” (i.e., retrieving the wrong item). Although
the warehouse manager thinks most errors are due to inexperienced workers, she
believes that a training program also may help to reduce them. Before sending all
employees to training, she examines data from a pilot study of 30 employees.
Information is collected on the employee’s annual pick errors (Errors), experience
(Exper in years), and whether or not the employee attended training (Train equals
1 if the employee attended training, 0 otherwise). A portion of the data is shown in
the accompanying table.
13 9 0
3 27 0
⋮ ⋮ ⋮
4 24 1
b. Which model provides a better fit in terms of adjusted R2 and the significance of
the predictor variables at the 5% level?
c. Use the chosen model to predict the number of pick errors for an employee with
10 years of experience who attended the training program, and for an employee
with 20 years of experience who did not attend the training program.
d. Give a practical interpretation for the positive interaction coefficient.
13. FILE BMI. According to the World Health Organization, obesity has reached
epidemic proportions globally. While obesity has generally been linked with
chronic disease and disability, researchers argue that it may also affect wages. In
other words, the body mass index (BMI) of an employee is a predictor for salary.
(A person is considered overweight if his/her BMI is at least 25 and obese if BMI
exceeds 30.) The accompanying table shows a portion of salary data (in $1,000s)
for 30 college-educated men with their respective BMI and a dummy variable that
represents 1 for a white man and 0 otherwise.
34 33 1
43 26 1
⋮ ⋮ ⋮
45 21 1
a. Estimate a model for Salary with BMI and White as the predictor variables.
b. Reestimate the model with BMI, White, and a product of BMI and White as the
predictor variables.
c. Which of the models is more suitable? Explain. Use this model to estimate the
salary for a white college-educated man with a BMI of 30. Compute the
corresponding salary for a nonwhite man.
14. FILE Compensation. To encourage performance, loyalty, and continuing
education, the human resources department at a large company wants to develop
a regression-based compensation model (Comp in $ per year) for midlevel
managers based on three variables: (1) business-unit profitability (Profit in $1000s
per year), (2) years with the company (Years), and (3) whether or not the manager
has a graduate degree (Grad equals 1 if graduate degree, 0 otherwise). The
accompanying table shows a portion of data collected for 36 managers.
118100 4500 37 1
90800 5400 5 1
⋮ ⋮ ⋮ ⋮
85000 4200 29 0
33.93 7.14 0
18.68 −26.39 0
⋮ ⋮ ⋮
0.08 −29.41 1
a. Estimate a model with the initial return as the response variable and the price
revision and the high-tech dummy variable as the predictor variables.
b. Reestimate the model with price revision along with the dummy variable and the
product of the dummy variable and the price revision.
c. Which of these models is the preferred model? Explain. Use this model to
estimate the initial return for a high-tech firm with a 15% price revision.
Compute the corresponding initial return for a firm that is not high tech.
16. FILE GPA_College. It is often claimed that the SAT is an important indicator of
the grades students will earn in the first year of college. The admission officer at a
state university wants to analyze the relationship between the first-year GPA in
college (College GPA) and the SAT score of the student (SAT), the unweighted
high school GPA (HS GPA), and the student’s race (White equals 1 if the student
is white, 0 otherwise). A portion of the data is shown in the accompanying table.
⋮ ⋮ ⋮ ⋮
52 58 80 0
55 68 43 0
⋮ ⋮ ⋮ ⋮
58 90 35 0
a. Estimate and interpret a regression model for Health using Social, Income, and
College as the predictor variables. Predict the health rating of a college-
educated person given Social = 80 and Income = 100. What if the person is not
college-educated?
b. Estimate and interpret an extended model that includes the interactions of
Social with Income and Social with College. Predict the health rating of a
college-educated person given Social = 80 and Income = 100. What if the
person is not college-educated?
c. Explain which of the above two models is preferred for making predictions.
18. FILE College. With college costs and student debt on the rise, the choices that
families make when searching for and selecting a college have never been more
important. Consider the information from 116 colleges on annual post-college
earnings of graduates (Earnings in $), the average annual cost (Cost in $), the
graduation rate (Grad in %), the percentage of students paying down debt (Debt
in %), and whether or not a college is located in a city (City equals 1 if a city
location, 0 otherwise). A portion of the data is shown in the accompanying table.
a. Estimate a regression model for the annual post-college earnings on the basis
of Cost, Grad, Debt, City, and the interaction between Cost and Grad. Interpret
the estimated coefficient of the interaction variable.
b. Predict the annual post-college earnings of graduates of a college in a city with
a graduation rate of 60%, Debt of 80% and the average annual cost of $20,000,
$30,000, and $40,000.
c. Repeat the above analysis with a graduation rate of 80%.
19. FILE Rental. Jonah has worked as an agent for rental properties in Cincinnati,
Ohio, for almost 10 years. He understands that the premium for additional
bedrooms is more in larger as compared to smaller homes. Consider the monthly
data on rent (Rent, in $) along with square footage (Sqft), number of bedrooms
(Bed), and number of bathrooms (Bath). A portion of the data is shown in the
accompanying table.
page 272
Rent Sqft Bed Bath
2300 2050 3 3
1240 680 2 1
⋮ ⋮ ⋮ ⋮
2800 2370 4 3
a. Estimate a regression model for monthly rent on the basis of Bed, Bath, Sqft,
and the interaction between Bed and Sqft. Interpret the estimated coefficient of
the interaction variable.
b. Predict the monthly rent of homes with a square footage of 1,600, 2 baths, and
the number of bedrooms equal to 2, 3, and 4. Compute the incremental
predicted rent as the number of bedrooms increases from 2 to 3 and 3 to 4.
c. Repeat the above analysis with a square footage of 2,400.
7.2 REGRESSION MODELS FOR
NONLINEAR RELATIONSHIPS
LO 7.2
Estimate and interpret nonlinear regression models.
page 273
FIGURE 7.3 Scatterplot of y against x with trendline generated from estimating the
quadratic regression model
It is important to be able to determine whether a quadratic regression
model provides a better fit than the linear regression model. Recall
from Chapter 6 that we cannot compare these models on the basis
of their respective R2 values because the quadratic regression
model uses one more parameter than the linear regression model.
For comparison purposes, we use adjusted R2, which imposes a
penalty for the additional parameter.
In order to estimate the quadratic regression model y = β0 + β1x +
β2x2 + ε, we have to first create a variable x2 that contains the
squared values of x. The quadratic model is estimated in the usual
way as where b1 and b2 are the estimates of β1 and
β2, respectively.
reaches a maximum (b2 < 0) or minimum (b2 > 0) when the partial
effect equals zero. The value of x when this happens is obtained
EXAMPLE 7.5
Consider the quadratic regression of the average cost (AC, in
$) on annual output (Output, in millions of units). The estimated
regression equation using the predictor variables Output and
Output2 is given by It is
found that both predictor variables are statistically page 274
significant at the 5% level, thus confirming the
quadratic effect. Use the estimates to answer the following
questions.
a. What is the change in average cost going from an output level
of four million units to five million units?
b. What is the change in average cost going from an output level
of eight million units to nine million units? Compare this result
to the result found in part a.
c. What is the output level that minimizes average cost?
SOLUTION:
The predicted average cost for a firm that produces five million
units is:
EXAMPLE 7.6
In the United States, age discrimination is illegal, but its
occurrence is hard to prove (The New York Times, August 7,
2017). Even without discrimination, it is widely believed that
wages of workers decline as they get older. A young worker
can expect wages to rise with age only up to a certain point,
beyond which wages begin to fall. Ioannes Papadopoulos
works in the human resources department of a large
manufacturing firm and is examining the relationship between
wages (in $), years of education, and age. Specifically, he
wants to verify the quadratic effect of age on wages. He
gathers data on 80 workers in his firm with information on their
hourly wage, education, and age. A portion of the data is
shown in Table 7.9.
FILE
Wages
TABLE 7.9 Hourly Wage of Americans (n = 80)
17.54 12 76
20.93 10 61
⋮ ⋮ ⋮
23.66 12 49
page 275
a. Plot Wage against Age and evaluate whether the linear or the
quadratic regression model better captures the relationship.
Verify your choice by using the appropriate goodness-of-fit
measure.
b. Use the appropriate model to predict hourly wages for
someone with 16 years of education and age equal to 30, 50,
or 70.
c. According to the model, at what age will someone with 16
years of education attain the highest wages?
SOLUTION:
page 277
FIGURE 7.6 Scatterplot of y against x with trendline generated from estimating the
log-log regression model
FIGURE 7.7 Scatterplot of y against x with trendline generated from estimating the
logarithmic regression model
page 279
FIGURE 7.8 Scatterplot of y against x with trendline generated from estimating the
exponential regression model
For the exponential model, β1 × 100 measures the approximate
percentage change in E(y) when x increases by one unit; the exact
percentage change can be calculated as (exp(β1) − 1) × 100. For
example, a value of β1 = 0.05 implies that a one-unit increase in x
leads to an approximate 5% (= 0.05 × 100) increase in E(y), or more
precisely, a 5.1271% (= (exp(0.05) − 1) × 100) increase in E(y). In
applied work, we often see this model used to describe the rate of
growth of certain economic variables, such as population,
employment, salaries, and sales. As in the case of the log-log
regression model, we make a correction for making predictions
because the response variable is measured in logs.
page 280
Predicted
Model Estimated Slope Coefficient
Value
y = β0 +
b1 measures the change in when x
β1x + ε
increases by one unit.
by one unit.
It is advisable to use unrounded coefficients for making predictions.
FILE
AnnArbor
EXAMPLE 7.7
Real estate investment in college towns promises good returns
(The College Investor, August 22, 2017). First, students offer a
steady stream of rental demand as cash-strapped public
universities are unable to house their students beyond
freshman year. Second, this demand is projected to grow as
more children of baby boomers head to college. Table 7.12
shows a portion of rental data for Ann Arbor, Michigan, which is
home to the main campus of the University of Michigan. The
data include the monthly rent (Rent, in $), the number of
bedrooms (Beds), the number of bathrooms (Baths), and the
square footage (Sqft) for 40 rentals.
645 1 1 500
675 1 1 648
⋮ ⋮ ⋮ ⋮
FIGURE 7.9 Comparing Rent against (a) Beds and (b) Baths
page 282
In Models 1 and 2, we use Rent as the response
variable with Beds and Baths, along with Sqft in Model 1 and
ln(Sqft) in Model 2, as the predictor variables. Similarly, in
Models 3 and 4, we use ln(Rent) as the response variable with
Beds and Baths, along with Sqft in Model 3 and ln(Sqft) in
Model 4, as the predictor variables. Model estimates are
summarized in Table 7.14.
y and
Example 7.8 elaborates on the method with the use of Excel and
R.
precise estimate, you can use the unrounded value for se from
the regression output.) The third column of Table 7.15 shows a
portion of the results.
TABLE 7.15 Excel-Produced Predicted Values for Model 4
Rent
⋮ ⋮ ⋮
a. Import the AnnArbor data into a data frame (table) and label it
myData.
b. As shown in Chapter 6, we use the lm function to create a
regression model or, in R terminology, an object. We label this
object as Model4. We also use the log function for the natural
log transformation of the variables Rent and Sqft. Enter:
Enter:
> Pred_lnRent<-predict(Model4)
d. Next, we want to calculate the predicted values for Rent:
> SE<-summary(Model4)$sigma
> Pred_Rent<-exp(Pred_lnRent+SE^2/2)
e. Finally, we use R’s cor function to calculate the correlation
between Rent and We square this value to find the
EXERCISES 7.2
Mechanics
20. Consider the estimated quadratic model
Linear Quadratic
x 0.3392 4.0966
x2 NA −0.2528
R2 0.1317 0.5844
a. Use the appropriate goodness-of-fit measure to justify which model fits the data
better.
b. Given the best-fitting model, predict y for x = 4, 8, and 12.
22. Consider the sample regressions for the linear, the logarithmic, the exponential,
and the log-log models. For each of the estimated models, predict y when x
equals 50.
23. Consider the following sample regressions for the linear and the logarithmic
models.
Linear Logarithmic
x 1.0607 NA
ln(x) NA 10.5447*
se 2.4935 1.5231
R2 0.8233 0.9341
Log-Log Exponential
x NA 0.0513
ln(x) 0.3663 NA
Log-Log Exponential
se 0.3508 0.2922
R2 0.5187 0.6660
Applications
25. FILE Television. Numerous studies have shown that watching too much
television hurts school grades. Others have argued that television is not
necessarily a bad thing for children (Psychology Today, October 22, 2012). Like
books and stories, television not only entertains, it also exposes a child to new
information about the world. While watching too much television is harmful, a little
bit may actually help. Researcher Matt Castle gathers information on the grade
point average (GPA) of 28 middle school children and the number of hours of
television they watched per week. A portion of the data is shown in the
accompanying table.
GPA Hours
3.24 19
3.10 21
⋮ ⋮
3.31 4
a. Estimate a quadratic regression model where the GPA of middle school children
is regressed on hours and hours-squared.
b. Is the quadratic term in this model justified? Explain.
c. Find the optimal number of weekly hours of TV for middle school children.
26. FILE Sales_Reps. Brendan Connolly manages the human resource division of a
high-tech company. He has access to the salary information of 300 sales reps
along with their age, sex (Female), and the net promoter score (NPS) that
indicates customer satisfaction. A portion of the data is shown in the
accompanying table.
page 286
Salary Age Female NPS
97000 44 0 9
50000 34 0 4
⋮ ⋮ ⋮ ⋮
88000 36 0 10
a. Estimate and interpret a quadratic model using the natural log of salary as the
response variable and Age, Age2, Female, and NPS as the predictor variables.
b. Determine the optimal level of age at which the natural log of salary is
maximized.
c. At the optimal age, predict the salary of male and female sales reps with NPS =
8.
27. FILE Fertilizer2. A horticulturist is studying the relationship between tomato
plant height and fertilizer amount. Thirty tomato plants grown in similar conditions
were subjected to various amounts of fertilizer (in ounces) over a four-month
period, and then their heights (in inches) were measured. A portion of the data is
shown in the accompanying table.
Height Fertilizer
20.4 1.9
29.1 5.0
⋮ ⋮
36.4 3.1
Time Parts
30.8 62
9.8 32
⋮ ⋮
29.8 60
a. Estimate the linear regression model to predict time as a function of the number
of parts (Parts). Then estimate the quadratic regression model to predict time
as a function of Parts and Parts squared.
b. Evaluate the two models in terms of variable significance (α = 0.05) and
adjusted R2.
c. Use the best-fitting model to predict how long it would take to build a circuit
board consisting of 48 parts.
29. FILE Inventory_Cost. The inventory manager at a warehouse distributor wants
to predict inventory cost (Cost in $) based on order quantity (Quantity in units).
She thinks it may be a nonlinear relationship because its two primary components
move in opposite directions: (1) order processing cost (costs of procurement
personnel, shipping, transportation), which decreases as order quantity increases
(due to fewer orders needed), and (2) holding cost (costs of capital, facility,
warehouse personnel, equipment), which increases as order quantity increases
(due to more inventory held). She has collected monthly inventory costs and order
quantities for the past 36 months. A portion of the data is shown in the
accompanying table.
Cost Quantity
844 54.4
503 52.1
⋮ ⋮
Cost Quantity
870 55.5
28 503 1
33 534 1
⋮ ⋮ ⋮
24 518 1
a. Estimate the linear model y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + ɛ. What is the
predicted price if x1 = 16, x2 = 600, x3 = 120, and x4 = 20?
b. Estimate the exponential model ln(y) = β0 + β1x1 + β2x2 + β3x3 + β4x4 + ɛ.
What is the predicted price if x1 = 16, x2 = 600, x3 = 120, and x4 = 20?
16747 46 22 75
7901 31 24 98
⋮ ⋮ ⋮ ⋮
11380 56 28 84
2950 4 4 1453
2400 4 2 1476
⋮ ⋮ ⋮ ⋮
744 2 1 930
a. Estimate the linear model that uses Rent as the response variable. Estimate the
exponential model that uses log of Rent as the response variable.
b. Compute the predicted rent for a 1,500-square-foot house with 3 bedrooms and
2 bathrooms for the linear and the exponential models (ignore the significance
tests).
c. Use R2 to select the appropriate model for prediction.
34. FILE Savings_Rate. The accompanying table shows a portion of the monthly
data on the personal savings rate (Savings in %) and personal disposable income
(Income in $ billions) in the U.S. from January 2007 to November 2010.
⋮ ⋮ ⋮
a. Estimate the linear model Savings = β0 + β1Income + ɛ and the log-log model,
ln(Savings) = β0 + β1ln(Income) + ɛ. For each model, predict Savings if Income
= $10,500.
b. Which is the preferred model? Explain.
35. FILE Learning_Curve. Learning curves are used in production operations to
estimate the time required to complete a repetitive task as an operator gains
experience. Suppose a production manager has compiled 30 time values (in
minutes) for a particular operator as she progressed down the learning curve
during the first 100 units. A portion of this data is shown in the accompanying
table.
page 288
Time per Unit Unit Number
18.30 3
17.50 5
⋮ ⋮
5.60 100
a. Create a scatterplot of time per unit against units built. Superimpose the linear
trendline and the logarithmic trendline to determine visually the best-fitting
model.
b. Estimate the simple linear regression model and the logarithmic regression
model for predicting time per unit using unit number as the predictor variable.
c. Based on R2, use the best-fitting model to predict the time that was required for
the operator to build Unit 50.
36. FILE Happiness. Numerous attempts have been made to understand
happiness. Because there is no unique way to quantify it, researchers often rely
on surveys to capture a subjective assessment of well-being. One study finds that
holding everything else constant, people seem to be least happy when they are in
their 40s (Psychology Today, April 27, 2018). Another study suggests that money
does buy happiness, but its effect diminishes as incomes rise above $75,000 a
year (Money Magazine, February 14, 2018). Consider survey data of 100 working
adults’ self-assessed happiness on a scale of 0 to 100, along with their age and
annual income. A portion of the data is shown in the accompanying table.
69 49 52000
83 47 123000
Happiness Age Income
⋮ ⋮ ⋮
79 31 105000
a. Estimate and interpret a regression model for Happiness based on Age, Age2,
and ln(Income).
b. Predict happiness with Income equal to $80,000 and Age equal to 30, 45, and
60 years.
c. Predict happiness with Age equal to 60 and Income equal to $25,000, $75,000,
$125,000.
37. FILE Production_Function. Economists often examine the relationship between
the inputs of a production function and the resulting output. A common way of
modeling this relationship is referred to as the Cobb-Douglas production function.
This function can be expressed as ln(Q) = β0 + β1ln(L) + β2ln(K) + ɛ, where Q
stands for output, L for labor, and K for capital. The accompanying table lists a
portion of data relating to the U.S. agricultural industry in the year 2004.
⋮ ⋮ ⋮ ⋮
⋮ ⋮ ⋮ ⋮
probability of success.
EXAMPLE 7.9
FILE
Mortgage
y x1 x2
1 16.35 49.94
1 34.43 56.16
⋮ ⋮ ⋮
0 17.85 26.86
Furthermore, for any given slope, we can find some value of x for
which the predicted probability is outside the [0,1] interval. For
meaningful analysis, we would like a nonlinear specification that
constrains the predicted probability between 0 and 1.
Consider the following logistic specification
page 291
model and the logistic regression model, given b1 > 0. Note that in
the linear probability model, the probability falls below 0 for small
values of x and exceeds 1 for large values of x. The probabilities
implied by the logistic regression model, however, are always
constrained in the [0,1] interval. (For ease of exposition, we use the
same notation to refer to the coefficients in the linear probability
model and the logistic regression model. We note, however, that
these coefficients and their estimates have a different meaning
depending on which model we are referencing.)
FIGURE 7.11 Predicted probabilities for linear probability and logistic regression
models, with b1 > 0
will not be the same if x increases from 20 to 21. We can show that
page 292
EXAMPLE 7.10
Let’s revisit Example 7.9.
a. Estimate and interpret the logistic regression model for the
loan approval outcome y based on the applicant’s percentage
of down payment x1 and the income-to-loan ratio x2.
b. For an applicant with a 30% income-to-loan ratio, predict loan
approval probabilities with down payments of 20% and 30%.
c. Compare the predicted probabilities based on the estimated
logistic regression model with those from the estimated linear
probability model in Example 7.9.
SOLUTION:
4 30 −0.02 0.03
20 30 0.28 0.21
30 30 0.47 0.51
50 30 0.85 0.94
60 30 1.04 0.98
As discussed earlier, with the linear probability model, the
predicted probabilities can be negative or greater than one.
The probabilities based on the logistic regression model stay
between zero and one for all possible values of the predictor
variables. Therefore, whenever possible, it is preferable to use
the logistic regression model over the linear probability model
in binary choice models.
= 1.
It is common to assess the performance of linear probability and
logistic regression models on the basis of the accuracy rates defined
as the percentage of correctly classified observations. Using a
default cutoff of 0.5, we compare the binary values of the response
variable y with the binary predicted values that equal one if
EXAMPLE 7.11
For the mortgage approval example, compare the accuracy
rates of the estimated linear probability model (LPM) with the
estimated logistic regression model.
SOLUTION:
In order to compute the accuracy rates, we first find the
predicted approval probabilities given the sample values of the
predictor variables. For the first sample observation, we find
the predicted approval as:
Because the predicted values for both models are greater than
0.5, their corresponding binary predicted values are one.
Predictions for other sample observations are computed
similarly; see Table 7.20 for a portion of these predictions.
. See Figure 7.12. Click on the ellipsis next to the Data range and
highlight cells A1:C31. Make sure that the box preceding First Row
Contains Headers is checked. The Variables in Input Data box will
populate. Select and move variables x1 and x2 to Selected
Variables box and y to Output Variable box. Accept other defaults
and click Next.
page 296
Note that the coefficient estimates are the same as
those reported in Table 7.18. In order to find the standard errors, we
take the positive square root of the diagonal elements of the
variance-covariance matrix. For example, the standard error of
which is the same as in Table 7.18. The z-
value is calculated as
page 297
EXERCISES 7.3
Mechanics
39. Consider a binary response variable y and a predictor variable x that varies
between 0 and 4. The linear probability model is estimated as
x 0.05 0.26
(0.06) (0.02)
a. Test for the significance of the intercept and the slope coefficients at the 5%
level in both models.
b. What is the predicted probability implied by the linear probability model for x =
20 and x = 30?
c. What is the predicted probability implied by the logistic regression model for x =
20 and x = 30?
41. Consider a binary response variable y and two predictor variables x1 and x2. The
following table contains the parameter estimates of the linear probability model
(LPM) and the logistic regression model, with the associated p-values shown in
parentheses.
x1 0.32 0.98
(0.04) (0.06)
x2 −0.04 −0.20
(0.01) (0.01)
Applications
45. FILE Purchase. Annabel, a retail analyst, has been following Under Armour,
Inc., the pioneer in the compression-gear market. Compression garments are
meant to keep moisture away from a wearer’s body during athletic activities in
warm and cool weather. Annabel believes that the Under Armour brand attracts a
younger customer, whereas the more established companies, Nike and Adidas,
draw an older clientele. In order to test her belief, she collects data on the age of
the customers and whether or not they purchased Under Armour (Purchase; 1 for
purchase, 0 otherwise). A portion of the data is shown in the accompanying table.
Purchase Age
1 30
0 19
⋮ ⋮
1 24
a. Estimate the linear probability model using Under Armour as the response
variable and Age as the predictor variable.
b. Compute the predicted probability of an Under Armour purchase for a 20-year-
old customer and a 30-year-old customer.
c. Test Annabel’s belief that the Under Armour brand attracts a younger customer,
at the 5% level.
46. FILE Purchase. Refer to the previous exercise for a description of the data set.
Estimate the logistic regression model where the Under Armour purchase
depends on age.
a. Compute the predicted probability of an Under Armour purchase for a 20-year-
old customer and a 30-year-old customer.
page 298
b. Test Annabel’s belief that the Under Armour brand attracts a
younger customer, at the 5% level.
47. FILE Parole. More and more parole boards are using risk assessment tools
when trying to determine an individual’s likelihood of returning to crime (Prison
Legal News, February 2, 2016). Most of these models are based on a range of
character traits and biographical facts about an individual. Many studies have
found that older people are less likely to re-offend than younger ones. In addition,
once released on parole, women are not likely to re-offend. A sociologist collects
data on 20 individuals who were released on parole two years ago. She notes if
the parolee committed another crime over the last two years (Crime equals 1 if
crime committed, 0 otherwise), the parolee’s age at the time of release, and the
parolee’s sex (Male equals 1 if male, 0 otherwise). The accompanying table
shows a portion of the data.
1 25 1
0 42 1
⋮ ⋮ ⋮
0 30 1
a. Estimate the linear probability model where crime depends on age and the
parolee’s sex.
b. Are the results consistent with the claims of other studies with respect to age
and the parolee’s sex?
c. Predict the probability of a 25-year-old male parolee committing another crime;
repeat the prediction for a 25-year-old female parolee.
48. FILE Parole. Refer to the previous exercise for a description of the data set.
a. Estimate the logistic regression model where crime depends on age and the
parolee’s sex.
b. Are the results consistent with the claims of other studies with respect to age
and the parolee’s sex?
c. Predict the probability of a 25-year-old male parolee committing another crime;
repeat the prediction for a 25-year-old female parolee.
49. FILE Health_Insurance. According to the 2017 census, just over 90% of
Americans have health insurance (CNBC, May 22, 2018). However, a higher
percentage of Americans on the lower end of the economic spectrum are still
without coverage. Consider a portion of data in the following table relating to
insurance coverage (1 for coverage, 0 for no coverage) for 30 working individuals
in Atlanta, Georgia. Also included in the table is the percentage of the premium
paid by the employer and the individual’s income (in $1,000s).
Premium
Insurance Income
Percentage
1 0 88
0 0 60
⋮ ⋮ ⋮
0 60 60
a. Analyze the linear probability model for insurance coverage with premium
percentage and income used as the predictor variables.
b. Consider an individual with an income of $60,000. What is the probability that
she has insurance coverage if her employer contributes 50% of the premium?
What if her employer contributes 75% of the premium?
50. FILE Health_Insurance. Refer to the previous exercise for a description of the
data set. Estimate the logistic regression model where insurance coverage
depends on premium percentage and income. Consider an individual with an
income of $60,000. What is the probability that she has insurance coverage if her
employer contributes 50% of the premium? What if her employer contributes 75%
of the premium?
51. FILE Assembly. Because assembly line work can be tedious and repetitive, it is
not suited for everybody. Consequently, a production manager is developing a
binary choice regression model to predict whether a newly hired worker will stay
in the job for at least one year (Stay equals 1 if a new hire stays for at least one
year, 0 otherwise). Three predictor variables will be used: (1) Age; (2) a Female
dummy variable that equals 1 if the new hire is female, 0 otherwise; and (3) an
Assembly dummy variable that equals 1 if the new hire has worked on an
assembly line before, 0 otherwise. The accompanying table shows a portion of
data for 32 assembly line workers.
0 35 1 0
0 26 1 0
Stay Age Female Assembly
⋮ ⋮ ⋮ ⋮
1 38 0 1
a. Estimate and interpret the linear probability model and the logistic regression
model where being on the job one year later depends on Age, Female, and
Assembly.
b. Compute the accuracy rates of both models.
c. Use the preferred model to predict the probability that a 45-year-old female who
has not worked on an assembly line before will still be on the job one year later.
What if she has worked on an assembly line before?
52. FILE CFA. The Chartered Financial Analyst (CFA) designation is the de facto
professional certification for the financial industry. Employers
encourage their prospective employees to complete the CFA
page 299
exam. Daniella Campos, an HR manager at SolidRock Investment, is reviewing
10 job applications. Given the low pass rate for the CFA Level 1 exam, Daniella
wants to know whether or not the 10 prospective employees will be able to pass
it. Historically, the pass rate is higher for those with work experience and a good
college GPA. With this insight, she compiles the information on 263 current
employees who took the CFA Level I exam last year, including the employee’s
success on the exam (1 for pass, 0 for fail), the employee’s college GPA, and
years of work experience. A portion of the data is shown in the accompanying
table.
1 3.75 18
0 2.62 17
⋮ ⋮ ⋮
0 2.54 4
a. Estimate the linear probability model to predict the probability of passing the
CFA Level I exam for a candidate with a college GPA of 3.80 and five years of
experience.
b. Estimate the logistic regression model to predict the probability of passing the
CFA Level I exam for a candidate with a college GPA of 3.80 and five years of
experience.
53. FILE Admit. Unlike small selective colleges that pay close attention to personal
statements, teacher recommendations, etc., large, public state university systems
primarily rely on a student’s grade point average (GPA) and scores on the SAT or
ACT for the college admission decisions. Data were collected for 120 applicants
on college admission (Admit equals 1 if admitted, 0 otherwise) along with the
student’s GPA and SAT scores. A portion of the data is shown in the
accompanying table.
1 3.10 1550
0 2.70 1360
⋮ ⋮ ⋮
1 4.40 1320
a. Estimate and interpret the appropriate linear probability and the logistic
regression models.
b. Compute the accuracy rates of both models.
c. Use the preferred model to predict the probability of admission for a college
student with GPA = 3.0 and SAT = 1400. What if GPA = 4.0?
54. FILE Divorce. Divorce has become an increasingly prevalent part of American
society. According to a 2019 Gallup poll, 77% of U.S. adults say divorce is morally
acceptable, which is a 17-point increase since 2001 (Gallup, May 29, 2019). In
general, the acceptability is higher for younger adults who are not very religious. A
sociologist conducts a survey in a small Midwestern town where 200 American
adults are asked about their opinion on divorce (Acceptable equals 1 if morally
acceptable, 0 otherwise), religiosity (Religious equals 1 if very religious, 0
otherwise), and their age. A portion of the data is shown in the accompanying
table.
1 78 0
1 20 0
⋮ ⋮ ⋮
1 22 0
a. Estimate and interpret the appropriate linear probability and the logistic
regression models.
b. Compute the accuracy rates of both models.
c. Use the preferred model to predict the probability that a 40-year old, very
religious adult will find divorce morally acceptable. What if the adult is not very
religious?
55. FILE STEM. Several studies have reported lower participation in the science,
technology, engineering, and mathematics (STEM) careers by female and
minority students. A high school counselor surveys 240 college-bound students,
collecting information on whether the student has applied to a STEM field (1 if
STEM, 0 otherwise), whether or not the student is female (1 if female, 0
otherwise), white (1 if white, 0 otherwise), and Asian (1 if Asian, 0 otherwise). Also
included in the survey is the information on the student’s high school GPA and the
SAT scores. A portion of the data is shown in the accompanying table.
a. Estimate and interpret the logistic regression model using STEM as the
response variable, and GPA, SAT, White, Female, and Asian as the predictor
variables.
b. Find the predicted probability that a white male student will apply to a STEM
field with GPA = 3.4 and SAT = 1400. Find the corresponding probabilities for
an Asian male and a male who is neither white nor Asian.
c. Find the predicted probability that a white female student will apply to a STEM
field with GPA = 3.4 and SAT = 1400. Find the corresponding probabilities for
an Asian female and a female who is neither white nor Asian.
page 300
OVERFITTING
Overfitting occurs when a regression model is made overly
complex to fit the quirks of given sample data. By making the
model conform too closely to the sample data, its predictive
power is compromised.
A useful method to assess the predictive power of a model is to
test it on a data set not used in estimation. Cross-validation is a
technique that evaluates predictive models by partitioning the
original sample into a training set to build (train) the model and a
validation set to evaluate (validate) it. Although training set
performance can be assessed, it may result in overly optimistic
estimates because it is based on the same data as were used to
build the model. Therefore, a validation set is used to provide an
independent performance assessment by exposing the model to
unseen data. Sometimes, the data are partitioned into an optional
third set called a test data set; further detail is provided in Chapters
8, 9, and 10.
CROSS-VALIDATION
Cross-validation is a technique in which the sample is
partitioned into a training set to estimate the model and a
validation set to assess how well the estimated model predicts
with unseen data.
set. Recall that RMSE is the square root of mean squared error
(MSE). In addition to RMSE, other important performance measures
include the mean absolute deviation (MAD) and the mean absolute
percentage error (MAPE); these measures are discussed in Chapter
8. To assess binary choice models, we use the accuracy rate
computed as the percentage of correctly classified observations in
the validation set; other performance measures are discussed in
Chapter 8.
page 301
EXAMPLE 7.12
The objective outlined in the introductory case is to analyze the
gender gap in salaries of project managers. Consider two
models to analyze a project manager’s salary. For predictor
variables, Model 1 uses Size, Experience, Female, and Grad,
whereas Model 2 also includes the interactions between
Female with Experience, Female with Grad, and Size with
Experience. Use the holdout method to compare the
predictability of both models using the first 150 observations for
training and the remaining 50 observations for validation.
SOLUTION:
We use the training set with 150 observations to estimate
Model 1 and Model 2. The estimates are presented in Table
7.21.
page 302
Observation y
⋮ ⋮ ⋮ ⋮
FILE
AnnArbor
EXAMPLE 7.13
In Example 7.8, we used in-sample measures to compare two
models, where Model 1 used Rent and Model 2 used ln(Rent)
as the response variable. Recall that these models were
referred to as Models 2 and 4, respectively, and the predictor
variables for both models included Beds, Baths, and ln(Sqft).
Use the holdout method to compare the predictability of the two
competing models, using the first 30 observations for training
and the remaining 10 observations for validation.
SOLUTION:
We use the training set with 30 observations to estimate Model
1 and Model 2. The estimates are presented in Table 7.23.
page 303
se 104.2082 0.0854
Notes: Standard error of the estimate se is used for making predictions with Model
2.
We use the estimated models to predict the response variable
Observation y
⋮ ⋮ ⋮ ⋮
Characters −0.0104 NA
(0.074)
page 305
Analytic Solver and R are easy to use for partitioning the data,
estimating models with the training set, and deriving the necessary
cross-validation measures. As mentioned earlier, we generally use
random draws for partitioning the sample data into the training and
validation sets. Here, for replicating purposes, we will continue to
use the latter part of the sample data for the validation set. We use
the Spam data to replicate the results in Example 7.14 for Model 1;
results for Model 2 can be derived similarly.
page 306
Using R
. Import the Spam data into a data frame (table) and label it myData.
. We partition the sample into training and validation sets, labeled
TData and VData, respectively. Enter:
> TData <- myData[1:375,]
> VData <- myData[376:500,]
. We use the training set, TData, to estimate Model 1. Enter:
> Model1 <- glm(Spam ∼ Recipients+Hyperlinks+Characters,
family=binomial(link = logit), data = TData)
. We use the estimates to make predictions for VData and then
convert them into a binary prediction. Finally, we compute the
accuracy rate in the validation set. Enter:
> Pred1 <- predict(Model1, VData, type=“response”)
> Binary1 <- round(Pred1)
> 100*mean(VData$Spam == Binary1)
R returns: [1] 68
This is the same as derived for Model 1 in Example 7.14.
FILE
Gender_Gap
EXAMPLE 7.15
In Example 7.12, we used the holdout method to assess two
models for analyzing a project manager’s salary. For
predictors, Model 1 used Size, Experience, Female, and Grad.
Model 2 extends the model by also including the interactions
between Female with Experience, Female with Grad, and Size
with Experience. Use the k-fold cross-validation method to
compare the predictability of the models using k = 4.
SOLUTION:
We assess both models four times with the validation set
formed by the observations 151–200, 101–150, 51–100, and
1–50, respectively. Each time the training set includes the
remaining observations for estimating the regression models.
TABLE 7.28 RMSE for the k-Fold Cross-Validation Method with k = 4, Example
7.15
page 308
EXERCISES 4
EXERCISES 7.4
Mechanics
56. FILE Exercise_7.56. The accompanying data file contains 40 observations on
the response variable y along with the predictor variables x and d. Consider two
linear regression models where Model 1 uses the variables x and d and Model 2
extends the model by including the interaction variable xd. Use the holdout
method to compare the predictability of the models using the first 30 observations
for training and the remaining 10 observations for validation.
57. FILE Exercise_7.57. The accompanying data file contains 40 observations on
the response variable y along with the predictor variables x1 and x2. Use the
holdout method to compare the predictability of the linear model with the
exponential model using the first 30 observations for training and the remaining
10 observations for validation.
58. FILE Exercise_7.58. The accompanying data file contains 40 observations on
the binary response variable y along with the predictor variables x1 and x2. Use
the holdout method to compare the accuracy rates of the linear probability model
with the logistic regression model using the first 30 observations for training and
the remaining 10 observations for validation.
Applications
59. FILE IceCream. The accompanying data file contains 35 observations for an ice
cream truck driver’s daily income (Income in $), number of hours on the road
(Hours), whether it was a particularly hot day (Hot = 1 if the high temperature was
above 85°F, 0 otherwise), and whether it was a Holiday (Holiday = 1, 0 otherwise).
Consider two models where Model 1 predicts Income on the basis of Hours, Hot,
and Holiday and Model 2 also includes the interaction between Hot and Holiday.
Use the holdout method to compare the predictability of the models using the first
24 observations for training and the remaining 11 observations for validation.
60. FILE Mobile_Devices. The accompanying data file contains survey data for 80
participants with information on average daily time spent on mobile devices
(Usage, in minutes), household income (Income, in $1,000s), lived in a rural area
(Rural =1 if rural, 0 otherwise), and had a college degree (College = 1 if college
graduate, 0 otherwise). Consider two predictive models for mobile device usage
where Model 1 is based on Income, Rural, and College and Model 2 also includes
the interaction between Rural and College.
a. Use the holdout method to compare the predictability of the models using the
first 60 observations for training and the remaining 20 observations for
validation.
b. Use the k-fold cross-validation method to compare the predictability of the
models using k = 4.
61. FILE BMI. The accompanying data file contains salary data (in $1,000s) for 30
college-educated men with their respective BMI and a dummy variable that
represents 1 for a white man and 0 otherwise. Model 1 predicts Salary using BMI
and White as predictor variables. Model 2 includes BMI and White along with an
interaction between the two.
a. Use the holdout method to compare the predictability of the models using the
first 20 observations for training and the remaining 10 observations for
validation.
b. Use the k-fold cross-validation method to compare the predictability of the
models using k = 3.
62. FILE Pick_Errors. The accompanying data file contains information on 30
employee’s annual pick errors (Errors), experience (Exper in years), and whether
or not the employee attended training (Train equals 1 if the employee attended
training, 0 otherwise). Model 1 predicts Errors using Exper and Train as predictor
variables. Model 2 includes Exper and Train along with an interaction between the
two.
a. Use the holdout method to compare the predictability of the models using the
first 20 observations for training and the remaining 10 observations for
validation.
b. Use the k-fold cross-validation method to compare the predictability of the
models using k = 3.
63. FILE Health_Factors. The accompanying data file contains survey information
on 120 American adults who rated their health (Health) and social connections
(Social) on a scale of 1 to 100. The file also contains information on their
household income (Income, in $1,000s) and college education (College equals 1
if they have completed a bachelor’s degree, 0 otherwise). Consider two linear
regression models for Health. Model 1 uses Social, Income, and College as the
predictor variables, whereas Model 2 also includes the interactions of Social with
Income and Social with College.
a. Use the holdout method to compare the predictability of the models using the
first 90 observations for training and the remaining 30 observations for
validation.
b. Use the k-fold cross-validation method to compare the predictability of the
models using k = 4.
64. FILE Rental. The accompanying data file contains monthly data on rent (Rent, in
$) along with square footage (Sqft), number of bedrooms (Bed), and number of
bathrooms (Bath) of 80 rental units. Consider two linear regression models for
Rent. Model 1 uses Bed, Bath, and Sqft, whereas Model 2 also allows the
interaction between Bed and Sqft.
a. Use the holdout method to compare the predictability of the models using the
first 60 observations for training and the remaining 20 observations for
validation.
b. Use the k-fold cross-validation method to compare the predictability of the
models using k = 4.
page 309
65. FILE Crew_Size. The accompanying data file contains weekly
data on crew size (the number of workers) and productivity (jobs/week) over the
past 27 weeks. A linear regression model and a quadratic regression model are
considered for predicting productivity on the basis of crew size.
a. Use the holdout method to compare the predictability of the models using the
first 18 observations for training and the remaining 9 observations for validation.
b. Use the k-fold cross-validation method to compare the predictability of the
models using k = 3.
66. FILE Happiness. The accompanying data file contains information on 100
working adults’ self-assessed happiness on a scale of 0 to 100, along with their
age and annual income. Consider two quadratic models for happiness. Model 1 is
based on age, age2, and income. Model 2 is based on age, age2, and ln(income).
a. Use the holdout method to compare the predictability of the models using the
first 75 observations for training and the remaining 25 observations for
validation.
b. Use the k-fold cross-validation method to compare the predictability of the
models using k = 4.
67. FILE Electricity. Consider a regression model for predicting monthly electricity
cost (Cost in $) on the basis of average outdoor temperature (Temp in °F),
working days per month (Days), and tons of product produced (Tons). The
accompanying data file contains data on 80 observations. Use the holdout
method to compare the predictability of the linear and the exponential regression
models using the first 60 observations for training and the remaining 20
observations for validation.
68. FILE Arlington_Homes. The accompanying data file contains information on the
sale price (in $) for 36 single-family homes in Arlington, Massachusetts. In order
to analyze house price, the predictor variables include the house’s square footage
(Sqft), the number of bedrooms (Beds), the number of bathrooms (Baths), and
whether or not it is a colonial (Col = 1 if colonial, 0 otherwise).
a. Use the holdout method to compare the predictability of the linear and the
exponential models using the first 24 observations for training and the
remaining 12 observations for validation.
b. Use the k-fold cross-validation method to compare the predictability of the
models using k = 3.
69. FILE Purchase. Consider the accompanying data to predict an Under Armour
purchase (Purchase; 1 for purchase, 0 otherwise) on the basis of customer age.
a. Use the holdout method to compare the accuracy rates of the linear probability
model (Model 1) and the logistic regression model (Model 2) using the first 20
observations for training and the remaining 10 observations for validation.
b. Use the k-fold cross-validation method to compare the accuracy of the models
using k = 3.
70. FILE Divorce. Consider the accompanying data to analyze how people view
divorce (Acceptable equals 1 if morally acceptable, 0 otherwise) on the basis of
age and religiosity (Religious equals 1 if very religious, 0 otherwise).
a. Use the holdout method to compare the accuracy rates of two competing
logistic models for divorce, using the first 150 observations for training and the
remaining 50 observations for validation. Model 1 uses Age and Religious as
predictor variables, whereas Model 2 also incudes the interaction between Age
and Religious.
b. Use the k-fold cross-validation method to compare the accuracy of the models
using k = 4.
As mentioned in Chapter 6, when using regression analysis with big data, the
emphasis is placed on the model’s predictability and not necessarily on tests of
significance. In this section, we will use logistic regression models for predicting
college admission and enrollment decisions. Before running the models, we have to
first filter out the College_Admission data to get the appropriate subset of
observations for selected variables. We encourage you to replicate the results in the
report.
Case Study
Create a sample report to analyze admission and enrollment decisions at the school
of arts & letters in a selective four-year college in North America. For predictor
variables, include the applicant’s sex, ethnicity, grade point average, and SAT scores.
Make predictions for the admission probability and the enrollment probability using
typical values of the predictor variables.
page 310
Sample Report—College Admission and
Enrollment
College admission can be stressful for both students and parents as there is
no magic formula when it comes to admission decisions. Two important
factors considered for admission are the student’s high school record and
performance on standardized tests. According to the National Association for
College Admission Counseling (NACAC), a student’s high school record
carries more weight than standardized test scores.
Rawpixel.com/Shutterstock
Of the 6,964 students who applied to the school of arts & letters, 30.76%
were males; in addition, the percentages of white and Asian applicants were
55.59% and 12.42%, respectively, with about 32.00% from other ethnicities.
The average applicant had a GPA of 3.50 and an SAT score of 1146. Table
7.29 also shows that 1,739 (or 24.97%) applicants were granted admission,
of which 401 (23.06%) decided to enroll. As expected, the average GPA and
SAT scores of admitted applicants are higher than those who applied and
those who enrolled, but to a lesser extent.
Two logistic regression models are estimated using the same predictor
variables, one for predicting the admission probability and the other for
predicting the enrollment probability. The entire pool of 6,964 applicants is
used for the first regression, whereas 1,739 admitted applicants are used for
the second regression. The results are presented in Table 7.30.
Accuracy (%) 81 77
Notes: Parameter estimates are in the top half of the table with the z-statistics
given in parentheses; * represents significance at the 5% level. Accuracy (%)
measures the percentage of correctly classified observations.
With accuracy rates of 81% and 77%, respectively, both models do a good
job with predicting probabilities. It seems that the sex of the applicant plays
no role in the admission or enrollment decisions.
Interestingly, both white and Asian applicants have a
page 311
lower probability of admission than those from other ethnicities. Perhaps this
is due to affirmative action, whereby colleges admit a proportionally higher
percentage of underrepresented applicants. As expected, quality applicants,
in terms of both GPA and SAT, are pursued for admission.
On the enrollment side, admitted applicants who are white are more likely
to enroll than all other admitted applicants. Finally, admitted applicants with
high GPA and high SAT scores are less likely to enroll at this college. This is
not surprising because academically strong applicants will have many offers,
which lowers the probability that an applicant will accept the admission offer
of a particular college.
In order to further interpret the influence of SAT scores on college
admission and enrollment, we compute predicted admission and enrollment
probabilities for representative males from all ethnicities with a GPA of 3.8
and SAT scores varying between 1000 and 1600. The results are shown in
Figures 7.18 and 7.19.
Report 7.1 FILE College_Admissions. Choose a college of interest and use the
sample of enrolled students to best predict a student’s college grade point average.
Explore interactions of the relevant predictor variables and use cross validation to
select the best predictive model. In order to estimate these models, you have to first
filter the data to include only the enrolled students.
Report 7.2 FILE House_Price. Choose two comparable college towns. Develop a
predictive model for the sale price of a house for each college town. Explore log-
linear transformations, dummy variables, and interactions of the relevant predictor
variables. Use cross validation to select the best predictive model.
Report 7.3 FILE NBA. Develop a model for predicting a player’s salary. For
predictor variables, consider age (quadratic effect), height, weight, and relevant
performance measures. Use cross validation to assess the choice of predictor
variables as well as the functional form (linear or exponential). In order to estimate
these models, you will have to first filter the data to include only career statistics
based on regular seasons. Exclude players with no information on salary.
Report 7.5 FILE TechSales_Reps. The net promoter score (NPS) is a key indicator
of customer satisfaction and loyalty. Use data on employees in the software product
group with a college degree to develop the logistic regression model for predicting if
a sales rep will score an NPS of 9 or more. Use cross validation to select the
appropriate predictor variables. In order to estimate this model, you have to first
construct the (dummy) target variable, representing NPS ≥ 9 and subset the data to
include only the employees who work in the software product group with a college
degree.
page 313
For illustration, we use the Spam data that was used to assess
Model 1 and Model 2 in Example 7.14. The first three steps are
similar to those used when assessing the linear regression model.
page 314
A. Import the Spam data into a data frame (table) and label it
myData.
B. Install and load the caret package. Enter:
> install.packages(“caret”)
> library(caret)
C. > myControl <- trainControl(method = “cv”, number = 4)
D. Before estimating the logistic regression model, we must convert
y from numeric type into factor type so that R treats it as a
categorical variable with two classes; in other words, a dummy
variable. We use the as.factor function to accomplish this task.
Enter:
> myData$Spam <- as.factor(myData$Spam)
> Model1 <- train(Spam ~ Recipients + Hyperlinks + Characters,
data = myData, trControl = myControl, method = “glm”, family
= binomial(link=logit), metric = “Accuracy”)
> Model1
> Model2 <- train(Spam ~ Recipients + Hyperlinks, data =
myData, trControl = myControl, method = “glm”, family =
binomial(link=logit), metric = “Accuracy”)
> Model2
page 315
page 316
8 Introduction to Data
Mining
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 8.1 Describe the data mining process.
LO 8.2 Implement similarity measures.
LO 8.3 Assess the predictive performance of data
mining models.
LO 8.4 Conduct principal component analysis.
page 317
©easy camera/Shutterstock
INTRODUCTORY CASE
Social Media Marketing
Alissa Bridges is the marketing director of FashionTech, an
online apparel retailer that specializes in activewear for both
men and women. The target market for FashionTech includes
individuals between the ages of 18 and 35 with an active
and/or outdoors lifestyle who look for both fashion and value in
their apparel purchase. The company markets its products via
a variety of media channels including TV ads, quarterly
catalogs, product placements, search engines, and social
media. Alissa has hired a social media marketing firm,
MarketWiz, to develop predictive models that would help
FashionTech acquire new customers as well as increase sales
from existing customers. Using FashionTech’s historical social
media marketing and sales data, MarketWiz develops two
types of predictive models.
A classification model that predicts potential customers’
purchase probability from FashionTech within 30 days of
receiving a promotional message in their social media
account.
Two prediction models that predict the one-year purchase
amounts of customers acquired through social media
channels.
In order to assess the performance of the predictive models,
Alissa’s team would like to use the validation data set to:
1. Evaluate how accurately the classification model classifies
potential customers into the purchase and no-purchase
classes.
2. Compare the performance of prediction models that estimate
the one-year purchase amounts of customers acquired
through social media channels.
A synopsis of this introductory case is presented at the end of
section 8.3.
page 318
CRISP-DM
CRISP-DM was developed in the 1990s by a group of five
companies: SPSS, TeraData, Daimler AG, NCR, and OHRA. CRISP-
DM consists of six major phases: business understanding, data
understanding, data preparation, modeling, evaluation, and
deployment. The six phases can be summarized as follows:
page 320
The next few sections of this chapter explore three key concepts
relevant to data mining: similarity measures, performance evaluation,
and dimension reduction techniques. They are the prerequisite
knowledge for understanding the data mining techniques discussed
in Chapters 9, 10, and 11.
page 323
LO 8.2
Implement similarity measures.
Observation x1 x2
1 3 4
2 4 5
3 10 1
where xki and xkj represent the ith and jth observations for the kth
variable.
The Manhattan distance is the shortest distance between two
observations if you are only allowed to move horizontally or
vertically. Referring to the city of Manhattan, which is laid out in
square blocks, the Manhattan distance between two points is the
shortest path for a vehicle to travel from one point to another in the
city. For a data set with k variables, the Manhattan distance between
the ith and jth observations is calculated as:
where xki and xkj represent the ith and jth observations for the kth
variable.
page 324
EXAMPLE 8.1
Refer back to Table 8.1, where three observations are reported
for variables x1 and x2.
a. Calculate and interpret the Euclidean distance between all
pairwise observations.
b. Calculate and interpret the Manhattan distance between all
pairwise observations.
SOLUTION:
Observations 1 and 3:
Observations 2 and 3:
Using the Euclidean measure to gauge similarity, observations
1 and 2 are the most similar with the shortest Euclidean
distance of 1.41. Observations 1 and 3 are the most dissimilar
given the longest Euclidean distance of 7.62. This conclusion
is reinforced by the plot in Figure 8.3.
b. The Manhattan distances between all pairwise observations
are calculated as:
and
where xki and xkj represent the ith and jth observations for the
kth variable.
page 325
the mean and the standard deviation of the kth variable, respectively.
Another widely used transformation method is min-max
normalization, which subtracts the minimum value from the
observation and then divides the difference by the range (maximum
value − minimum value). This approach rescales each value to be
between 0 and 1. In order to find the min-max normalized value for
the ith observation of the kth variable, denoted qki, we calculate:
where xki is the ith observation of the kth variable, and mink and
rangek are the minimum value and the range of the kth variable,
respectively.
where xki is the ith observation of the kth variable, and mink and
rangek are the minimum value and the range of the kth
variable, respectively.
EXAMPLE 8.2
Consider a sample of five consumers with their annual income
(Income, in $) and hours spent online per week (Hours Spent)
shown in columns 1–3 of Table 8.2. The distance measures
calculated from raw data can be distorted because the annual
income values are much larger than the values of hours spent
online per week.
Matching Coefficient
The matching coefficient for a categorical variable is based on
matching values to determine similarity among observations
(records). The matching coefficient between two observations is:
The higher the value of the matching coefficient, the more similar
the two observations are. A matching coefficient value of one implies
a perfect match.
EXAMPLE 8.3
Consider the list of college students in Table 8.3. Each record
shows the student’s major, field, sex, and whether or not the
student is on the Dean’s List. Compute the matching coefficient
between all pairs of students and determine how similar or
dissimilar they are from each other.
TABLE 8.3 Characteristics of College Students
SOLUTION:
Comparing students 1 and 2, only one of the four variables
(i.e., Dean’s List) has a matching value. Therefore, the
matching coefficient equals 1/4 = 0.25. Comparing students 1
and 3, two of the four variables (i.e., Major and Sex) have a
matching value. Therefore, the matching coefficient equals 2/4
= 0.50. Comparing students 2 and 3, none of the four variables
has a matching value. Therefore, the matching coefficient
equals 0/4 = 0.
Based on the matching coefficients for these three students,
students 1 and 3 are the most similar to each other, whereas
students 2 and 3 are the most dissimilar.
Jaccard’s Coefficient
The matching coefficient makes no distinction between positive
outcomes (e.g., made a purchase) and negative outcomes (e.g., did
not make a purchase) and, therefore may provide a page 328
misleading measure of similarity. Consider, for example,
consumer purchasing behavior at a grocery store where thousands
of products are available. Intuitively, two consumers purchasing the
same products exhibit similar behavior. However, if one consumer
purchases only a Hostess cupcake, and the other purchases only
organic apples, most people would conclude that these two
consumers are not similar. However, a simple matching coefficient
will account for the fact that neither of the customers purchased
thousands of other products that are available and, therefore,
erroneously produce a very high value of the matching coefficient.
Jaccard’s coefficient, named after French scientist Paul
Jaccard, is a more appropriate measure of similarity in situations
where negative outcomes are not as important as positive outcomes.
It is computed as:
EXAMPLE 8.4
A retail store collects point of sales information that shows
whether or not five products were included in each sales
transaction. This information is captured in Table 8.4. A ‘yes’
indicates a positive outcome, while a ‘no’ indicates a negative
outcome. For example, Transaction 1 indicates that a
consumer purchases a computer keyboard, a mouse, and a
headphone. Compute and compare matching coefficients and
Jaccard’s coefficients for all sets of pairwise transactions.
SOLUTION:
We demonstrate how to calculate the matching coefficient and
the Jaccard coefficient between transactions 1 and 2. For the
matching coefficient, transactions 1 and 2 have three matching
values: keyboard, mouse, and USB drive. Given that there are
five products (so five variables), the matching coefficient is
calculated as 3/5 = 0.6. For the Jaccard’s coefficient, the two
transactions have two matching positive outcomes (keyboard
and mouse) and one matching negative outcome (USB drive).
Therefore, the Jaccard’s coefficient for the pair is calculated as
2/(5 − 1) = 0.50. The coefficients for the remaining pairs are
calculated similarly and are shown in Table 8.5.
Matching Jaccard’s
Transactions
coefficients coefficients
EXERCISES 8.2
Note: For exercises in this section, it is advisable to use unrounded numbers in the
calculations.
Mechanics
1. FILE Exercise_8.1. The accompanying data file contains 10 observations with
two variables, x1 and x2.
a. Using the original values, compute the Euclidean distance between the first two
observations.
b. Using the original values, compute the Manhattan distance between the first
two observations.
c. Based on the entire data set, calculate the sample mean and standard deviation
for x1 and x2. Use z-scores to standardize the values, and then compute the
Euclidean distance between the first two observations.
d. Based on the entire data set, find the minimum and maximum values for x1 and
x2. Use the min-max transformation to normalize the values, and then compute
the Euclidean distance between the first two observations.
2. FILE Exercise_8.2. The accompanying data file contains 10 observations with
three variables, x1, x2, and x3.
a. Using the original values, compute the Euclidean distance between the first two
observations.
b. Using the original values, compute the Manhattan distance between the first
two observations.
c. Based on the entire data set, calculate the sample mean and standard deviation
for the three variables. Use z-scores to standardize the values, and then
compute the Euclidean distance between the first two observations.
d. Based on the entire data set, find the minimum and maximum values for the
three variables. Use the min-max transformation to normalize the values, and
then compute the Euclidean distance between the first two
observations.
page 330
3. FILE Exercise_8.3. The accompanying data file contains 28 observations with
three variables, x1, x2, and x3.
a. Using the original values, compute the Euclidean distance for all possible pairs
of the first three observations.
b. Based on the entire data set, calculate the sample mean and standard deviation
for the three variables. Use z-scores to standardize the values, and then
compute the Euclidean distance for all possible pairs of the first three
observations. Compare the results with part a.
c. Using the original values, and then the z-score standardized values, compute
the Manhattan distance for all possible pairs of the first three observations.
4. FILE Exercise_8.4. The accompanying data file contains 19 observations with
two variables, x1 and x2.
a. Using the original values, compute the Euclidean distance for all possible pairs
of the first three observations.
b. Based on the entire data set, find the minimum and maximum values for the
three variables. Use the min-max transformation to normalize the values, and
then compute the Euclidean distance for all possible pairs of the first three
observations.
c. Using the original values, and then the min-max normalized values, compute
the Manhattan distance for all possible pairs of the first three observations.
5. FILE Exercise_8.5. The accompanying data file contains 10 observations with
two variables, x1 and x2.
a. Based on the entire data set, calculate the sample mean and standard deviation
of the two variables. Using the original values, and then the z-score
standardized values, compute the Euclidean distance for all possible pairs of
the first three observations.
b. Using the original values, and then the z-score standardized values, compute
the Manhattan distance for all possible pairs of the first three observations.
6. FILE Exercise_8.6. The accompanying data file contains 12 observations with
three variables, x1, x2, and x3.
a. Based on the entire data set, calculate the sample mean and standard deviation
for the three variables. Using the original values, and then the z-score
standardized values, compute the Euclidean distance for all possible pairs of
the first three observations.
b. Based on the entire data set, find the minimum and maximum values of the
three variables. Using the original values, and then the min-max normalized
values, compute the Manhattan distance for all possible pairs of the first three
observations.
7. FILE Exercise_8.7. The accompanying data file contains five observations with
three categorical variables, x1, x2, and x3.
a. Compute the matching coefficient for all pairwise observations.
b. Identify the observations that are most and least similar to each other.
8. FILE Exercise_8.8. The accompanying file contains four observations with four
categorical variables, x1, x2, x3, and x4.
a. Compute the matching coefficient for all pairwise observations.
b. Identify the pair of observations that are most and least similar to each other.
9. FILE Exercise_8.9. The accompanying file contains six observations with four
binary variables, x1, x2, x3, and x4.
a. Compute the matching coefficient for all pairwise observations. Identify the pair
of observations that are most and least similar to each other.
b. Compute the Jaccard’s coefficient for all pairwise observations. Identify the pair
of observations that are most and least similar to each other.
Applications
10. FILE Employees. Consider the following portion of data that lists the starting
salaries (in $1,000) of newly hired employees and their college GPAs:
1 72 3.53
2 66 2.86
3 72 3.69
⋮ ⋮ ⋮
11 59 3.49
a. Without transforming the values, compute the Euclidean distance for all
possible pairs of the first three employees. Make sure to exclude the employee
numbers in the calculations.
b. Compute the z-score standardized values for salary and GPA, and then
compute the Euclidean distance for all possible pairs of the first three
employees. Discuss the impact that standardization has on the similarity
distance.
c. Using the z-score standardized values, compute the Manhattan distance for all
possible pairs of the first three employees. Discuss the differences between the
Euclidean and Manhattan distances.
11. FILE Online_Retailers. Consider the following portion of data that lists
information about an online retailer’s annual revenues (Revenue in $), the number
of products available on the retailer’s website (SKUs), and the number of visits to
the website per day (Visits):
⋮ ⋮ ⋮ ⋮
page 331
a. Without transforming the values, compute the Euclidean
distance for all pairwise observations of websites 001, 003, and 005 based on
annual revenues, SKUs, and visits per day. Make sure to exclude the Website
IDs from your calculation.
b. Compute the min-max normalized values for annual revenues, SKUs, and visits
per day, and then compute the Euclidean distance for all pairwise observations
of websites 001, 003, and 005. Discuss the differences in parts a and b. Make
sure to exclude the Website IDs from your calculation.
c. Using the min-max normalized values, compute the Manhattan distance for all
pairwise observations of websites 001, 003, and 005. Discuss the difference
between the Euclidean and Manhattan distances. Make sure to exclude the
Website IDs from your calculation.
12. FILE Vehicles. Consider the following portion of data that shows vehicle
information based on three variables: Type of vehicle, whether or not the vehicle
has all-wheel drive (AWD), and whether the vehicle’s transmission is automatic or
manual.
2 Sedan No Manual
⋮ ⋮ ⋮ ⋮
56 SUV No Automatic
a. Using the first five vehicles, compute the matching coefficient for all pairwise
observations based on Type, AWD, and Transmission.
b. Identify and describe the vehicles that are most and least similar to each other.
Explain the characteristics of these vehicles.
13. FILE Home_Loan. Consider the following portion of data that includes
information about home loan applications. Variables on each application include
(1) whether the application is conventional or subsidized by the federal housing
administration (LoanType), (2) whether the property is a single-family or multi-
family home (PropertyType), and (3) whether the application is for a first-time
purchase or refinancing (Purpose).
⋮ ⋮ ⋮ ⋮
a. Using the first five applications, compute the matching coefficients for all
pairwise observations based on LoanType, PropertyType, and Purpose.
b. Identify and describe the loan applications that are most and least similar to
each other. Explain the characteristics of these loan applications.
14. FILE University_Students. Information is collected on university students. The
variables of interest include (1) whether a student is an undergraduate or
graduate student (Level of education), (2) whether or not the student is a math
major, (3) whether or not the student is a statistics minor, and (4) whether a
student is male or female. A portion of the data is shown in the accompanying
table.
a. Using the first five students, compute the matching coefficients for all pairwise
observations based on the four variables.
b. Identify and describe students who are most and least similar to each other.
15. FILE Bookstore. Consider the following data that shows a portion of point of
sales transactions from a local bookstore. Each binary variable indicates whether
or not a particular book genre is purchased within a transaction.
a. Using the first five sales transactions, compute the matching coefficients for all
pairwise observations for the five binary variables.
b. Using the first five sales transactions, compute Jaccard’s coefficients for all
pairwise observations for the five binary variables.
c. Based on the results from parts a and b, identify and describe the transactions
that are most and least similar to each other. Discuss the difference between
the results from parts a and b.
16. FILE Grocery_Store. Consider the following data that show a portion of the
point of sales transactions from a grocery store. Each binary variable indicates
whether or not a product is purchased as part of the transaction.
a. Using the first five sales transactions, compute the matching coefficients for all
pairwise observations for the four binary variables.
page 332
b. Using the first five sales transactions, compute Jaccard’s
coefficients for all pairwise observations for the four binary variables.
c. Based on the results from parts a and b, identify and describe the transactions
that are most and least similar to each other. Discuss the difference between
the results from parts a and b.
17. FILE Fast_Food. Consider the following data that show a portion of the sales
transactions from a local fast-food restaurant. Each binary variable indicates
whether or not a product is purchased as part of the sales transaction.
Transaction Hamburgers Fries Soda
1 0 1 0
2 1 1 0
⋮ ⋮ ⋮ ⋮
238 0 1 1
a. Using the first five sales transactions, compute the matching coefficients for all
pairwise observations for the three binary variables.
b. Using the first five sales transactions, compute Jaccard’s coefficients for all
pairwise observations for the three binary variables.
c. Based on the results from parts a and b, identify and describe the transactions
that are most and least similar to each other. Discuss the difference between
the results from parts a and b.
Data Partitioning
Data partitioning is the process of dividing a data set into a training,
a validation, and, in some situations, an optional test data set. A
common data partitioning practice is a two-way random partitioning
of the data to generate a training data set and a validation data set.
We use random partitioning, as opposed to fixed draws discussed in
Chapter 7, to avoid any bias in the selection of the training and
validation data sets. The training data set, which often contains the
larger portion of the data, “trains” data mining algorithms to identify
the relationship between the predictor variables and the target
(response) variable. The validation data set, which is not involved in
model building, is used to provide an unbiased assessment of the
predictive performance of data mining models.
The model built from the training data set is used to predict the
values of the target variable in the validation data set. These
predicted values are then compared to the actual values of the target
variable of the validation data set to evaluate the performance of the
model. The performance measures obtained from this process can
be used to help assess model performance, fine-tune the model, or
compare the performance of competing models. A page 333
common practice is to partition 60% of the data into
the training data set and 40% of the data into the validation data set.
However, sometimes, the 70%/30% training/validation partitioning is
used if the model building process can benefit from a larger training
data set. As explained in Chapter 7, we can implement cross-
validation with the holdout as well as k-fold methods.
In supervised data mining, we sometimes use three-way random
partitioning of the data to generate training, validation, and test data
sets. The third data partition, called a test data set, which is not
involved in either model building or model selection, is created to
evaluate how well the final model would perform on a new data set
that it has never seen before. As in the case of two-way partitioning,
data are usually partitioned randomly to create unbiased training,
validation, and test data sets. In the three-way random partitioning
method, 50% training/30% validation/20% test partitioning is often
recommended. The three-way random partitioning is usually
implemented with the holdout method, as opposed to k-fold cross-
validation.
DATA PARTITIONING
Data partitioning is the process of dividing a data set into a
training, a validation, and an optional test data set. The training
data set is used to generate one or more models. The
validation data set is used to fine-tune or compare the
performance of competing models, The optional test data set is
used to assess the performance of the final model on a new
data set.
Oversampling
Sometimes the target class that we wish to classify is very rare,
which reduces the usefulness of classification models. For example,
in order to build a classification model that informs an online retailer
about which prospective customers are likely to respond to a
promotional e-mail, the retailer conducts an experiment by sending
promotional e-mail messages to 10,000 potential customers of which
only 200 customers respond, resulting in a response rate of tow %.
The resulting data from this experiment would consist mostly of non-
responders with little information to distinguish them from the target
class of responders. If this retailer attempts to build classification
models, the best-performing model might be the one that suggests
no prospective customers will respond to the promotional e-mail
message as the model would have an extremely high accuracy rate
of 98%. Nevertheless, the best-performing model in this application
might be useless to the retailer.
A common solution to this problem is called oversampling. The
oversampling technique involves intentionally selecting more
samples from one class than from the other class or classes in order
to adjust the class distribution of a data set. Rare target class cases
will be more represented in the data set if they are oversampled.
This would lead to predictive models that are more useful in
predicting the target class cases.
In the promotional e-mail example, the retailer may choose to
oversample the rare target class cases by including most, if not all,
responders and reduce the number of non-responders so that the
resulting data set constitutes a healthy combination of responders
and non-responders.
OVERSAMPLING
The oversampling procedure overweights the rare class relative
to the other class or classes in order to adjust the class
distribution of a data set.
page 334
OVERFITTING
Overfitting occurs when a predictive model is made overly
complex to fit the quirks of given sample data. By making the
model conform too closely to the sample data, its predictive
power is compromised.
FIGURE 8.4 The relationship between error rate and model complexity
We will now discuss the assessment of classification models
where the target variable is categorical, and prediction models where
the target variable is numerical. Performance measures used in
classification and prediction models are different; therefore, they are
discussed separately in the following subsections.
page 335
Actual
Predicted Class 1 Predicted Class 0
Class
EXAMPLE 8.5
Recall the FashionTech example from the introductory case
where Class 1 (target) is the group of social media users who
purchase at least one product within 30 days of receiving a
promotional message and Class 0 (nontarget) is the group of
social media users who do not make a purchase after receiving
a promotional message. Table 8.7 shows the confusion matrix
that is obtained after applying the classification model to the
validation data set of 200 observations. [Shortly, we show how
we create the confusion matrix in Excel.]
Class 1 29 19
Class 0 19 133
ActualClass TargetProb
1 0.77585787
0 0.04709009
⋮ ⋮
0 0.38111670
Actual
Predicted Class 1 Predicted Class 0
Class
EXAMPLE 8.6
Table 8.10 shows the actual class and the predicted target
class probability for 10 observations of a categorical variable.
The table also includes the predicted target class memberships
based on five different cutoff values.
TABLE 8.10 Actual and Predicted Class Memberships Given Various Cutoffs
a. Calculate the misclassification rate with a cutoff value of 0.25
and a cutoff value of 0.75.
b. Compute the sensitivity and precision with a cutoff value of
0.25 and a cutoff value of 0.75.
c. Report and interpret performance measures with all five cutoff
values.
SOLUTION:
page 340
The blue upper curve in Figure 8.5 is called the lift curve, which
represents the ability of the classification model to identify Class 1
cases (for example, buyers). To construct a lift curve, the
observations in the validation data set are sorted in descending
order based on their predicted probability of belonging to the target
class (i.e., Class 1). The lift curve in Figure 8.5 shows that if we
select the top 2,000 observations with the highest predicted
probability of belonging to Class 1, we will capture 465 Class 1
observations. If we randomly select 2,000 observations, we will likely
capture only 190 Class 1 observations. The ratio of the number of
Class 1 observations captured by the model to the number captured
by random selection is called the lift.
The cumulative lift chart in Figure 8.5 shows that when we select
2,000 observations, the model provides a lift of 465/190 or 2.447.
This means that among the top 2,000 observations selected by the
model, we are able to capture 2.447 times as many Class 1
observations as compared to the 2,000 observations selected
randomly. Therefore, the cumulative lift chart shows how well a
predictive model can capture the target class cases, compared to
random selection. A high lift number is desirable in most applications
ranging from marketing to fraud detection as it means that we are
able to capture a large proportion of the target class cases by only
focusing on a small portion of cases with a high predicted probability.
A lift curve that is above the baseline is indicative of good predictive
performance of the model. The further above the baseline the lift
curve lies, the better the model’s ability to identify target class cases.
Figure 8.6 shows that the lift for the first 10% of the observations
(first bar) is about 7.1, which means that the top 10% of the
observations selected by the model contain 7.1 times as many Class
1 cases as the 10% of the observations that are randomly selected.
It also allows us to determine the point at which the model’s
predictions become less useful. For example, after the second
decile, the lift values dip below one, suggesting that the model will
capture fewer target class cases than random selection in the
subsequent deciles. Therefore, in this case, we want to focus on the
first 20% of the observations that have the highest predicted
probabilities of belonging to the target class if our goal is to capture
the target class cases (e.g., in a marketing campaign or when
detecting fraudulent credit card transactions).
page 342
EXAMPLE 8.7
Let’s go back to the introductory case where Alissa and her
team at FashionTech want to assess the performance of
predictive models developed by MarketWiz. Alissa and her
team constructed a cumulative lift chart, a decile-wise lift chart,
and a ROC curve; see Figure 8.8. Interpret these performance
charts. [Shortly, we show how to create these graphs in Excel.]
page 343
We now use Excel to replicate the graphs shown in Figure 8.8. The
graphs are based on the Class-Prob data file that we had earlier
used to create the confusion matrix in Excel.
. Open the original Class_Prob data file and sort the data as
described in part B for the cumulative lift chart.
. We need to compute the ratio of the number of Class 1 cases
identified in each decile (200 × 0.10 = 20 cases in each decile) to
the number of Class 1 cases that we would have identified if 10% of
the cases were selected randomly. First type the page 345
column heading “# of Class 1 Cases in Each Decile” in
D1 and “Decile-Wise Lift” in E1. In order to find the number of Class
1 cases identified in the first decile—the numerator of the ratio for
the first decile—type =SUM(A2:A21) in D2. Similarly, to find the
number of Class 1 cases identified in the second decile, type
=SUM(A22:A41) in D3. The other eight deciles are found in a like
manner. Because there are a total of 48 Class 1 cases, the number
of Class 1 cases among the randomly selected 10% of the cases
would be 48 × 0.10 = 4.8 cases; this is the denominator of the ratio
for all deciles. Enter the formula =D2/4.8 in cell E2 to compute the
lift for the first decile. Fill the range E3:E11 with the formula in E2.
This computes the lift values for the rest of the 9 deciles.
. Select the range E2:E11. Choose Insert > Insert Column or Bar
Chart > Clustered Columns to create a column chart. [Formatting
(regarding the chart title, axis titles, etc.) can be done by selecting
Format > Add Chart Element from the menu.]
0 0 1
⋮ ⋮ ⋮
1 1 0
. Open the Spec_Sens data file.
. Select column C. Right-click and choose Insert to insert a column.
Type the column heading “1 − Specificity” in the new column. Insert
the formula = 1−B2 in cell C2. Fill the range C3:C22 with the
formula in C2. The new column computes the 1 − Specificity values
for all the cutoff values in column A.
. Select the range C1:D22. Choose Insert > Scatter with Smooth
Lines to create a scatter chart that connects all the data points in a
line. Change the chart title to “ROC Curve”. Change the x-axis title
to “1 − Specificity”. Change the y-axis title to “Sensitivity”.
. To compare the ROC curve of the classification model to that of the
baseline model, you will add a diagonal line that connects points
(0,0) and (1,1). Right-click the chart area and choose Select Data.
Click the Add button in the Select Data Source dialog box. Enter
“Baseline Model” in the Series name: box. Enter “0, 1” in the Series
X values: box. Enter “0, 1” in the Series Y values: box. Click OK
twice.
page 346
Note that prediction errors are calculated in the same way as the
residuals, but they are calculated using the validation data set rather
than the training data set. As with regression models, low values for
the prediction error e are preferred. The following definition box
highlights five performance measures: (1) the root mean square
error (RMSE), (2) the mean error (ME) or the average error (AE), (3)
the mean absolute deviation (MAD) or mean absolute error (MAE),
(4) the mean percentage error (MPE), and (5) the mean absolute
percentage error (MAPE).
page 347
EXAMPLE 8.8
FILE
Prediction
Recall from the introductory case that Alissa and her team at
FashionTech want to assess the performance of predictive
models developed by MarketWiz. Two prediction models were
designed to predict the one-year purchase amounts for
customers acquired through social media channels. After
applying the prediction models to the validation data set of 200
observations, Alissa and her team generated the Prediction
data set, which contains the predicted one-year purchase
amounts for the 200 observations. Each row of the data set
includes the actual target value (ActVal) and the corresponding
predicted value for two models (PredVal1 and PredVal2). A
portion of the data is shown in Table 8.13.
⋮ ⋮ ⋮
ME 11.2530 12.0480
SOLUTION:
For both models, ME and MPE, as measures of prediction bias,
offer conflicting results. The positive values of ME suggest that
predictions for both models tend to underestimate the purchase
amounts, while negative values of MPE suggest that
predictions tend to slightly overestimate the purchase amounts
percentwise. Because Model 1 exhibits lower values of all
performance measures, we conclude that Model 1 has the
lower prediction bias as well as lower magnitude of the errors.
Therefore, Model 1 is the preferred model to predict the one-
year purchase amounts for customers acquired through social
media channels.
RMSE =SQRT(SUM(E2:E201)/200)
ME =SUM(D2:D201)/200
MAD =SUM(F2:F201)/200
MPE =SUM(G2:G201)/200
MAPE =SUM(H2:H201)/200
Wayhome studio/Shutterstock
20. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity for the following confusion matrix.
21. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity for the following confusion matrix.
22. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity for the following confusion matrix.
23. FILE Exercise_8.23. Answer the following questions using the accompanying
data set that lists the actual class memberships and predicted Class 1 (target
class) probabilities for 10 observations.
a. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity using the cutoff value of 0.5.
b. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity using the cutoff value of 0.25.
c. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity using the cutoff value of 0.75.
24. Use the data in the preceding exercise.
a. What is the lift that the classification model provides if 5 observations are
selected by the model compared to randomly selecting 5 observations?
b. What is the lift that the classification model provides if 8 observations are
selected by the model compared to randomly selecting 8 observations?
25. FILE Exercise_8.25. Answer the following questions using the accompanying
data set that lists the actual class memberships and predicted Class 0 (nontarget
class) probabilities for 10 observations.
a. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity using the cutoff value of 0.5.
b. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity using the cutoff value of 0.25.
c. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity using the cutoff value of 0.75.
26. Use the data in the preceding exercise.
a. What is the lift that the classification model provides if 5 observations are
selected by the model compared to randomly selecting 5 observations?
b. What is the lift that the classification model provides if 8 observations are
selected by the model compared to randomly selecting 8 observations?
27. FILE Exercise_8.27. Compute the RMSE, ME, MAD, MPE, and MAPE using the
accompanying data set that lists the actual and predicted values for 10
observations.
28. FILE Exercise_8.28. Compute the RMSE, ME, MAD, MPE, and MAPE using the
accompanying data set that lists the actual and predicted values for 10
observations.
page 350
Applications
29. A mail-order catalog company has developed a classification model to predict
whether a prospective customer will place an order if he or she receives a catalog
in the mail. If the prospective customer is predicted to place an order, then he or
she is classified in Class 1; otherwise, he or she is classified in Class 0. The
validation data set results in the following confusion matrix.
Class 1 75 25
Class 0 84 816
1 1 0.22
2 0 0.54
⋮ ⋮ ⋮
100 1 0.5
a. Specify the predicted class membership for the validation data set using the
cutoff values of 0.25, 0.5, and 0.75. Produce a confusion matrix in Excel based
on the classification results from each cutoff value.
b. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity of the classification model for each of the three cutoff values
specified in part a.
c. Create a cumulative lift chart and a decile-wise lift chart for the classification
model.
d. What is the lift that the classification model provides if 20% of the observations
are selected by the model compared to randomly selecting 20% of the
observations?
e. What is the lift that the classification model provides if 50% of the observations
are selected by the model compared to randomly selecting 50% of the
observations?
32. FILE IT_Professionals. Kenzi Williams is the Director of Human Resources of a
high-tech company. In order to manage the high turnover rate of the IT
professionals in her company, she has developed a predictive model for
identifying software engineers who are more likely to leave the company within
the first year. If the software engineer is predicted to leave the company within a
year, he or she is classified into Class 1; otherwise, he or she is classified into
Class 0. Applying the model to the validation data set generated a table that lists
the actual class membership and predicted Class 1 probability of the 100
observations in the validation data set. A portion of the table is shown below.
1 0 0.09
2 0 0.22
⋮ ⋮ ⋮
100 1 0.41
a. Specify the predicted class membership for the validation data set using the
cutoff values of 0.25, 0.5, and 0.75. Produce a confusion
matrix based on the classification results from each cutoff
page 351
value.
b. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity of the classification model from each of the three cutoff values
specified in part a.
c. Create a cumulative lift chart and a decile-wise lift chart for the classification
model.
d. What is the lift that the classification model provides if 20% of the observations
are selected by the model compared to randomly selecting 20% of the
observations?
e. What is the lift that the classification model provides if 50% of the observations
are selected by the model compared to randomly selecting 50% of the
observations?
33. FILE Gamers. Monstermash, an online game app development company, has
built a predictive model to identify gamers who are likely to make in-app
purchases. The model classifies gamers who are likely to make in-app purchases
in Class 1 and gamers who are unlikely to make in-app purchases in Class 0.
Applying the model on the validation data set generated a table that lists the
actual class and Class 1 probability of the gamers in the validation data set. A
portion of the table is shown below.
1 0 0.022666
2 0 0.032561
⋮ ⋮ ⋮
920 0 0.040652
a. Specify the predicted class membership for the validation data set using the
cutoff values of 0.25, 0.5, and 0.75. Produce a confusion matrix based on the
classification results from each cutoff value.
b. Compute the misclassification rate, accuracy rate, sensitivity, precision, and
specificity of the classification model from each of the three cutoff values
specified in part a.
c. Create the cumulative lift chart, decile-wise lift chart and ROC curve for the
classification model.
d. What is the lift that the classification model provides if 20% of the observations
are selected by the model compared to randomly selecting 20% of the
observations?
e. What is the lift that the classification model provides if 50% of the observations
are selected by the model compared to randomly selecting 50% of the
observations?
34. FILE House_Prices. A real estate company has built two predictive models for
estimating the selling price of a house. Using a small test data set of 10
observations, it tries to assess how the prediction models would perform on a new
data set. The following table lists a portion of the actual prices and predicted
prices generated by the two predictive models.
⋮ ⋮ ⋮ ⋮
a. Compute the ME, RMSE, MAD, MPE, and MAPE for the two predictive models.
b. Are the predictive models over- or underestimating the actual selling price on
average?
c. Compare the predictive models to a base model where every house is predicted
to be sold at the average price of all the houses in the training data set, which is
$260,500. Do the predictive models built by the real estate company outperform
the base model in terms of RMSE?
d. Which predictive model is the better-performing model?
35. FILE Customer_Spending. Mary Grant, a marketing manager of an online
retailer, has built two predictive models for estimating the annual spending of new
customers. Applying the models to the 100 observations in the validation data set
generates a table that lists the customers’ actual spending and predicted
spending. A portion of the data set is shown below.
Actual Predicted Spending Predicted Spending
Customer
Spending 1 2
⋮ ⋮ ⋮ ⋮
a. Compute the ME, RMSE, MAD, MPE, and MAPE for the two predictive models.
b. Are the predictive models over- or underestimating the actual selling price on
average?
c. Compare the RMSE and MAD of the two predictive models. Which predictive
model performs better?
d. Compare the better-predictive model to a base model where every customer is
predicted to have the average spending of the cases in the training data set,
which is $290. Does the better-predictive model built by Mary outperform the
base model in terms of RMSE?
page 352
b. Variance Distribution
PC1 PC2
The PC scores shown in the last two columns of Figure 8.11c are
based on the weights shown in Figure 8.11a. The following equation
is used to compute the mth principal component score for the ith
observation, denoted as PCm,i:
page 355
where wmk is the weight for the kth variable of the mth principal
component and zki is the standardized value for the ith observation
of the kth variable.
For example, let’s calculate the PC1 score for the first candy bar,
three Musketeers, denoted PC1,3M. Figure 8.11a shows that the
weights for fat and protein are both 0.707107, so w11 = 0.707107
and w12 = 0.707107. Figure 8.11c shows that the standardized
values for three Musketeer’s fat and protein content are −0.7860 and
−0.8982, respectively, so z11 = −0.7860 and z21 = −0.8982. We
calculate PC1,3M as:
The PC1 scores for each candy bar appear in the second-to-last
column in Figure 8.11c. Similarly, the PC2 score for three
Musketeers, denoted PC2,3M, is calculated as:
The PC2 scores for each candy bar appear in the last column in
Figure 8.11c.
Figure 8.11b shows the distribution of variance across all the
principal components. PC1 accounts for 89.22% of the total variance
while PC2 only accounts for 10.78% of the variance. PCA has
redistributed the variance in the data so that using PC1 alone will
result in very little loss of information.
So far, we have worked with two-dimensional data with which we
can construct at most two principal components. If we had k original
variables, x1, x2, . . . , xk, then up to k principal components could be
constructed. For example, with the CandyBars data on calories, fat,
protein and carb, we could have constructed up to four principal
components for the analysis. The amount of variance explained
decreases from the first principal component to the last principal
component. Therefore, the first few principal components often
represent a very large portion of the total variance (or information) in
the data. This allows us to significantly reduce the number of
dimensions by only including the first few principal components
instead of the entire list of original variables without losing a lot of
information. In practice, a predetermined percentage of variance
explained (e.g., 80%) is used as a threshold to determine the
number of principal components to retain. In addition, the principal
components are uncorrelated; therefore, multicollinearity would not
be an issue in subsequent data mining analysis.
EXAMPLE 8.9
Consider country-level health and population measures for 38
countries from World Bank’s 2000 Health Nutrition and
Population Statistics database. For each country, the measures
include death rate per 1,000 people (Death Rate, in %), health
expenditure per capita (Health Expend, in US$), life
expectancy at birth (Life Exp, in years), male adult mortality
rate per 1,000 male adults (Male Mortality), female adult
mortality rate per 1,000 female adults (Female Mortality),
annual population growth (Population Growth, in %), female
population (Female Pop, in %), male population (Male Pop, in
%), total population (Total Pop), size of labor force (Labor
Force), births per woman (Fertility Rate), and birth rate per
1,000 people (Birth Rate). A portion of the data is shown in
Table 8.17.
page 356
FILE
Health
SOLUTION:
As mentioned earlier, PCA is especially effective when the
variables are highly correlated. The correlation analysis on the
Health data reveals strong correlation between many of the
variables. For example, the correlation coefficient between
male and female mortality rates is 0.95, indicating a strong
positive correlation between the two variables. The correlation
measure tells us that some of the information in male mortality
of a country in the form of variance is duplicated in female
mortality. However, if we remove one of the variables from the
data set, we may lose a sizable portion of the variance
(information) in the data. Applied to a data set with a large
number of correlated variables, PCA can significantly reduce
data dimensions while retaining almost all of the information in
the data.
Next, we demonstrate how to use Analytic Solver and R to
perform PCA on the Health and Population data set. We also
summarize the findings.
Using Analytic Solver
d. See Figure 8.13. Analytic Solver offers you two options for the
number of principal components to display in the results. The
first option is to display a fixed number of page 357
principal components. The total number of
principal components that can be calculated is the same
number as the number of variables. Therefore, the maximum
number of principal components to display in this example is
12. The other option allows you to display the smallest number
of components that collectively explain at least a certain
percentage of the total variance in the data, where the default
threshold is 95%. Choose the option to display all 12
components.
Analytic Solver also offers you the option of choosing one
of the two methods for performing PCA: using the covariance
matrix or the correlation matrix. Recall that the correlation
coefficients are unit-free and, therefore, choosing the Use
Correlation Matrix option is equivalent to using standardized
data for PCA. As mentioned earlier, it is recommended that
data be standardized (converted to z-scores) prior to
performing PCA when the scales are different. The scales of
the data in the Health data set vary significantly; therefore, we
will choose the Use Correlation Matrix option to remove the
impact of data scale on the results. Click Next.
e. In the final dialog box, make sure the box next to Show
principal components score is checked. Click Finish.
Analytic Solver inserts two new worksheets (PCA_Output
and PCA_Scores). Figure 8.14c shows a portion of the
principal component scores for the 38 countries from the
PCA_Scores worksheet. Here, Record 1 refers to Argentina,
Record 2 refers to Austria, and so on. Recall that the principal
component scores are the weighted linear combinations of the
original variables. Figures 8.14a and 8.14b summarize the
results from the PCA_Output worksheet, showing the weights
used to calculate the principal component scores and the
amount of variance accounted for by each principal
component. We see that the first principal component accounts
for a very large portion, 46.6189%, of the total variance in the
data. Among the original variables, Life Exp has the most
weight, in absolute terms, for the first principal component,
followed by Birth Rate. Cumulatively, the first four principal
components account for 93.3486% of the total variance in the
data, suggesting that if we reduce the dimension of the data
from the original 12 variables to the four newly calculated
principal components, we would still be able to retain almost all
the information in the data.
page 358
Using R
The following instructions are based on R version 3.5.3. Also,
the results from R may be slightly different from the Analytic
Solver results due to the differences in the algorithms
employed by the two software. In fact, the sign of some of the
PC scores in R may be the opposite of the ones computed in
Analytic Solver. These sign differences, however, will have no
impact on the resulting data mining techniques.
a. Import the Health data into a data frame (table) and label it
myData.
b. Display a few observations in the myData data frame to
inspect the data using the head function. Enter:
> head(myData)
c. We exclude the Country Name variable (1st variable) from the
analysis, standardize the rest of the variables, and store their
standardized values in a data frame called myData.st. Use the
scale function to standardize the data. Enter:
> pca$rotation
Figure 8.15b shows a portion of the weights. Again, the
weights are very similar to the ones produced by Analytic
Solver.
page 359
f. To review the principal component scores, we
display the x property of the PCA results. Enter:
> pca$x
Figure 8.15c displays a portion of the principal component
scores of the observations. In general, the results are
consistent with those produced by Analytic Solver.
g. In order to list countries (1st variable) along with the principle
component scores, we first use the data.frame statement to
combine the myData data frame with the principal component
scores. We then remove columns 2 to 13, which contain the
original variables in the myData data frame. We display a few
observations in the new data frame using the head function.
Enter:
EXERCISES 8.4
Note: and indicate that the exercises must be solved using Analytic Solver
and R, respectively.
Mechanics
39. FILE Exercise_8.39. Standardize the accompanying data set, and then
40. FILE Exercise_8.40. Standardize the accompanying data set, and then
x1 −0.892 0.452
x2 −0.452 −0.892
x1 x2
a. The mean and standard deviation for x1 are 3.5 and 1.2, respectively. The
mean and standard deviation for x2 are 11.75 and 3.8, respectively. Compute
the z-scores for the x1 and x2 values for the two observations.
b. Compute the first principal component score for observation 1.
c. Compute the second principal component score for observation 2.
43. The following table displays the weights for computing the principal components
and the data for two observations.
x1 −0.81 0.59
x2 −0.59 −0.81
x1 x2
Weight PC1 PC2
a. The mean and standard deviation for x1 are 5.8 and 2.4, respectively. The
mean and standard deviation for x2 are 380.5 and 123.4, respectively. Compute
the z-scores for the x1 and x2 values for the two observations.
b. Compute the first principal component score for observation 1.
c. Compute the second principal component score for observation 2.
44. The following table displays the weights for computing the principal components
and the standardized data (z-scores) for three observations.
z1 z2 z3
page 361
b. Compute the second principal component score for
observation 2.
c. Compute the third principal component score for observation 3.
45. The following Analytic Solver results were generated by a principal component
analysis.
a. How many original variables are in the data set?
b. What percent of variance does the third principal component account for?
c. How many principal components need to be retained in order to account for at
least 95% of the total variance in the data?
46. The following R results were generated by a principal component analysis.
Applications
47. FILE Development. The accompanying data set contains some economic
indicators for 11 African countries collected by the World Bank in 2015. The
economic indicators include annual % growth in agriculture (Agriculture), annual
% growth in exports (Exports), annual % growth in final consumption (Final
consumption), annual % growth in GDP (GDP growth), and annual % growth in
GDP per capita (GDP per capita growth). A portion of the data set is shown in the
accompanying table.
a. Perform correlation analysis on the data. Discuss the correlation between the
variables in the data set. Which variable pair has the highest correlation?
b. Conduct principal component analysis on all the variables in the data set.
Standardize the data prior to the analysis or use the correlation matrix if you are
using Analytic Solver.
c. Allow the maximum number of principal components to be calculated by the
software. How many principal components are computed? What percent of the
total variability is accounted for by the first three principal components? How
many principal components must be retained in order to account for at least
80% of the total variance in the data?
d. Which original variable is given the highest weight to compute the first principal
component? Which original variable is given the highest weight to compute the
second principal component?
e. Create a new data set that contains the country names and the principal
components that account for at least 80% of the total variance in the data.
northern California who loves watching football. She keeps track of football results
and statistics of the quarterbacks of each high school team. The accompanying
table shows a portion of the data that Beza has recorded, with the following
variables: the player’s number that Beza assigns to each quarterback (Player),
team’s mascot (Team), completed passes (Comp), attempted pass (Att),
completion percentage (Pct), total yards thrown (Yds), average yards per attempt
(Avg), yards thrown per game (Yds/G), number of touchdowns (TD), and number
of interceptions (Int).
page 362
c. Which original variable is given the highest weight to compute
the first principal component? Which original variable is given the highest
weight to compute the second principal component?
d. What is the principal component 1 score for the first record (Player 1)?
49. FILE Baseball_Players. Ben Derby is a scout for a college baseball team.
He attends many high school games and practices each week in order to evaluate
potential players to recruit during each college recruitment season. He also keeps
detailed records about each prospective player. His college team is in particular
need to add another hitter to its roster next season. Luckily, Ben has information
on 144 high school players who have played at least 100 games during the
previous season. portion of the data is shown in the accompanying table that
includes the following variables: Player’s ID (Player), the number of games played
(G), at bats (AB), runs (R), hits (H), homeruns (HR), runs batted in (RBI), batting
average (AVG), on base percentage (OBP), and slugging percentage (SLG).
a. Conduct principal component analysis on all the variables, except the player’s
ID, in the data set. Standardize the data prior to the analysis.
b. How many principal components are computed? How many principal
components must be retained to account for at least 80% of the total variance in
the data?
c. Display the weights used to compute the first principal component scores.
d. Create a new data set that contains the players’ names and the principal
components that account for at least 80% of the total variance in the data.
50. FILE Internet_Addiction. Internet addiction has been found to be a widespread
problem among university students. A small liberal arts college in Colorado
conducted a survey of Internet addiction among its students using the Internet
Addiction Test (IAT) developed by Dr. Kimberly Young. The IAT contains 20
questions that measure various aspects of Internet addiction. Individuals who
score high on the IAT are more likely to have problematic Internet use. Also
included in the data are the sex of the students (Sex, 1 for female and 0 for male)
and whether or not they study at the graduate level (1 for Graduate and 0 for
Undergraduate). A portion of the data set is shown in the accompanying table.
a. Conduct principal component analysis on all the variables except the Record
Number. Should you standardize the data? Explain.
b. How many principal components are computed? What percent of the total
variability is accounted for by the first principal component? How many principal
components must be retained in order to account for at least 80% of the total
variance in the data?
c. Which original variable is given the highest weight to compute the first principal
component? Which original variable is given the highest weight to compute the
second principal component?
d. What is the principal component 1 score for the first record?
Hillside College, is conducting research on various common food items and their
nutritional facts. She compiled a data set that contains the nutrition facts on 30
common food items. Jenny feels that many of the nutritional facts of these food
items are highly correlated and wonders if there is a way to reduce the
dimensionality of the data set by creating principal components. A portion of the
data set is shown in the accompanying table. (Values are based on 100 grams of
the food items.)
a. Conduct principal component analysis on all the variables except the Name
variable. Should you standardize the data? Explain.
b. How many principal components are computed? What percent of the total
variability is accounted for by the first principal component? How many principal
components must be retained in order to account for at least 80% of the total
variance in the data?
c. Display the weights used to compute the first principal component scores.
d. Create a new data set that contains the names of the food items and the
principal components that account for at least 80% of the total variance in the
data.
page 363
52. FILE Happiness. Since 2014, the United Nations has
conducted annual studies that measure the level of happiness among its member
countries. Experts in social science and psychology are commissioned to collect
relevant data and define measurements related to happiness. Happiness
measurements are based on survey questions such as how people feel about
their life (i.e., life ladder), levels of positive and negative emotion, freedom to
make choices (Life choices), and aggregate indicators such as social support, life
expectancy, and relative household income. These data are converted into
numerical scores for each member country. The accompanying table shows a
portion of the United Nation’s happiness data.
a. Conduct principal component analysis on all the variables except the Country
variable. Should you standardize the data? Explain.
b. What percent of the total variability is accounted for by the first principal
component? How many principal components must be retained in order to
account for at least 80% of the total variance in the data?
c. Display the weights used to compute the first principal component scores.
Which original variable is given the highest weight to compute the second
principal component?
d. What is the principal component 1 score for the first record (Albania)?
53. FILE Tennis. In tennis, how well a player serves and returns serves often
determine the outcome of the game. Coaches and players track these numbers
and work tirelessly to make improvement. The accompanying table shows a
sample of data that includes local youth tennis players. The relevant variables
include the number of aces (Aces), number of double faults (DF), first serve
percentage (1ST SRV), first serve win percentage (1ST SRV WIN), break point
saved percentage (BP SVD), percentage of service games won (SRV WIN), first
serve return win percentage (1ST RTN WIN), percentage of return games won
(RTN WON), and break point conversion percentage (BP CONV).
a. Conduct principal component analysis on all variables except the Player
variable and display the weights used to compute the first principal component
scores.
b. Which original variable is given the highest weight to compute the first principal
component? Which original variable is given the highest weight to compute the
second principal component?
c. Create a new data set that contains the Player column and the principal
components that account for at least 90% of the total variance in the data.
54. FILE Stocks. Investors usually consider a variety of information to make
investment decisions. The accompanying table displays a sample of large publicly
traded corporations and their financial information. Relevant information includes
stock price (Price), dividend as a percentage of share price (Dividend), price to
earnings ratio (PE), earnings per share (EPS), book value, lowest and highest
share prices within the past 52 weeks (52 wk low and 52 wk high), market value
of the company’s shares (Market cap), and earnings before interest, taxes,
depreciation, and amortization (EBITDA in $billions).
a. Conduct principal component analysis on all the variables except the Name
variable. Should you standardize the data? Explain.
b. What percent of the total variability is accounted for by the first principal
component? How many principal components must be retained in order to
account for at least 80% of the total variance in the data?
c. Which original variable is given the highest weight to compute the first principal
component? Which original variable is given the highest weight to compute the
second principal component?
d. What is the principal component 1 score for the first record (3M)?
page 364
Case Study
FILE
NBA
Merrick Stevens is a sports analyst working for ACE Sports Management, a sports
agency that represents over 200 athletes. Merrick is tasked with analyzing sports-
related data and developing a predictive model for the National Basketball
Association (NBA). He uses the NBA data set that contains information on 30
competing NBA teams and 455 players. The player statistics are for several seasons
as well as for their career. Because a player’s salary is based on his performance
over multiple seasons, Merrick decides to only look at the career regular season data
rather than data of a particular season.
Given the large number of predictor variables that may explain a player’s salary,
Merrick decides to investigate whether principal component analysis may be
advantageous as a first step in model building.
Marcio Sanchez/Ap/Pool/EPA/Shutterstock
page 365
Report 8.1 FILE Longitudinal_Information. Subset the data to select only those
individuals who lived in an urban area. Reduce the dimensionality of the data by
converting numerical variables such as age, height, weight, number of years of
education, number of siblings, family size, number of weeks employed, self-esteem
scale, and income into a smaller set of principal components that retain at least 90%
of the information in the original data. Note: You will need to standardize the data
prior to PCA as the scales of the variables are different.
Report 8.2 FILE College_Admissions. Select the data for one of the three
colleges. Reduce the dimensionality of the data by converting numerical variables
such as high school GPAs, MCA GPAs, SAT scores, and ACT scores into a smaller
set of principal components that retain at least 90% of the information in the original
data. Then use the principal components as predictor variables for building models
for predicting admission.
Report 8.3 FILE House_Price. Choose one of the college towns in the data set.
Reduce the dimensionality of the data by converting numerical variables such as
number of bedrooms, number of bathrooms, home square footage, lot square
footage, and age of the house into a smaller set of principal components that retain
at least 90% of the information in the original data. Then use the principal
components as predictor variables for building models for predicting sale prices of
houses.
Report 8.4 FILE TechSales_Reps. Subset the data to include only the employees
who work in the hardware division and have a college degree. Reduce the
dimensionality of the data by converting numerical variables such as age, years with
the company, number of certifications, feedback scores, and salary into a smaller set
of principal components that retain at least 85% of the information in the original
data. Then use the principal components as predictor variables for building models
for predicting the net promoter score (NPS).
page 366
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
page 367
INTRODUCTORY CASE
24/7 Fitness Center Annual Membership
24/7 Fitness Center is a high-end full-service gym and recruits
its members through advertisements and monthly open house
events. Each open house attendee is given a tour and a one-
day pass. Potential members register for the open house event
by answering a few questions about themselves and their
exercise routine. The fitness center staff places a follow-up
phone call with the potential member and sends information to
open house attendees by mail in the hopes of signing the
potential member up for an annual membership.
Janet Williams, a manager at 24/7 Fitness Center, wants to
develop a data-driven strategy for selecting which new open
house attendees to contact. She has compiled information from
1,000 past open house attendees in the Gym_Data worksheet
of the Gym data file. The data include whether or not the
attendee purchases a club membership (Enroll equals 1 if
purchase, 0 otherwise), the age and the annual income of the
attendee, and the average number of hours that the attendee
exercises per week. Janet also collects the age, income, and
number of hours spent on weekly exercise from 23 new open
house attendees and maintains a separate worksheet called
Gym_Score in the Gym data file. Because these are new open
house attendees, there is no enrollment information on this
worksheet. A portion of the two worksheets is shown in Table
9.1.
FILE
Gym
1 26 18000 14
0 43 13000 9
⋮ ⋮ ⋮ ⋮
0 48 67000 18
b. The Gym_Score Worksheet
22 33000 5
23 65000 9
⋮ ⋮ ⋮
51 88000 6
page 368
SCORING A RECORD
A process for classifying or predicting the value of the target
variable of a new record for given values of predictor variables.
As discussed in Sections 7.4 and 8.3, the problem of overfitting
occurs when a predictive model is made overly complex to fit the
quirks of given sample data. It is possible for a model to perform
really well with the data set used for estimation, but then perform
miserably once a new data set is used. To address this issue, we
partition the data into training, validation, and optional test data sets.
The training data set is used to generate one or more models. The
validation data set is used to evaluate the performance of the models
and select the best model. The optional test data set is used to
assess the performance of the final model on a new data set. Hands-
on examples for data partitioning and cross-validation in Analytic
Solver and R will be provided later in this chapter.
We also discussed several performance measures to assess
predictive models in Section 8.3. For classification, a confusion
matrix is used to summarize all outcomes obtained from the
validation or test data set. These outcomes include true positive (TP)
and true negative (TN), where the observations are classified
correctly by the model. Similarly, false positive (FP) and false
negative (FN) outcomes imply that the observation is incorrectly
classified by the model. Based on these outcomes, performance
measures such as the accuracy rate, the sensitivity, and the
specificity are computed. We also use graphical representations
such as the cumulative lift chart, the decile-wise lift chart, and the
receiver operating characteristic (ROC) curve to assess model
performance. For prediction models, we considered the prediction
error, defined as the difference between the actual and the predicted
value of the numerical target variable. We discussed several
performance measures that capture the magnitude of the prediction
errors, including the root mean square error (RMSE), the mean error
(ME), the mean absolute deviation (MAD) or mean absolute error
(MAE), the mean percentage error (MPE), and the mean absolute
percentage error (MAPE). In this chapter, we will simply interpret
predictive models and assess their performance using Analytic
Solver and R.
page 369
page 371
001 65 35 Yes
002 59 34 No
003 71 40 No
004 67 32 No
005 60 31 Yes
006 59 37 No
007 72 42 No
Customer ID Income Age Default
008 63 34 No
009 64 38 No
010 71 34 Yes
As you can see from this example, the choice of k will result in
different classification results. For example, if k = 1, then we use the
nearest neighbor (the red marker) to determine the class
membership of the new applicant. Now, the probability that the new
applicant will default becomes 1/1 = 1.
Because the KNN method is a supervised data mining technique,
data partitioning is used to determine its performance and to
optimize model complexity. The nearest neighbor observations are
identified in the training data set. Most practitioners try to minimize
the error rate in the KNN method by experimenting with a range of k
values and their corresponding performance measures from the
validation data set in order to determine the optimal k value. In some
software applications such as Analytic Solver, the test data set is
used to assess the performance of the final predictive model and the
optimal k value. The KNN algorithm can also classify new
observations based on the existing observations that are most
similar to the new observation. As mentioned earlier, this process is
called scoring a new record based on the values of the predictor
variables. As presented in Figure 9.1, the KNN algorithm first
searches the entire training data set for the most similar
observations, and then provides a score for a new observation.
Unlike a traditional analysis with linear and logistic regression
models, the KNN algorithm does not assume any underlying
distribution in the data for making predictions. This nonparametric
approach makes the KNN method very useful in a practical situation
because real-world data do not always follow theoretical
assumptions. Despite its simplicity, the KNN method often produces
highly accurate results. Because the KNN algorithm searches the
entire training data set for close neighbors, the process can be quite
time-consuming and computationally intensive, especially with a
large amount of data. Fortunately, several computer software
programs include the KNN algorithm. Example 9.1 illustrates how to
perform KNN classification using Analytic Solver and R.
FILE
Gym
page 373
EXAMPLE 9.1
Recall from the introductory case that Janet Williams, the
membership manager at 24/7 Fitness Center, has collected
data from 1,000 past open house attendees (Gym_Data
worksheet) and 23 new open house attendees (Gym_Score
worksheet). The data include the age and the annual income of
the attendee, the average number of hours that the attendee
exercises per week, and for the Gym_Data worksheet, whether
or not the attendee purchases a membership. Janet is
interested in a model that will help her identify attendees who
are likely to purchase a membership. Perform KNN analysis to
estimate a classification model and evaluate its performance.
Report the relevant performance measures and score new
records.
SOLUTION:
Using Analytic Solver
As discussed in Chapters 7 and 8, to develop and evaluate a
classification model, we generally perform cross-validation
using either the holdout method or the k-fold method. Analytic
Solver provides a procedure for the holdout method and allows
for partitioning a data set into training, validation, and test data
sets. In this chapter, we partition our data set in Analytic Solver
as follows: 50% for training, 30% for validation, and 20% for
test. Here, an independent assessment of the predictive
performance of the KNN model is conducted with the test data
set that is not used in the model development.
a. Open the Gym data file and go to the Gym_Data worksheet.
b. Choose Data Mining > Partition > Standard Partition.
page 374
c. See Figure 9.2. Click on the ellipsis button next to
Data Range and highlight cells $A$1:$D$1001. Make sure that
the box First Row Contains Headers is checked. The
Variables in Input Data box will populate. Select and move all
the variables to the Selected Variables box. Choose Specify
percentages with 50% for Training Set, 30% for Validation Set,
and 20% for Test Set. Accept other defaults. Click OK.
page 375
g. See Figure 9.5. In the Scoring tab, check the
Summary Report boxes for Score Training Data, Score
Validation Data, and Score Test Data. Also check the Lift
Charts box for Score Test Data. Check the In Worksheet
checkbox under the Score New Data section. Under the New
Data (WS) tab, select Gym_Score in the Worksheet box and
enter $A$1:$C$24 in the Data Range. These 23 records of
potential members will be scored by the KNN classification
algorithm. Make sure the First Row Contains Headers box is
checked and click the Match By Name button. Click Finish.
page 376
Using R
As mentioned earlier, in order to evaluate the accuracy of the
KNN algorithm as well as determine the appropriate value of k,
we generally perform cross-validation using either the holdout
method or the k-fold method, both discussed in Chapter 7.
Most data mining packages in R allow for the k-fold cross-
validation method, which, when compared with the holdout
method, is less sensitive to how the data are partitioned.
Therefore, in R we only need to divide the data set into two
partitions, training and validation, and implement the k-fold
cross-validation. Depending on the amount of data available,
when cross-validation is performed, most practitioners partition
data into either 60% training/40% validation or 70%
training/30% validation. We divide the data in the Gym_Data
worksheet into two partitions, 60% for training and 40% for
validation, and then implement a 10-fold cross-validation
technique.
a. Import the data from the Gym_Data worksheet of the Gym
data file into a data frame (table) and label it myData.
The following instructions and results are based on R version
3.5.3. To replicate the results with newer versions of R, enter:
> suppressWarnings(RNGversion(“3.5.3”))
b. For KNN estimation and the resulting performance measures
and diagrams, install and load the caret, gains, and pROC
packages. Enter:
> install.packages(c(“caret”, “gains”, “pROC”))
> library(caret)
> library(gains)
> library(pROC)
page 379
On some computers, you might also need to
install other packages that support the caret package using
the command > install.packages(”caret”, dependencies =
c(”Depends”, ”Suggests”)). Also, if prompted by R Studio,
install and load the car package.
c. We use the scale function to standardize the Age, Income,
and Hours variables; store the standardized values in a new
data frame called myData1; and append the original Enroll
variable back to myData1. We use the as.factor function to
convert the target variable (Enroll) into a categorical data type.
To simplify the R code, we use the colnames function to
rename myData1$myData.Enroll (in column 4) to
myData1$Enroll. Enter:
> myData1 <- scale(myData[2:4])
> myData1 <- data.frame(myData1, myData$Enroll)
> colnames(myData1)[4] <- ‘Enroll’
> myData1$Enroll <- as.factor(myData1$Enroll)
d. To partition the data into 60% training and 40% validation sets,
we use the createDataPartition function and specify Enroll as
the target variable. To ensure consistency, we use the
set.seed function to set the random seed to 1. Enter:
> set.seed(1)
> myIndex <- createDataPartition(myData1$Enroll, p=0.6, list
= FALSE)
> trainSet <- myData1[myIndex,]
> validationSet <- myData1[-myIndex,]
The data set is partitioned into 60% for training and 40% for
validation with the option p = 0.6. The 60% training set is
assigned to an object called trainSet and the other 40% is
assigned to the validationSet object. In order to maintain the
same ratio of target and nontarget class cases for both the
training and validation data sets, R partitions 601 cases into
the training data set and 399 cases into the validation set.
e. We use the trainControl function to implement a 10-fold
cross-validation by setting the option method equal to “cv” and
the option number equal to 10. Enter:
> myCtrl <- trainControl(method = ”cv”, number = 10)
f. We use the expand.grid function to specify possible k values
from 1 to 10 and store the results in an object called myGrid.
The optimal k value is determined based on accuracy. The
possible range of k values may vary; you may experiment with
a different range by changing the numbers in the statement.
Enter:
> myGrid <- expand.grid(.k=c(1:10))
g. To implement the KNN method with the training data set with
option values specified in steps e and f, we use the train
function and store the results in an object called KNN_fit. To
ensure consistency of the cross-validation results, we again
use the set.seed function to fix a random seed. Enter:
> set.seed(1)
> KNN_fit <- train(Enroll ~., data = trainSet, method = ”knn”,
trControl=myCtrl, tuneGrid = myGrid)
> KNN_fit
Note that the Enroll variable is specified as a target variable
and “knn” is specified as the classification method. The KNN
results are shown in Figure 9.9. The value k = 6 page 380
yields the highest accuracy rate (0.9102695) and
will be used in subsequent steps.
page 384
o. We use the roc function, which produces a roc
object that can be used to plot the ROC curve. We then use
the plot.roc function to plot the ROC curve and the auc
function to retrieve the AUC value. Enter:
> roc_object<- roc(validationSet$Enroll, KNN_Class_prob[,2])
> plot.roc(roc_object)
> auc(roc_object)
The area under the ROC curve, or AUC, is very high (0.9532),
indicating that the KNN model performs extremely well in
predicting the gym enrollment among potential members.
Figure 9.15 displays the ROC curve. The KNN model
performs better than the baseline model (shown as the green
diagonal line) in terms of sensitivity and specificity across all
cutoff values.
EXERCISES 9.2
Note: These exercises can be solved using Analytic Solver and/or R. The answers,
however, will depend on the software package used. All answers in R are based on
version 3.5.3. To replicate the results with newer versions of R, execute the following
line of code at the beginning of the R session:
suppressWarnings(RNGversion(“3.5.3”)). For Analytic Solver, partition data sets into
50% training, 30% validation, and 20% test and use 12345 as the default random
seed. For R, partition data sets into 60% training and 40% validation and implement
the 10-fold cross-validation. Use the statement set.seed(1) to specify the random
seed for data partitioning and cross-validation. When searching for the optimal value
of k, search within possible k values from 1 to 10 for both Analytic Solver and R.
Some data files have two worksheets (for example, Exercise_9.3_Data and
Exercise_9.3_Score worksheets) for model development and scoring new records.
Mechanics
1. FILE Exercise_9.1. The accompanying file contains 60 observations with the
binary target variable y along with the predictor variables x1 and x2.
a. Perform KNN analysis. What is the optimal value of k?
b. Report the accuracy, specificity, sensitivity, and precision rates for the test data
set (for Analytic Solver) or validation data set (for R).
page 386
2. FILE Exercise_9.2. The accompanying file contains 111
observations with the binary target variable y along with the predictor variables x1,
x2, x3, and x4.
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. Report the accuracy, specificity, sensitivity, and precision rates for the test data
set (for Analytic Solver) or validation data set (for R).
c. Comment on the performance of the KNN classification model.
3. FILE Exercise_9.3. The accompanying file contains 80 observations with the
binary target variable y along with the predictor variables x1, x2, x3, and x4.
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. What is the misclassification rate for the optimal k?
c. Report the accuracy, specificity, sensitivity, and precision rates for the test data
set (for Analytic Solver) or validation data set (for R).
d. What is the area under the ROC curve (or the AUC value)?
e. Score the new observations in the Exercise_9.3_Score worksheet. What is the
predicted response value of the first observation?
4. FILE Exercise_9.4. The accompanying file contains 200 observations with the
binary target variable y along with the predictor variables x1, x2, x3, x4, and x5.
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. What is the misclassification rate for the optimal k?
c. Report the accuracy, specificity, sensitivity, and precision rates for the test data
set (for Analytic Solver) or validation data set (for R).
d. Change the cutoff value to 0.2. Report the accuracy, specificity, sensitivity, and
precision rates for the test data set (for Analytic Solver) or the validation data
set (for R).
e. Comment on the performance of the KNN classification model.
5. FILE Exercise_9.5. The accompanying file contains 1,000 observations with the
binary target variable y along with the predictor variables x1, x2, and x3.
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. What is the misclassification rate for the optimal k?
c. Report the accuracy, specificity, sensitivity, and precision rates and the AUC
value for the test data set (for Analytic Solver) or validation data set (for R).
d. Comment on the performance of the KNN classification model.
6. FILE Exercise_9.6. The accompanying file contains 2,000 observations with the
binary target variable y along with the predictor variables x1, x2, and x3.
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. Report the accuracy, specificity, sensitivity, and precision rates and the AUC
value for the test data set (for Analytic Solver) or validation data set (for R).
c. Compare the specificity and sensitivity rates and comment on the performance
of the KNN classification model.
d. Display the cumulative lift chart, decile-wise lift chart, and ROC curve.
7. FILE Exercise_9.7. The accompanying file contains 400 observations with the
binary response variable y along with the predictor variables x1, x2, x3, and x4.
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. Report the accuracy, specificity, sensitivity, and precision rates and the AUC
value for the test data set (for Analytic Solver) or validation data set (for R).
c. What is the lift value of the leftmost bar shown in the decile-wise lift chart?
d. Score the new observations in the Exercise_9.7_Score worksheet. What are
the predicted response values of the first two new observations?
Applications
8. FILE Admit. Universities often rely on a high school student’s grade point
average (GPA) and scores on the SAT or ACT for the college admission
decisions. Consider the data for 120 applicants on college admission (Admit
equals 1 if admitted, 0 otherwise) along with the student’s GPA and SAT scores. A
portion of the Admit_Data worksheet is shown in the accompanying table.
1 3.10 1550
0 2.70 1360
⋮ ⋮ ⋮
1 4.40 1320
page 387
9. FILE CFA. The Chartered Financial Analyst (CFA) designation
is the de facto professional certification for the financial industry. Employers
encourage their prospective employees to complete the CFA exam. Daniella
Campos, an HR manager at SolidRock Investment, is reviewing 10 job
applications. Given the low pass rate for CFA Level 1, Daniella wants to know
whether or not the 10 prospective employees will be able to pass the CFA Level 1
exam. Historically, the pass rate is higher for those with work experience and a
good GPA in college. With this insight, she compiles the information on 63 current
employees who took the CFA Level I exam last year, including the employee’s
success on the exam (1 for pass, 0 for fail), the employee’s college GPA, and
years of work experience. A portion of the CFA_Data worksheet is shown in the
accompanying table.
1 3.20 11
1 3.75 15
⋮ ⋮ ⋮
1 3.21 6
a. Perform KNN analysis to estimate a classification model for the CFA Level 1
exam using the CFA_Data worksheet and score the 10 job applicants in the
CFA_Score worksheet. What are the misclassification rates for k = 3, 4, and 5?
b. What is the optimal value of k? Report the overall accuracy, specificity,
sensitivity, and precision rates for the test data set (for Analytic Solver) or
validation data set (for R).
c. What is the predicted CFA Level 1 outcome for the first job applicant?
10. FILE SocialMedia. A social media marketing company is conducting consumer
research to see how the income level and age might correspond to whether or not
consumers respond positively to a social media campaign. Aliyah Turner, a new
college intern, is assigned to collect data from the past marketing campaigns. She
compiled data on 284 consumers who participated in the marketing campaigns in
the past, including income (in $1,000s), age, and whether or not each individual
responded to the campaign (1 if yes, 0 otherwise). A portion of the
SocialMedia_Data worksheet is shown in the accompanying table.
0 103.3 67
0 61.4 34
⋮ ⋮ ⋮
1 91.3 40
a. Perform KNN analysis to estimate a classification model for the social media
campaign using the SocialMedia_Data worksheet and score new consumer
records in the SocialMedia_Score worksheet. What is the optimal value of k?
b. Report and interpret the overall accuracy, specificity, sensitivity, and precision
rates for the test data set (for Analytic Solver) or validation data set (for R).
c. What is the area under the ROC curve (or the AUC value)? Comment on the
performance of the KNN classification model.
d. What is the predicted outcome for the first new consumer record?
e. Change the cutoff value to 0.3. Report the accuracy, specificity, sensitivity, and
precision rates for the test data set (for Analytic Solver) or validation data set
(for R).
11. FILE Spam. Peter Derby works as a cyber security analyst at a private equity
firm. His colleagues at the firm have been inundated by a large number of spam
e-mails. Peter has been asked to implement a spam detection system on the
company’s e-mail server. He reviewed a sample of 500 spam and legitimate e-
mails with relevant variables: spam (1 if spam, 0 otherwise), the number of
recipients, the number of hyperlinks, and the number of characters in the
message. A portion of the Spam_Data worksheet is shown in the accompanying
table.
0 19 1 47
0 15 1 58
⋮ ⋮ ⋮ ⋮
1 13 2 32
a. Perform KNN analysis to estimate a classification model for spam detection
using the Spam_Data worksheet and score new e-mails in the Spam_Score
worksheet. What is the optimal value of k?
b. Report the overall accuracy, specificity, sensitivity, and precision rates for the
test data set (for Analytic Solver) or validation data set (for R).
c. What is the area under the ROC curve (or AUC value)?
d. What is the predicted outcome for the first new e-mail?
12. FILE Security. Law enforcement agencies monitor social media sites on a
regular basis, as a way to identify and assess potential crimes and terrorism
activities. For example, certain keywords on Facebook pages are tracked, and the
data are compiled into a data mining model to determine whether or not the
Facebook page is a potential threat. Officer Matthew Osorio is assigned to explore
data mining techniques that can be used for this purpose. He starts by
experimenting with KNN algorithms to monitor and assess social media sites with
war-related terms such as “warfare,” “bomb,” and “attack” as well as suspicious
keywords such as “extremist,” “radical,” and “conspiracy.” He collects a data set
with 300 observations, a portion of which is shown in the accompanying table.
Each record in the data set includes the following variables: Threat (1 if yes, 0
otherwise), the number of suspicious words (WarTerms and Keywords), and the
number of hyperlinks to or mentioning of suspicious sites.
page 388
Threat WarTerms Keywords Links
0 6 5 5
0 3 5 8
⋮ ⋮ ⋮ ⋮
1 4 4 2
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. What is the misclassification rate for the optimal k?
c. Report and interpret the accuracy, specificity, sensitivity, and precision rates for
the test data set (for Analytic Solver) or validation data set (for R).
d. Generate the cumulative lift chart. Does the entire lift curve lie above the
baseline?
e. Generate the decile-wise lift chart. What is the lift value of the leftmost bar?
f. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
g. Comment on the performance of the KNN classification model.
13. FILE HR. Daniel Lara, a human resources manager at a large tech consulting
firm, has been reading about using analytics to predict the success of new
employees. With the fast-changing nature of the tech industry, some employees
have had difficulties staying current in their field and have missed the opportunity
to be promoted into a management position. Daniel is particularly interested in
whether or not a new employee is likely to be promoted into a management role
after 10 years with the company. He gathers information on 300 current
employees who have worked for the firm for at least 10 years. The information
was based on the job application that the employees provided when they
originally applied for a job at the firm. For each employee, the following variables
are listed: Promoted (1 if promoted within 10 years, 0 otherwise), GPA (college
GPA at graduation), Sports (number of athletic activities during college), and
Leadership (number of leadership roles in student organizations). A portion of the
HR_Data worksheet is shown in the accompanying table.
0 3.28 0 2
1 3.93 6 3
⋮ ⋮ ⋮ ⋮
0 3.54 5 0
a. Use the HR_Data worksheet to help Daniel perform KNN analysis to determine
whether or not an employee is likely to be promoted into a management role
after 10 years with the company. Score the records of the 10 new employees in
the HR_Score worksheet. What is the optimal k? What is the predicted outcome
for the first new employee?
b. What is the misclassification rate for the optimal k?
c. Report the accuracy, specificity, sensitivity, and precision rates for the test data
set (for Analytic Solver) or validation data set (for R).
d. Display the cumulative lift chart, decile-wise chart, and ROC curve.
e. Comment on the performance of the KNN classification model. Is the KNN
classification an effective way to predict an employee’s success?
14. FILE Heart. In recent years, medical research has incorporated the use of data
analytics to find new ways to detect heart disease in its early stage. Medical
doctors are particularly interested in accurately identifying high-risk patients so
that preventive care and intervention can be administered in a timely manner. A
readily available set of information such as the patient’s age (Age), blood pressure
(BP Systolic and BP Diastolic), and BMI, along with an indicator whether or not
the patient has heart disease (Disease = 1 if heart disease, 0 otherwise). A
portion of the data set is provided in the accompanying table.
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. What is the misclassification rate for the optimal k?
c. Report the accuracy, specificity, sensitivity, and precision rates for the test data
set (for Analytic Solver) or validation data set (for R).
d. Display the cumulative lift chart, decile-wise chart, and ROC curve.
e. Comment on the performance of the KNN classification model. Is the KNN
classification an effective way to predict heart disease?
15. FILE Retail. Online retailers often use a recommendation system to suggest new
products to consumers. Consumers are compared to others with similar
characteristics such as past purchases, age, income, and education level. A data
set, such as the one shown in the accompanying table, is often used as part of a
product recommendation system in the retail industry. The variables used in the
system include whether or not the consumer eventually purchases the suggested
item (Purchase = 1 if purchased, 0 otherwise), the consumer’s age (Age in years),
income (Income, in $1,000s), and number of similar items previously purchased
(PastPurchase).
page 389
Purchase Age Income PastPurchase
1 48 99 21
1 47 32 0
⋮ ⋮ ⋮ ⋮
0 34 110 2
115 45 N
68 31 Y
⋮ ⋮ ⋮
73 34 N
9.36 79.99 0
10.37 61.89 0
⋮ ⋮ ⋮
15.02 38.02 1
a. Perform KNN analysis on the data set. What is the optimal value of k?
b. What is the misclassification rate for the optimal k?
c. Report the accuracy, specificity, sensitivity, and precision rates for the test data
set (for Analytic Solver) or validation data set (for R).
d. What is the area under the ROC curve (or the AUC value)?
e. Based on your answers in part c and d, is KNN an effective way to classify
potential customers?
f. Change the cutoff value to 0.3. Report the accuracy, specificity, sensitivity, and
precision rates for the test data set (for Analytic Solver) or validation data set
(for R).
page 390
Male (x = 1) Female (x = 0)
Purchase (y = 1) 30 50
No Purchase (y = 0) 20 100
page 392
EXAMPLE 9.2
An institute for public policy in Washington, D.C., hires a
number of college interns every summer. This year, Sara
Anderson, a third-year Economics major from Massachusetts,
is selected as one of the research interns. Her first assignment
is to conduct data analysis to help congressional offices gain a
better understanding of U.S. residents whose incomes are
below the poverty level. To complete her assignment, Sara
extracts a relevant data set maintained by the U.S. Census
Bureau. The data set has 9,980 observations and is stored in
the Census_Data worksheet of the Census data file. Each
observation contains an individual’s marital status (Married;
yes/no), sex (Female; yes/no), ethnicity (White; yes/no), age
groups (Age: 1 for [18, 25), 2 for [25, 35), 3 for [35, 45), 4 for
[45, 55), and 5 for 55 years and older), whether or not the
individual receives college-level education (Edu; yes/no), and
whether or not the individual’s income is below the poverty
level (Poverty = 1 if living in poverty, 0 otherwise). In addition,
she keeps records of 66 new individuals with the predictor
variables in the Census_Score worksheet of the Census data
file for scoring based on the naïve Bayes classifier. A portion of
the two worksheets is shown in Table 9.7.
FIGURE 9.16 Analytic Solver’s classification summary tables for naïve Bayes
page 394
Using R
As before, we will use the caret package for classification with
k-fold cross-validation. So, we will divide the data into two
partitions for training and validation.
a. Import the data from the Census_Data worksheet of the
Census data file into a data frame (table) and label it myData.
The following instructions and results are based on R version
3.5.3. To replicate the results with newer versions of R, enter:
> suppressWarnings(RNGversion(“3.5.3”))
page 395
b. The caret and klaR packages contain necessary
functions for partitioning the data, k-fold cross-validation, and
naïve Bayes classification. Also, similar to the KNN example,
a cumulative lift chart, decile-wise chart, and ROC curve can
be generated for visual inspection of the model performance
using the gains and pROC packages. Install and load all
packages, if you have not already done so. Enter:
> install.packages(”caret”)
> install.packages(”klaR”)
> install.packages(”gains”)
> install.packages(”pROC”)
> library(caret)
> library(klaR)
> library(gains)
> library(pROC)
c. We use the as.factor command to convert the poverty
variable into a categorical type. Enter:
> myData$Poverty <- as.factor(myData$Poverty)
d. As in the KNN example, we use the set.seed command to set
the random seed to one for consistency. We use the
createDataPartition function to partition the data into training
(60%) and validation (40%). Enter:
> set.seed(1)
> myIndex <- createDataPartition(myData$Poverty, p=0.6, list
= FALSE)
> trainSet <- myData[myIndex,]
> validationSet <- myData[-myIndex,]
e. We use the trainControl function to specify a 10-fold cross-
validation process. On the training data set, we use the train
function and set the method option equal to “nb”, which stands
for naïve Bayes. The Poverty variable is identified as the
target variable. To ensure consistency of the cross-validation
results, we again use the set.seed function to fix a random
seed. Enter:
> myCtrl <- trainControl(method=’cv’, number=10)
> set.seed(1)
> nb_fit <- train(Poverty ~., data = trainSet, method = ”nb”,
trControl=myCtrl)
> nb_fit
Figure 9.18 displays the naïve Bayes results. The naïve
Bayes classifier generates two accuracy rates based on the
underlying distribution of the target variables and uses the
option that yields a higher accuracy (77.29% in this case) in
the final model.
page 396
f. We use the predict and confusionMatrix
functions to validate the model on the validation data set and
produce a confusion matrix. Enter:
> nb_class <- predict(nb_fit, newdata = validationSet)
> confusionMatrix(nb_class, validationSet$Poverty, positive =
’1’)
Note that in the confusionMatrix statement, we specify
the value ‘1’ in the poverty variable as a positive or success
class. The accuracy value (75.65%) remains consistent with
results from the training data set. Unlike in Analytic Solver, the
naïve Bayes model in R yields sensitivity and specificity
values that are close to each other (76.73% and 71.80%,
respectively). Again, the algorithms in the two software are not
the same and yield different results. A portion of the R results
is shown in Figure 9.19.
> plot(c(0,
gains_table$cume.pct.of.total*sum(validationSet$Poverty)) ~
c(0, gains_table$cume.obs),xlab = ”# of cases”,
ylab=”Cumulative”, type =”l”)
> lines(c(0, sum(validationSet$Poverty)) ~ c(0,
dim(validationSet)[1]), col=”red”, lty=2)
i. We use the barplot function to create a decile-wise chart as
shown in Figure 9.22. Enter:
>
barplot(gains_table$mean.resp/mean(validationSet$Poverty),
names.arg=gains_table$depth, xlab=”Percentile”, ylab=”Lift”,
ylim = c(0,1.5), main = ”Decile-Wise Lift Chart”)
j. We use the roc, plot.roc, and auc functions to create the
ROC curve and compute the area under the curve. Enter:
> roc_object<- roc(validationSet$Poverty, nb_class_prob[,2])
> plot.roc(roc_object)
> auc(roc_object)
The area under the ROC curve, or AUC, is 0.8437, indicating
that the naïve Bayes model performs well in predicting
whether an individual is in poverty or not. Figure 9.23 displays
the ROC curve, which shows that model performs better than
the baseline model (shown as the green diagonal line) in
terms of sensitivity and specificity across all cutoff values.
page 398
k. Import the new observations from the
Census_Score worksheet of the Census data file into a data
frame (table) and label it myScoreData. We then use the
predict function to score the 66 new records and append the
classification results back to the original data. Enter:
> nb_class_score <- predict(nb_fit, newdata = myScoreData)
> myScoreData <- data.frame(myScoreData, nb_class_score)
Table 9.9 shows R’s scoring results. Similar to the Analytic
Solver model, the KNN model in R tends to classify new
records who are married male with a college education as
unlikely to live in poverty.
Using the data from the Gym_Data worksheet, we will outline the
procedure for binning the Age variable; other variables can be
binned similarly. Note that the values of the Age variable range from
21 to 68. Suppose we want to bin Age into five categories.
page 399
In Analytic Solver, go to Data Mining > Data Analysis >
Transform > Transform Continuous Data > Bin. Click on the
ellipsis next to Data range and highlight cells $A$1:$D$1001.
EXERCISES 9.3
Note: These exercises can be solved using Analytic Solver
and/or R. The answers, however, will depend on the software
package used. All answers in R are based on version 3.5.3. To
replicate the results with newer versions of R, execute the
following line of code at the beginning of the R session:
suppressWarnings(RNGversion(“3.5.3”)). For Analytic Solver,
partition the data into 60% training and 40% validation and use
12345 as the default random seed. For R, partition the data into
60% training and 40% validation and implement the 10-fold
cross-validation. Use the statement set.seed(1) to specify the
random seed one for data partitioning and cross-validation.
Some data files have two worksheets (e.g.,
Exercise_9.18_Data and Exercise_9.18_Score worksheets) for
model development and scoring new records.
Mechanics
18. FILE Exercise_9.18. The accompanying data set contains three predictor
variables (x1, x2, and x3) and the target variable (y). Partition the data in the
Exercise_9.18_Data worksheet to develop a naïve Bayes classification model
where “Yes” denotes the positive or success class for y. Score the 10 new
observations on the Exercise_9.18_Score worksheet.
a. Report the accuracy, sensitivity, and specificity rates for the validation data set.
b. Generate the decile-wise lift chart. What is the lift value of the leftmost bar?
What does this value imply?
page 400
c. Generate the ROC curve. What is the area under the ROC
curve (or the AUC value)?
d. Report the scoring results for the first three new observations.
19. FILE Exercise_9.19. The accompanying data set contains three predictor
variables (x1, x2, and x3) and the target variable (y). Partition the data in the
Exercise_9.19_Data worksheet to develop a naïve Bayes classification model
where “Y” denotes the positive or success class for y. Score the five new
observations on the Exercise_9.19_Score worksheet.
a. Report the accuracy, sensitivity, and specificity rates for the validation data set.
b. What is the area under the ROC curve (or AUC value)?
c. Report the scoring results for the five new observations.
20. FILE Exercise_9.20. The accompanying data set contains four predictor
variables (x1, x2, x3, and x4) and the target variable (y). Partition the data in the
Exercise_9.20_Data worksheet to develop a naïve Bayes classification model
where “1” denotes the positive or success class for y. Score the five new
observations on the Exercise_9.20_Score worksheet.
a. Report the accuracy, sensitivity, and specificity rates for the validation data set.
b. Generate the cumulative lift chart. Does the entire lift curve lie above the
baseline?
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
d. Report the scoring results for the five new observations.
e. Develop the naïve Bayes model with only x1, x2, and y in the naïve Bayes
model. Repeat parts a through c and compare the results.
page 401
21. FILE Exercise_9.21. The accompanying data set contains
three predictor variables (x1, x2, and x3) and the target variable (y). Partition the
data to develop a naïve Bayes classification model where “1” denotes the positive
or success class for y.
a. Report the accuracy, sensitivity, and specificity rates for the validation data set.
b. Generate the decile-wise lift chart. What is the lift value of the leftmost bar?
What does this value imply?
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
d. Can the naïve Bayes model be used to effectively classify the data? Explain
your answer.
22. FILE Exercise_9.22. The accompanying data set contains three predictor
variables (x1, x2, and x3) and the target variable (y). Partition the data to develop
a naïve Bayes classification model where “1” denotes the positive or success
class for y.
a. Report the accuracy, sensitivity, specificity, and precision rates for the validation
data set.
b. Generate the cumulative lift chart. Does the entire lift curve lie above the
baseline?
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
d. Can the naïve Bayes model be used to effectively classify the data? Explain
your answer.
23. FILE Exercise_9.23. The accompanying data set contains two predictor
variables (x1 and x2 ) and the target variable (y). Partition the data to develop a
naïve Bayes classification model where “1” denotes the positive or success class
for y.
a. Report the accuracy, sensitivity, specificity, and precision rates for the validation
data set.
b. Generate the decile-wise lift chart. What is the lift value of the leftmost bar?
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
d. Can the naïve Bayes model be used to effectively classify the data? Explain.
24. FILE Exercise_9.24. The accompanying data set contains three predictor
variables (x1, x2, and x3) and the target variable (y).
a. Bin predictor variables x1, x2, and x3. For Analytic Solver, choose the Equal
count option and three bins for each of the three variables. For R, bin x1 into [0,
6), [6, 14), and [14, 30); x2 into [0, 10), [10, 20), and [20, 61); and x3 into [0, 3),
[3, 5), and [5, 10). What are the bin numbers for the variables of the first two
observations?
b. Partition the transformed data to develop a naïve Bayes classification model
where “1” denotes the positive or success class for y. Report the accuracy,
sensitivity, specificity, and precision rates for the validation data set.
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
d. Change the cutoff value to 0.2. Report the accuracy, sensitivity, specificity, and
precision rates for the validation data set.
25. FILE Exercise_9.25. The accompanying data set contains three predictor
variables (x1, x2, and x3) and the target variable (y).
a. Bin predictor variables x1 and x2. For Analytic Solver, choose the Equal count
option and three bins for each of the three variables. For R, bin x1 into [0, 60),
[60, 400), and [400, 30000); and x2 into [0, 160), [160, 400), and [400, 800).
What are the bin numbers for the variables of the first two observations?
b. Partition the transformed data to develop a naïve Bayes classification model
where “1” denotes the positive or success class for y. Report the accuracy,
sensitivity, specificity, and precision rates for the validation data set.
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
26. FILE Exercise_9.26. The accompanying data set contains three predictor
variables (x1, x2, and x3) and the target variable (y).
a. Bin predictor variables x1, x2, and x3. For Analytic Solver, choose the Equal
interval option and 2 bins for each of the three variables. For R, bin x1 into [0,
125) and [125, 250); x2 into [0, 30) and [30, 60); and x3 into [0, 30) and [30,
60). What are the bin numbers for the variables of the first two observations?
b. Partition the transformed data to develop a naïve Bayes classification model
where “1” denotes the positive or success class for y. Report the accuracy,
sensitivity, specificity, and precision rates for the validation data set.
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
d. Change the cutoff value to 0.4. Report the accuracy, sensitivity, specificity, and
precision rates for the validation data set.
27. FILE Exercise_9.27. The accompanying data set contains four predictor
variables (x1, x2, x3, and x4) and the target variable (y).
a. Bin predictor variables x1, x2, x3, and x4. For Analytic Solver, choose the Equal
interval option and 2 bins for each of the four variables. For R, bin x1 into [0,
40000) and [40000, 80000); x2 into [0, 50) and [50, 100); x3 into [50, 75) and
[75, 100); and x4 into [0, 20000) and [20000, 40000). What are the bin numbers
for the variables of the first two observations?
b. Partition the transformed data to develop a naïve Bayes classification model
where “1” denotes the positive or success class for y. Report the accuracy,
sensitivity, specificity, and precision rates for the validation data set.
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
d. Can the naïve Bayes model be used to effectively classify the data? Explain
your answer.
Applications
28. FILE International. Every year, hundreds of thousands of international students
apply to graduate programs in the United States. Two of the most important
admissions criteria are undergraduate GPAs and TOEFL scores. An English
language preparation school in Santiago, Chile, wants to examine the acceptance
records of its former students who had applied to graduate school in the United
States during the past two years. The results will be used to help advise new
students about their chance of acceptance to their first choice of graduate
programs. A portion of the data set is shown in the accompanying table with the
following variables: Accept (1 if accepted, 0 otherwise); GPA (1 for below 3.00, 2
for 3.00–3.49, 3 for 3.50 and above); and TOEFL (H for 80 or above, L for below
80).
1 1 H
0 1 H
⋮ ⋮ ⋮
0 1 L
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
c. Can the naïve Bayes model be used to effectively classify the data? Explain
your answer.
29. FILE OnlineRetail. An online retailer is offering a new line of running shoes. The
retailer plans to send out an e-mail with a discount offer to some of its existing
customers and wants to know if it can use data mining analysis to predict whether
or not a customer might respond to its e-mail offer. The retailer prepares a data
set of 170 existing customers who had received online promotions in the past,
which include the following variables: Purchase (1 if purchase, 0 otherwise); Age
(1 for 20 years and younger, 2 for 21 to 30 years, 3 for 31 to 40 years, 4 for 41 to
50 years, and 5 for 51 and older); Income (1 for $0 to $50K, 2 for $51K to $80K, 3
for $81K to $100K, 4 for $100K+); and PastPurchase (1 for no past purchase, 2
for 1 or 2 past purchases, 3 for 3 to 6 past purchases, 4 for 7 or more past
purchases). A portion of the data set is shown in the accompanying table.
1 4 3 4
1 4 1 1
⋮ ⋮ ⋮ ⋮
1 3 4 3
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Generate the decile-wise lift chart. What is the lift value of the leftmost bar?
What does this value imply?
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Can the naïve Bayes model be used to effectively classify the data? Explain
your answer.
30. FILE MedSchool. Admission to medical schools in the United States is highly
competitive. The acceptance rate to the top medical schools could be as low as
2% or 3%. With such a low acceptance rate, medical school admissions
consulting has become a growing business in many cities. In order to better serve
his clients, Paul Foster, a medical school admissions consultant, wants to build a
data-driven model to predict whether or not a new applicant is likely to get
accepted into one of the top 10 medical schools. He collected a database of 1,992
past applicants to the top 10 medical schools with the following information: Sex
(F = female, M = male), CollegeParent (1 if parents with college degrees, 0
otherwise), GPA (1 if undergraduate GPA of 3.50 or higher, 0 otherwise), Med (1 if
accepted to the top 10 medical school, 0 otherwise). A portion of the data set is
shown in the accompanying table.
F 1 1 1
Sex CollegeParent GPA Med
M 1 0 1
⋮ ⋮ ⋮ ⋮
M 0 0 0
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Generate the decile-wise lift chart. What is the lift value of the leftmost bar?
What does this value imply?
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Can the naïve Bayes model be used to effectively classify the medical school
applicant data? Explain your answer.
31. FILE CreditCard. A home improvement retail store is offering its customers
store-branded credit cards that come with a deep discount when used to
purchase in-store home improvement products. To maintain the profitability of this
marketing campaign, the store manager would like to make these offers only to
the customers who are likely to carry a high monthly balance on the credit card. A
data set obtained from a nationwide association of home improvement stores
contains records of 500 consumers who carry similar credit cards offered by other
home improvement stores. Relevant variables include Sex (Female or Male),
Education (1 if did not finish college, 2 if undergraduate degree, 3 if graduate
degree), Children (1 if have children, 0 otherwise), Age (1 if below 20 years old, 2
if 20–29 years, 3 if 30–39 years, 4 if 40–49 years, 5 if 50–59 years, 6 if 60 years
and older). Balance is the target variable where one indicates the customer
maintains a high monthly balance, and 0 otherwise. A portion of the
CreditCard_Data worksheet is shown in the accompanying table.
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
c. Interpret the performance measures and evaluate the
effectiveness of the naïve Bayes model.
page 402
d. Score the new customer records in the CreditCard_Score worksheet. What is
the scoring result of the first customer record?
e. Change the cutoff value to 0.3. Report the accuracy, sensitivity, specificity, and
precision rates for the validation data set.
32. FILE Volunteer. A community center is launching a campaign to recruit local
residents to help maintain a protected nature preserve area that encompasses
extensive walking trails, bird watching blinds, wild flowers, and animals. The
community center wants to send out a mail invitation to selected residents and
invite them to volunteer their time to help but does not have the financial
resources to launch a large mailing campaign. As a result, they solicit help from
the town mayor to analyze a data set of 5,000 local residents and their past
volunteer activity, stored in the Volunteer_Data worksheet. The data include Sex
(F/M), Married (Y = married, N = not married), College (1 if college degree, 0
otherwise), Income (1 if annual income of $50K and above, 0 otherwise), and
Volunteer (1 if participated in volunteer activities, 0 otherwise). They want to use
the analysis results to help select potential residents who are likely to accept the
invitation to volunteer. A portion of the data set from the mayor is shown in the
accompanying table.
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Generate the ROC curve and decile-wise lift chart. What is the area under the
ROC curve (or AUC value)? What is the lift of the leftmost bar of the decile-wise
lift chart?
c. Score the new volunteer records in the Volunteer_Score worksheet. What is the
scoring result of the first customer record?
33. FILE Vacation. Nora Jackson owns a number of vacation homes on a beach.
She works with a consortium of rental home owners to gather a data set to build a
classification model to predict the likelihood of potential customers renting a
beachfront home during holidays. A portion of the data set is shown in the
accompanying table with the following variables: whether the potential customer
owns a home (Own = 1 if yes, 0 otherwise), whether the customer has children
(Children = 1 if yes, 0 otherwise), the customer’s age in years (Age), annual
income (Income), and whether or not the customer has previously rented a
beachfront house (Rental = 1 if yes , 0 otherwise).
a. Bin the Age and Income variables as follows. For Analytic Solver, choose the
Equal count option and two bins for each of the two variables. For R, bin Age
into [22, 45) and [45, 85) and Income into [0, 85000) and [85000, 300000).
What are the bin numbers for Age and Income of the first two observations?
page 403
b. Partition the transformed data to develop a naïve Bayes
classification model. Report the accuracy, sensitivity, specificity, and precision
rates for the validation data set.
c. Generate the decile-wise lift chart. What is the lift value of the leftmost bar?
d. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
e. Interpret the performance measures and evaluate the effectiveness of the naïve
Bayes model.
34. FILE InGame. A mobile gaming company wants to study a group of its existing
customers about their in-game purchases. A data set, a portion of which is shown
in the accompanying table, is extracted and includes how old the customer is
(Age), Sex (1 if female, 0 otherwise), the amount of weekly play time in hours
(Hours), whether or not the customer’s mobile phone is linked to a Facebook
account (Facebook = 1 if yes, 0 otherwise), and whether or not the customer has
made an in-game purchase (Buy = 1 if yes, 0 otherwise).
a. Bin the Age and Hours variables as follows. For Analytic Solver, choose the
Equal interval option and two bins for each of the two variables. For R, bin Age
into [15, 40) and [40, 65) and Hours into [0, 20) and [20, 40). What are the bin
numbers for Age and Hours for the first two observations?
b. Partition the transformed data to develop a naïve Bayes classification model.
Report the accuracy, sensitivity, specificity, and precision rates for the validation
data set.
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Interpret the performance measures and evaluate the effectiveness of the naïve
Bayes model.
35. FILE Grit. Forbes magazine published an article that studied career
accomplishments and factors that might contribute to career success (August 30,
2018). It turns out that career success has less to do with talents and is not
necessarily influenced by test scores or IQ scores. Rather, “grit,” or a combination
of persistence and passion, was found to be a good indicator of a person’s career
success. Tom Weyerhaeuser, an HR manager at an investment bank, is
conducting a campus recruitment and wants to know how he might be able to
measure “grit.” He thought that he could ask each job candidate about his or her
GPA, athletic activities, leadership roles in college, study abroad experience, and
employment during college as a way to gauge whether or not each candidate has
the persistence and passion to succeed in the investment banking industry. He
extracts data from the corporate HR database on 157 current employees with
variables indicating whether or not they currently hold an upper-management
position (Success = 1 if yes, 0 otherwise), had a part-time job in college (Job = 1 if
yes, 0 othwerwise), graduated with a GPA higher than 3.5 (GPA = 1 if yes, 0
otherwise), represented their university in athletic activities (Sports = 1 if yes, 0
otherwise), and had a leadership role in college organizations (Leadership = 1 if
yes, 0 otherwise). A portion of the Grit_Data worksheet is shown in the
accompanying table.
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Generate and display the cumulative lift chart, the decile-wise lift chart, and the
ROC curve.
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Interpret the performance measures and evaluate the effectiveness of the naïve
Bayes model.
e. Score the new job applicants in the Grit_Score worksheet. What is the predicted
outcome of the first applicant?
36. FILE Graduation. Predicting whether or not an entering freshman student will
drop out of college has been a challenge for many higher education institutions.
Nelson Touré, a senior student success adviser at an ivy-league university, has
been asked to investigate possible indicators that might allow the university to be
more proactive to provide support for at-risk students. Nelson reviews a data set
of 200 former students and selects the following variables to include in his study:
Graduate (Graduate = 1 if graduated, 0 otherwise); whether or not the student
received a passing grade in his or her first calculus, statistics, or math course
(Math = 1 if yes, 0 otherwise); whether or not the student received a passing
grade in his or her first English or communications course (Language = 1 if yes, 0
otherwise); whether or not the student had any contact with the advising center
during his or her first semester (Advise = 1 if yes, 0 otherwise); and whether or not
the student lived on campus during his or her first year at college (Dorm = 1 if yes,
0 otherwise). A portion of the data is shown in the accompanying table.
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Generate the cumulative lift chart. Does the entire lift curve lie above the
baseline?
c. Generate the ROC curve. What is the area under the ROC curve (or the AUC
value)?
d. Interpret the results and evaluate the effectiveness of the naïve Bayes model.
37. FILE Fraud. Credit card fraud is becoming a serious problem for the financial
industry and can pose a considerable cost to banks, credit card issuers, and
consumers. Fraud detection using data mining techniques has become an
indispensable tool for banks and credit card companies to combat fraudulent
transactions. A sample credit card data set contains the following variables: Fraud
(1 if fraudulent activities, 0 otherwise), Amount (1 if low, 2 if medium, 3 if high),
Online (1 if online transactions, 0 otherwise), and Prior (1 if products that the card
holder previously purchased, 0 otherwise). A portion of the data set is shown in
the accompanying table.
0 2 0 1
0 3 0 0
⋮ ⋮ ⋮ ⋮
Fraud Amount Online Prior
0 2 0 1
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Generate the decile-wise lift chart. What is the lift value of the page 404
leftmost bar? What does this value imply?
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Interpret the performance measures and evaluate the effectiveness of the naïve
Bayes model.
e. Change the cutoff value to 0.1. Report the accuracy, sensitivity, specificity, and
precision rates for the validation data set.
38. FILE Insurance. Insurance companies use a number of factors to help
determine the premium amount for car insurance coverage. Discounts or a lower
premium may be given based on factors including credit scores, history of at-fault
accidents, age, and sex. Consider the insurance discount data set from 200
existing drivers. The following variables are included in the data set: Discount (1 if
yes, 0 otherwise), Female (1 if female, 0 otherwise), Credit (1 if low, 2 if medium,
3 if high scores), AtFault (1 if history of at-fault accidents, 0 otherwise), and Age
(1 if 25 years and older, 0 otherwise). A portion of the data is shown in the
accompanying table.
a. Partition the data to develop a naïve Bayes classification model. Report the
accuracy, sensitivity, specificity, and precision rates for the validation data set.
b. Display the cumulative lift chart, the decile-wise lift chart, and the ROC curve.
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Interpret the performance measures and evaluate the effectiveness of the naïve
Bayes model.
39. FILE Solar. Refer to Exercise 9.16 for the description of a solar panel company
called New Age Solar and the Solar_Data worksheet.
a. Bin the Age and Income variables in the Solar_Data worksheet as follows. For
Analytic Solver, choose the Equal count option and two bins for each of the two
variables. For R, bin Age into [30, 50) and [50, 90) and Income into [30, 85) and
[85, 140). What are the bin numbers for Age and Income of the first two
observations?
b. Partition the transformed data and develop a naïve Bayes classification model.
Report the accuracy, sensitivity, specificity, and precision rates for the validation
data set.
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Interpret the results and evaluate the effectiveness of the naïve Bayes model.
40. FILE Depression. Michelle McGrath is a college student working to complete an
undergraduate research project to fulfill her psychology degree requirements. She
is interested in how physical and behavioral factors might be used to predict an
individual’s risk of having depression. After receiving an approval from her
adviser, she sends out a survey to local residents asking for their age (Age, in
years), years of education (Education), the number of hours per month they
engaged in moderate or vigorous physical activities (Hours), and whether or not
they have experienced depression (Depression: Y/N). A portion of the data from
261 respondents is shown in the accompanying table.
44 12 20 Y
49 9 30 Y
⋮ ⋮ ⋮ ⋮
69 15 34 Y
a. Bin the Age, Education, and Hours variables as follows. For Analytic Solver,
choose the Equal interval option and three bins for each of the variables. For R,
bin Age into [20, 40), [40,60), and [60, 81), Education into [9, 12), [12, 16), and
[16, 20), and Hours into [0, 55), [55, 105), and [105, 150). What are the bin
numbers for the variables of the first two observations?
b. Partition the transformed data and develop a naïve Bayes classification model.
Report the accuracy, sensitivity, specificity, and precision rates for the validation
data set.
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Interpret the results and evaluate the effectiveness of the naïve Bayes model.
9.4 Writing With Big Data
FILE
College_Admission
In this chapter, we discussed two well-known supervised data mining techniques for
classification problems: the KNN method and the naïve Bayes method. In the
introductory case, we used the KNN method to analyze gym membership data that
have numerical predictor variables to predict whether or not a potential member will
purchase a gym membership. The naïve Bayes method, however, requires
categorical predictor variables. In the Big Data case presented next, we will
transform numerical predictor variables into categorical variables and use the naïve
Bayes method to predict the college admissions of high school students. Details on
data wrangling and transformation are discussed in Chapter 2.
page 405
Case Study
Every year, millions of high school students apply and vie for acceptance to a college
of their choice. For many students and their parents, this requires years of
preparation, especially for those wishing to attend a top-ranked college. In high
schools, students usually work with college advisors to research different colleges
and navigate the admissions process.
Elena Sheridan, a college counselor at Beachside High School, is working with 14
students who are interested in applying to the same selective four-year college. She
is asked by her school principal to prepare a report that analyzes the chances of the
14 students getting accepted into one of the three academic programs. In a database
of past college applicants available to counselors at Beachside High, predictor
variables include the student’s high school GPA, SAT score, and the Male, White,
and Asian dummy variables that capture the student’s sex and ethnicity. Elena also
wants to know whether or not the parents’ education can be a predictor of a student’s
college acceptance and plans to include the education level of both parents in her
analysis.
Based on her conversation with college counselors at other high schools, she
believes that high school students with a GPA of 3.5 or above have a much higher
chance of getting accepted into a selective college. She also thinks that SAT scores
of at least 1,200 substantially increase the chance of acceptance. To test these
anecdotal assumptions, Elena wants to convert the GPAs and SAT scores into the
categories corresponding to these thresholds. In addition, the database has a target
variable indicating whether or not the past applicant was accepted to the college.
Develop the naïve Bayes classification model and create a report that presents
an analysis of the factors that may influence whether or not a high school student is
admitted to a selective four-year college. Predictor variables should include the
applicant’s sex, ethnicity, parents’ education levels, GPA, and SAT scores. Transform
the GPAs and SAT scores into appropriate categorical variables. Make predictions
whether or not each of the 14 high school students at Beachside High in the
College_Admission_Score worksheet will be admitted.
page 407
Table 9.11 presents the scoring results of the 14 current students. In
general, the only students who are likely to be admitted into any of the three
academic programs are those who maintain a GPA of at least 3.50 and score
1,200 or above on the SAT exam. Out of the 14 students, only six of them
meet both criteria. A high GPA or a high SAT score alone is not likely to result
in an acceptance to the college.
These data-driven results confirm the anecdotal intuition that a GPA of
3.50 and an SAT score of 1,200 are the minimum thresholds that students at
Beachside High need to achieve in order to be admitted into a more selective
college or university. Moreover, the School of Arts and Letters appears to be
more selective than the other two schools. Only four of the six students with a
GPA above 3.50 and an SAT score above 1,200 are likely to get accepted
into this program. With this information, it is advised that some of the 14
students wishing to attend the School of Arts and Letters apply to additional
colleges and universities with a similar degree program as a back-up plan.
Report 9.1 FILE Longitudinal_Survey. Subset the data to include only those
individuals who lived in an urban area. Predict whether or not an individual’s
marriage will end up in a divorce, a separation, or a remarriage using predictor
variables such as sex, parents’ education, height, weight, number of years of
education, self-esteem scale, and whether the person is outgoing as a kid and/or
adult. Note: You may need to remove observations with missing values prior to the
analysis.
Report 9.2 FILE Longitudinal_Survey. Subset the data to include only those
individuals who lived in a non-urban area. Predict whether or not an individual will
have full-time employment (i.e., work for 52 weeks) using predictor variables such as
parents’ education, race, self-esteem score, and whether the person is outgoing as a
kid. Note: You may need to remove observations with missing values prior to the
analysis.
Report 9.3 FILE TechSales_Reps. Subset the data set to include only college-
educated sales professionals in the software product group. Predict whether the
sales professional will receive a high (9 or 10) net promoter score
(NPS) or not using predictor variables such as age, sex, tenure with
page 408
the company, number of professional certificates acquired, annual evaluation score,
and personality type. Note: You may need to perform data transformation in order to
meet the requirements of the analytics technique.
Report 9.4 FILE College_Admissions. Subset the data to include only one of the
colleges. Predict whether an admitted applicant will eventually decide to enroll at the
college using predictor variables such as gender, race, high school GPA, SAT/ACT
score, and parents’ education level.
Report 9.5 FILE Car_Crash. Subset the data to include only the accidents that
occured in one city or during one month. Predict whether or not a car crash will result
in fatality or severe injuries by using predictor variables such as the weather
condition, amount of daylight, and whether or not the accident takes place on a
highway.
page 409
page 410
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 10.1 Apply classification trees to classify new
records.
LO 10.2 Apply regression trees to predict new
records.
LO 10.3 Apply ensemble tree models to classify
new records.
page 411
30 Female 101000 0
25 Male 86000 0
⋮ ⋮ ⋮ ⋮
47 Male 33000 1
25 Female 45000
23 Male 22000
⋮ ⋮ ⋮
51 Male 43000
page 412
10.1 INTRODUCTION TO
CLASSIFICATION AND
REGRESSION TREES (CART)
Decision trees are a popular supervised data mining technique with
a wide range of applications. Their appeal lies in the fact that the
output is displayed as one or more upside-down trees that are easy
to interpret. In general, a decision tree represents a set of If-Then
rules, whereby answers to these statements will eventually lead to a
solution to the application at hand.
Imagine a business decision such as the one in the introductory
case where a bank manager tries to decide which customers are
likely to respond to the bank’s home equity line of credit (HELOC)
offer. Figure 10.1 shows a hypothetical decision tree applied to the
HELOC example. The top node of the decision tree is called the root
node. The root node is the first variable to which a split value is
applied. The left branch from the root node represents the cases
whose variable values are less than (<) the split value, and the right
branch represents the cases whose variable values are greater than
or equal to (≥) the split value. Figure 10.1 shows that the root node is
the Age variable. Customers who are younger than 50 years old are
placed in the left branch, and customers who are at least 50 years
old are placed in the right branch. These branches often lead to
interior nodes where more decision rules are applied.
FIGURE 10.1 A simplified decision tree
PURE SUBSETS
Pure subsets consist of leaf nodes that contain cases with the
same value of the target variable. There is no need to further
split pure subsets.
page 415
44 99000 0
63 158000 1
⋮ ⋮ ⋮
40 97000 0
page 416
EXAMPLE 10.1
What are the possible split points on the Income variable for
the 20 cases in Table 10.2?
where
page 417
EXAMPLE 10.2
Let’s return to the HELOC example with the 20 randomly
selected customers. Figure 10.2 shows a plot of Income
against Age for these 20 cases. The 15 blue dots represent
customers who did not respond to the HELOC offer (Class 0),
whereas the five red dots represent customers who responded
to the HELOC offer (Class 1).
EXAMPLE 10.3
We continue with the HELOC example as applied to the 20
randomly selected cases from Table 10.2. Earlier, we
calculated the possible split points for the Age variable. One of
the possible split points is when Age = 49. Figure 10.3 shows a
black vertical line at Age = 49. The subset on the left includes
all the cases whose ages are less than 49, and the subset on
the right contains all the cases whose ages are greater than or
equal to 49. Use Figure 10.3 to compute the Gini index given
an Age split of 49.
SOLUTION: For the 16 cases in the left subset (Age < 49), two
belong to Class 1 and 14 belong to Class 0. For the four cases
in the right subset (Age ≥ 49), one belongs to Class 1 and three
belong to Class 0. Therefore, the Gini indexes for the two
subsets are:
page 418
To compute the overall Gini index for the split, we use the
weighted combination of the Gini indexes using the percentage
of cases in each partition as the weight:
EXAMPLE 10.4
Recall from Example 10.1 that one of the possible split points
on the Income variable is when Income = 64,500. Figure 10.4
shows a horizontal black line at Income = 64,500. This split
creates top and bottom partitions. Compute the overall Gini
index given an Income split of 64,500.
page 419
It can be shown that the split with Income = 64,500 produces the
smallest Gini index across all possible Income splits. Note that the
splits at Age = 49 and Income = 64,500 (Examples 10.3 and 10.4)
produce a Gini index that is better than the original Gini index
computed in Example 10.2. Next, we will start constructing a
decision tree by considering the Gini index values from the two Age
and Income splits.
page 420
50 Female 35000 0
23 Male 60000 0
56 Male 87000 1
42 Female 90000 0
62 Male 71000 1
Unlike the full-grown tree, the tree obtained from the pruning
process is much simpler. The following three rules can be derived
from the optimized tree:
If Age < 49, then the customer will not respond.
If Age ≥ 49 and Income < $39,500, then the customer will not
respond.
If Age ≥ 49 and Income ≥ $39,500, then the customer will respond.
The optimized tree shows that individuals whose age is 49 or older
and income is $39,500 or higher are likely to respond to a HELOC
offer from the bank. As this example demonstrates, pruning is
effective in determining the optimal complexity of a decision tree and
reducing the chance of overfitting.
page 422
Using Analytic Solver and R to Develop a
Classification Tree
We now demonstrate how to use Analytic Solver and R to build
classification trees, using the complete data set. Consider Example
10.5.
FILE
HELOC
EXAMPLE 10.5
Recall from the introductory case that Hayden Sellar, the bank
manager of Sunnyville Bank, wants to use historical data from
500 bank customers to develop a classification model for
determining whether or not new bank customers will respond to
a HELOC offer. He also thinks that Sunnyville Bank is not
targeting the right customers when marketing its HELOC
products and hopes that the classification model will offer
actionable insights for his company. Hayden plans to assess
the performance of the model and then classify 20 new bank
customers as likely or unlikely to respond to a HELOC offer.
Use Analytic Solver and R to build the preferred classification
tree, and then use this tree to score the new cases.
SOLUTION:
Using Analytic Solver
page 423
The CT_Output worksheet shows a prune log that reports
the error rate depending on the number of decision nodes.
Table 10.4 shows a portion of the prune log listing error rates
against decision nodes. The simplest tree with the minimum
validation error rate of 0.16 occurs at three decision nodes.
0 0.2267
1 0.2267
2 0.2267
3 0.1600
4 0.2000
5 0.2000
6 0.2000
Decision Node:
Go left if Sex is from set {Female}
Go right if Sex is not from set {Female}
A number of If-Then rules can be derived from the
classification tree. For example, one rule is that if the customer
is a male who is younger than 32.5 years old, then he will not
respond to a HELOC offer. Another rule suggests that if a
customer is a male who is 32.5 years or older with an income
greater than $26,000, then he will respond to a HELOC offer.
Finally, and perhaps more importantly, the classification tree
indicates that female customers are not likely to respond to the
HELOC offer, confirming the bank manager’s intuition that the
current marketing campaign is not targeting all the right
customers.
Because the validation data were used for pruning, the
performance of the classification tree on new data should be
assessed using the test data set. To assess the performance of
the resulting classification tree, we turn to Testing:
Classification Summary in the CT_TestScore worksheet, a
portion of which is shown in Figure 10.10.
page 425
Finally, the ROC curve (see Figure 10.11c) shows that the
classification tree performs much better than the baseline
model in terms of sensitivity and specificity across all cutoff
values. The area under the curve (AUC) is 0.7710, which is
closer to the optimum classifier (AUC = 1) as compared to the
random classifier (AUC = 0.5).
The CT_NewScore worksheet shows the classification
predictions for the 20 records of new customers. The results
are summarized in Table 10.5. As shown in the page 426
Prediction: HELOC column, the first two
potential customers are classified as not likely to respond to a
HELOC offer, while the last customer is likely to respond. The
PostProb:0 and PostProb:1 columns provide the predicted
probabilities of the customer belonging to Class 0 and Class 1,
respectively.
Using R
The most popular algorithms for building decision trees in R
can be found in the rpart package. As in the case of the KNN
method discussed in Chapter 9, while Analytic Solver uses
three-way partitioning (i.e., train, validation, and test data) for
building decision trees, R implements a k-fold cross-validation
process for pruning decision trees using two-way partitioning
(i.e., only train and validation data).
a. Import the data from the HELOC_Data worksheet of the
HELOC data file into a data frame (table) and label it myData.
The following instructions and results are based on R version
3.5.3. To replicate the results with newer versions of R, enter:
>suppressWarnings(RNGversion(“3.5.3”))
b. Install and load the caret, gains, rpart, rpart.plot, and pROC
packages using the following commands. Enter:
>install.packages(“caret”, dependencies = c(“Depends”,
“Suggests”))
>install.packages(“gains”)
>install.packages(“rpart”)
>install.packages(“rpart.plot”)
>install.packages(“pROC”)
>library(caret)
>library(gains)
>library(rpart)
>library(rpart.plot)
>library(pROC)
c. For constructing a classification tree model, R requires that
the target variable, HELOC, be a factor variable, a categorical
data type. We use the as.factor command to convert the
HELOC variable into a categorical type. Enter:
>myData$HELOC <- as.factor(myData$HELOC)
d. We use the set.seed command to set the random seed to 1,
thus generating the same partitions as in this example. We
use the createDataPartition function to partition the data into
training (70%) and validation (30%). Enter:
>set.seed(1)
>myIndex <- createDataPartition(myData$HELOC, p=0.7, list
= FALSE)
>trainSet <- myData[myIndex,]
>validationSet <- myData[-myIndex,]
e. We use the rpart function to generate the default classification
tree, default_tree. Within the rpart function, we specify the
model structure, data source, and method. The method option
is set to “class” for developing a classification tree. To view the
details about the default tree, use the summary function.
Because R uses the cross-validation method for pruning the
tree, to ensure consistency of the cross-validation results, we
use the set.seed function to set a random seed of 1. Enter:
>set.seed(1)
>default_tree <- rpart(HELOC ~ ., data=trainSet,
method=“class”)
>summary(default_tree)
f. To view the classification tree visually, use the prp function.
The type option is set equal to 1 so that all nodes except the
leaf nodes are labeled in the tree diagram. The extra option is
set equal to 1 so that the number of observations that fall into
each node are displayed. The under option is set equal to
TRUE to put the number of cases under each decision node in
the diagram. Enter:
>prp(default_tree, type=1, extra=1, under = TRUE)
page 427
Figure 10.12 shows the default classification tree.
The first decision node is on the Sex variable, followed by Age
and Income splits. Note that R presents decision trees in a
slightly different format as Analytic Solver does. The root node
provides information about how to interpret the tree. For
example, in Figure 10.12, the root node shows that if Sex is
“Female” then go to the left branch, otherwise go to the right
branch. The subsequent decision nodes follow the same
format. For example, the second decision node suggests that
if Age is less than 25 then go to the left branch, otherwise go
to the right branch.
page 434
Syda Productions/Shutterstock
A closer study of the tree diagram also reveals a rather troubling issue for
Hayden and his team. The decision tree predicts that female customers are not
likely to respond to the HELOC offer. At the same time, Hayden knows that female
customers account for a substantial portion of Sunnyville customers. Could the
alarmingly low percentage of female customers who are interested in HELOC
products be due to the unconscious bias in the bank’s marketing materials? To
answer this question, Hayden and his team decide to review the previous ads of
HELOC products. As they expected, all previous HELOC ads indeed seem to have a
bias toward male customers. This confirms Hayden’s intuition that Sunnyville’s
current marketing strategy is not targeting the right customers. As a result, he has
scheduled to meet and work with the bank’s marketing team to eliminate any
unconscious bias in its marketing materials and focus more on the female
customers.
EXERCISES 10.2
Mechanics
1. FILE Exercise_10.1. The accompanying data set contains two predictor
variables, age and income, and one binary target variable, newspaper
subscription (subscribe), indicating whether or not the person subscribes to a
newspaper. A media company wants to create a decision tree for predicting
whether or not a person will subscribe to the newspaper.
a. List the possible split values for age in ascending order.
b. List the possible split values for income in ascending order.
2. FILE Exercise_10.2. The accompanying data set contains two predictor
variables, average annual number of sunny days (days) and average annual
precipitation (precipitation), and one numeric target variable, average annual crop
yield in bushels per acre (yield). An agricultural researcher wants to create a
decision tree for predicting the annual crop yield in bushels per acre for various
areas.
a. List the possible split values for days in ascending order.
b. List the possible split values for precipitation in ascending order.
3. After a data set was partitioned, the first partition contains 43 cases that belong to
Class 1 and 12 cases that belong to Class 0, and the second partition contains 24
cases that belong to Class 1 and 121 cases that belong to Class 0.
a. Compute the Gini impurity index for the root node.
b. Compute the Gini impurity index for partition 1.
c. Compute the Gini impurity index for partition 2.
d. Compute the Gini impurity index for the split.
4. After a data set was partitioned using the split value of 45.5 for age. The age <
45.5 partition contains 22 patients with a diabetes diagnosis and 178 patients
without a diabetes diagnosis, and the age ≥ 45.5 partition contains 48 patients
page 435
with a diabetes diagnosis and 152 patients without a diabetes
diagnosis.
a. Compute the Gini impurity index for the root node.
b. Compute the Gini impurity index for the age < 45.5 partition.
c. Compute the Gini impurity index for the age ≥ 45.4 partition.
d. Compute the Gini impurity index for the split.
e. State the rules generated from this split.
5. FILE Exercise_10.5. Use the accompanying data set to answer the following
questions.
a. Which split value for age would best separate the newspaper subscribers from
nonsubscribers based on the Gini impurity index?
b. Which split value for income would best separate the newspaper subscribers
from nonsubscribers based on the Gini impurity index?
c. Between the best split values for age and income, which one should be chosen
as the first split for a decision tree that predicts whether someone is a
newspaper subscriber or not?
d. State the rules generated from this split.
6. The classification tree below relates type of wine (A, B, or C) to alcohol content,
flavonoids, malic acid, and magnesium. Classify each of the following wines of
unknown class.
a. Wine with alcohol = 13.3, flavanoids = 1.95, malic acid = 1.03, and magnesium
= 114
b. Wine with alcohol = 11.8, flavanoids = 2.44, malic acid = 0.97, and magnesium
= 103
c. Wine with alcohol = 12.9, flavanoids = 2.21, malic acid = 1.22, and magnesium
= 99
7. FILE Exercise_10.7. The accompanying data set in the Exercise_10.7_Data
worksheet contains four predictor variables (x1 to x4) and one binary target
variable (y). Select the best-pruned tree for scoring and display the full-grown,
best-pruned, and minimum error trees.
a. What is the minimum error in the prune log for the validation data? How many
decision nodes are associated with the minimum error?
b. Display the best-pruned tree. How many leaf nodes are in the best-pruned tree?
How many leaf nodes are in the minimum error tree?
c. What are the predictor variable and split value for the first split of the best-
pruned tree?
d. What are the classification accuracy rate, sensitivity, specificity, and precision of
the best-pruned tree on the test data?
e. Generate the decile-wise lift chart of the best-pruned tree on the test data. What
is the lift value of the leftmost bar of the decile-wise lift chart? What does this
value imply?
f. Generate the ROC curve of the best-pruned tree on the test data. What is the
area under the ROC curve (or AUC value)?
g. Score the new observations in the Exercise_10.7_Score worksheet using the
best-pruned tree. What is the predicted response value of the new
observations? What is the Class 1 probability of the first observation? Round
your answers to four decimal places.
worksheet contains four predictor variables (x1 to x4) and one binary target
variable (y). Select the best-pruned tree for scoring and display the full-grown,
best-pruned, and minimum error trees.
a. What is the minimum error in the prune log for the validation data? How many
decision nodes are associated with the minimum error?
b. Display the best-pruned tree. How many leaf nodes are in the best-pruned tree?
How many leaf nodes are in the minimum error tree?
c. What are the predictor variable and split value for the first split of the best-
pruned tree?
d. What are the classification accuracy rate, sensitivity, specificity, and precision of
the best-pruned tree on the test data?
e. Generate the lift chart of the best-pruned tree on the test data. Does the entire
lift curve lie above the baseline?
f. Generate the ROC curve of the best-pruned tree on the test data. What is the
area under the ROC curve (or AUC value)?
g. Score the new observations in the Exercise_10.8_Score worksheet using the
best-pruned tree. What is the predicted response value of the new
observations? What is the Class 1 probability of the first observation? Round
your answers to four decimal places.
variables (x1 to x5) and one binary target variable (y). Follow the instructions
below to create classification trees using the Exercise_10.9_Data worksheet.
page 436
a. Use the rpart function to build a default classification tree.
Display the default classification tree. How many leaf nodes are in the default
classification tree? What are the predictor variable and split value for the first
split of the default classification tree (root node)?
b. Use the rpart function to build a fully-grown classification tree. Display the cp
table. What is the cp value associated with the lowest cross-validation error?
How many splits are in the minimum error tree?
c. Is there a simpler tree within one standard error of the cross-validation error of
the minimum error tree? If there is, then what is the cp value associated with
the best-pruned tree?
d. Use the prune function to prune the full tree to the minimum error tree. Display
the minimum error tree. How many leaf nodes are in the minimum error tree?
e. Assign Class 1 to be the positive class. What are the accuracy, sensitivity,
specificity, and precision of the minimum error tree on the validation data?
f. Generate the cumulative lift chart and decile-wise lift chart of the minimum error
tree on the validation data. Does the entire lift curve lie above the baseline?
What is the lift value of the leftmost bar of the decile-wise lift chart? What does
this value imply?
g. Generate the ROC curve of the minimum error tree on the validation data. What
is the area under the ROC curve (or AUC value)?
h. Score the new observations in the Exercise_10.9_Score worksheet. What is the
predicted response value of the new observations? What is the Class 1
probability of the first observation? Round your answers to four decimal places.
10. FILE Exercise_10.10. The accompanying data set contains four predictor
variables (x1 to x4) and one binary target variable (y). Follow the instructions
below to create classification trees using the Exercise_10.10_Data worksheet.
a. Use the rpart function to build a default classification tree. Display the default
classification tree. How many leaf nodes are in the default classification tree?
What are the predictor variable and split value for the first split of the default
classification tree (root node)?
b. Use the rpart function to build a fully-grown classification tree. Display the cp
table. What is the cp value associated with the lowest cross-validation error?
How many splits are in the minimum error tree?
c. Is there a simpler tree within one standard error of the cross-validation error of
the minimum error tree? If there is, then what is the cp value associated with
the best-pruned tree?
d. Use the prune function to prune the full tree to the best-pruned tree or minimum
error tree if the answer to part c is “No.” Display the pruned tree. How many leaf
nodes are in the pruned tree?
e. Assign Class 1 to be the positive class. What are the accuracy, sensitivity,
specificity, and precision of the pruned tree on the validation data?
f. Generate the decile-wise lift chart of the pruned tree on the validation data.
What is the lift value of the leftmost bar of the decile-wise lift chart? What does
this value imply?
g. Generate the ROC curve of the pruned tree on the validation data. What is the
area under the ROC curve (or AUC value)?
h. Score the new observations in the Exercise_10.10_Score worksheet using the
pruned tree. What is the predicted response value of the new observations?
What is the Class 1 probability of the first observation? Round your answers to
four decimal places.
Applications
11. FILE Travel_Plan. Jerry Stevenson is the manager of a travel agency. He
wants to build a model that can predict whether or not a customer will travel within
the next year. He has compiled a data set that contains the following variables:
whether the individual has a college degree (College), whether the individual has
credit card debt (CreditCard), annual household spending on food (FoodSpend),
annual income (Income), and whether the customer has plans to travel within the
next year (TravelPlan, 1 = has travel plans, 0 = does not have travel plans). A
portion of the Travel_Plan_Data worksheet is shown in the accompanying table.
Create a classification tree model for predicting whether or not the customer will
travel within the next year (TravelPlan). Select the best-pruned tree for scoring
and display the full-grown, best-pruned, and minimum error trees.
a. How many leaf nodes are in the best-pruned tree and minimum error tree?
What are the predictor variable and split value for the root node of the best-
pruned tree?
b. What are the classification accuracy rate, sensitivity, and specificity of the best-
pruned tree on the test data?
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Score the two new customers in the Travel_Plan_Score worksheet using the
best-pruned tree. What is the probability of the first customer having plans to
travel within the next year? What is the probability for the second customer?
12. FILE Travel_Plan. Refer to the previous exercise for a page 437
description of the problem and data set. Build a default classification tree to
predict whether a customer has plans to travel within the next year. Display the
default classification tree.
a. How many leaf nodes are in the tree? What are the predictor variable and the
split value for the first split of the default classification tree?
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error? How many splits are in the minimum error tree?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum cross-validation error? If there is, then which cp value is
associated with the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Display the tree. How many leaf nodes are in the
pruned tree?
e. Create a confusion matrix and display the various performance measures.
Assign Class 1 to be the positive class. What are the accuracy, sensitivity,
specificity, and precision of the pruned tree on the validation data?
f. Display the cumulative lift chart, the decile-wise lift chart, and the ROC curve of
the minimum error tree on the validation data. Comment on the performance of
the classification tree.
g. Score the two new customers in the Travel_Plan_Score worksheet using the
pruned tree. What is the probability of the first customer having plans to travel
within the next 12 months? What is the probability for the second customer?
Round your answers to four decimal places.
a. How many leaf nodes are in the tree? What are the predictor variable and the
split value for the first split of the default classification tree?
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error? How many splits are in the minimum error tree?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum cross-validation error? If there is, then which cp value is
associated with the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Display the tree. How many leaf nodes are in the
pruned tree?
e. Create a confusion matrix and display the various performance measures.
Assign Class 1 to be the positive class. What are the accuracy, sensitivity,
specificity, and precision of the pruned tree on the validation data?
f. Change the cutoff value to 0.1. Report the accuracy, sensitivity, specificity, and
precision rates of the pruned tree on the validation data.
g. Generate the decile-wise lift chart. What is the lift value of the leftmost bar of
the decile-wise lift chart?
h. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
i. Score the two new individuals in the Continue_Edu_Score worksheet using the
pruned tree. What is the probability of the first community member enrolling in
summer courses? What is the probability for the second community member?
Round your answers to four decimal places.
14. FILE Continue_Edu. Refer to the previous exercise for a description of the
problem and data set. Create a classification tree model for predicting whether the
community member is likely to enroll in summer courses (ContinueEdu). Select
the best-pruned tree for scoring and display the full-grown, best-pruned, and
minimum error trees.
a. How many leaf nodes are in the best-pruned tree and minimum error tree?
What are the predictor variable and split value for the root node of the best-
pruned tree?
b. What are the accuracy rate, sensitivity, and specificity of the best-pruned tree
on the test data?
c. Display the cumulative lift chart, the decile-wise lift chart, and the ROC curve.
Comment on the performance of the classification model?
d. Score the two new cases in the Continue_Edu_Score worksheet using the best-
pruned tree. What is the probability of the first community member enrolling in
summer classes? What is the probability for the second community member?
15. FILE Church. The following data set in the Church_Data worksheet is used
a. How many leaf nodes are in the best-pruned tree and minimum error tree?
What are the rules that can be derived from the best-pruned tree?
b. What are the accuracy rate, sensitivity, specificity, and precision of the best-
pruned tree on the test data?
c. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
d. Score the cases in the Church_Score worksheet using the best-pruned tree.
What percentage of the individuals in the score data set are likely to go to
church based on a cutoff probability value of 0.5?
16. FILE Church. Refer to the previous exercise for a description of the problem
and data set. Build a default classification tree to predict whether an individual is
likely to attend church. Display the default classification tree.
a. How many leaf nodes are in the tree? What are the predictor variable and the
split value for the first split of the default classification tree? What are the rules
that can be derived from the root node?
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum cross-validation error? If there is, then
page 439
which cp value is associated with the best-pruned tree? How many splits are in
the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Create a confusion matrix and display the various
performance measures. Assign Class 1 to be the positive class. What are the
accuracy, sensitivity, specificity, and precision of the pruned tree on the
validation data?
e. Display the cumulative lift chart, the decile-wise lift chart, and the ROC curve of
the minimum error tree on the validation data. Comment on the performance of
the classification tree.
f. Score the cases in the Church_Score worksheet using the best-pruned tree.
What percentage of the individuals in the score data set are likely to go to
church based on a cutoff probability value of 0.5?
may be interested in its new mobile banking app. The worksheet called
Mobile_Banking_Data contains 500 customer records collected from a previous
marketing campaign for the bank’s mobile banking app. Each observation in the
data set contains the customer’s age (Age), sex (Male/Female), education level
(Edu, ranging from one to three), income (Income in $1,000s), whether the
customer has a certificate of deposit account (CD), and whether the customer
downloaded the mobile banking app (App equals 1 if downloaded, 0 otherwise). A
portion of the data set is shown in the accompanying table. Create a classification
tree model for predicting whether a customer will download the mobile banking
app. Assign one as the success class as we are more interested in identifying
customers who download the app. Select the best-pruned tree for scoring and
display the full-grown, best-pruned, and minimum error trees.
page 440
a. How many leaf nodes are in the best-pruned tree and minimum error tree?
What are the predictor variable and split value for the root node of the best-
pruned tree?
b. What are the accuracy rate, sensitivity, specificity, and precision of the best-
pruned tree on the test data?
c. Generate the decile-wise lift chart. What is the lift value of the leftmost bar of
the decile-wise lift chart? What does it imply?
d. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
e. Score the 20 customers in the Mobile_Banking_Score worksheet using the
best-pruned tree. How many of the 20 new customers will likely download the
mobile banking app based on your classification model? What is the probability
of the first new customer downloading the app?
the problem and data set. Build a default classification tree to predict whether a
customer will download the mobile banking app. Display the default classification
tree.
a. How many leaf nodes are in the tree? What are the predictor variable and split
value for the first split of the default classification tree? State the rule that can
be derived from the first leaf node from the top of the tree diagram.
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum cross-validation error? If there is, then which cp value is
associated with the best-pruned tree? How many splits are in the best-pruned
tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Display the tree. Create a confusion matrix and display
the various performance measures. Assign Class 1 to be the positive class.
What are the accuracy, sensitivity, specificity, and precision of the pruned tree
on the validation data?
e. Generate the decile-wise lift chart. What is the lift value of the leftmost bar of
the decile-wise lift chart?
f. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
g. Score the 20 new customers in the Mobile_Banking_Score worksheet using the
pruned tree. How many customers will likely download the mobile banking app
based on your classification model? What is the probability of the first customer
to download the mobile banking app? Round your answer to 4 decimal places.
university. The university has set a goal to increase the number of students who
graduate within four years by 20% in five years. Dereck is asked by his boss to
create a model that would flag any student who has a high likelihood of not being
able to graduate within four years. He has compiled a data set of 2,000 previous
students of the university that contains the following variables: sex (M/F), whether
the student is Caucasian (White), high school GPA (HS GPA), SAT score (SAT),
College GPA (GPA), whether the student’s parents are college educated (College
Parent), and whether the student graduated within four years (Grad). A portion of
the Graduate_Data worksheet is shown in the accompanying table. Build a default
classification tree to predict whether the student will be able to graduate within
four years (Grad). Display the classification tree.
a. What are the predictor variable and the split value for the first split of the default
classification tree? State the rules that can be derived from the root node.
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum cross-validation error? If there is, then which cp value is
associated with the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Display the tree. Is the pruned tree the same tree as
the default tree created in a?
e. Create a confusion matrix and display the various performance measures.
Assign Class 1 to be the positive class. What are the accuracy, sensitivity,
specificity, and precision of the pruned tree on the validation data?
f. Generate the cumulative lift chart. Does the lift curve lie above the baseline?
What does this mean?
g. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)? What does the AUC value imply?
h. Score the three university students in the Graduate_Score worksheet using the
pruned tree. How many of these three students will be able to graduate within
four years according to your model?
20. FILE Graduate. Refer to the previous exercise for a description of the
problem and data set. Create a classification tree model for predicting whether the
student will be able to graduate within four years (Grad). Assign 0 as the success
class as we are more interested in identifying students who are at the risk of not
being able to graduate within four years. Select the best-pruned tree for scoring
and display the full-grown, best-pruned, and minimum error trees.
a. How many leaf nodes are in the best-pruned tree and minimum error tree?
What are the rules that can be derived from the best-pruned tree?
b. What are the accuracy rate, sensitivity, specificity, and precision of the best-
pruned tree on the test data?
c. Generate the cumulative lift chart. Does the lift curve lie above the baseline?
What does this mean?
d. Generate the ROC curve. What is the area under the ROC curve (or AUC
value)?
e. Score the three university students in the Graduate_Score worksheet using the
best-pruned tree. Will these three students be able to graduate within four years
according to your model?
company, wants to be able to predict which gamers are likely to make in-app
purchases. Ranon Weatherby, the company’s data analyst, has compiled a data
set about customers that contains the following variables: customer age (Age),
sex (1 if male, 0 otherwise), household income (Income in $1,000s), the number
of years playing online games (Years), the number of hours playing online games
per week (Hours), whether the customer has a credit card (CreditCard), whether
the customer has a Facebook profile (Facebook), and whether the customer has
made in-app purchases before (Buy). A portion of the In_App_Pur_Data
worksheet is shown in the accompanying table. Create a classification tree model
for predicting whether the customer will make in-app purchases (Buy). Select the
best-pruned tree for scoring and display the full-grown, best-pruned, and
minimum error trees.
a. How many leaf nodes are in the best-pruned tree and minimum error tree?
What are the predictor variable and split value for the root node of the best-
pruned tree?
b. Describe the rules produced by the best-pruned tree.
c. What are the accuracy rate, sensitivity, specificity, and precision of the best-
pruned tree on the test data?
d. Display the cumulative lift chart, the decile-wise lift chart, and the ROC curve,
does the classification model outperform the baseline model? What is the area
under the ROC curve (or AUC value)?
e. Score the two new customers in the In_App_Pur_Score worksheet using the
best-pruned tree. Will the first new customer make in-app purchases according
to your model? What about the second new customer? What are the
probabilities of the two new customers to make in-app purchases?
22. FILE In_App_Pur. Refer to the previous exercise for a description of the
problem and data set. Build a default classification tree to predict whether the
gamer will make in-app purchases. Display the classification tree.
a. What are the predictor variable and the split value for the first split of the default
classification tree?
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum cross-validation error? If there is, then which cp value is
associated with the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Create a confusion matrix and display the various
performance measures. Assign Class 1 to be the positive class. What are the
accuracy, sensitivity, and specificity of the minimum error tree on the validation
data?
e. Display the cumulative lift chart, the decile-wise lift chart, and the ROC curve of
the minimum error tree on the validation data. Comment on the performance of
the classification tree.
f. Score the two gamers in the In_App_Pur_Score worksheet using the pruned
tree. What is the probability of the first gamer making in-app purchases
according to your classification model? What is the probability for the second
gamer?
page 441
⋮ ⋮ ⋮ ⋮
Age Sex Income Balancec
35 Female 65000
56 Male 160000
⋮ ⋮ ⋮
52 Female 155000
EXAMPLE 10.6
For illustration, we use only a small data set of 20 randomly
selected bank customers from the Balance_Data worksheet.
Table 10.8 shows a portion of the data, which is stored in the
Balance_Data_20 worksheet. Select the optimal split for the
Age variable based on the MSE impurity measure.
Age Balance
43 1775
34 3675
⋮ ⋮
26 10761
page 442
page 443
The MSE value of the Age = 30 split is slightly smaller than that
of the Age = 32 split, suggesting that the Age = 30 split
generates a lower level of impurity; therefore, Age = 30 is a
better split for constructing the regression tree. In fact, if we
compute the MSE values for all the possible splits for Age in
our example, we would find that the split at Age = 30 produces
the smallest MSE. Therefore, the Age = 30 split is chosen as
the best split for Age. Figure 10.20 shows the regression tree
when Age is the only predictor variable used in the analysis.
EXAMPLE 10.7
Hayden wants to use historical data from 500 bank customers
to develop a prediction model for predicting a customer’s
account balance. He also plans to assess the performance of
the regression tree and then score 20 new customers. Use
Analytic Solver and R to build the preferred prediction tree, and
then use this tree to score the new cases.
SOLUTION:
Using Analytic Solver
a. Open the Balance_Data worksheet of the Balance data file.
b. Choose Data Mining > Partition (under the Data Mining
group) > Standard Partition.
c. In the Standard Data Partition dialog box, we select the data
range $A$1:$D$501 and move Age, Sex, Income, and
Balance to the Selected Variables box. Select Pick up rows
randomly and check the Set seed checkbox to use the default
random seed of 12345. Select Specify percentages to allocate
50%, 30%, and 20% of the data to training, validation, and test
sets, respectively. Click OK.
d. Analytic Solver will create a new worksheet called
STDPartition with the partitioned data. With the STDPartition
worksheet active, choose Data Mining > Predict >
Regression Tree to open the Regression Tree dialog box. In
the Data tab, move the predictor variables Age page 444
and Income to the Selected Variables box and
Sex to the Categorical Variables box. Select and move the
target variable, Balance, to the Output Variable box. Accept
other defaults. Click Next.
e. In the Parameters tab, check the Prune (Using Validation Set)
checkbox and click on the Tree for Scoring button. Select Best
Pruned, then click Done. This selection tells Analytic Solver to
score cases using the best-pruned tree. Click on the Trees to
Display button. Check the checkboxes next to Fully Grown,
Best Pruned, and Minimum Error, then click Done. Accept
other defaults. Click Next.
f. In the Scoring tab, Check the Summary Report boxes for
Score Training Data, Score Validation Data, and Score Test
Data. Check the In Worksheet check box under the Score
New Data section. Under the New Data (WS) tab, select
Balance_Score in the Worksheet box. These 20 records of
new bank customers will be scored by the regression tree.
Make sure the First Row Contains Headers box is checked
and click the Match by Name button. Accept other defaults.
Click Finish.
Analytic Solver creates a number of worksheets. Table 10.9
shows a portion of the prune log, which is found in the
RT_Output worksheet. The prune log shows that the simplest
tree with the minimum validation MSE of 34,625,847.56 occurs
at two decision nodes. Because no smaller tree has a
validation MSE within one standard error of this value, the
minimum error tree and best-pruned tree coincide in this
example.
0 10010714.87 47757025.42
1 8713455.60 39391178.20
2 7016220.13 34625847.56
3 7091273.75 35733534.52
4 6440499.50 35411119.80
5 5154901.31 35536360.84
6 5764821.26 36632166.92
The RT_FullTree worksheet shows the full tree. The full tree is
quite complex with many decision nodes. Due to its size, the
full tree is not displayed here. The RT_BestTree and
RT_MinErrorTree worksheets show the best-pruned and
minimum error trees, respectively. As mentioned earlier, the
two trees are identical in this example. Figure 10.21 displays
the tree. It is relatively simple with two decision nodes and five
total nodes if the terminal nodes are included. The following If-
Then rules can be derived:
RMSE 6151.07
MAD 4326.83
page 446
Using R
As mentioned earlier in the construction of classification trees,
R’s built-in cross-validation process means that there is no
need to separate the data into three partitions as we did when
using Analytics Solver. Here, we separate the data into two
partitions: training and validation data sets.
a. Import the data from the Balance_Data worksheet of the
Balance data file into a data frame (table) and label it myData.
The following instructions and results are based on R version
3.5.3. To replicate the results with newer versions of R, enter:
>suppressWarnings(RNGversion(“3.5.3”))
b. Install and load the caret, rpart, rpart.plot, and forecast
packages using the following command if you have not
already done so. Enter:
>install.packages(“caret”, dependencies = c(“Depends”,
“Suggests”))
>install.packages(“rpart”)
>install.packages(“rpart.plot”)
>install.packages(“forecast”)
>library(caret)
>library(rpart)
>library(rpart.plot)
>library(forecast)
c. By setting the random seed to 1, we will generate the same
partitions as shown in this example. As the construction of a
decision tree is a data-driven process and can benefit from a
large training set, we use the createDataPartition function to
randomly allocate 70% of the data into the training data set
and 30% into the validation data set. Enter:
>set.seed(1)
>myIndex <- createDataPartition(myData$Balance, p=0.7, list
= FALSE)
>trainSet <- myData[myIndex,]
>validationSet <- myData[-myIndex,]
d. We use the rpart function to generate the default classification
tree labeled default_tree. Within the rpart function, we specify
the model structure, data source, and method. The method =
“anova” option tells the function to build a regression tree to
estimate a numerical target value. To view the details of the
default tree, use the summary function. To ensure
consistency of the cross-validation results, we use the
set.seed function to fix the random seed to 1. Enter:
>set.seed(1)
>default_tree <- rpart(Balance ~ ., data=trainSet,
method=“anova”)
>summary(default_tree)
e. To view the regression tree visually, we use the prp function.
The type option is set equal to 1 so that all nodes except the
leaf nodes are labeled in the tree diagram. The extra option is
set equal to 1 so that the number of observations that fall into
each node are displayed. The under option is set equal to
TRUE in order to put the number of cases under each
decision node in the diagram. Enter:
>prp(default_tree, type=1, extra=1, under = TRUE)
Figure 10.22 shows the default regression tree for predicting
average balance.
page 447
As discussed earlier, to find the optimal decision
tree, a common practice is to grow the full tree and then prune
it to a less-complex tree based on the prediction errors
produced by the cross-validation process of the rpart function.
By identifying the value of the complexity parameter (cp)
associated with the smallest cross-validated prediction error,
we can create the minimum error tree. Next, we demonstrate
the pruning process to optimize the complexity of the tree.
f. We first grow the full tree by using the rpart function. We set
the options cp equal to 0, minsplit equal to 2, and minbucket
equal to 1. As discussed in the classification tree section,
these settings ensure that the largest possible tree will be
produced. We plot the full tree using the prp function. Again,
to ensure consistency of the cross-validation results, we set
the random seed to 1. Enter:
>set.seed(1)
>full_tree <- rpart(Balance ~ ., data= trainSet,
method=“anova”, cp=0, minsplit=2, minbucket=1)
>prp(full_tree, type=1, extra=1, under = TRUE)
The full tree in this case is very complex so it is not displayed
here.
g. To identify the value of cp that is associated with page 448
the smallest cross-validated prediction error, we
use the printcp function. Enter:
>printcp(full_tree)
Due to the complexity of the full tree, 273 subtree options are
displayed in the cp table. Figure 10.23 displays a portion of
the complexity parameter table showing the first eight
candidate trees with increasing complexity. The nsplit column
shows the number of splits for each tree. The rel error column
shows the prediction error for each tree, relative to the
prediction error of the root node if all cases are given the
predicted balance that equals the average of all balances. The
prediction performance of the trees needs to be evaluated by
inspecting the cross-validation errors associated with each
tree; see the xerror column. The third tree, with two splits, has
the lowest cross-validation error (xerror = 0.72172); therefore,
it is the minimum error tree. The xstd column (standard error)
can be used to identify the best-pruned tree, which is the
smallest tree whose cross-validation error falls within one
standard error of the minimum error tree (0.72172 + 0.10521 =
0.82693). In this case, the second tree, with just one split, has
a cross-validation error of 0.78354, which is within the range;
hence, the best-pruned tree is the second tree. Next, we use
the value of cp associated with the second tree to produce our
best-pruned tree.
page 450
EXERCISES 10 3
EXERCISES 10.3
Mechanics
23. FILE Exercise_10.23. The accompanying data set contains two predictor
variables, x1 and x2, and one numerical target variable, y. A regression tree will
be constructed using the data set.
a. List the possible split values for x1 in ascending order.
b. List the possible split values for x2 in ascending order.
c. Compute the MSE of the partition x1 = 131.
d. Compute the MSE of the partition x2 = 105.
e. Compare and interpret the results from parts c and d.
24. FILE Exercise_10.24. The accompanying data set contains three predictor
variables, x1, x2, and x3, and one numerical target variable, y. A regression tree
will be constructed using the data set.
a. List the possible split values for x1 in ascending order.
b. List the possible split values for x2 in ascending order.
c. List the possible split values for x3 in ascending order.
d. Compute the MSE of the partition x1 = 252.
e. Compute the MSE of the partition x2 = 92.5.
f. Compute the MSE of the partition x3 = 14.25.
g. Compare and interpret the results from parts d, e, and f.
25. FILE Exercise_10.25. The accompanying data set contains two predictor
variables, x1 and x2, and one numerical target variable, y. A regression tree will
be constructed using the data set.
a. Which split on x1 will generate the smallest MSE?
b. Which split on x2 will generate the smallest MSE?
c. Which variable and split value should be used to create the root node if a
regression tree is constructed using the accompanying data set?
d. State the rules generated from this split.
26. FILE Exercise_10.26. The accompanying data set contains three predictor
variables, x1, x2, and x3, and one numerical target variable, y. A regression tree
will be constructed using the data set.
a. Which split on x1 will generate the smallest MSE?
b. Which split on x2 will generate the smallest MSE?
c. Which split on x3 will generate the smallest MSE?
d. Which variable and split value should be used to create the root node if a
regression tree is constructed using the accompanying data set?
e. State the rules generated from this split.
27. The regression tree below relates credit score to number of defaults (NUM DEF),
revolving balance (REV BAL), and years of credit history (YRS HIST). Predict the
credit score of each of the following individuals.
29. FILE Exercise_10.29. Create a regression tree using the page 451
accompanying data set (predictor variables: x1 to x4; target: y). Select the best-
pruned tree for scoring and display the full-grown, best-pruned and minimum error
trees.
a. What is the minimum validation MSE in the prune log? How many decision
nodes are associated with the minimum error?
b. How many leaf nodes are in the best-pruned and minimum error trees?
c. Display the best-pruned tree. What are the predictor variable and split value for
the first split (root node) of the best-pruned tree? What are the rules that can be
derived from the root node?
d. What are the RMSE and MAD of the best-pruned tree on the test data?
e. According to the best-pruned tree, what is the predicted y for a new observation
with the following values: x1 = 20; x2 = 40; x3 = 36; x4 = 8.3?
data set in the Exercise_10.31 worksheet (predictor variables: x1 to x4; target: y).
a. Use the rpart function to build a default regression tree. Display the default
regression tree. How many leaf nodes are in the default regression tree? What
are the predictor variable and split value for the first split of the default
regression tree?
b. Use the rpart function to build a fully-grown regression tree. What is the cp
value that is associated with the lowest cross-validation error? How many splits
are in the minimum error tree?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum cross-validation error? What is the cp value associated
with the best-pruned tree? How many splits are in the best-pruned tree?
d. Prune the full tree to the best-pruned tree or minimum error tree if the answer to
part c is “No.” Display the pruned tree. What are the ME, RMSE, MAE, MPE,
and MAPE measures of the pruned tree on the validation data?
e. Comment on the performance of the pruned regression tree.
data set (predictor variables: x1 to x4; target: y). Select the best-pruned tree for
scoring and display the full-grown, best-pruned, and minimum error trees.
a. What is the minimum validation MSE in the prune log? How many decision
nodes are associated with the minimum error?
b. Display the best-pruned tree. How many leaf nodes are in the best-pruned and
minimum error trees?
c. What are the predictor variable and split value for the first split (root node) of the
best-pruned tree? What are the rules that can be derived from the root node?
d. What are the RMSE and MAD of the best-pruned tree?
page 452
Applications
34. FILE Travel. Jerry Stevenson is the manager of a travel agency. He wants to
build a model that can predict customers’ annual spending on travel products. He
has compiled a data set that contains the following variables: whether the
individual has a college degree (College), whether the individual has credit card
debt (CreditCard), annual household spending on food (FoodSpend), annual
income (Income), and annual household spending on travel products
(TravelSpend). A portion of the Travel_Data worksheet is shown in the
accompanying table. Create a regression tree model for predicting the customer’s
annual household spending on travel products (TravelSpend). Select the best-
pruned tree for scoring and display the full-grown, best-pruned, and minimum
error trees.
a. How many leaf nodes are in the best-pruned tree and the minimum error tree?
What are the rules that can be derived from the root node of the best-pruned
tree?
b. What are the RMSE and MAD of the best-pruned tree on the test data?
c. Score the two new customers in the Travel_Score worksheet using the best-
pruned tree. What are their predicted annual spending amounts on travel
products?
35. FILE Travel. Refer to the previous exercise for a description of the data set.
36. FILE Houses. Melissa Hill is a real estate agent in Berkeley, California. She
wants to build a predictive model that can help her price a house more accurately.
Melissa has compiled a data set in the House_Data worksheet that contains the
information about the houses sold in the past year. The data set contains the
following variables: number of bedrooms (BM), number of bathrooms (Bath),
square footage of the property (SQFT), lot size (Lot_Size), type of property
(Type), age of the property (Age), and price sold (Price). A portion of the data set
is shown in the accompanying table. Build a default regression tree to predict
house prices (Price). Display the regression tree.
a. What are the predictor variable and split value for the first split of the default
regression tree? What are the rules that can be derived from the root node?
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error? How many splits are in the minimum-error tree?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum error? If there is, then which cp value is associated with
the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Display the tree. How many leaf nodes are in the
pruned tree?
e. What are the ME, RMSE, MAE, MPE, and MAPE of the pruned tree on the
validation data? On average, does the regression tree over- or under-predict
prices of houses? Is the regression tree model effective in predicting prices of
houses?
f. Score the two new houses on the market in the Houses_Score page 453
worksheet using the pruned tree. What are their predicted
prices?
37. FILE Houses. Refer to the previous exercise for a description of the data
set. Create a regression tree model for predicting house prices (Price). Select the
best-pruned tree for scoring and display the full-grown, best-pruned, and
minimum error trees.
a. Display the best-pruned tree. How many leaf nodes are in the best-pruned tree?
What are the predictor variable and split value of the root node of the best-
pruned tree?
b. What are the RMSE and MAD of the best-pruned tree on the test data? On
average, does the regression tree over- or under-predict prices of houses? Is
the regression tree model effective in predicting prices of houses?
c. Score the two new houses on the market in the Houses_Score worksheet using
the best-pruned tree. What are their predicted prices according to your model?
spending in the first three months of the year. Brian Duffy, the marketing analyst of
the company, has compiled a data set on 200 existing customers that includes
sex (Female: 1 = Female, 0 otherwise), annual income in 1,000s (Income), age
(Age, in years), and total spending in the first three months of the year
(Spending). A portion of the E-Retailer_Data worksheet is shown in the
accompanying table. Create a regression tree model for predicting customer
spending during the first three months of the year (Spending). Select the best-
pruned tree for scoring and display the full-grown, best-pruned, and minimum
error trees.
0 87.5 52 156.88
1 66.5 43 275.16
⋮ ⋮ ⋮ ⋮
0 51.9 61 159.51
a. Display the best-pruned tree. How many leaf nodes are in the best-pruned tree?
b. What are the RMSE and MAD of the best-pruned tree on the test data?
c. Score the 10 new customers in the E-Retailer_Score worksheet using the best-
pruned tree. What are the mean and median values of the predicted spending
amounts according to your model?
39. FILE E_Retailer. Refer to the previous exercise for a description of the data
set. Build a default regression tree to predict the customer’s spending during the
first three months of the year (Spending). Display the regression tree.
a. What are the rules that can be derived from the default regression tree?
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error? How many leaf nodes are in the minimum-error tree?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum error? If there is, then which cp value is associated with
the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Display the tree. How many leaf nodes are in the
pruned tree?
e. What are the ME, RMSE, MAE, MPE, and MAPE of the pruned tree on the
validation data?
f. Score the 10 new customers in the E-Retailer_Score worksheet using the
pruned tree. What are the mean and median values of the predicted spending
amounts during the first three months?
40. FILE Electricity. Kyle Robson, an energy researcher for the U.S. Energy
a. What are the predictor variable and split value for the first split of the default
regression tree? What are the rules that can be derived from the default
regression tree?
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error? How many leaf nodes are in the minimum-error tree?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum error? If there is, then which cp value is associated with
the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Display the tree. How many leaf nodes are in the
pruned tree?
e. What are the ME, RMSE, MAE, MPE, and MAPE of the pruned tree on the
validation data?
f. What is the predicted per capita electricity retail sales for a state with the
following values: Price = 11, Generation = 25, and Income = 65,000?
41. FILE Electricity. Refer to the previous exercise for a description of the data
set. Create a regression tree model for predicting per capita electricity retail sales
(Sales). Select the best-pruned tree for scoring and display the full-grown, best-
pruned, and minimum error trees.
a. How many leaf nodes are in the best-pruned tree and minimum error tree?
b. What are the predictor variable and split value for the first split of the best-
pruned tree? What are the rules that can be derived from the root node?
c. What are the RMSE and MAD of the best-pruned tree on the test data?
d. What is the predicted per capita electricity retail sales for a state with the
following values: Price = 11, Generation = 25, and Income = 65,000?
page 454
42. FILE NBA. Merrick Stevens is a sports analyst working for ACE Sports
a. How many leaf nodes are in the best-pruned tree and minimum error tree?
b. Display the best-pruned tree. What are the predictor variable and split value for
the first split of the best-pruned tree?
c. What are the RMSE and MAD of the best-pruned tree on the test data?
d. Score the three NBA players Merrick is trying to sign as ACE Sports
Management clients in the NBA_Score worksheet using the pruned tree. What
is the average predicted salary of the three players?
43. FILE NBA. Refer to the previous exercise for a description of the data set.
Build a default regression tree to predict an NBA player’s salary (salary). Display
the regression tree.
a. What are the predictor variable and split value for the first split of the default
regression tree?
b. Build a full-grown tree. Which cp value is associated with the lowest cross-
validation error? How many leaf nodes are in the minimum-error tree?
c. Is there a simpler tree with a cross-validation error that is within one standard
error of the minimum error? If there is, then which cp value is associated with
the best-pruned tree?
d. Prune the full tree to the best-pruned tree or the minimum error tree if the
answer to part c is “No.” Display the tree. What are the rules that can be derived
from the pruned tree?
e. What are the ME, RMSE, MAE, MPE, and MAPE of the pruned tree on the
validation data?
f. Score the three NBA players Merrick is trying to sign as ACE Sports
Management clients in the NBA_Score worksheet using the pruned tree. What
is the average predicted salary of the three players?
page 456
EXAMPLE 10.8
Use Analytic Solver and R to develop ensemble classification
tree models using the HELOC data set used in Example 10.5.
Compare the model performance of the ensemble tree model
with the single-tree model developed in Example 10.5.
Feature Importance
Age 3.048356023
Income 1.874119349
Sex 0.285117909
You probably have noticed that this result contradicts with
the single-tree model developed in Example 10.5, which finds
Sex to be the most important variable to distinguish responders
from nonresponders. The reason for the discrepancy is that the
student edition of Analytic Solver only allows you to create 10
weak learners, each of which randomly selects only two
predictor variables to construct the tree. The predictor variable
Sex may not have been sufficiently represented in the very
limited number of weak learners, resulting in a biased feature
importance assessment. As we will see in the R example, by
constructing an ensemble tree model with 100 weak learners,
we will be able to conduct a more reliable feature importance
analysis.
Using R
>varImpPlot(bagging_tree, type=1)
g. The following commands create the confusion matrix by
comparing the predicted class memberships and actual class
memberships of the validation data set. Enter:
>predicted_class <- predict(bagging_tree, validationSet)
>confusionMatrix(predicted_class, validationSet$HELOC,
positive = “1”)
page 461
The confusion matrix is displayed in Figure 10.29.
The performance measures show that the bagging ensemble
tree model has an overall accuracy rate of 80% on the
validation data. As shown earlier, the model is much better at
classifying Class 0 cases correctly (specificity = 0.8851) than
at classifying Class 1 cases correctly (sensitivity = 0.5577).
page 463
To create an ensemble tree model using the
random forest strategy, simply replace the R commands in
step e with the following commands:
>set.seed(1)
>randomforest_tree <- randomForest(HELOC ~ .,
data=trainSet, ntree= 100, mtry = 2, importance = TRUE)
As you can see, the only difference between the commands
for bagging and random forest is the number you specify for
the mtry option in the randomForest function. Because
random forest selects a subset of the predictor variables for
building each single-tree model, we specify the number of
predictor variables to include in the subset. In this case, we ask
the function to randomly select two predictor variables for
building individual single-tree models.
R uses the boosting function of the adabag package to
create ensemble tree models that implement the boosting
strategy. Due to the technical requirements of the adabag
package, we create the boosting tree model slightly differently
as shown in the example below:
a. Import the data from the HELOC_Data worksheet of the
HELOC data file into a data frame (table) and label it myData.
b. Install and load the caret, gains, pROC, and page 464
adabag packages. Enter:
>install.packages(“caret”, dependencies = c(“Depends”,
“Suggests”))
>install.packages(“gains”)
>install.packages(“pROC”)
>install.packages(“adabag”)
>library(caret)
>library(gains)
>library(pROC)
>library(adabag)
c. Because the adabag package requires the use of data frame
class objects, we convert myData to a data frame class object
using the data.frame function. We convert categorical
variables (i.e, Sex and HELOC) to factor variables as required
by the software. Enter:
>myData <- data.frame(myData)
>myData$HELOC <- as.factor(myData$HELOC)
>myData$Sex <- as.factor(myData$Sex)
d. We set the random seed to 1 and partition the data into
training (60%) and validation (40%) data sets. Enter:
>set.seed(1)
>myIndex <- createDataPartition(myData$HELOC, p=0.6, list
= FALSE)
>trainSet <- myData[myIndex,]
>validationSet <- myData[-myIndex,]
e. We use the boosting function to construct the ensemble tree
model using the boosting strategy. By setting the random seed
to 1, you will get the same results as in this example. The
mfinal option specifies the number of weak learners (single-
tree models) to create. Enter:
>set.seed(1)
>boosting_tree <- boosting(HELOC ~ ., data=trainSet, mfinal
= 100)
f. The following commands create the confusion matrix by
comparing the predicted class memberships and actual class
memberships of the validation data set. For boosting tree
models, the predict function produces a list (named prediction
here), which includes predicted class memberships and
probabilities. You can access the predicted class
memberships and predicted probabilities using
prediction$class and prediction$prob, respectively.
>prediction <- predict(boosting_tree, validationSet)
>confusionMatrix(as.factor(prediction$class),
validationSet$HELOC, positive = “1”)
Verify that the accuracy rate, sensitivity, and specificity are
0.8, 0.5577, and 0.8851, respectively. Using the default cutoff
value of 0.5, the boosting ensemble tree shows the same
predictive performance as the bagging ensemble tree does.
g. The following commands create the cumulative lift chart, the
decile-wise lift chart, and the ROC curve. As the syntax for
creating these graphs has been discussed in previous
sections, we will not repeat it here. Note that we access the
predicted probability of a validation case belonging the Class 1
(target class) using prediction$prob[, 2] because the Class 1
probabilities are listed in the second column of the object.
Enter:
>validationSet$HELOC <-
as.numeric(as.character(validationSet$HELOC))
>gains_table <- gains(validationSet$HELOC,
prediction$prob[,2])
page 465
>gains_table
>#cumulative lift chart
>plot(c(0,
gains_table$cume.pct.of.total*sum(validationSet$HELOC)) ~
c(0, gains_table$cume.obs),xlab = “# cases”,
ylab=”Cumulative”, type =”l”)
>lines(c(0, sum(validationSet$HELOC)) ~ c(0,
dim(validationSet)[1]), col=”red”, lty=2)
># decile-wise lift chart
>barplot(gains_table$mean.resp/mean(validationSet$HELOC)
, names.arg=gains_table$depth, xlab=”Percentile”, ylab=”Lift”,
ylim= c(0, 3), main = “Decile-Wise Lift Chart”)
># ROC curve
>roc_object<- roc(validationSet$HELOC, prediction$prob[,2])
>plot.roc(roc_object)
># compute auc
>auc(roc_object)
Verify that the AUC value of the ROC curve is 0.8209, which is
slightly lower than the AUC value of the ROC curve derived
from the bagging ensemble tree.
h. Finally, to score the 20 new cases, we import the data from
the HELOC_Score worksheet of the HELOC data file into a
data frame (table) and label it myScoreData. Convert
myScoreData to a data frame class object and predictor
variable Sex to a factor variable as required by the software.
Enter:
>myScoreData <- data.frame(myScoreData)
>myScoreData$Sex <- as.factor(myScoreData$Sex)
>predicted_class_score <- predict(boosting_tree,
myScoreData)
>predicted_class_score$class
>predicted_class_score$prob
The scoring results of the boosting ensemble model are
slightly different from those of the bagging ensemble model;
the 20th case is classified as a Class 0 case by the boosting
model, whereas it is classified as a Class 1 case by the
bagging model. As discussed before, unlike the bagging and
random forest models, the boosting model does not provide
variable importance information.
EXERCISES 10.4
Note: These exercises can be solved using Analytic Solver
and/or R. The answers, however, will depend on the software
package used. For Analytic Solver, partition data sets into 60%
training and 40% validation, and use 12345 as the default
random seed. Create 10 weak learners when constructing the
ensemble model. For R, partition data sets into 60% training
and 40% validation, and use the statement set.seed(1) to
specify the random seed for data partitioning and constructing
the ensemble tree models. Create 100 weak learners when
constructing the ensemble model. All answers in R are based
on version 3.5.3. To replicate the results with newer versions of
R, execute the following line of code at the beginning of the R
session: suppressWarnings(RNGversion(“3.5.3”)). If the
predictor variable values are in the character format, then treat
the predictor variable as a categorical variable. Otherwise, treat
the predictor variable as a numerical variable.
Mechanics
44. FILE Exercise_10.44. Create a bagging ensemble classification tree model
using the accompanying data set (predictor variables: x1 to x4; target: y).
a. What are the overall accuracy rate, sensitivity, and specificity of the model on
the validation data?
b. What is the AUC value of the model? page 466
c. Score a new record with the following values: x1 = 3.45, x2 =
1, x3 = 18, x4 = 5.80. How would the record be classified using the bagging
ensemble tree model? What is the probability of the record belonging to Class
1?
45. FILE Exercise_10.45. Create a boosting ensemble classification tree model
using the accompanying data set (predictor variables: x1 to x4; target: y).
a. What are the overall accuracy rate, sensitivity, and specificity of the model on
the validation data?
b. What is the AUC value of the model?
c. Score a new record with the following values: x1 = 3.45, x2 = 1, x3 = 18, x4 =
5.80. How would the record be classified using the boosting ensemble tree
model? What is the probability of the record belonging to Class 1?
46. FILE Exercise_10.46. Create a random forest ensemble classification tree
model using the accompanying data set (predictor variables: x1 to x4; target: y).
Select two predictor variables randomly to construct each weak learner.
a. What are the overall accuracy rate, sensitivity, and specificity of the model on
the validation data?
b. What is the AUC value of the model?
c. Which is the most important predictor variable?
d. Score a new record with the following values: x1 = 3.45, x2 = 1, x3 = 18, x4 =
5.80. How would the record be classified using the random forest ensemble
model? What is the probability of the record belonging to Class 1?
47. FILE Exercise_10.47. Create a bagging ensemble classification tree model
using the accompanying data set (predictor variables: x1 to x5; target: y).
a. What are the overall accuracy rate, sensitivity, and specificity of the model on
the validation data?
b. What is the lift value of the leftmost bar of the decile-wise lift chart?
c. Score a new record with the following values: x1 = 52.8, x2 = 230.50, x3 = 1, x4
= 144, x5 = 6.23. How would the record be classified using the bagging
ensemble tree model? What is the probability of the record belonging to Class
1?
48. FILE Exercise_10.48. Create a boosting ensemble classification tree model
using the accompanying data set (predictor variables: x1 to x5; target: y).
a. What are the overall accuracy rate, sensitivity, and specificity of the model on
the validation data?
b. What is the lift value of the leftmost bar of the decile-wise lift chart?
c. Score a new record with the following values: x1 = 52.8, x2 = 230.50, x3 = 1, x4
= 144, x5 = 6.23. How would the record be classified using the boosting
ensemble tree model? What is the probability of the record belonging to Class
1?
49. FILE Exercise_10.49. Create a random forest ensemble classification tree
model using the accompanying data set (predictor variables: x1 to x5; target: y).
Select two predictor variables randomly to construct each weak learner.
a. What are the overall accuracy rate, sensitivity, and specificity of the model on
the validation data?
b. What is the lift value of the leftmost bar of the decile-wise lift chart?
c. Which is the most important predictor variable?
d. Score a new record with the following values: x1 = 52.8, x2 = 230.50, x3 = 1, x4
= 144, x5 = 6.23. How would the record be classified using the random forest
ensemble model? What is the probability of the record belonging to Class 1?
Applications
50. FILE Spam. Mateo Derby works as a cyber security analyst at a private equity
firm. His colleagues at the firm have been inundated by a large number of spam
e-mails. Mateo has been asked to implement a spam detection system on the
company’s e-mail server. He reviewed a sample of 500 spam and legitimate e-
mails with relevant variables: spam (1 if spam, 0 otherwise), the number of
recipients, the number of hyperlinks, and the number of characters in the
message. A portion of the Spam_Data worksheet is shown in the accompanying
table.
0 19 1 47
0 15 1 58
⋮ ⋮ ⋮ ⋮
1 13 2 32
page 467
a. Create a bagging ensemble classification tree model to
determine whether a future e-mail is spam. What are the overall accuracy rate,
sensitivity, and specificity of the model on the validation data? What is the AUC
value of the model?
b. Create a random forest ensemble classification tree model. Select two predictor
variables randomly to construct each weak learner. What are the overall
accuracy rate, sensitivity, and specificity of the model on the validation data?
What is the AUC value of the model? Which is the most important predictor
variable?
c. Score the new cases in the Spam_Score worksheet using the bagging
ensemble classification tree model. What percentage of the e-mails is spam?
51. FILE HR. Daniella Lara, a human resources manager at a large tech consulting
firm, has been reading about using analytics to predict the success of new
employees. With the fast-changing nature of the tech industry, some employees
have had difficulties staying current in their field and have
missed the opportunity to be promoted into a management
page 468
position. Daniella is particularly interested in whether or not a new employee is
likely to be promoted into a management role after 10 years with the company.
She gathers information on current employees who have worked for the firm for at
least 10 years. The information is based on the job application that the employees
provided when they originally applied for a job at the firm. For each employee, the
following variables are listed: Promoted (1 if promoted within 10 years, 0
otherwise), GPA (college GPA at graduation), Sports (number of athletic activities
during college), and Leadership (number of leadership roles in student
organizations). A portion of the HR_Data worksheet is shown in the
accompanying table.
0 3.28 0 2
1 3.93 6 3
⋮ ⋮ ⋮ ⋮
0 3.54 5 0
115 45 N
68 31 Y
⋮ ⋮ ⋮
Income Age Install
73 34 N
F 1 1 1
M 1 0 1
⋮ ⋮ ⋮ ⋮
M 0 0 0
0 2 0 1
0 3 0 0
⋮ ⋮ ⋮ ⋮
0 2 0 1
Case Study
As millions of people in the U.S. are crippled by student loan debt and high
unemployment, policymakers are raising the question of whether college is even a
good investment. Richard Clancy, a sociology graduate student, is interested in
developing a model for predicting an individual’s income for his master’s thesis. He
found a rich data set maintained by the U.S. Bureau of Labor Statistics, called the
National Longitudinal Surveys (NLS), that follows over 12,000 individuals in the
United States over time. The data set focuses on labor force activities of these
individuals, but also includes information on a wide range of variables, such as
income, education, sex, race, personality trait, health, and marital history.
Rawpixel.com/Shutterstock
page 470
Years of education
Mother’s years of education
Father’s years of education
Urban (equals 1 if the individual lived in an urban area
at the age of 14, 0 otherwise)
Black (equals 1 if the individual is black, 0 otherwise)
Hispanic (equals 1 if the individual is Hispanic, 0
otherwise)
White (equals 1 if the individual is white, 0 otherwise)
Male (equals 1 if the individual is male, 0 otherwise)
Self-Esteem (the individual’s self-esteem using the
Rosenberg Self-Esteem Scale; a higher score indicates
a higher self-esteem.)
Outgoing kid (equals 1 if the individual was outgoing at
the age of six, 0 otherwise)
Outgoing adult (equals 1 if the individual is outgoing as
an adult, 0 otherwise)
As the objective is to determine whether an individual has an income that is at
or above the median personal income rather than an individual’s actual
income, the target variable income is converted into a categorical variable
that assumes the value one if income is greater than or equal to $29,998 and
0 otherwise. The final data set includes 5,821 observations, after
observations with missing values were removed from the analysis.
To build a predictive model and assess the model’s performance, the data
are partitioned into training (60%) and validation (40%) sets. As the predictor
variables include both numerical and categorical data types, the decision tree
methodology is a suitable technique to build a classification model for this
application. Figure 10.31 shows the best-pruned classification tree with eight
decision nodes and nine leaf nodes. A number of conclusions can be drawn
from the classification tree that shows that personal income level can be
predicted using an individual’s sex, education, race, and
mother’s education. For example, if an individual is male
page 471
and had more than 16 years of education, he has a high probability of earning
an income that is at or above the U.S. median income. If a person is female
and has less than 16 years of education, she has a high probability of earning
an income that is below the U.S. median income.
Predicted Class
Report 10.1 FILE Car_Crash. Subset the data to include only the accidents that
occurred in one city or during one month. Develop a decision tree model that predicts
whether an automobile accident results in fatal or severe injuries using predictor
variables, such as traffic violation category, weather condition, type of collision,
location of the accident, and lighting condition. Note: Many of the predictor variables
are categorical; therefore, you will need to convert them to the appropriate data form
prior to the analysis.
Report 10.2 FILE Longitudinal_Survey. Subset the data to include only those
individuals who lived in an urban area. Develop a decision tree model to predict an
individual’s body mass index (BMI) or whether an individual is going to be overweight
or not using predictor variables, such as sex, number of years of education, parents’
years of education, self-esteem scale, and whether the person is
outgoing as a kid and/or adult. Note: The Centers for Disease
page 472
Control and Prevention (CDC) defines overweight as someone whose BMI is equal
to or greater than 25.
Report 10.3 FILE TechSales_Reps. Subset the data to include only college-
educated sales professionals in the software product group. Develop a decision tree
model to predict whether or not the sales professional will receive a high (9 or 10) net
promoter score (NPS), using predictor variables, such as age, sex, tenure with the
company, number of professional certificates acquired, annual evaluation score, and
personality type. Note: You may need to perform data transformation in order to meet
the requirements of the analytics technique.
Report 10.4 FILE College_Admissions. Subset the data to include only one of the
three colleges. Develop a series of decision tree models to predict which college is
most likely to accept a given university applicant based on the applicant’s sex, race,
high school GPA, SAT/ACT score, and parents’ years of education. Note: You will
construct a decision tree model for each college. The college whose model produces
the highest probability of acceptance is the one that is most likely to accept the
applicant.
Report 10.5 FILE House_Price. Develop a decision tree model to predict the sale
price of a house by using predictor variables, such as number of bedrooms, number
of bathrooms, home square footage, lot square footage, and age of the house. Note:
As real estate values can differ dramatically from region to region, you may want to
develop a model for each geographic location in the data set. Compare the
performance of the decision tree model with predictive models that use other
supervised learning techniques discussed in Chapter 9.
page 473
page 474
11 Unsupervised Data
Mining
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 11.1 Conduct hierarchical cluster analysis.
LO 11.2 Conduct k-means cluster analysis.
LO 11.3 Conduct association rule analysis.
page 475
©Lissandra Melo/Shutterstock
FILE
Candy_Bars
INTRODUCTORY CASE
Nutritional Facts of Candy Bars
Aliyah Williams is an honors student at a prestigious business
school in Southern California. She is also a fledgling
entrepreneur and owns a vending machine business. Most of
her vending machines are located on campus and in the
downtown area, and they are stocked with a variety of snacks,
including a large selection of candy bars. Aliyah is aware that
California consumers are becoming increasingly health
conscious when it comes to food purchase. She has learned
from her consumer research class that the U.S. Department of
Agriculture (USDA) maintains a website with a database of
nutritional facts about food products, including candy bars.
Aliyah wants to come up with a better selection of candy
bars and strategically group and display them in her vending
machines. She also wants to feature certain products more
prominently in different locations based on the type of
consumers that tend to frequent these locations. Table 11.1
shows a portion of the data that Aliyah has downloaded from
the USDA website. Listed for each candy bar are the calories
per serving (Calories), total fat (Fat in grams), protein (Protein
in grams), and carbohydrates (Carb in grams).
page 476
page 477
CLUSTER ANALYSIS
Cluster analysis is an unsupervised data mining technique that
groups data into categories that share some similar
characteristic or trait.
HIERARCHICAL CLUSTERING
Hierarchical clustering is a technique that uses an iterative
process to group data into a hierarchy of clusters. Common
strategies of hierarchical clustering usually follow one of two
methods: agglomerative clustering or divisive clustering.
Agglomerative clustering is a “bottom-up” approach that
starts with each observation being its own cluster, with the
algorithm iteratively merging clusters that are similar to each
other as one moves up the hierarchy.
Divisive clustering is a “top-down” approach that starts by
assigning all observations to one cluster, with the algorithms
iteratively separating the most dissimilar observations as one
moves down the hierarchy.
page 478
Dendrogram
Once the AGNES algorithm completes its clustering process, data
are usually represented in a treelike structure where each
observation can be thought of as a “leaf” on the tree. The treelike
structure is called a dendrogram. It allows users to visually inspect
the clustering result and determine the appropriate number of
clusters in the data. Determining the right number of clusters is
somewhat subjective, but we can visually inspect a dendrogram to
help guide our decision. Figure 11.2 shows an example of a
dendrogram.
A DENDROGRAM
A dendrogram is a graphical representation of data as a treelike
structure, where each branch can be considered a cluster of
observations.
⋮ ⋮ ⋮ ⋮
page 481
SOLUTION:
Using Analytic Solver
Data range and select cells $A$1:$D$42. Make sure that the
box First Row Contains Headers is checked. The Variables in
Input Data box will populate. Select and move variables
Crime, Poverty, and Income to the Selected Variables box.
Accept other defaults and click Next.
page 482
TABLE 11.3 Updated Cities Data using Analytic Solver for Example 11.1
We can now compute summary statistics for each cluster to
review the cluster characteristics. Table 11.4 shows these
summary statistics, with the average crime rate, poverty rate,
and income for each cluster. We identify Cluster 1 as a group
of 11 cities with the lowest crime rate, the lowest poverty rate,
and the highest median income among the three clusters. On
the other hand, Cluster 2 represents a group of 15 cities with
the highest crime rate, highest poverty rate, and lowest median
income among the three clusters. Cluster 3 has a crime rate
closer to Cluster 1, a medium poverty rate, and a median
income closer to Cluster 2. Policymakers may find this type of
information useful when they make funding decisions or try to
understand the varying effect of an economic policy on different
cities.
TABLE 11.4 Hierarchical Clustering Results using Analytic Solver for Example
11.1
a. Import the Cities data into a data frame (table) and label it
myData.
b. Install and load the cluster package. Enter:
> install.packages(“cluster”)
> library(cluster)
c. We exclude the City variable from the analysis; standardize
the Crime, Poverty, and Income variables; and store their
standardized values in a data frame called myData1. We use
the scale function to standardize the values of the three
variables. Enter:
> myData1 <- scale(myData[ , 2:4])
If no standardization is needed, we simply select the three
numerical variables and our R code would be > myData1 <-
myData[ , 2:4].
d. We use the dist function to determine similarity among
observations, and store the distance values in a new variable
called d. For options within the dist function, we use method
to specify the distance calculation. Options for the distance
calculation include “euclidean”, “manhattan”, “binary” (for
Jaccard’s coefficient), “maximum”, and “minkowski”. [The last
two options are beyond the scope of this text.] We specify
“euclidean” to use the Euclidean distance as the similarity
measure. Enter:
> d <- dist(myData1, method = “euclidean”)
e. We use the agnes function to perform agglomerative
clustering and label the results as aResult. For options within
the agnes function, we use method to specify the clustering
method. Options for the clustering method include “single”
(single linkage), “complete” (complete linkage), “average”,
“ward” (Ward’s method), “weighted”, and “flexible”. We specify
“ward”. The diss option (meaning dissimilarity) is set equal to
TRUE to indicate that variable d contains a distance matrix
instead of the original variables. Enter “aResult” to obtain the
clustering results including an agglomerative coefficient, which
measures the strength of the clustering structure and whether
or not there is a natural clustering structure in the data.
Generally, a coefficient value of 0.75 or greater is indicative of
the existence of a good and natural clustering structure. Enter:
> aResult <- agnes(d, diss = TRUE, method = “ward”)
> aResult
R reports the following portion of results:
Call: agnes(x = myData, method = “ward”)
Agglomerative coefficient: 0.942097
The agglomerative coefficient (0.942097) suggests that a
good and natural clustering exists in the data.
f. We use the plot function to produce the dendrogram as well
as a banner plot. Enter:
> plot(aResult)
Hit <Return> to see next plot:
page 484
You will be prompted to press <Return> (the
“Enter” key) to display the plots. If you press <Return> once,
the banner plot appears.
Figure 11.5(a) shows the banner plot. A banner plot is an
alternative to the dendrogram. The red bars in the banner plot
represent observations, and each empty gap between the red
bars represents potential clusters. As with the dendrogram, we
can cut the banner to obtain a desired number of clusters. In
this example, if we cut the banner where the Height value = 7
(shown as a blue line), we will obtain a result with three
clusters.
page 486
EXAMPLE 11.2
A national phone carrier conducted a socio-demographic study
of their current mobile phone subscribers. Subscribers were
asked to fill out survey questions about their current annual
salaries (Salary), whether or not they live in a city (City equals
1 if living in a city, 0 otherwise), and socio-demographic
information such as marital status (Married equals 1 if married,
0 otherwise), sex (Sex equals 1 if male, 0 otherwise), and
whether or not they have completed a college degree (College
equals 1 if college degree, 0 otherwise). Table 11.7 shows a
portion of the survey data collected from 196 subscribers.
EXERCISES 11.1
Mechanics
page 489
Applications
9. FILE Colleges. Peter Lara, an aspiring college student, met with his high school
college advisor to discuss potential colleges to which he might apply. He was
advised to consult with the College Scorecard information on the Department of
Education website. After talking to his family, he downloaded a list of 116 colleges
and information on annual post-college earnings (Earnings in $), the average
annual cost (Cost in $), the graduation rate (Grad in %), the percentage of
students paying down debt (Debt in %), and whether or not a college is located in
a city (1 denotes a city location, 0 otherwise). Peter and his family want to group
these colleges based on the available information to help narrow down their
choices. The accompanying table shows a portion of the data.
0 1 1 1
1 0 1 0
⋮ ⋮ ⋮ ⋮
1 0 0 1
a. Perform agglomerative clustering to group the students in Anne’s data set. Use
Jaccard’s coefficients for the similarity measure and the complete linkage
clustering method. Inspect the dendrogram. How many clusters are generated if
the minimum distance between clusters is 0.8? How many transfer students are
in the largest cluster?
b. Repeat part a but choose average linkage clustering method. How many
students are on the Dean’s list in the largest cluster?
11. FILE Football_Players. Denise Lau is an avid football fan and religiously follows
every college football game. During the current season, she meticulously keeps a
record of how each quarterback has played throughout the season. Denise is
making a presentation at a local college football fan club about these
quarterbacks. The accompanying table shows a portion of the data that she has
recorded. Variables include the player number (Player), completed passes
(Comp), attempted passes (Att), completion percentage (Pct), total yards thrown
(Yds), average yards per attempt (Avg), yards thrown per game (Yds/G), number
of touchdowns (TD), and number of interceptions (Int).
a. Perform agglomerative clustering to group the quarterbacks according to their
performance to help Denise prepare for this presentation. Standardize the
variables and use the Euclidean distance and the Ward clustering method.
Cluster the data into three clusters. How many players are in the largest
cluster? What is the average number of touchdowns of the largest cluster?
b. Select only the quarterbacks with at least 150 attempted passes and repeat the
cluster analysis performed in part a. Cluster the data into two clusters. How
many players are in the larger cluster? What is the average number of
touchdowns of the larger cluster? Note: For Analytic Solver, use the Filter option
in Excel to select the observations. Copy the observations to a new worksheet.
Do not alter the order of the observations.
12. FILE Baseball_Players. Ben Derby is a highly paid scout for a professional
baseball team. He attends at least five or six Major League Baseball games a
week and watches as many recorded games as he can in order to evaluate
potential players for his team. He also keeps detailed records about each
perspective player. His team is now seeking to add another hitter to its roster next
season. Luckily, Ben has information on 144 hitters in the league who have played
at least 100 games during the last season. The accompanying table shows a
portion of the data. Variables include the player number (Player), the number of
games played (G), at bats (AB), runs (R), hits (H), homeruns
(HR), runs batted in (RBI), batting average (AVG), on base
page 490
percentage (OBP), and slugging percentage (SLG).
Use agglomerative clustering to group the 144 players into three clusters using all
variables except Player. Standardize the variables and use the Euclidean distance
and single linkage clustering method. How many players are in each cluster?
Describe the characteristics of each cluster using the average values of the
variables.
13. FILE Internet_Addiction. Internet addiction has been found to be a widespread
problem among university students. A small liberal arts college in Colorado
conducted a survey of Internet addiction among its students using the Internet
Addiction Test (IAT) developed by Dr. Kimberly Young. The IAT contains 20
questions that measure three underlying psychometric factors. Questions 1
through 9 measure emotional/psychological conflicts, which refer to the degree to
which the individual uses the Internet as a means to avoid interactions with
friends and family. Questions 10 through 14 measure time management issues,
which refer to the degree to which the individual chooses to spend time online at
the expense of other responsibilities. Finally, Questions 15 through 20 measure
mood modification, which refers to the degree to which the individual’s Internet
dependence is motivated by the need for mood improvement. The accompanying
table shows a portion of the responses from 350 students. Students respond on a
scale of 1 to 5, where higher values are indicative of possible problematic Internet
use.
16. FILE Pizza_Customers. A local pizza store wants to get a better sense of
who its customers are. The accompanying table shows a portion of data that it
collected on 30 randomly selected customers. Variables include
age, female (1 if female, 0 otherwise), annual income, married (1
page 491
if married, 0 otherwise), own (1 if own residence, 0 otherwise), college (1 if
completed college degree, 0 otherwise), household size (Size), and annual store
spending (Spending).
1 2.53 17
2 2.54 4
⋮ ⋮ ⋮
10 3.09 0.5
page 492
21. FILE Internet_Addiction2. Internet addiction has been
data from the National Longitudinal Survey (NLS), which follows over 12,000
individuals in the United States over time. Variables in this analysis include Urban
(1 if lives in urban area, 0 otherwise), Siblings (number of siblings), White (1 if
white, 0 otherwise), Christian (1 if Christian, 0 otherwise), FamilySize, Height,
Weight (in pounds), and Income (in $).
page 493
page 494
FILE
CandyBars
EXAMPLE 11.3
Recall that the objective outlined in the introductory case is to
help Aliyah Williams group candy bars into meaningful clusters
and improve product selection and display for her vending
machines. Perform k-means clustering on the data assuming
four clusters. Interpret the results.
SOLUTION:
Using Analytic Solver
Data range and select cells $A$1:$E$38. Make sure that the
box First Row Contains Headers is checked. The Variables in
Input Data box will populate. Select and move variables
Calories, Fat, Protein, and Carb to the Selected Variables box.
Click Next.
page 495
a. Import the CandyBars data into a data frame (table) and label
it myData.
The following instructions and results are based on R version
3.5.3. To replicate the results with newer versions of R, enter:
> suppressWarnings(RNGversion(“3.5.3”))
b. Install and load the cluster package. Enter:
> install.packages(“cluster”)
> library(cluster)
c. We exclude the Brand variable from the analysis and
standardize the other variables using the scale function. The
new data frame is called myData1. Enter:
> myData1 <- scale(myData[ , 2:5])
d. We use the set.seed function to set the random seed and the
pam function to perform the k-means clustering. Within the
pam function, we set the option k to 4 because we have
preselected 4 clusters. We store the clustering results in a
variable called kResult. We use the summary function to view
the results. Enter:
> set.seed(1)
> kResult <- pam(myData1, k = 4)
> summary(kResult)
page 496
Table 11.9 shows a portion of the R results
that we have rearranged for presentation purposes. Despite
the fact that the results are quite different than the ones that
we obtained with Analytic Solver, there are some similarities.
For example, Cluster 2 in R, like Cluster 1 in Analytic Solver,
includes three candy bars with a relatively high amount of
carbohydrates (z-score of 2.0860) and a low amount of fat (z-
score of −1.9482). Similarly, Cluster 4 in R is like Cluster 2 in
Analytic Solver and Cluster 1 in R is like Cluster 3 in Analytic
Solver.
FIGURE 11.10 R’s cluster plot and silhouette plot for Example 11.3
The silhouette plot in Figure 11.10(b) shows how close each
observation in one cluster is to observations in the other
clusters. It also allows us to visually determine the appropriate
number of clusters. The silhouette width for each observation,
si, ranges from −1 to +1, where a value closer to +1 indicates
that the observation is well matched to its own cluster and
poorly matched to neighboring clusters. If most observations
have values close to +1, then the clustering configuration is
appropriate. If many points have a low or negative value, then
the clustering configuration may have too many or too few
clusters. The results in Figure 11.10(b) seem to suggest that
the results are good, but not great. Clusters 2 and 4 seem well-
configured. Two of the observations in Cluster 1 and one
observation in Cluster 3 have silhouette widths that are
negative. The average silhouette width is 0.32 suggesting a
reasonable clustering configuration overall. It is not clear if a
different configuration would necessarily improve the results.
Steve Degenhardt/Shutterstock
The defining characteristics of the two remaining groups are less obvious.
However, Aliyah observes that one of the remaining two clusters includes candy
bars, such as Reese’s Pieces, that have a high content of protein and calories.
Finally, the last cluster tends to include candy bars that contain a moderate amount
of the four nutrients.
Given the increase in health-conscious consumers in the market, Aliyah feels
that her vending machines would best meet customer needs by increasing the
proportion of candy bars that are low in calories and fat, found in the first two
clusters. These candy bars would likely attract customers who crave for something
sweet but at the same time do not want to add too much calories and fat to their diet.
The fruit-flavored candies would likely cater to customers who are on a low-fat, high-
carb diet. She will also place the healthier candy bars in a more prominent place in
her vending machines located close to healthcare and fitness facilities and display a
sign listing the nutritional facts of each candy bar.
Aliyah also decides to increase the number of selected candy bars from the third
cluster for customers who are looking to consume a little more protein in their diet.
Finally, to maintain product variety, she will also choose an assortment of popular
candy bars from the remaining cluster to include in her vending machines.
page 498
EXERCISES 11.2
Mechanics
23. FILE Exercise_11.23. Perform k-means clustering on the accompanying data
set.
a. Use all variables in the analysis. Do not standardize the variables. Set the
number of clusters to 3. What are the size and average distance for the largest
cluster?
b. Specify the same settings as in part a, but standardize the variables. What are
the size and average distance for the largest cluster?
24. FILE Exercise_11.24. Perform k-means clustering on both variables in the
accompanying data set. Standardize the variables. Experiment with the k values
of 2, 3, and 4. Compare the number of observations and distance statistics of the
largest cluster for each k value.
25. FILE Exercise_11.25. Perform k-means clustering on the accompanying data
set.
a. Use variables x1, x2, and x3 in the analysis. Standardize the data. Specify the k
value as 2. What are the cluster center values for the larger cluster?
b. Specify the k value as 3. What are the cluster center values for the smallest
cluster?
data set. Use variables x4, x5, x6, and x7, standardized to z-scores, in the
analysis.
a. Specify the k value as 2 and plot the cluster membership using the cluster and
silhouette plots. What is the average silhouette width?
b. Specify the k value as 3 and plot the cluster membership using the cluster and
silhouette plots. What is the average silhouette width?
c. Specify the k value as 4 and plot the cluster membership using the cluster and
silhouette plots. What is the average silhouette width?
27. FILE Exercise_11.27. Perform k-means clustering on the accompanying data
set. Use variables x1, x3, and x5 in the analysis. Do not standardize the variables.
a. Set the number of clusters to 3. What are the size and cluster center values for
the largest cluster?
b. Perform the same analysis as in part a, but with variables standardized to z-
scores. What are the size and cluster center values for the largest cluster?
28. FILE Exercise_11.28. Perform k-means clustering on all the variables in the
30. FILE Exercise_11.30. Perform k-means clustering on all the variables in the
Applications
31. FILE Iris. British biologist Ronald Fisher studied iris flowers and classified them
according to the width and length of the flower’s petals and sepals (a small, green
leafy part below the petal). The accompanying table shows a portion of the data
that Fisher used in his study.
⋮ ⋮ ⋮ ⋮
a. Perform k-means clustering using k = 4. What are the size and cluster center
values of the largest cluster?
b. Experiment with k = 3 and k = 5. What are the size and cluster center values of
the largest cluster in each case?
32. FILE Football_Players. Denise Lau is an avid football fan and religiously follows
every college football game. During the current season, she
meticulously keeps a record of how each quarterback has played
page 499
throughout the season. Denise is making a presentation at the local college
football fan club about these quarterbacks. The accompanying table shows a
portion of the data that Denise has recorded, with the following variables: the
player number (Player), completed passes (Comp), attempted passes (Att),
completion percentage (Pct), total yards thrown (Yds), average yards per attempt
(Avg), yards thrown per game (Yds/G), number of touchdowns (TD), and number
of interceptions (Int).
a. Perform k-means clustering using k = 3 on all variables except the player
number. Standardize the variables. What are the size and cluster center values
of the largest cluster?
b. Select only the quarterbacks with at least 150 attempted passes and repeat part
a. Note: For Analytic Solver, use the Filter option in Excel to select the
observations. Copy the observations to a new worksheet. Do not alter the order
of the observations.
c. Comment on the differences in the clustering results of parts a and b.
33. FILE Napa. Jennifer Gomez is moving to a small town in Napa Valley, California,
and has been house hunting for her new home. Her Realtor has given her a list of
35 homes with at least 2 bedrooms that were recently sold. Jennifer wants to see
if she can group them in some meaningful ways to help her narrow down her
options. A portion of the data is shown in the accompanying table with the
following variables: the sale price of a home (Price, in $), the number of bedrooms
(Beds), the number of bathrooms (Baths), and the square footage (Sqft).
799,000 4 3 2,689
⋮ ⋮ ⋮ ⋮
327,900 3 1 1,459
a. Perform k-means clustering to group the 144 players into three clusters.
Standardize and include all the variables except the player number in the
analysis. What is the size of the largest cluster? Which cluster has the highest
average number of homeruns?
b. Select only the players with at least 500 at bats (AB ≥ 500). Perform k-means
clustering with k = 3. What is the size of the largest cluster? Which cluster has
the highest average number of homeruns? Note: For Analytic Solver, use the
Filter option in Excel to select the observations. Copy the observations to a new
worksheet. Do not alter the order of the observations.
level health and population measures for 38 countries from the World Bank’s 2000
Health Nutrition and Population Statistics database. For each country, the
measures include death rate per 1,000 people (Death Rate, in %), health
expenditure per capita (Health Expend, in US$), life expectancy at birth (Life Exp,
in years), male adult mortality rate per 1,000 male adults (Male Mortality), female
adult mortality rate per 1,000 female adults (Female Mortality), annual population
growth (Population Growth, in %), female population (Female Pop, in %), male
population (Male Pop, in %), total population (Total Pop), size of labor force
(Labor Force), births per woman (Fertility Rate), birth rate per 1,000 people (Birth
Rate), and gross national income per capita (GNI, in US$). The accompanying
table shows a portion of the data.
page 500
a. Perform k-means clustering to group the 38 countries into four
clusters according to their health measures (i.e., Death Rate, Health Expend,
Life Exp, Male Mortality, and Female Mortality) only. Is data standardization
necessary in this case?
b. What are the size and the average GNI per capita for the largest cluster of
countries?
studies the relationship between the education level and the median income of a
community. The accompanying table shows a portion of the data that he has
collected on the educational attainment and the median income for 77 areas in
the city of Chicago. Sanjay plans to cluster the areas using the educational
attainment data and compare the average median incomes of the clusters. For
each community area, the measures include total number of residents 25 years
and over (25 or Over), number of residents with less than a high school education
(Less than HS), number of residents with a high school education (HS), number of
residents with some college (SC), number of residents with a Bachelor’s degree
or higher (Bachelor), and median household income (Income, in $). The
accompanying table shows a portion of the data.
a. Does Sanjay need to standardize the data before performing cluster analysis?
Explain.
b. Perform k-means clustering to group the community areas into three clusters
based on the variables related to educational attainment of the population (i.e.,
Less than HS, HS, SC, and Bachelor). Plot the three clusters using the cluster
and silhouette plots. What is the average silhouette width? What are the size
and cluster center values of the largest cluster? Which cluster of community
areas has the highest average median household income?
38. FILE Nutritional_Facts. The accompanying data set contains the nutrition
facts on 30 common food items; a portion of the data is shown. The values are
based on 100 grams of the food items. Perform k-means clustering using k = 3 on
the nutritional facts of the food items. Standardize the variables. Describe the
characteristics of each cluster.
a. Perform k-means clustering to group the competitors into five clusters using all
four performance measures. Does the data set need to be standardized prior to
cluster analysis?
b. Plot the cluster membership using the cluster and silhouette plots. What is the
average silhouette width? Describe the characteristics of each cluster. What are
the size and cluster center values of the largest cluster of competitors?
a. Perform k-means clustering to group the countries into three clusters using all
the development indicators. Is data standardization necessary in this case?
b. What is the size of the largest cluster? Which cluster has the highest average
growth in GDP per capita?
c. Compare the cluster membership between parts a and b.
41. FILE SAT_NYC. The accompanying data set contains the school level
average SAT critical reading (CR), math (M), and writing (W) scores for the
graduating seniors from 100 high schools in New York City. The data set also
records the number of SAT test takers (Test Takers) from each school.
a. Perform k-means clustering to group the 100 high schools into four clusters on
the critical reading, math, and writing scores. Is data standardization necessary
in this case? What are the size and cluster center values of the largest cluster?
Which cluster has the fewest average number of test takers?
b. Select only the schools with at least 200 SAT test takers and repeat the cluster
analysis performed in part a. What are the size and cluster center values of the
largest cluster? Note: Use the Filter option in Excel to select the observations.
Copy the observations to a new worksheet. Do not alter the order of the
observations.
42. FILE Telecom. A telecommunications company wants to identify customers who
are likely to unsubscribe to the telephone service. The company collects the
following information from 100 customers: customer ID (ID), age (Age), annual
income (Income), monthly usage (Usage, in minutes), tenure (Tenure, in months),
and whether the customer has unsubscribed to the telephone service
(Unsubscribe). A portion of the data set is shown in the accompanying table.
a. Perform k-means clustering to group the 100 customers into four clusters based
on age, income, monthly usage, and tenure. Describe the characteristics of
each cluster.
b. Compute the percent of customers that have unsubscribed to the telephone
service from each cluster. Which cluster has the highest percent of customers
who have unsubscribed to the telephone service?
43. FILE Information_Technology. A country’s information technology use has
been linked to economic and societal development. A non-profit organization
collects data that measure the use and impact of information technology in over
100 countries annually. The accompanying table shows a portion of the data
collected, with the following variables: country number (Country), hardware
industry (Hw), software industry (Sw), telecommunications industry (Tele),
individual usage (IU), business usage (BU), government usage (GU), diffusion of
e-business (EB), and diffusion of e-payment (EP). Each value in the table
represents the score a country has achieved for a given category. The range of
the score is from 1 (lowest) to 7 (highest).
a. Perform k-means clustering to group the countries into three clusters using the
following variables: individual usage, business usage, and government usage.
Is data standardization necessary in this case?
b. What are the size and cluster center values of the largest cluster? Which cluster
has the highest average diffusion of e-business? Which cluster has the highest
average diffusion of e-payment?
c. Use k-means clustering to group the countries using all the variables except the
country number into the same number of clusters as in part a. What are the size
and cluster center values of the largest cluster?
page 502
page 503
page 504
where
page 505
where
EXAMPLE 11.4
Consider the 10 retail transactions of cosmetics in Table 11.10.
For example, the first transaction includes a lipstick, mascara,
and an eye liner, while the second transaction includes an eye
shadow and a mascara.
Transaction Items
Thus, with a lift ratio greater than one, a strong and positive
association between the purchase of mascara and eye liner
appears to exist, relative to having no rule at all. The lift ratio of
1.19 implies that identifying a customer who purchased a
mascara as one who also purchased an eyeliner is 19% better
than just guessing that a random customer purchased an
eyeliner.
Using Analytic Solver and R to Perform
Association Rule Analysis
With a small data set, it is relatively easy to manually describe and
assess association rules with one item in the antecedent and
consequent. Discovering more complex associations with multiple
items in the antecedents and/or consequents in a larger data set will
require computer applications. The following example demonstrates
how to use association rule analysis in Analytic Solver and R.
page 507
EXAMPLE 11.5
The store manager at an electronics store collects data on the
last 100 transactions. Five possible products were purchased:
a keyboard, an SD card, a mouse, a USB drive, and/or a
headphone. Table 11.12 shows a portion of the data. For
example, transaction 1 shows that the customer purchased a
keyboard, a mouse, and a headphone. This type of data is
called a “basket” of products, which is often used in a market
basket analysis.
a. Open the Transaction data file. [Note: The data are already in
a comma separated format.]
b. From the menu, choose Data Mining > Associate >
Association Rules.
a. Install and load the arules and the arulesViz packages. Enter:
> install.packages(“arules”)
> install.packages(“arulesViz”)
> library(arules)
> library(arulesViz)
b. We use the read.transactions function to read the data and
store the data in a new variable called myData. This example
assumes that the Transaction data file is saved in the C:
folder (or the root directory of the C: drive). You will need to
make changes according to the location of the file on your
computer. Make sure to specify the format option as “basket”
and the sep option as “,”. Enter:
> myData <- read.transactions(“C:/Transaction.csv”, format =
“basket”, sep = “,”)
c. We use the inspect function to display the first five
transactions. Enter:
> inspect(myData[1:5])
Figure 11.13 shows the results.
page 509
d. We use the itemFrequency and the
itemFrequencyPlot functions to check the frequency of the
items. Enter:
> itemFrequency(myData)
> itemFrequencyPlot(myData)
Figure 11.14 shows the item frequency table and plot. The
results indicate that headphones appear in 32% of the
transactions (support of {headphone} = 0.32), while USB
drives appear in only 19% of the transactions (support of {USB
drive} = 0.19).
FIGURE 11.14 R’s item frequency table and item frequency
plot
Mechanics
of 25 transactions.
Transaction 1 a, i, j, k
Transaction 2 c, g, i, k
⋮ ⋮
Transaction 25 a, e, f, j
a. Generate association rules with a minimum support of 10 transactions and
minimum confidence of 75%. Sort the rules by lift ratio. What is the lift ratio for
the top rule?
b. Interpret the support count of the top rule.
c. Generate association rules with a minimum support of 5 transactions and
minimum confidence of 50%. What is the lift ratio for the top rule?
d. Interpret the confidence of the top rule.
45. FILE Exercise_11.45.csv The data are the same as in the preceding
exercise. Read the data file into R using the readtransactions function and
perform the following tasks.
a. Produce an item frequency plot and a frequency table. Which item is the most
frequent item?
b. Generate association rules with a minimum support of 0.25 and minimum
confidence of 0.75. Sort the rules by lift ratio. What is the lift ratio for the top
rule?
page 511
c. Interpret the support proportion of the rule with the largest lift
ratio in part b.
d. Generate association rules with a minimum support of 0.15 and minimum
confidence of 0.60. Sort the rules by lift ratio. What is the lift ratio for the top
rule?
e. Interpret the confidence of the rule with the largest lift ratio in part d.
f. Generate and interpret the scatterplot for the rules generated in part d.
of 40 transactions. Read the data file into R using the readtransactions function
and perform the following tasks.
Transaction 1 a, d
Transaction 2 a, b
⋮ ⋮
Transaction 40 b, d, f
a. Produce an item frequency plot and frequency table. Which item is the least
frequent item?
b. Generate association rules with a minimum support of 0.25 and minimum
confidence of 0.50. Sort the rules by lift ratio. What is the lift ratio for the top
rule?
c. Generate association rules with a minimum support of 0.10 and minimum
confidence of 0.50. Sort the rules by lift ratio. What is the lift ratio for the top
rule?
47. FILE Exercise_11.47.csv The data are the same as in the previous
of 100 transactions.
Transaction 1 a, c, d
Transaction 2 c, e, g
⋮ ⋮
Transaction 100 c, d, e, g
49. FILE Exercise_11.49.csv The data are the same as in the previous
exercise. Read the data file using the readtransactions function and perform the
following tasks.
a. Produce an item frequency plot and frequency table. Which item is the most
frequent item?
b. Generate association rules with a minimum support of 0.1 and minimum
confidence of 0.75. Sort the rules by the lift ratio. Report and interpret the lift
ratio for the top rule.
c. Generate association rules with a minimum support of 0.25 and minimum
confidence of 0.75. Sort the rules by the lift ratio. Report and interpret the lift
ratio for the top rule.
d. Generate a scatterplot to display the rules obtained in part c.
of 41 transactions. Read the data file using the readtransactions function and
perform the following tasks.
Transaction 1 a, b, c, e, f
Transaction 2 a, b, c, d, e
⋮ ⋮
Transaction 41 a, b, c, e, f, g
a. Produce an item frequency plot and frequency table. Which item is the least
frequent item?
b. Generate association rules with a minimum support of 0.25 and minimum
confidence of 0.7. Sort the rules by lift ratio. Report and interpret the lift ratio for
the top rule.
c. Generate a scatterplot comparing the rules obtained in part b.
51. FILE Exercise_11.51.csv The data are the same as in the previous
Applications
consumer study to find out the movie genres that its customers watch. Eighty-
eight households volunteered to participate in the study and
allow the company to track the genres of movies they watch over
page 512
a one-week period. A portion of the data is shown in the accompanying table.
Record 1 shows that members of that household watched action, romance, and
drama movies during the week, while the second household watched only action
movies during the same week.
Record 2 action
⋮ ⋮
53. FILE Movies.csv Use the movie data set from the previous exercise and R
to perform association rule analysis. Make sure to read the data file using the
readtransactions function first.
a. Explore the data using an item frequency plot and a frequency table. Which
genre of movie is the most frequently watched?
b. Generate association rules with a minimum support of 0.1 and minimum
confidence of 0.5. How many rules are generated?
c. Sort the rules by lift ratio. What’s the lift ratio of the top rule? What does it
mean?
54. FILE Fruits.csv A local grocery store keeps track of individual products that
customers purchase. Natalie Jackson, the manager in charge of the fresh fruits
and produce section, wants to learn more about the customer purchasing patterns
of apples, bananas, cherries, oranges, and watermelons, the five most frequently
purchased fruit items in the store. She gathers data on the last 50 transactions, a
portion of which is shown in the accompanying table. Transaction 1 shows that
the customer purchased apples, bananas, and watermelon, while the second
customer purchased bananas, cherries, and oranges. Find association rules for
this consumer study. Use 10 as the minimum support and 50% as the confidence
values. Report and interpret the lift ratio of the top three rules.
⋮ ⋮
out which popular Beatles songs are frequently downloaded together by its users.
The service collects the download logs for 100 users over the past month, where
the download logs show which songs were downloaded during the same session.
A portion of the data is shown in the accompanying table.
User 1 Yesterday, All You Need Is Love . . . Here Comes the Sun
⋮ ⋮
User 100 Hey Jude, Come Together . . . Here Comes the Sun
56. FILE Beatles_Songs.csv. Use the online music data set from the previous
exercise to perform association rule analysis. Make sure to read the data file into
R using the readtransactions function first.
a. Explore the data using an item frequency plot and a frequency table. What is
the most frequently downloaded song?
b. Generate association rules with a minimum support of 0.1 and minimum
confidence of 0.5. Sort the rules by lift ratio. Report and interpret the lift ratio of
the top rule.
c. Display a scatterplot comparing the rules obtained in part b. How many rules
have at least 0.75 confidence, 0.15 support, and 1.2 lift ratio?
⋮ ⋮
page 513
Make sure to read the data file into R using the
readtransactions function first.
a. Generate association rules with a minimum support of 0.1 and minimum
confidence of 0.5. Sort the rules by lift ratio. Report and interpret the lift ratio of
the top three association rules.
b. Display a scatterplot comparing the rules obtained in part a. How many rules
have at least 0.75 confidence, 0.15 support, and 1.2 lift ratio?
analyze the city’s historic crime data in order to better allocate police resources in
the future. He collects data over the past two years. Each record in the data
shows the type of crime reported and the location of the crime. Todd is interested
in answering the following question: Which types of crimes are often associated
with which locations?
⋮ ⋮
59. FILE Crime_Analysis.csv Use the crime data from the previous exercise to
perform association rule analysis. Make sure to read the data file using the
readtransactions function first.
a. Explore the data using an item frequency plot and a frequency table. What type
of crime occurred most frequently? What location is associated with crimes
most frequently?
b. Generate association rules with a minimum support of 0.02 and minimum
confidence of 0.30. How many rules are generated?
c. Based on the rules, which type of crime is most likely to be committed in a
department store? Which type of crime is most likely to be committed on the
sidewalk? Which type of crime is most likely to be committed in apartments?
media usage patterns. In his research, he noticed that people tend to use multiple
social media applications, and he wants to find out which popular social media
applications are often used together by the same user. He surveyed 100 users
about which social media applications they use on a regular basis. A sample of
the data set is shown in the accompanying table.
⋮ ⋮
city government of New York City (NYC). She is given an assignment to find out
which types of business are most likely to open in which area of NYC. She
located a data file that contains the business license types and location of over
70,000 NYC businesses. A portion of the data is shown in the accompanying
table. Make sure to read the data file into R using the readtransactions function
first.
⋮ ⋮
Customer 1 HELOC,IRA,Checking
Customer 2 Mortgage,CD,Checking,CD
⋮ ⋮
Customer 81 Checking,Savings
page 514
63. FILE Bank_Accounts.csv Use the bank account data set
from the previous exercise to perform association rule analysis. Make sure to read
the data file into R using the readtransactions function first.
a. Explore the data using an item frequency plot and a frequency table. Which
account type is the most frequent item?
b. Generate association rules with a minimum support of 0.1 and minimum
confidence of 50%. Sort the rules by lift ratio. How many rules are generated?
Report and interpret the lift ratio of the top rule.
Case Study
FILE Car_Crash. Ramona Kim is a California Highway Patrol (CHP) officer who
works in the city of San Diego. Having lost her own uncle in a car accident, she is
particularly interested in educating local drivers about driver safety. After discussing
this idea with her commanding officer, she learns that since 2005 the CHP
headquarters receives traffic-related collision information from local and state
agencies and makes them publicly available on its website. Her commanding officer
recommends that she focus on car accidents that result in deaths and severe injuries
and encourages Officer Kim to share her findings with her colleagues as well as
make presentations at local community meetings to raise awareness about car
accidents among local drivers.
Relevant San Diego traffic-accident data are extracted from the Statewide
Integrated Traffic Records System database from January 1, 2013, through
February 28, 2013. It is the rainy season in California during this time period;
thus, drivers often have to travel in heavy rain and other bad weather
conditions. Table 11.13 shows a portion of the data from the 785 accidents
that occurred in the city of San Diego over this time
period. The five variables of interest for the analysis
page 515
include the day of the week (WEEKEND =1 if weekend, 0 otherwise), crash
severity (CRASHSEV =1 if severe injury or fatality, 0 otherwise), whether or
not there was inclement weather (WEATHER = 1 if inclement, 0 otherwise),
whether or not the accident occurred on a highway (HIGHWAY = 1 if highway,
0 otherwise), and whether or not there was daylight (LIGHTING = 1 if
daylight, 0 otherwise). For example, the first observation represents an
accident that occurred on a weekday that resulted in a severe injury or fatality.
At the time of the accident, the weather was not inclement, but there was no
daylight. And finally, the accident did not occur on a highway.
Report 11.1 FILE College_Admissions. Subset the data to include only one of the
three colleges. Group college applicants into clusters based on categorical variables
of your choice (e.g., sex, parent’s education). Determine the appropriate number of
clusters. Report the characteristics of each cluster and explain how they are different
from each other. Note: You will need to transform the categorical values into binary
variables in order to perform cluster analysis. See Chapter 2 for procedures used to
transform categorical variables.
Report 11.2 FILE College_Admissions. Subset the data to include only one of the
three colleges. Find a combination of numerical and categorical variables to include
in cluster analysis. Determine the appropriate number of clusters and write a report
to explain your decision and describe how each cluster is different
from other clusters. Note: Cluster analysis with mixed data can only
page 517
be performed in R.
Report 11.3 FILE TechSales_Reps. Subset the data set to include only sales
professionals in one of the two product groups. Cluster the sales professionals based
on numerical variables of your choice (e.g., age, salary). Determine the appropriate
number of clusters. Describe each cluster and compare the average net promoter
scores of the clusters.
Report 11.4 FILE NBA. Group NBA players into clusters based on either their
career performance or performance statistics from a particular season (e.g., 2013–
2014). Determine an appropriate number of clusters, and write a report based on the
clustering results.
Report 11.5 FILE Longitudinal_Survey. Subset the data set to include only those
individuals who lived in an urban area. Cluster the individuals using a combination of
numerical and categorical variables. Determine the appropriate number of clusters
and write a report to describe the differences between clusters. Note: Cluster
analysis with mixed data can only be performed in R.
page 518
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 12.1 Describe the time series forecasting
process.
LO 12.2 Use smoothing techniques to make
forecasts.
LO 12.3 Use linear regression models to make
forecasts.
LO 12.4 Use nonlinear regression models to make
forecasts.
LO 12.5 Apply cross-validation techniques for
model selection.
LO 12.6 Use advanced smoothing methods to
make forecasts.
bservations of any variable recorded over time in sequential order
are considered a time series. Forecasting with time series is an
O important aspect of analytics, providing guidance for decisions
in all areas of business. In fact, the success of any business
depends on the ability to accurately forecast vital variables. Sound
forecasts not only improve the quality of business plans, but also
help identify and evaluate potential risks. Examples include
forecasting product sales, product defects, the inflation rate, cyber
attacks, or a company’s cash flows.
In this chapter, we focus on the trend, the seasonal, and the
random components of a time series. Several models are introduced
that capture one or more of these components. In particular, we use
simple smoothing techniques for making forecasts when short-term
fluctuations in the data represent random departures from the overall
pattern with no discernible trend or seasonal fluctuations.
Forecasting models based on regression and advanced smoothing
are introduced when trend and seasonal fluctuations are present in
the time series. Because it is highly unlikely to know a priori which of
the competing models is likely to provide the best forecast, we apply
in-sample and out-of-sample criteria to select the preferred model for
forecasting.
page 519
©icon Stocker/Shutterstock
INTRODUCTORY CASE
Apple Revenue Forecast
On August 2, 2018, Apple Inc. reported its fourth consecutive
quarter of record revenue and became the first publicly traded
American company to surpass $1 trillion in market value. Its
explosive growth has played a big role in the technology
industry’s ascent to the forefront of the global market economy
(The Wall Street Journal, Aug. 2, 2018). Although the company
designs, develops, and sells consumer electronics, computer
software, and online services, the iPhones segment continues
to be the company’s core source of revenue.
Cadence Johnson, a research analyst at a small investment
firm, is evaluating Apple’s performance by analyzing the firm’s
revenue. She is aware that Apple could be seeing some
resistance to its newly revamped and high-priced line of
iPhones, stoking fears among investors that demand for
iPhones is waning. Cadence hopes that Apple’s past
performance will aid in predicting its future performance. She
collects quarterly data on Apple’s revenue for the fiscal years
2010 through 2018, with the fiscal year concluding at the end of
September. A portion of the data is shown in Table 12.1.
FILE
Revenue_Apple
2010 1 15,683
2010 2 13,499
⋮ ⋮ ⋮
2018 4 62,900
TIME SERIES
A time series is a set of sequential observations of a variable
over time. It is generally characterized by the trend, the
seasonal, the cyclical, and the random components.
FILE
Revenue_Apple
Forecasting Methods
Forecasting methods are broadly classified as quantitative or
qualitative. Qualitative forecasting methods are based on the
judgment of the forecaster, who uses prior experience and expertise
to make forecasts. On the other hand, quantitative forecasting
methods use a formal model along with historical data for the
variable of interest.
Qualitative forecasting is especially attractive when historical data
are not available. For instance, a manager may use qualitative
forecasts when she attempts to project sales for a new product.
Similarly, we rely on qualitative forecasts when future results are
suspected to depart markedly from results in prior periods, and,
therefore, cannot be based on historical data. For example, major
changes in market conditions or government policies will render the
analysis from historical data misleading.
Although attractive in certain scenarios, qualitative forecasts are
often criticized on the grounds that they are prone to some well-
documented biases such as optimism and overconfidence.
Decisions based on the judgment of an overly optimistic manager
may prove costly to the business. Furthermore, qualitative
forecasting is difficult to document, and its quality is totally
dependent on the judgment and skill of the forecaster. Two people
with access to similar information may offer different qualitative
forecasts.
Formal quantitative models have been used extensively to
forecast variables, such as product sales, product defects, house
prices, inflation, stock prices, and cash flows.
FORECASTING METHODS
Forecasting methods are broadly classified as quantitative or
qualitative. Qualitative methods are based on the judgment of
the forecaster, whereas quantitative methods use a formal
model to project historical data.
page 522
EXAMPLE 12.1
In preparation for staffing during the upcoming summer
months, an online retailer reviews the number of customer
service calls received over the past three weeks (21 days).
Table 12.2 shows a portion of the time series.
a. Construct a 3-period moving average series for the data.
b. Plot the time series and its corresponding 3-period moving
average, and comment on any differences.
c. Using the 3-period moving average series, forecast the
number of customer service calls for the 22nd day.
d. Calculate MSE, MAD, and MAPE.
FILE
Service_Calls
Day Calls
1 309
2 292
3 284
4 294
5 292
⋮ ⋮
19 326
20 327
21 309
page 524
SOLUTION:
page 525
c. As mentioned earlier, if the series exhibits
primarily random variations, we can use moving averages to
generate forecasts. Because represents the average for
page 526
EXAMPLE 12.2
Revisit the Service_Calls data from Example 12.1.
a. Construct the simple exponentially smoothed series with α =
0.20 and L1 = y1.
b. Plot time series and its corresponding exponentially smoothed
series against days. Comment on any differences.
c. Using the exponentially smoothed series, forecast the number
of customer service calls for the 22nd day.
d. Calculate MSE, MAD, and MAPE. Compare these values with
those obtained using the 3-period moving average technique
in Example 12.1.
SOLUTION: Again, the calculations are based on unrounded
values even though we show rounded values in the text.
a. In Column 3 of Table 12.4, we present sequential estimates of
Lt with the initial value L1 = y1 = 309. We use Lt = αyt + (1 −
α)Lt−1 to continuously update the level with α = 0.2. For
instance, for periods 2 and 3 we calculate
EXERCISES 12.2
Applications
1. FILE Convenience_Store. The owner of a convenience store near Salt Lake
City in Utah has been tabulating weekly sales at the store, excluding gas. The
accompanying table shows a portion of the sales for 30 weeks.
Week Sales
1 5387
2 5522
⋮ ⋮
30 5206
a. Use the 3-period moving average to forecast sales for the 31st week.
b. Use simple exponential smoothing with α = 0.3 to forecast sales for the 31st
week.
c. Which is the preferred technique for making the forecast based on MSE, MAD,
and MAPE?
2. FILE Spotify. Spotify is a music streaming platform that gives access to songs
from artists all over the world. On February 28, 2018, Spotify filed for an initial
public offering (IPO) on the New York Stock Exchange. The accompanying table
shows a portion of the adjusted monthly stock price of Spotify from April 1, 2018,
to February 1, 2019.
Apr-18 161.67
May-18 157.71
Date Stock Price
⋮ ⋮
Feb-19 134.71
page 529
a. Use the 3-period moving average to forecast Spotify’s stock
price for March 2019.
b. Use simple exponential smoothing with α = 0.2 to forecast Spotify’s stock price
for March 2019.
c. Which is the preferred technique for making the forecast based on MSE, MAD,
and MAPE?
3. FILE FoodTruck. Food trucks have become a common sight on American
campuses. They serve scores of hungry students strolling through campus and
looking for trendy food served fast. The owner of a food truck collects data on the
number of students he serves on weekdays on a small campus in California. A
portion of the data is shown in the accompanying table.
Weekday Students
1 84
2 66
⋮ ⋮
40 166
a. Use the 3-period moving average to make a forecast for Weekday 41.
b. Use the 5-period moving average to make a forecast for Weekday 41.
c. Which is the preferred technique for making the forecast based on MSE, MAD,
and MAPE?
4. Exchange_Rate. Consider the exchange rate of the $ (USD) with € (Euro) and $
(USD) with £ (Pound). The accompanying table shows a portion of the exchange
rates from January 2017 to January 2019.
⋮ ⋮ ⋮
a. Find the 3-period and the 5-period moving averages for Euro. Based on MSE,
MAD, and MAPE, use the preferred model to forecast Euro for February 2019.
b. Find the simple exponential smoothing series for Pound with possible α values
of 0.2, 0.4, 0.6. Based on MSE, MAD, and MAPE, use the preferred model to
forecast Pound for February 2019.
5. FILE Downtown_Cafe. The manager of a trendy downtown café in Columbus,
Ohio, collects weekly data on the number of customers it serves. A portion of the
data is shown in the accompanying table.
Week Customers
1 944
2 997
⋮ ⋮
52 1365
⋮ ⋮ ⋮
c. Find the 3-period and the 5-period moving averages for gas prices in New
England. Based on MSE, MAD, and MAPE, use the preferred model to forecast
gas prices for the first week of February 2019.
d. Find the simple exponential smoothing series with possible α values of 0.2, 0.4,
0.6 for gas prices on the West Coast. Based on MSE, MAD, and MAPE, use the
preferred model to forecast gas prices for the first week of February 2019.
page 530
EXAMPLE 12.3
A local organic food store carries several food products for
health-conscious consumers. The store has witnessed a
steady growth in the sale of chef-designed meals, which are
especially popular with college-educated millennials. For
planning purposes, the manager of the store would like to
extract useful information from the weekly sales of chef-
designed meals for the past year, a portion of which is shown in
Table 12.6.
FILE
Organic
1 1925
2 2978
⋮ ⋮
52 6281
where d1, d2, and d3 are the dummy variables representing the
first three quarters. Forecasts based on the estimated model
are made as follows:
EXAMPLE 12.4
With Amazon.com at the lead, e-commerce retail sales have
increased substantially over the last decade. Consider
quarterly data on e-commerce retail sales in the U.S. from the
first quarter of 2010 to the first quarter of 2019, a portion of
which is shown in Table 12.7.
page 532
FILE
ECommerce
2010 1 37059
2010 2 38467
⋮ ⋮ ⋮
2019 1 127265
page 533
b. The estimated linear trend model with seasonal
dummy variables is:
> install.packages(“forecast”)
> library(forecast)
page 534
Note that in addition to the known future values of t, d1, d2, and d3,
this approach will work only if we also know, or can predict, the
future value of the variable x. In other words, we cannot forecast
product sales if the advertising budget in the future is not known.
Sometimes, we can justify the use of lagged values of causal
variables for making forecasts. In the product sales example, the
relationship between advertising budget and sales may not be
contemporaneous. Therefore, we can specify a model where product
sales is related to the trend term, seasonal dummy variables, and
the value of the advertising budget in the previous period. Further
discussion of lagged regression models is beyond the scope of this
text.
EXERCISES 12.3
Applications
7. FILE Inquiries Morgan Bank has been encouraging its customers to use its new
mobile banking app. While this may be good for business, the bank has to deal
with a number of inquiries it receives about the new app. The following table
contains a portion of weekly inquiries the bank has received over the past 30
weeks.
Week Inquiries
1 286
2 331
⋮ ⋮
30 219
Estimate the linear trend model to forecast the number of inquiries over the next
two weeks.
8. FILE Apple_Price. Apple Inc. has performed extremely well in the last decade.
After its stock price dropped to below 90 in May 2016, it made a tremendous
comeback to reach about 146 by May 2017 (SeekingAlpha.com, May 1, 2017). An
investor seeking to gain from the positive momentum of Apple’s stock price
analyzes 53 weeks of stock price data from 5/30/16 to 5/26/17. A portion of the
data is shown in the accompanying table.
Date Price
Date Price
5/30/2016 97.92
6/6/2016 98.83
⋮ ⋮
5/26/2017 153.57
Date Revenue
Feb-14 3,519,756
Mar-14 4,092,575
⋮ ⋮
Oct-18 22,589,679
Use the linear trend model (no seasonality) to forecast the tax revenue for
November and December of 2018.
10. FILE Revenue_Lowes. Lowe’s Companies, Inc., is a home improvement
company offering a range of products for maintenance, repair, remodeling, and
decorating. During the recovery phase since the financial crisis of 2008, Lowe’s
has enjoyed a steady growth in revenue. The following table contains a portion of
quarterly data on Lowe’s revenue (in $ millions) with its fiscal year concluding at
the end of January.
2010 1 12,388
2010 2 14,361
Year Quarter Revenue
⋮ ⋮ ⋮
2018 3 17,415
a. Estimate and interpret the linear trend model with seasonal dummy variables.
b. Use the estimated model to forecast Lowe’s revenue for the fourth quarter of
2018.
page 535
11. FILE Vacation Vacation destinations often run on a seasonal
basis, depending on the primary activities in that location. Amanda Wang is the
owner of a travel agency in Cincinnati, Ohio. She has built a database of the
number of vacation packages (Vacation) that she has sold over the last twelve
years. The following table contains a portion of quarterly data on the number of
vacation packages sold.
2008 1 500
2008 2 147
⋮ ⋮ ⋮
2019 4 923
a. Estimate the linear regression models using seasonal dummy variables with
and without the trend term.
b. Determine the preferred model and use it to forecast the quarterly number of
vacation packages sold in the first two quarters of 2020.
12. FILE Consumer_Sentiment. The following table lists a portion of the University
of Michigan’s Consumer Sentiment index. This index is normalized to have a
value of 100 in 1966 and is used to record changes in consumer morale.
Jan-10 74.4
Feb-10 73.6
Date Consumer Sentiment
⋮ ⋮
Nov-18 97.5
a. Estimate and interpret the linear trend model with seasonal dummy variables.
b. Use the estimated model to make a consumer sentiment index forecast for
December 2018.
13. FILE UsedCars Used car dealerships generally have sales quotas that they
strive to hit each month, quarter, and calendar year. Consequently, buying a used
car toward the end of those periods presents a great opportunity to get a good
deal on the car. A local dealership has compiled monthly sales data for used cars
(Cars) from 2014-2019, a portion of which is shown in the accompanying table.
Date Cars
Jan-2014 138
Feb-2014 179
⋮ ⋮
Dec-2019 195
a. Estimate the linear regression models using seasonal dummy variables with
and without the trend term.
b. Determine the preferred model and use it to forecast used cars sales in the first
two months of 2020.
page 536
FIGURE 12.6 Scatterplot with superimposed linear and exponential trend lines
Recall from Chapter 7 that we specify an exponential model as
ln(yt) = β0 + β1t + εt. In order to estimate this model, we first generate
the series in natural logs, ln(yt), and then run a regression of ln(yt) on
t. Because, in the exponential model, the response variable is
measured in logs, we make forecasts in regular units as
where se is the standard error of the estimate.
EXAMPLE 12.5
According to data compiled by the World Bank, the world
population has increased from 3.03 billion in 1960 to 7.53
billion in 2017. This rapid increase concerns environmentalists,
who believe that our natural resources may not be able to
support the ever-increasing population. Additionally, most of
the rapid population growth has been in 34 low-income
countries, many of which are located in Africa. Consider the
population data, in millions, for low-income countries from 1960
to 2017, a portion of which is shown in Table 12.9.
FILE
Population_LowInc
Year Population
1960 166.5028
1961 170.2108
⋮ ⋮
2017 732.4486
page 537
⋮ ⋮ ⋮ ⋮
t y
⋮ ⋮ ⋮ ⋮
FIGURE 12.8 Scatterplots with superimposed linear and quadratic trend lines
page 539
The cubic trend model allows for two changes in the direction of a
series. In the cubic trend model, we basically generate two additional
variables, t2 and t3, for the regression. A multiple regression model is
run that uses y as the response variable and t, t2, and t3 as the
predictor variables. The estimated model is used to make forecasts
as
page 540
EXAMPLE 12.6
FILE
Revenue_Apple
page 541
The estimated trend models are:
The coefficients for the seasonal dummy variables indicate
that the revenue is about $19,757 million, or $19.76 billion,
higher in the first quarter as compared to the fourth quarter.
The results also suggest that compared to the fourth quarter,
the revenue is somewhat higher in the second quarter and
lower in the third quarter. The positive coefficient for the trend
variable t in the linear model indicates an upward movement
of the revenue. The positive coefficient for t along with a
negative coefficient for t2 in the quadratic model captures the
inverted U-shape of the series. Given the coefficients, holding
seasonality constant, the revenue reaches its maximum at
which suggests that Apple’s revenues
For several years, Apple’s smartphone segment has been the company’s core
source of revenue, resulting in record revenue. A scatterplot of Apple’s quarterly
revenue for the fiscal years 2010 through 2018 highlights some important
characteristics. First, there is a persistent upward movement with the revenue
plateauing near the end of the observation period. Second, a seasonal pattern
repeats itself. For each year, the revenue is the highest in the first quarter (October–
December) followed by the second (January–March), fourth (July–September), and
third (April–June) quarters.
The coefficients of the estimated quadratic trend model with seasonal dummy
variables suggest that the revenue is about $20 billion higher in the first quarter as
compared to the fourth quarter. This is not surprising because given Apple’s fiscal
calendar, the first quarter encompasses the holiday period with usual strong sales.
The positive coefficient for the trend variable t along with a negative coefficient for t2
captures the plateauing of the series. In fact, given the coefficients, holding
seasonality constant, the revenue reaches its maximum in the fourth quarter of
2018. This finding is consistent with the concern that while Apple is clearly doing
well for now, its future growth may be murky partly because, in terms of the
smartphone market, it only sells on the somewhat saturated mid-to-high-end range
market. The quarterly revenue forecasts for 2019 are $76.28, $60.38, $53.62, and
$55.95 billion, respectively, resulting in a whopping $246 billion in revenue for fiscal
year 2019.
page 542
EXERCISES 12.4
Applications
14. FILE Whites. In 2016, demographers reported that deaths outnumbered births
among white Americans in more than half the states in the U.S. (The New York
Times, June 20, 2018). Consider the white American population, in millions, from
2005 through 2017; a portion of the data is shown in accompanying table.
Year Whites
2005 215.33
2006 221.33
⋮ ⋮
2017 235.51
a. Use the scatterplot to explore linear and quadratic trends; the cubic trend is not
considered. Which trend model do you think describes the data better?
b. Validate your intuition by comparing adjusted R2 of the two models. Use the
preferred model to forecast the white population in 2018 and 2019.
15. FILE TrueCar. Investors are always reviewing past pricing history and using it to
influence their future investment decisions. On May 16, 2014, online car buying
system TrueCar launched its initial public offering (IPO), raising $70 million in the
stock offering. An investor, looking for a promising return, analyzes the monthly
stock price data of TrueCar from June 2014 to May 2017. A portion of the data is
shown in the accompanying table.
Date Price
Jun-14 14.78
Jul-14 13.57
Date Price
⋮ ⋮
May-17 17.51
a. Estimate the linear, the quadratic, and the cubic trend models.
b. Determine the preferred model and use it to make a forecast for June 2017.
16. FILE Miles_Traveled. The number of cars sold in the United States in 2016
reached a record high for the seventh year in a row (CNNMoney, January 4,
2017). Consider monthly total miles traveled (in billions) in the United States from
January 2010 to December 2016. A portion of the data is shown in the
accompanying table.
Date Miles
Jan-10 2953.305
Feb-10 2946.689
⋮ ⋮
Dec-16 3169.501
Day Sales
1 263
2 215
⋮ ⋮
100 2020
Estimate the exponential trend model to forecast sales for the 101st day.
18. FILE Population_Japan. For several years, Japan’s declining population has led
experts and lawmakers to consider its economic and social repercussions (NPR,
December 21, 2018). Consider the population data, in millions, for Japan from
1960 to 2017; a portion of the data is shown in the accompanying table.
Year Population
1960 92.50
1961 94.94
⋮ ⋮
2017 126.79
Jan-16 177.412
Feb-16 177.828
⋮ ⋮
Nov-18 206.263
page 543
a. Estimate the linear and the exponential trend models and
calculate their MSE, MAD, and MAPE.
b. Use the preferred model to forecast the Case-Shiller index for December 2018.
20. FILE Expenses. The controller of a small construction company is attempting to
forecast expenses for the next year. He collects quarterly data on expenses (in
$1,000s) over the past five years, a portion of which is shown in the
accompanying table.
Year Quarter Expenses
2008 1 96.50
2008 2 54.00
⋮ ⋮ ⋮
2017 4 22335.30
a. Estimate and interpret the exponential trend model with seasonal dummy
variables.
b. Use the estimated model to forecast expenses for the first two quarters of 2018.
21. FILE Treasury_Securities. Treasury securities are bonds issued by the U.S.
government. Consider a portion of quarterly data on treasury securities, measured
in millions of U.S. dollars.
2010 1 927527
2010 2 1038881
⋮ ⋮ ⋮
2018 3 2284572
Estimate the exponential trend model with seasonal dummy variables to make a
forecast for the fourth quarter of 2018.
22. FILE House_Price. The West Census region for the U.S. includes Montana,
Wyoming, Colorado, New Mexico, Idaho, Utah, Arizona, Nevada, California,
Oregon, and Washington. Consider the median house prices in the West Census
region from 2010:01 through 2018:03, a portion of which is shown in the
accompanying table.
2010 1 263600
2010 2 264100
⋮ ⋮ ⋮
Year Quarter Price
2018 3 404300
a. Estimate and interpret the quadratic trend model with seasonal dummy
variables.
b. Use the estimated model to forecast the median house price in the West
Census region for the fourth quarter of 2018.
23. FILE Vehicle_Miles. The United States economy picked up speed in 2012 as
businesses substantially built up their inventories and consumers increased their
spending. This also led to an increase in domestic travel. Consider the vehicle
miles traveled in the U.S. (in millions) from January 2012 through September
2018, a portion of which is shown in the accompanying table.
Jan-12 227527
Feb-12 218196
⋮ ⋮
Sep-18 260555
a. Estimate the linear and the exponential trend models with seasonal dummy
variables for vehicle miles and calculate their MSE, MAD, and MAPE.
b. Use the preferred model to forecast vehicle miles for the last three months of
2018.
24. FILE Housing_Starts. Housing starts are the number of new residential
construction projects that have begun during any given month. It is considered to
be a leading indicator of economic strength. The following table contains a portion
of monthly data on housing starts (in 1,000s) in the U.S. from Jan-11 to Nov-18.
Jan-11 40.2
Feb-11 35.4
⋮ ⋮
Nov-18 95.9
a. Estimate and interpret the exponential seasonal trend model.
b. Use the estimated model to forecast housing starts for December 2018.
25. FILE Weekly_Earnings. Data on weekly earnings are collected as part of the
Current Population Survey, a nationwide sample survey of households in which
respondents are asked how much each worker usually earns. The following table
contains a portion of quarterly data on weekly earnings (Earnings, adjusted for
inflation) in the U.S. from 2010–2017.
page 544
Year Quarter Earnings
2010 1 347
2010 2 340
⋮ ⋮ ⋮
2017 4 347
a. Estimate the linear and the quadratic trend models with seasonal dummy
variables.
b. Determine the preferred model and use it to forecast earnings for the first two
quarters of 2018.
yt and denote the value of the series and its forecast, respectively,
page 545
EXAMPLE 12.7
FILE
Population_LowInc
The derivations for the resulting MSE, MAD, and MAPE in the
validation set are:
EXAMPLE 12.8
FILE
Revenue_Apple
page 546
to make
page 547
. We use the entire data, that combines the training and the
validation sets, to re-estimate the preferred linear model for
forecasting Apple’s revenue for fiscal year 2019. Enter:
EXERCISES 12.5
Applications
26. FILE Population_Japan. The accompanying data file contains annual
population data (in millions) for Japan from 1960 to 2017. For cross-validation, let
the training and the validation sets comprise the periods from 1960 to 2005 and
2006 to 2017, respectively.
a. Use the training set to estimate the linear, the quadratic, and the cubic trend
models and compute the resulting MSE, MAD, and MAPE for the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast the population in Japan for 2018.
27. FILE Tax_Revenue. The accompanying data file contains 57 months of tax
revenue from medical and retail marijuana tax and fee collections. For cross-
validation, let the training and the validation sets comprise the first 45 months and
the last 12 months, respectively.
page 548
a. Use the training set to estimate the linear, the quadratic, and
the cubic trend models and compute the resulting MSE, MAD, and MAPE for
the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast tax revenue for the 58th month.
28. FILE Cafe_Sales. The accompanying data file contains daily sales (in $) at Café
Venetian for 100 days. For cross-validation, let the training and the validation sets
comprise the first 80 days and the last 20 days, respectively.
a. Use the training set to estimate the linear and the exponential trend models and
compute the resulting MSE, MAD, and MAPE for the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast sale for the 101st day.
29. FILE Apple_Price. The accompanying data file contains 53 weeks of Apple’s
stock price data. For cross-validation, let the training and the validation sets
comprise the first 40 weeks and the last 13 weeks, respectively.
a. Use the training set to estimate the linear and the exponential trend models and
compute the resulting MSE, MAD, and MAPE for the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast Apple’s stock price for the 54th week.
30. FILE Expenses. The accompanying data file contains quarterly data on
expenses (in $1,000s) over five years. For cross-validation, let the training and
the validation sets comprise the periods from 2008:01 to 2015:04 and 2016:01 to
2017:04, respectively.
a. Use the training set to estimate the linear and the exponential trend models with
seasonal dummy variables and compute the resulting MSE, MAD, and MAPE
for the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast expenses for the first quarter of 2018.
31. FILE House_Price. The accompanying data file lists quarterly data on median
house prices in the West Census region. For cross-validation, let the training and
the validation sets comprise the periods from 2010:01 to 2016:04 and 2017:01 to
2018:03, respectively.
a. Use the training set to estimate the linear and the quadratic trend models with
seasonal dummy variables and compute the resulting MSE, MAD, and MAPE
for the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast house price for the fourth quarter of 2018.
32. FILE Vehicle_Miles. The accompanying data file lists monthly data on vehicle
miles traveled in the U.S. (in millions). For cross-validation, let the training and the
validation sets comprise the periods from Jan-12 to Dec-16 and Jan-17 to Sep-18,
respectively.
a. Use the training set to estimate the linear and the exponential trend models with
seasonal dummy variables and compute the resulting MSE, MAD, and MAPE
for the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast vehicle miles for October 2018.
33. FILE Weekly_Earnings. The accompanying data file contains quarterly data on
weekly earnings (Earnings, adjusted for inflation) in the U.S. For cross-validation,
let the training and the validation sets comprise the periods from 2010:01 to
2015:04 and 2016:01 to 2017:04, respectively.
a. Use the training set to estimate the linear and the quadratic trend models with
seasonal dummy variables and compute the resulting MSE, MAD, and MAPE
for the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast earnings for the first quarter of 2018.
34. FILE Housing_Starts. The accompanying data file contains monthly data on
housing starts (in 1,000s) in the U.S. For cross-validation, let the training and the
validation sets comprise the period from Jan-11 to Dec-16 and Jan-17 to Nov-18,
respectively.
a. Use the training set to estimate the linear and the exponential trend models with
seasonal dummy variables and compute the resulting MSE, MAD, and MAPE
for the validation set.
b. Determine the preferred model and reestimate it with the entire data set to
forecast housing starts for December of 2018.
EXAMPLE 12.9
FILE
Population_LowInc
α 0.20 0.9137
β 0.15 0.4158
> install.packages(“forecast”)
> library(forecast)
page 550
c. We use the ts function to create a time series
object and call it newData. Within ts, we specify the start and
end periods as well as frequency, denoting the number of
seasons in a year. Enter:
α 0.20 0.9999
β 0.15 0.9999
page 551
Again, computer-generated smoothing
parameters are preferred as they result in lower MSE, MAD,
and MAPE values.
g. We use the entire data, which combine the training and the
validation sets, to reimplement the preferred model for making
forecasts. Enter:
> HFinal <- ets(newData, model = “AAN”)
> forecast(HFinal, h=1)
In Example 12.10, we will use Analytic Solver and R for the Holt-
Winters exponential smoothing method using computer-generated
smoothing parameters. As noted earlier, results may differ between
the software packages and their different versions.
page 552
EXAMPLE 12.10
FILE
Revenue_Apple
α 0.4202 0.6746
β 0.0323 0.0321
γ 0.9155 0.9156
The additive method with lower MSE, MAD, and MAPE values
is preferred.
e. From the original data worksheet, choose Time Series >
Smoothing > Holt-Winters > Additive. Select and move
Revenue to Selected Variable. Input 4 in the Period box and
check Optimize. Also, check Produce forecast and input 4 in
the # Forecasts box. The revenue forecasts (in $ millions) are
$95,206.29 for quarter 1, $67,774.63 for quarter 2, $60,275.86
for quarter 3, and $69,155,45 for quarter 4. The quarterly
forecasts result in a sum of $292,412.23 million in Apple’s
revenue for fiscal year 2019.
Using R
e. We use the ets function, with model inputs “AAA” for additive
seasonality and “AAM” for multiplicative seasonality. Also, we
enter FALSE in restrict; sometimes the default setting of TRUE
results in an error. Finally, we use the summary function to
view the results. Enter:
α 0.3967 0.9995
β 0.0001 0.0857
γ 0.6033 0.0005
EXERCISES 12.6
Applications
Note: These exercises can be solved using Analytic Solver and/or R. The answers,
however, will depend on the software package used. In R, use AAN for the Holt
method, and use AAA and AAM for the Holt-Winters method with additive and
multiplicative seasonality, respectively.
35. FILE Population_Japan. The accompanying data file contains annual
population data (in millions) for Japan from 1960 to 2017. For cross-validation, let
the training and the validation sets comprise the periods from 1960 to 2005 and
2006 to 2017, respectively.
a. Use the training set to implement the Holt exponential smoothing method with
user-supplied (α = 0.20 and β = 0.15), as well as the computer-generated,
smoothing parameters and compute the resulting MSE, MAD, and MAPE for
the validation set.
b. Determine the preferred model and reimplement it with the entire data set to
forecast the population in Japan for 2018.
36. FILE Tax_Revenue. The accompanying data file contains 57 months of tax
revenue from medical and retail marijuana tax and fee collections. For cross-
validation, let the training and the validation sets comprise the first 45 months and
the last 12 months, respectively.
a. Use the training set to implement the Holt exponential smoothing method with
user-supplied (α = 0.30 and β = 0.20), as well as the computer-generated,
smoothing parameters and compute the resulting MSE, MAD, and MAPE for
the validation set.
b. Determine the preferred model and reimplement it with the entire data set to
forecast tax revenue for the 58th month.
37. FILE Cafe_Sales. The accompanying data file contains daily sales (in $) at Café
Venetian for 100 days. For cross-validation, let the training and the validation sets
comprise the first 80 days and the last 20 days, respectively.
a. Use the training set to implement the Holt exponential smoothing method with
two sets of user-supplied smoothing parameters, α = 0.20 and β = 0.10 and α =
0.30 and β = 0.20. Compute the resulting MSE, MAD, and MAPE for the
validation set.
b. Determine the preferred model and reimplement it with the entire data set to
forecast sales for the 101st day.
38. FILE Apple_Price. The accompanying data file contains 53 weeks of Apple’s
stock price data. For cross-validation, let the training and the validation sets
comprise the first 40 weeks and the last 13 weeks, respectively.
a. Use the training set to implement the Holt exponential smoothing method with
user-supplied (α = 0.20 and β = 0.10), as well as the computer-generated,
smoothing parameters and compute the resulting MSE, MAD, and MAPE for
the validation set.
b. Determine the preferred model and reimplement it with the entire data set to
forecast Apple’s stock price for the 54th week.
39. FILE Expenses. The accompanying data file contains quarterly data on
expenses (in $1,000s) over five years. For cross-validation, let the training and
the validation sets comprise the periods from 2008:01 to 2015:04 and 2016:01 to
2017:04, respectively.
a. Use the training set to implement the Holt-Winters exponential smoothing
method with additive and multiplicative seasonality and compute the resulting
MSE, MAD, and MAPE for the validation set.
b. Determine the preferred model and reimplement it with the entire data set to
forecast expenses for the first quarter of 2018.
40. FILE House_Price. The accompanying data file lists quarterly data on median
house prices in the West Census region. For cross-validation, let the training and
the validation sets comprise the periods from 2010:01 to 2016:04 and 2017:01 to
2018:03, respectively.
a. Use the training set to implement the Holt-Winters exponential smoothing
method with additive and multiplicative seasonality and compute the resulting
MSE, MAD, and MAPE for the validation set.
b. Determine the preferred model and reimplement it with the entire data set to
forecast house price for the fourth quarter of 2018.
page 555
41. FILE Vehicle_Miles. The accompanying data file lists monthly
data on vehicle miles traveled in the U.S. (in millions). For cross-validation, let the
training and the validation sets comprise the periods from Jan-12 to Dec-16 and
Jan-17 to Sep-18, respectively.
a. Use the training set to implement the Holt-Winters exponential smoothing
method with additive and multiplicative seasonality and compute the resulting
MSE, MAD, and MAPE for the validation set.
b. Determine the preferred model and reimplement it with the entire data set to
forecast vehicle miles for the last three months of 2018.
42. FILE Housing_Starts. The accompanying data file contains monthly data on
housing starts (in 1,000s) in the U.S. For cross-validation, let the training and the
validation sets comprise the period from Jan-11 to Dec-16 and Jan-17 to Nov-18,
respectively.
a. Use the training set to implement the Holt-Winters exponential smoothing
method with additive and multiplicative seasonality and compute the resulting
MSE, MAD, and MAPE for the validation set.
b. Determine the preferred model and reimplement it with the entire data set to
forecast housing starts for December of 2018.
page 556
Case Study
Leading economic indicators, such as the stock market or the housing market, often
change prior to large economic adjustments. For example, a rise in stock prices often
means that investors are more confident of future growth in the economy. Or, a fall in
building permits is likely a signal that the housing market is weakening – which is
often a sign that other sectors of the economy are on the downturn.
feverpitched/123RF
Consider what happened prior to the 2008 recession. As early as October 2006,
building permits for new homes were down 28% from October 2005. Analysts use
economic indicators to predict future trends and gauge where the economy is
heading. The information provided by economic indicators helps firms implement or
alter business strategies.
Pooja Nanda is an analyst for a large investment firm in Chicago. She covers the
construction industry and has been given the challenging task of forecasting housing
starts for June 2019. She has access to seasonally adjusted monthly housing starts
in the United States from January 2016 to May 2019. A portion of the data is shown
in Table 12.19.
FILE
Starts
Jan-16 1114
Feb-16 1208
⋮ ⋮
May-19 1269
Pooja would like to use the sample information to identify the best-fitting model to
forecast housing starts for June 2019.
page 557
Given the findings from Figure 12.9, three trend models are estimated.
1. The three-period moving average model.
2. The simple exponential smoothed model with various values for the speed
of decline α.
3. The simple linear trend model, y t = β 0 + β 1Time t + ε t, where yt
represents housing starts.
Three performance measures are used for model selection: mean square
error (MSE), mean absolute deviation (MAD), and the mean absolute
percentage error (MAPE). The preferred model will have the lowest MSE,
MAD, or MAPE. Table 12.20 shows the values of these three performance
measures for the models.
*For the exponential smoothing model, α = 0.20 provides the lowest values
for MSE, MAD, and MAPE.
The linear trend model provides the best sample fit, as it has the lowest
values for MSE, MAD, and MAPE. Therefore, the estimated linear trend
model is used to derive the forecast for June 2019 as
Housing starts plays a key role in determining the health of the economy and
is, therefore, always under scrutiny. The U.S. housing market seems to be on
solid ground even though there has been slowdown from its peak in early
2018. In this report, we employ simple time series models to project historical
data on housing starts.
page 558
Suggested Case Studies
Report 12.1 FILE Fried_Dough. Fried dough is a popular North American food
associated with outdoor food stands at carnivals, amusement parks, fairs, festivals,
and so on. Usually dusted with powdered sugar and drenched in oil, it is not
particularly healthy, but it sure is tasty! Jose Sanchez owns a small stall at Boston
Commons in Boston, Massachusetts, where he sells fried dough and soft drinks.
Although business is good, he is apprehensive about the variation in sales for no
apparent reason. The accompanying data file contains information on the number of
plates of fried dough and soft drinks that he sold over the last 30 days. In a report,
use the sample information to
Explore forecasting models, including moving averages and the simple exponential
method, to smooth the time series for fried dough and soft drinks.
Use the preferred method to forecast sales of fried dough and soft drinks for the
31st day.
Report 12.2 FILE India_China. According to United Nations estimates, more than
half of the world population live in just seven countries, with China, closely followed
by India, leading the pack. The other five countries on the list include the United
States, Indonesia, Brazil, Pakistan, and Nigeria. It is believed that India will overtake
China to become the world’s most populous nation much sooner than previously
thought (CNN, June 2019). The accompanying data file, compiled by the World
Bank, contains the population data, in millions, for India and China from 1960 to
2017. In a report, use the sample information to
Explore forecasting models, including trend regression models and the Holt
exponential smoothing model, to capture the population trend for both China and
India.
Use the preferred model to forecast the population of China and India from 2018–
2020.
13 Introduction to
Prescriptive Analytics
page 561
LEARNING OBJECTIVES
After reading this chapter, you should be able to:
LO 13.1 Generate values for random variables.
LO 13.2 Develop and apply Monte Carlo
simulation models.
LO 13.3 Formulate a linear programming model.
LO 13.4 Solve and interpret a linear programming
model.
LO 13.5 Formulate and solve a linear integer
programming model.
page 562
page 563
LO 13.1
Generate values for random variables.
page 565
page 568
EXAMPLE 13.2
From the introductory case, the material cost of a red pullover
jacket sold by FashionTech is normally distributed with a mean
of $12 per unit and a standard deviation of $1.37. Generate the
material cost for 100 randomly selected jackets.
LO 13.2
Develop and apply Monte Carlo simulation models.
EXAMPLE 13.3
From the introductory case, Abigail wants to develop an
analytical model to analyze the demand and production
information for a red pullover jacket for women, the top-selling
product for FashionTech. She obtains relevant demand and
cost information from her corporate database, which is
summarized in Table 13.1. Create a Monte Carlo simulation to
help Abigail examine the profit at each staffing level. Also, use
the simulation results to determine whether or not FashionTech
should consider increasing the automation in its production
process.
Description Data
Using Excel
There are many ways to develop a Monte Carl simulation in
Excel; however, for consistency, we recommend that students
use a template provided in the Jacket worksheet for this
example. Also, to obtain consistent results, we use the random
number generation feature in Excel’s Analysis ToolPak and set
the random seed equal to 1. We will repeat the simulation for
100 iterations and summarize the results on the Jacket
worksheet.
a. Open the Jacket worksheet. Use Data > Data Analysis >
Random Number Generation to generate 100 random
numbers for the weekly demand in cells L11:L110, the
production rate in cells M11:M110, and the material cost in
cells N11: N110. Figure 13.2 shows the Random Number
Generation options for the three random variables; note that
the random seed is set equal to 1 in all simulations.
page 571
FIGURE 13.2 Random number generations for Demand,
Production Rate, and Material Cost
Source: Microsoft Excel
c. Enter the selling price, $50, in cell B2; weekly overhead cost,
$800, in cell B3; and the tailor’s weekly wage, $700, in cell B4.
Label these input parameters, as shown in Figure 13.3. Recall
that Abigail wants to examine the possibility of increasing the
automation in her production process. Storing the input
parameters in their own cells will allow us to later change
these values and obtain results from different scenarios.
d. Recall that the profit is calculated as revenue minus cost. To
compute the revenue, we multiply the selling price by the
number of jackets sold. Note that the number of jackets sold is
the lower value between the demand and the number of
jackets produced, and we use the MIN function in Excel to
determine the number of jackets sold. The total cost is an
aggregate of the overhead cost, the tailors’ wages, and the
material cost. To calculate the weekly profit for hiring one tailor
in cell E11, we enter the following formula.
EXERCISES 13.2
For exercises in this section, set the random seed at 1 for the
Random Number Generation feature in Excel’s Analysis
ToolPak. In R, use the statement set.seed(1). This ensures
consistency of the results.
Mechanics
1. Use Excel’s Analysis ToolPak or R, both with a seed of 1, to simulate 25 random
observations based on a binomial distribution with five trials and p = 0.2. What are
the mean and standard deviation of the 25 observations?
page 576
2. Use Excel’s Analysis ToolPak or R, both with a seed of 1, to
simulate 50 random observations based on a continuous uniform distribution over
the interval [23, 37]. What is the range of simulated observations?
3. Use Excel’s Analysis ToolPak or R, both with a seed of 1, to simulate 200 random
observations based on a Poisson distribution with λ = 2. What are the mean and
standard deviation of the 200 observations?
4. Use Excel’s Analysis ToolPak or R, both with a seed of 1, to simulate 120 random
observations of a continuous uniform random variable over the interval [10, 75].
What are the mean, the standard deviation, and the range of the 120
observations? How many observations are greater than 65?
5. Use Excel’s Analysis ToolPak or R, both with a seed of 1, to simulate 1,000
random observations of a normally distributed random variable with μ = 9.23 and
σ = 0.87. Report the mean and standard deviation of the 1,000 observations. How
many of the 1,000 observations have a value less than 8?
Applications
6. A manager at a local grocery store learns that 75% of her customers will use a
credit card when making a purchase. Use Excel’s Analysis ToolPak or R, both
with a seed of 1, to simulate a sample of 12 customers waiting in line at the
grocery store to make a purchase and repeat the simulation 500 times. Based on
the 500 simulations, report the average number of customers who use a credit
card to make a purchase. What is the range of the number of customers using a
credit card?
7. On a given day, about 85% of Internet users in the U.S. visit a social media site.
Use Analysis ToolPak or R, both with a seed of 1, to generate 100 simulations of
a sample of 10 Internet users and report the mean and the standard deviation of
the number of users in the sample who visit a social media site.
8. At a highly selective 4-year college, only 25% of transfer students go on to
graduate on time. Use Analysis ToolPak or R, both with a seed of 1, to generate
400 simulations where each simulation contains a group of 10 transfer students.
Report the mean, the standard deviation, and the range of the number of transfer
students with on-time graduation.
9. A local grocery store observes that on average 4 customers enter the store every
5 minutes during the hour between 5:30 pm and 6:30 pm each day. Use Analysis
ToolPak or R, both with a seed of 1, to generate a simulation for a period of 50
days. Report the mean and the standard deviation from the 50 simulations.
10. Monthly demand for a 60" TV at a local appliance store is normally distributed with
a mean of 11 units and standard deviation of 4.17 units. Use Analysis ToolPak or
R, both with a seed of 1, to develop a simulation for 300 months and report the
mean and the range of the demand for the TV.
11. Peter plans to invest $25,000 in a mutual fund whose annual returns are normally
distributed with a mean of 7.4% and standard deviation of 2.65%. Use Analysis
ToolPak or R, both with a seed of 1, to generate 100 trials to estimate the return
on Peter’s investment after one year. What are the mean and the range of the
investment returns?
12. Every Saturday morning between 9:00 am and 10:00 am, customers, on average,
arrive at a local coffee shop every 1.2 minutes, and the customer arrival follows
an exponential distribution. Use Analysis ToolPak or R, both with a seed of 1, to
generate 500 trials to simulate the time between customer arrivals and report the
sample mean and the standard deviation. Repeat the simulation with 1,000 trials
and compare the new sample mean and the standard deviation with the
theoretical values.
13. On average, a smartphone battery lasts about 6.5 hours with normal usage. The
smartphone battery life follows an exponential distribution. Use Analysis ToolPak
or R, both with a seed of 1, to generate 50 smartphone battery life simulations,
and report the sample mean and the standard deviation. If the number of
simulations is increased to 500, compare the sample mean and the standard
deviation to the theoretical values.
14. A manufacturer of a smartphone battery estimates that monthly demand follows a
normal distribution with a mean of 400 units and standard deviation of 26. Material
cost is uniformly distributed between $7.00 and $8.50. Fixed costs are $2,700 per
month, regardless of the production rate. The selling price is $15 per unit.
a. Use Analysis ToolPak or R, both with a seed of 1, to simulate 1,000 trials to
estimate the expected monthly profit and standard deviation. Demand values
need to be rounded to integers, and use two decimal places for the material
cost.
b. What are the best and worst profit scenarios for the company?
15. Peter has $30,000 to invest in a mutual fund whose annual returns are normally
distributed with a mean of 5% and standard deviation of 4.2%.
a. Use Analysis ToolPak or R, both with a seed of 1, to simulate 5,000 trials to
estimate the mean balance after one year.
b. What is the probability of a balance of $32,000 or more?
c. Compare your results to another investment option at a fixed annual return of
3% per year.
16. Each week, a grocery store purchases eggs from a local ranch for $1.99 for each
12-egg carton and sells it for $3.89. Any cartons not sold within a week will be on
“manager’s special” sales or sold to a low-cost outlet for $1.25. If the eggs sell out
before the end of the week, an estimated opportunity cost of not meeting demand
is $1.75 per carton. The demand distribution is normally distributed with a mean of
75 cartons and a standard deviation of 12.5 cartons. Use Analysis ToolPak or R,
both with a seed of 1, to develop a Monte Carlo simulation for 500 weeks to
answer parts a and b.
page 577
a. If the store has been ordering exactly 75 cartons per week
from the rancher, what are the likelihood and opportunity cost of not meeting the
weekly demand?
b. If the store increases the weekly order to 85 cartons, what is the estimated cost
of having too many eggs in store?
17. A local coffee shop observes that, on average, four customers enter the store
every 5 minutes during the rush hour between 6:30 am and 7:30 am each day.
The number of customers arriving at the coffee shop follows a Poisson
distribution. Each barista can serve 2 or 3 customers every 8 minutes, a pattern
that follows a uniform distribution. The shop owner staffs her coffee shop with two
baristas. During the rush hour in the morning, customers are in a hurry to get to
work or school and will balk when there is a line. The opportunity cost when a
customer balks is $4.25 per customer. The profit generated from each customer is
normally distributed with a mean of $6.50 and standard deviation of $2.37. Each
barista is paid $20 per hour.
a. Use Analysis ToolPak or R, both with a seed of 1, to develop a Monte Carlo
simulation with 500 trials to examine the current staffing level and report the
average profit or loss during the rush hour.
b. If the owner hires a third barista, what is the impact of the new hire on the
profit?
18. Hoping to increase its sales, a pizzeria wants to start a new marketing campaign
promising its customers that if their order does not get delivered within an hour,
the pizzas are free. Historically, the probability of on-time pizza delivery follows a
binomial distribution with n = 50 and p = 0.88. The order amount follows a normal
distribution with a mean of $35 and a standard deviation of $11.
a. Use Analysis ToolPak or R, both with a seed of 1, to simulate 1,000 pizza
orders. What is the average loss of revenue based on the 1,000 simulations?
b. In order to break even, how many new orders does the marketing campaign
need to generate?
19. An investor wants to invest $300,000 in a portfolio of three mutual funds. The
annual fund returns are normally distributed with a mean of 2.00% and standard
deviation of 0.30% for the short-term investment fund, a mean of 5.00% and
standard deviation of 2.50% for the intermediate-term fund, and a mean of 6.25%
and standard deviation of 5.50% for the long-term fund. An initial plan for the
investment allocation is 45% in the short-term fund, 35% in the intermediate-term
fund, and 20% in the long-term fund.
a. Use Analysis ToolPak or R, both with a seed of 1, to simulate 100 trials to
estimate the mean ending balance after the first year and assess the risk of this
investment.
b. If the allocation is changed to 30% short-term, 55% intermediate-term, and 15%
long-term, estimate the ending balance after the first year and the risk of the
investment.
20. At a local appliance store, weekly demand for a large screen TV during the week
prior to the Super Bowl is normally distributed with a mean of 35 units and
standard deviation of 12 units. The store usually keeps 40 units in its inventory.
Use Analysis ToolPak or R, both with a seed of 1, to simulate 500 trials to
estimate the likelihood of overstocking and understocking.
21. Following an exponential distribution, the average lifespan of a smartphone
battery is 2.3 years. The battery manufacturer wants to offer a warranty for its
customers to receive a free replacement if the battery fails during the first year.
Each battery generates a profit of $10.85, and the replacement cost is $6.35.
Develop a Monte Carlo simulation using Analysis ToolPak or R, both with a seed
of 1, for 100 battery units sold.
a. What is the expected total cost of this warranty program?
b. In order to cover the cost of the warranty program, how many additional battery
units does the company need to sell?
22. An amusement park has a roller coaster that can accommodate up to 45 park
goers per ride, and each ride lasts about 20 minutes. For each hour, the park
gives out 200 tickets for the ride, and on average, only 60% of the ticket holders
come to ride on the roller coaster. The number of roller coaster riders follows a
normal distribution with a mean of 120 riders per hour and standard deviation of
35 riders. Develop a Monte Carlo simulation using Analysis ToolPak or R, both
with a seed of 1, of 400 trials. What is the average number of riders unable to get
on the roller coaster?
page 579
EXAMPLE 13.4
After obtaining useful results and actionable insights from the
Monte Carlo simulation, Abigail Kwan wants to continue using
prescriptive analytics to solve other business problems for
FashionTech. One of the difficult decisions that she has to
make every winter season is deciding how many parkas and
winter jackets for women to produce. Both products use the
same materials and require very similar sewing and stitching
skills and manufacturing steps, and, therefore, compete for the
same resources. The main difference between the two
products is the length—jackets are generally shorter and end at
the waist or just below, while parkas are longer fitting, and
therefore offer more warmth. Due to their length, parkas have a
higher manufacturing cost, but also have a higher selling price
and generate a greater profit per unit.
Gordana Sermek/Shutterstock
page 580
JulijaDmitrijeva/Shutterstock
page 581
Subject to:
LO 13.4
Solve and interpret a linear programming model.
EXAMPLE 13.6
Recall from Example 13.4 that Abigail Kwan, the operations
manager at FashionTech, needs to decide how many winter
jackets and parkas to produce in order to maximize profits. Use
Excel’s Solver and R to maximize the objective function, given
the five constraints. Summarize the results.
SOLUTION:
Using Excel
Excel has an add-in feature called Solver that implements the
Simplex method and performs necessary computations to
solve LP problems. There are many ways to structure an Excel
spreadsheet for an LP problem. For consistency, we will use
the Parkas worksheet as a template to demonstrate how to
use Solver. Similar to the Monte Carlo simulation example,
creating an Excel worksheet with input parameters in separate
cells, as in the Parkas worksheet, will allow us to easily change
the parameter values in the LP model and evaluate alternative
scenarios.
a. Open the Parkas worksheet and make sure Solver is
activated by going to File > Options > Add-ins > Go (next to
Manage Excel Add-ins). Check the Solver checkbox and click
OK. Verify that a button labeled Solver is in the Analyze
group under the Data tab.
b. Navigate to the Parameters section of the Parkas worksheet
and, in cells B5 through B8, enter the values 8.5, 1.5, 2, and 9,
which correspond to the amount of fabric, the page 585
machine time, the amount of labor, and the per-
unit profit that are associated with the production of one winter
jacket. In cells C5 through C8, enter the corresponding values
for the production of one parka.
c. Enter the amount of fabric available to FashionTech each
month (4,000 feet) in cell D14 and enter the remaining
applicable information in cells D15 through D19. Your Parkas
worksheet should look similar to Figure 13.11.
page 586
g. Launch Solver by navigating to the Data tab and
click the Solver button in the Analyze group.
h. In the Solver Parameters dialog box, enter E4 for the Set
Objective box. Make sure to choose the Max option to
maximize the profit value in cell E4.
i. Enter B11:C11 in the By Changing Variable Cells box. Recall
that these two cells represent the decision variables.
j. Click the Add button to add the constraints for the amount of
fabric, total machine time, and labor available to FashionTech.
In the Change Constraint dialog box, enter B14:B16 for Cell
Reference, choose <= from the drop-down list of operators,
and enter D14:D16 for the Constraint option; see Figure
13.13. Click Add.
k. Click the Add button again, and in the next Change Constraint
dialog box, enter B18:B19 in the Cell Reference box for the
demand constraints, choose the <= option, and enter
“D18:D19” for the Constraint option. Click OK.
l. Make sure the option Make Unconstrained Variables Non-
Negative is checked to enforce the non-negativity constraints.
For the Select a Solving Method option, choose Simplex LP.
Your Solver dialog box should look similar to Figure 13.14.
Click Solve.
page 589
FILE
Chai
EXAMPLE 13.7
Recall from Example 13.5 that Anika Patel, the general
manager at Yerba Buena Tea, needs to decide how many
hours to operate each of the three production facilities. This
decision must be made to ensure that all orders for the spiced
chai powder mix and the chai tea concentrate are filled, while
the production cost is kept at minimum. Use Excel’s Solver and
R to minimize the objective function given the four constraints.
Summarize the results.
SOLUTION:
Using Excel
In this example, we use a template in the Chai worksheet for
consistency. If you have not already done so, activate the
Solver software by launching Excel and go to File > Options >
Add-ins > Go. Check the Solver checkbox and click OK.
Because many of the steps are analogous to those that we
outlined in Example 13.6, we are brief. Because Excel and R
produce identical results, we summarize the results after the R
instructions.
page 591
page 592
Using R
Because many of the steps are analogous to those that we
outlined in Example 13.6, we are brief.
a. If you have not already done so, install and load the lpSolve
package.
b. We specify the three objective function coefficients, the
parameters in the three constraints, the ≥ and ≤ signs, and the
amounts on the right-hand side of the three constraints. Note
that the constraint for the order amount for the tea concentrate
(260x1 + 375x2 ≥ 4,000) only has two decision variables, so
the parameter value for x3 is specified as 0. Similarly, the limit
on the operating hours for the old facility (x1 ≤ 8) has only one
decision variable, so the parameter values for x2 and x3 are
both 0. As a result, the last four parameters in the matrix
function are 0, 1, 0, 0. Enter:
> lp.objective <- c(190, 260, 150)
> lp.constraints <- matrix(c(295, 385, 350, 260, 375, 0, 1, 0,
0), nrow=3, byrow=TRUE)
> lp.directions <- c(“>=”, “>=”, “<=”)
> lp.rhs <- c(5500, 4000, 8)
c. In order to solve the LP minimization model, we use the lp
function with the min option. Enter:
> lp.output <- lp (“min”, lp.objective, lp.constraints,
lp.directions, lp.rhs, compute.sens = TRUE)
d. In order to display the LP solution and relevant output, enter:
> lp.output
> lp.output$solution
> lp.output$sens.coef.from
> lp.output$sens.coef.to
> lp.output$duals
> lp.output$duals.from
> lp.output$duals.to
Summary of the Results
Based on the optimal solution given by Excel’s Solver and R,
Anika should operate the older facility for 8 hours (x1 = 8), the
newer facility for 5.12 hours (x2 = 5.12), and the single-purpose
facility for approximately 3.34 hours (x3 = 3.34). This plan
would produce 5,500 ounces of the spiced chai powder mix
and 4,000 ounces of the spiced chai tea concentrate, which are
precisely the total amounts ordered. This optimal solution will
incur the total production cost of $3,352.11, which is the lowest
cost for the current parameters and constraints. All three
constraints in the LP model are binding, with no surplus. Based
on the range of optimality, the older facility should be used for 8
hours unless its operating cost increases beyond $192.30 per
hour. Reducing the operating cost of this facility does not have
any impact on the operating hours. For the newer facility, if its
operating cost can be reduced below $256.69 per hour, Anika
should consider using this facility for longer hours. For the
operating cost of the single-purpose facility, the range of
optimality is between $121.38 and $236.36 per hour.
The shadow price for the chai powder mix constraint is
about $0.43, which means that if the order amount for the
powder mix increases by one ounce, the total production cost
will increase by $0.43. This shadow price remains the same as
long as the order amount for the chai powder mix is not less
than 4,331.20 ounces. The shadow price for the tea
concentrate is about $0.25, which indicates that if the order
amount for the tea concentrate increases by one ounce, the
total production cost will increase by $0.25. The range of
feasibility for this shadow price is between 2,080.00 ounces
and 5,138.44 ounces.
page 593
EXERCISES 13.3
Mechanics
23. Consider the following LP problem.
Subject to:
Subject to:
where x1 and x2 represent the decision variables. Solve the LP problem to
answer the following questions.
a. What are the values of x1 and x2 at the optimal solution? What is the maximum
value of z?
b. Identify the binding and nonbinding constraints and report the slack value, as
applicable.
c. Report the shadow price and the range of feasibility of each binding constraint.
Interpret the results.
d. What is the range of optimality for the two objective function coefficients?
Interpret the results.
25. Consider the following LP problem.
Subject to:
Subject to:
page 594
27. Consider the following LP problem.
Subject to:
where x1, x2, and x3 represent the decision variables. Solve the LP problem to
answer the following questions.
a. What are the values of x1, x2, and x3 at the optimal solution? What is the
maximum value of z?
b. Identify the binding and nonbinding constraints and report the slack value, as
appropriate.
c. Report the shadow price and range of feasibility of each binding constraint.
Interpret the results.
d. What is the range of optimality for the three objective function coefficients?
Interpret the results.
28. Consider the following LP problem.
Subject to:
where x1, x2, and x3 represent the decision variables. Solve the LP problem to
answer the following questions.
a. What are the values of x1, x2, and x3 at the optimal solution? What is the
minimum value of z?
b. Identify the binding and nonbinding constraints and report the surplus value, as
appropriate.
c. Report the values and ranges of feasibility of the shadow price of each binding
constraint. Interpret the results.
d. What is the range of optimality for the three objective function coefficients?
Interpret the results.
Applications
29. Big Sur Taffy Company makes two types of candies: salt water taffy and special
home-recipe taffy. Big Sur wants to use a more quantitative approach to decide
how much salt water and special taffy to make each day. Molasses, honey, and
butter are the main ingredients that Big Sur uses to make taffy candies. For a
pound of salt water taffy, Big Sur uses 8 cups of molasses, 4 cups of honey, and
0.7 cup of butter, and the selling price is $7.50/lb. For a pound of special taffy, Big
Sur uses 6 cups of molasses, 6 cups of honey, and 0.3 cup of butter, and the
selling price is $9.25/lb. Taffy candies are made fresh at dawn each morning, and
Big Sur uses ingredients from a very exclusive supplier who delivers 400 cups of
molasses, 300 cups of honey, and 32 cups of butter once a day before sunrise.
a. Formulate and solve the LP model that maximizes revenue given the
constraints. What is the maximum revenue that Big Sur can generate? How
much salt water and special taffy does Big Sur make each day?
b. Identify the binding and nonbinding constraints and report the slack value, as
appropriate.
c. Report the shadow price and the range of feasibility of each binding constraint.
d. What is the range of optimality for the objective function coefficients?
30. A French vineyard in the Chablis region uses Chardonnay grapes to make
Chardonnay wine and Blanc de Blancs blended champagne. To produce 1 liter of
wine, the vineyard uses about 8 kilograms of Chardonnay grapes, and the
winemakers usually spend about 2.5 hours in the winemaking process, including
pressing, blending, and processing. To produce 1 liter of Blanc de Blancs
champagne, about 6 kilograms of Chardonnay grapes are used, with the
pressing, blending, and processing time of 3 hours. A liter of Chardonnay from the
vineyard is usually sold at $55, and each liter of champagne is sold at $45. Each
week, 400 kilograms of Chardonnay grapes and 150 hours for pressing, blending,
and processing are available.
a. Formulate and solve the LP model that maximizes revenue given the
constraints. What is the maximum revenue that the vineyard can make from
selling Chardonnay wine and champagne? How many liters of Chardonnay
wine and champagne should the vineyard make each week?
b. If the amount of Chardonnay grapes available to the vineyard increases to 500
kilograms per week, how much wine and champagne should be made? Explain.
c. If the price of champagne drops to $30 per liter, how much wine and
champagne should the vineyard make? Explain and discuss your answer.
31. A consumer product company makes two types of dishwasher detergents: regular
and concentrate. The company has two manufacturing facilities for making
detergent products. The first facility has an operating cost of $120 an hour and
can produce 300 ounces of regular detergent products per hour and 220 ounces
of concentrate detergent per hour. The second facility has an hourly operating
cost of $220 and produces 350 ounces of regular detergent and 450 ounces of
concentrate detergent per hour. The company received wholesale orders totaling
4,500 ounces of regular detergent and 5,200 ounces of concentrate detergent.
a. Formulate and solve the LP model that minimizes costs given the constraints.
How many hours should the company operate each facility in order to fulfill the
orders and minimize the operating cost?
b. Identify each binding constraint and report the shadow price and the range of
feasibility.
page 595
c. If the hourly operating cost of the second facility is reduced to
$200, how many hours should the company operate each facility? What is the
minimum operating cost in this scenario?
32. Calcium and vitamin D are some of the most essential nutrients for bone health.
According to the Institute of Medicine, an average adult should have in his or her
daily diet 1,000 mg of calcium and 600 IU of vitamin D. These two nutrients are
found in milk and cold cereal, which many people eat for breakfast. One cup of
whole milk has 270 mg of calcium and 124 IU of vitamin D. A cup of a popular
whole grain cereal brand contains 150 mg of calcium and 120 IU of vitamin D.
One gallon of whole milk contains about 16 cups and costs $4.89. A box of the
popular cereal also contains about 16 cups and costs $3.19. Suppose an adult
relies on getting these two nutrients only from breakfast. Formulate and solve the
LP model that minimizes the cost of milk and cereal but satisfies the daily
requirement of calcium and vitamin D. How much milk and cereal should an adult
consume? What is the minimum cost?
33. A renowned chocolatier, Francesco Schröeder, makes three kinds of chocolate
confectionery: artisanal truffles, handcrafted chocolate nuggets, and premium
gourmet chocolate bars. He uses the highest quality of cacao butter, dairy cream,
and honey as the main ingredients. Francesco makes his chocolates each
morning, and they are usually sold out by the early afternoon. For a pound of
artisanal truffles, Francesco uses 1 cup of cacao butter, 1 cup of honey, and ½
cup of cream. The handcrafted nuggets are milk chocolate and take ½ cup of
cacao, ⅔ cup of honey, and 2/3 cup of cream for each pound. Each pound of the
chocolate bars uses 1 cup of cacao butter, ½ cup of honey, and ½ cup of cream.
One pound of truffles, nuggets, and chocolate bars can be purchased for $35,
$25, and $20, respectively. A local store places a daily order of 10 pounds of
chocolate nuggets, which means that Francesco needs to make at least 10
pounds of the chocolate nuggets each day. Before sunrise each day, Francesco
receives a delivery of 50 cups of cocao butter, 50 cups of honey, and 30 cups of
dairy cream.
a. Formulate and solve the LP model that maximizes revenue given the
constraints. How much of each chocolate product should Francesco make each
morning? What is the maximum daily revenue that he can make?
b. Report the shadow price and the range of feasibility of each binding constraint.
c. If the local store increases the daily order to 25 pounds of chocolate nuggets,
how much of each product should Francesco make?
34. CaseInPoint is a start-up company that makes ultra-slim, protective cases in
stylish bold colors for a highly popular smartphone with two models: a large MÒR
model and a compact BEAG model. CaseInPoint uses thermoplastic
polyurethane, or TPU, to make all of its phone cases. The MÒR case is larger but
is mostly solid color in design. To produce a MÒR case, 30 grams of TPU are
used with 35 minutes of manufacturing time. The BEAG model is smaller but
more elaborate in design and requires 22 grams of TPU and 53 minutes of
manufacturing time. For each production period, 700 grams of TPU and 1,500
minutes of machine time are available. The profits from the MÒR and BEAG
cases are $9 and $7.50, respectively.
a. Formulate and solve the LP model that maximizes revenue given the
constraints. How many MÒR and BEAG cases should CaseInPoint make for
each production period?
b. If the amount of TPU available increases to 1,000 grams for each production
period, report the new LP solution and discuss your answer.
35. Many vitamin C products are made from ascorbic acid, which is derived from corn.
A small dietary supplement company makes two types of vitamin C supplements:
capsules and chewable tablets. To produce 1 kilogram of vitamin C capsules, 30
kilograms of corn, 2.3 hours of manufacturing time, and 2 hours of packaging time
are required. To produce 1 kilogram of vitamin C tablets, 40 kilograms of corn, 4.2
hours of manufacturing time, and 1 hour of packaging time are required. Each
week, the company has 2,000 kilograms of corn from its suppliers and allocates
180 hours of equipment time for manufacturing and 110 hours of packaging
equipment time for vitamin C supplements. One kilogram of vitamin C capsules
generates a profit of $9.50, and one kilogram of chewable tablets generates a
profit of $12.
a. Formulate and solve an LP model to maximize the profit contribution from the
two vitamin C products. How many kilograms of vitamin C capsules and tablets
should the company produce each week? What is the maximum profit the
company can make?
b. If the per-kilogram profits of vitamin C capsules and tablets increase to $11.50
and $16, respectively, how much of each product should the company make?
What is the corresponding profit amount?
36. A rancher has a six-year old pony that weighs about 180 pounds. A pony of this
age and size needs about 6.2 Mcal of digestible energy, 260 grams of protein, and
9,700 IU of vitamin A in her daily diet. EquineHealth and PonyEssentials are two
popular brands of horse feed. One serving of the EquineHealth feed costs $10.50
and provides 1.5 Mcal, 52 grams of protein, and 1,800 IU of vitamin A. One
serving of the PonyEssentials feed costs $12 and provides 1.8 Mcal, 58 grams of
protein, and 2,200 IU of vitamin A. If the rancher were to blend the two feed
products to make horse feed for his pony, how much of each brand should he use
each week to minimize the cost? Explain and discuss your answer.
page 596
Capital Budgeting
In a typical capital-budgeting problem, a decision maker tries to
choose from a number of potential projects. Examples include
choosing among possible manufacturing plant locations or among
new drug development projects. In these situations, it does not make
sense to partially fund individual projects. There are no real benefits
from building just 45% of a manufacturing plant or partially
developing a new drug. In Example 13.8, we demonstrate how to
mathematically formulate a capital budgeting problem as an IP
model and use Excel’s Solver and R to find an optimal solution.
EXAMPLE 13.8
North Star Biotech is exploring six new drug development
projects. Each project has a five-year cash investment estimate
and an expected return on investment as shown in Table 13.2.
For example, to develop Drug 1, North Star has to make a cash
investment of $2 million in Year 1 and $1 million in Years 2, 3,
and 4, and no cash investment is needed in Year 5. Once
completed, the company expects a return of $12.5 million. To
develop Drug 2, the company needs to make a $2 million cash
investment in Years 1 and 2, and $1 million in Years 3, 4, and
5, and expects a return of $13.5 million.
page 597
SOLUTION:
Integer Programming Formulation
North Star needs to decide whether or not to fund each project;
therefore, each decision variable xi, where i = 1, 2, . . . , 6, has
binary values. For example, x1 = 1 if North Star invests in
developing Drug 1, and 0 otherwise. We also define ci as the
expected return on investment for Drug i. For example, the
expected return for Drug 1 is $12.5 million. For brevity, we
show the financial amounts in millions of dollars, and, hence, c1
= 12.5. We formulate the objective function as follows.
a. Open the Biotech worksheet and make sure the Solver add-in
is activated. Note that cells B10:G10 represent the decision
variables that will have a value of either 0 or 1 (i.e., whether or
not each drug development project is funded). To help verify
our Excel formula, we initially have 1 in cell B10 and 0
elsewhere.
page 598
b. Cells B2:G6 store the annual cash investments
required for each project. To calculate the total cash
investment spent in Year 1, enter =SUMPRODUCT(B2:G2,
B$10:G$10) in cell H2. Verify that the value in cell H2 is 2.
Copy and paste the formula to cells H3:H6.
c. Cells B8:G8 contain the expected returns from the six
projects. To calculate the total return, enter
=SUMPRODUCT(B8:G8, B10:G10) in cell I8 (the objective
function for the IP problem). Verify that the result in cell I8 is
12.5.
d. Navigate to the Data tab and click on Solver. Enter cell I8 in
the Set Objective box. Make sure to choose the Max option to
maximize the total return value in cell I8. See Figure 13.20 for
all solver parameters.
page 599
Using R
a. If you have not already done so, install and load the lpSolve
package.
b. As in the previous examples, we specify the five objective
function coefficients, the parameters in the constraints, the ≤
signs, and the available annual cash amounts on the right-
hand side of the constraints. Enter:
c. We use the lp function with the max option. We also add the
all.bin = TRUE parameter to specify that the decision variables
are binary. (For other types of problems where the decision
variables can take on any integer values, we would use the
all.int = TRUE parameter instead.) Enter:
> lp.output <- lp(“max”, lp.objective, lp.constraints,
lp.directions, lp.rhs, all.bin = TRUE)
d. Display the integer programming solution. Enter:
> lp.output
> lp.output$solution
Summary of the Results
Both the Excel Solver and R outputs indicate that North Star
Biotech should invest in Projects 2, 3, 5, and 6, and not in
Projects 1 and 4. Therefore, the company would make the
annual cash investments of $9, $5, $7, $5, and $2 million over
the five-year period. This investment plan results in $1 million
of excess cash in Year 1, $3 million in Year 2, and $1 million in
Year 5. Recall that for a maximization problem, these are called
slack in the linear and integer programing constraints. Based
on this recommendation, North Star Biotech would maximize its
expected total return at $57 million.
Transportation Problem
A classic example of a transportation problem involves a manager
who needs to decide which of the warehouses to use in order to
deliver products to retail stores or customers. The manager’s goal
(or the objective function) is usually to minimize the overall shipment
cost while meeting the demand at each retail store or from individual
customers. Intuitively, a transportation problem is feasible only when
the total supply (i.e., the combined capacity of all warehouses) is at
least equal to the total demand (i.e., the total amount being ordered).
In many situations, the transportation model deals with a large
bundle of goods, such as a pallet of 1,600 bottles of water or a
bundle of 300 2 × 4 wood planks. As such, it is impractical to
transport these bundles of goods in fractional units, and we use the
IP technique to solve a transportation model to obtain integer
solutions, instead of using the LP technique and rounding the
solutions to the nearest integers. Example 13.9 demonstrates how to
formulate an IP model for a simple transportation problem and use
Excel’s Solver and R to find an optimal solution.
page 600
EXAMPLE 13.9
Rainier Spring Water has two warehouses that service three
retail stores. Per week, the first warehouse can supply up to
160 pallets of bottled water and the second warehouse can
supply up to 155 pallets. The manager at Rainier Spring Water
receives a weekly order from the three retail stores for 85, 125,
and 100 pallets, respectively. Note that the combined capacity
of the two warehouses is larger than the total number of pallets
of bottled water ordered by the three stores, a required
condition for a feasible solution. The shipping cost for each
pallet varies among the warehouses and retail stores,
depending on the distance, traffic, and road conditions. As
shown in Table 13.3, to ship one pallet of bottled water from
Warehouse 1 to the three stores, it would cost $4.15, $5.95,
and $6.25, respectively. Likewise, to ship one pallet from
Warehouse 2 to the three stores, it would cost $3.75, $4.25,
and $8.25, respectively.
page 601
Using Excel Solver
a. Open the Rainier worksheet and make sure the Solver add-in
is activated. Note that cells B5:D6 represent the decision
variables and will have only integer values (i.e., the number of
pallets of bottled water to ship from a warehouse to a retail
store). Initially these cells are blank.
b. We use cells E5 and E6 to show the total number of pallets
shipped from the two warehouses, respectively. Enter
=SUM(B5:D5) in cell E5. Copy and paste the formula to cell
E6.
c. We use cells B7:D7 to show the total number of pallets
shipped to the three stores, respectively. Enter =SUM(B5:B6)
in cell B7, and copy and paste it to cells C7 and D7.
d. We use cell F13 to calculate the total shipping cost (the
objective function), based on the per-unit cost in cells B13:D14
and the number of units shipped in cells B5:D6. Enter
=SUMPRODUCT(B13:D14, B5:D6) in cell F13.
e. Navigate to the Data tab and click on Solver. Enter cell F13 in
the Set Objective box. Make sure to choose the Min option to
minimize the total shipping cost. See Figure 13.22 for all
solver parameters.
a. If you have not already done so, install and load the lpSolve
package. The lpSolve package includes the lp.transport
function developed especially for solving transportation
problems.
b. We store the per-pallet shipping costs (shown in Table 13.3) in
an object called unit.costs, using the matrix function. Enter:
c. We specify the signs and order amount from the retail stores
for the three demand constraints. Enter:
> order.signs <- c(“>=”, “>=”, “>=”)
> order.amount <- c(85, 125, 100)
d. We specify the signs and the capacity of the two warehouses
for the two capacity constraints. Enter:
> capacity.signs <- c(“<=”, “<=”)
> capacity.limits <- c(160, 155)
e. As mentioned above, we use the lp.transport function to
obtain the optimal solution for transportation problems. By
default, the lp.transport function assumes that all decision
variables are integers. We use the min option to minimize the
total shipping cost and include the shipping cost matrix and all
the constraints. Enter:
> lp.output <- lp.transport(unit.costs, “min”, capacity.signs,
capacity.limits, order.signs, order.amount)
f. To display the optimal shipping cost and solution to the
transportation model, we use the following statements. Enter:
> lp.output
> lp.output$solution
Summary of the Results
The Excel Solver and R outputs indicate that from Warehouse
1, the manager at Rainier Spring Water should ship 55 pallets
of bottled water to Retail Store 1 and 100 pallets to Retail Store
3. From Warehouse 2, the manager should ship 30 pallets to
Retail Store 1 and 125 pallets to Retail Store 2. page 603
This shipment schedule would fulfill the orders
from all three stores. Warehouse 1 would deliver 155 out of the
160 pallets available in its weekly inventory, whereas
Warehouse 2 would use up all of its weekly bottled water
inventory. The five remaining pallets in Warehouse 1 are
available for shipping to other stores. Recall that in a
minimization problem, the excess quantity at Warehouse 1 is
called a surplus. With this shipment plan, Rainier Spring Water
will incur the minimum overall shipment cost of $1,497 per
week.
EXERCISES 13.4
Mechanics
37. Consider the following IP problem.
Subject to:
where x1, x2, and x3 represent the decision variables. Solve the IP problem to
answer the following questions.
a. What are the values of x1, x2, and x3 at the optimal solution?
b. What is the maximum value of z?
38. Consider the following IP problem.
Subject to:
where xij represent the decision variables. Solve the IP problem to answer the
following questions.
a. What are the values of xij at the optimal solution?
b. What is the minimum value of z?
Applications
39. An online apparel retailer runs regular marketing campaigns on social media
channels. The retailer is considering four social media marketing campaigns on
Facebook, Instagram, Pinterest, and Twitter for the four weeks prior to the
December holiday season. However, due to its limited marketing budget, the
retailer cannot run all four proposed campaigns. The following table describes the
weekly cost (in $1,000s) and the expected numbers of consumers each campaign
will reach (in 1,000s).
a. Formulate and solve the IP model to determine which of the four marketing
campaigns the retailer should run in order to maximize the number of
consumers reached, without going over its budget. What is the maximum
number of consumers that the retailer can reach?
b. Which social media channels should the retailer use in order to maximize the
number of consumers reached?
c. Will the retailer use up all of its marketing budget each week?
40. Akkadian Capital, a real estate investment firm, owns five apartment buildings
around a four-year college campus. The five buildings need significant repairs and
renovation, and each renovation project will take about three months to complete.
The company expects each renovated apartment building to generate a
substantial rental income from college students. However, Akkadian Capital has a
limited budget during the next three months and cannot repair and remodel all five
buildings. The remodeling cost and the expected annual income after expenses
from each apartment are shown in the following table. Also included in the table is
Akkadian’s budget for the next three months.
Case Study
FamVeld/Shutterstock
Demand for organic food has grown steadily over the years. Organic strawberry
farming is one of the segments in the food and agriculture industry that is projected
to continue expanding in the foreseeable future. Estuary Organic, a fruit farming
company on the coast of California, struggles to reduce its production cost while
staying true to its mission of offering fully certified organic produce to its customers.
To address the cost issue, Estuary Organic hires Professor Tom Richards as a
consultant to review different types of organic fertilizers that the company has used.
His task is to find a combination of organic fertilizers that can meet the three primary
nutrient requirements in the soil, nitrogen (N), phosphate (P2O5), and potassium
(K2O). At the same time, Professor Richards needs to take into consideration the
cost of certified organic fertilizers in his analysis.
For each strawberry planting season, growers need to have the right combination
of nitrogen, phosphate, and potassium in the soil. Adding too much or too little of
these nutrients is detrimental to the plant growth and its ability to bear fruit. This
combination of N, P2O5, and K2O nutrients is often called the NPK value. Generally,
the minimum amount of NPK per acre is 100 lbs, 50 lbs, and 50 lbs, respectively.
Based on the current soil condition and soil testing results, Professor Richards
also determines that the maximum amount of NPK per acre is 125 lbs, 55 lbs, and 55
lbs, respectively. Some of the fertilizers commonly used in the local area and by
Estuary Organic include alfalfa meal (typically made from fermented alfalfa), soybean
meal, fish meal (typically pollock, mackerel, and/or anchovies), and compost made
from animal manure. The cost and NPK composition of each fertilizer is listed in
Table 13.4.
TABLE 13.4 Organic Fertilizers, Their Cost, and Nutrient Contents
page 605
The goal of this consulting engagement is to provide Estuary Organic a
recommended combination of fertilizers that meet the NPK requirements at a minimal
cost. After analyzing relevant information, Professor Richards submits the following
report to the Estuary Organic owner.
Subject to:
The four decision variables, x1, x2, x3, and x4, represent the amount of
alfalfa meal, soybean meal, fish meal, and compost (in pounds) to be mixed
into the fertilizer compound. We also convert the cost of each fertilizer into a
per-pound basis (e.g., $0.28 per pound for the alfalfa meal, $0.46 per pound
for the soybean meal, and so on). The cost-per-pound values are used as the
objective function coefficients.
Based on the results obtained from the linear programming formulation,
we recommend the mix of organic fertilizers for each acre of land as shown in
Table 13.5.
Organic Recommended
Cost
Fertilizers Amount (lbs)
Report 13.1. A grocery store manager wants to analyze customer spending data by
product categories: fresh baked goods, meat and dairy, produce, and frozen food.
For each shopping trip, about 30% of shoppers purchase baked good items, and the
spending in this category tends to follow a continuous uniform distribution between
$3 and $19. For the meat and dairy products, 70% of shoppers make regular
purchases from this category, and their spending is normally distributed with a mean
of $21 and standard deviation of $5.27. Eighty percent of the shoppers spend an
average of $15 on produce, and their spending follows a normal distribution with a
standard deviation of $2.31. Sixty-five percent of shoppers purchase at least an item
from the frozen food aisles; the spending amount in this category follows a uniform
distribution between $7.25 and $28.50. On average, approximately 220 customers
make a trip to the grocery store each day. Develop a Monte Carlo simulation to
analyze the customer spending and the fluctuation in daily revenue.
Report 13.2. An oil and gas company has two refineries that produce light and heavy
crude oil. The first refinery can produce 500 barrels of light crude oil and 300 barrels
of heavy crude oil per day, and the second refinery can produce 600 barrels of light
crude oil and 450 barrels of heavy crude oil per day. The daily operating cost of each
refinery is $15,000 and $20,000, respectively. The company has to fulfill production
orders totaling 3,200 barrels of light crude oil and 2,100 barrels of heavy crude oil.
Analyze the information and recommend an appropriate production plan.
Report 13.3. South Bay Candles makes two types of scented candles: 10-inch pillars
and 2-inch decorative gel. The production requirements for each product and
available resources are shown in Table 13.6. The sales price for each pillar and gel
candle is $2.15 and $3.55, respectively. Analyze the data provided and create a
report to recommend how many pillar and gel candles South Bay should make.
TABLE 13.6 Product Information for South Bay Candles
page 607
page 608
APPENDIX A
Variable
Description or Possible Values
Name
Weekday 1 – Monday
2 – Tuesday
3 – Wednesday
4 – Thursday
5 – Friday
6 – Saturday
7 – Sunday
Month 1 – January
2 – February
3 – March
4 – April
5 – May
6 – June
7 – July
8 – August
9 – September
10 – October
11 – November
12 – December
CrashType A – Head-On
B – Sideswipe
C – Rear End
D – Broadside
E – Hit Object
F – Overturned
G – Vehicle/Pedestrian
Highway 1 – Highway
0 – Not highway
Daylight 1 – Daylight
0 – Not daylight
page 609
Variable
Description or Possible Values
Name
Gender M – Male
F – Female
Variable
Description or Possible Values
Name
White 1 – White
0 – Not White
Asian 1 – Asian
0 – Not Asian
page 610
page 611
Black 1 – Black
0 – Non-Black
Hispanic 1 – Hispanic
0 – Non-Hispanic
White 1 – White
0 – Non-White
Christian As of 1979,
1 – Christian
0 – Non-Christian
Male 1 – Male
0 – Female
page 612
Position C – Center
PF – Power Forward
SF – Small Forward
PG – Point Guard
SG – Shooting Guard
page 613
Variable
Description or Possible Values
Name
Female 1 – female
0 – otherwise
APPENDIX B
Formulas
In Excel, we use formulas to perform basic calculations. When we
enter a formula in a cell, Excel carries out the specified calculation
and returns the result in the same cell. We also use formulas to
manipulate the cell content such as rounding a number. A formula in
Excel always starts with an equal sign (=) and usually includes cell
addresses. A cell address or cell reference consists of a column
name and a row number. For example, cell reference A1 refers to the
top and leftmost cell in column A and row 1. Basic calculations can be
performed using arithmetic operations such as addition ( + ),
subtraction ( − ), multiplication ( * ), division ( / ), and exponentiation (
^ ). For example, we select an empty cell and use the formula
=A1+B1+C1 to add values from cells A1, B1, and C1, and the formula
=A1^2 to square the value in cell A1.
page 615
Functions
Functions in Excel are predefined formulas. Like a formula, a function
always begins with an equal sign (=) and must be written with the
correct syntax enclosed within parentheses. Most functions require at
least one argument. Arguments are the values that Excel uses to
perform calculations. For example, the COUNT function is used to
count the number of cells that contain numerical values and has the
syntax =COUNT(A1:A10), where A1:A10 is the argument indicating
the array of cells to be counted. Table B.2 provides a summary of
some of the most basic descriptive functions in Excel. The notation
array in the function’s argument in the table specifies the range of cell
addresses to be included in the calculation.
page 616
Function and
Description Example
Syntax
page 617
EXAMPLE B.1
FILE
Consulting
page 618
FIGURE B.1 Salary and Tax Information of Employees
page 620
Analytic Solver
Analytic Solver (formerly known as XLMiner) is an add-in software
that runs on Microsoft Excel. It offers a comprehensive suite of
statistics and data mining tools, including data wrangling, data
partitioning, and supervised and unsupervised data mining methods.
The Interface
Once Analytic Solver is installed, launch Excel and verify that you can
see the Analytic Solver, Data Mining, and Solver Home tabs. In this
text, we focus primarily on the features in the Data Mining tab. Figure
B.3 shows the features on the Data Mining tab. Examples and
exercise problems in this text are developed based on Analytic Solver
2019. Other versions of Analytic Solver may have a different user
interface and display the output in a different format.
FIGURE B.3 Data Mining Tab in Analytic Solver 2019 for Excel
APPENDIX C
What is RStudio?
RStudio is a program that makes R easier to use. On its own, R acts
like a programming language and, as such, comes with a minimal
user interface. As standalone software, R shows a single prompt for
you to enter commands; this is called the Console. While everything
we will ever need from R can be done by combining Console
commands with other programs, things can quickly get messy. To
make coding in R easier, we use an integrated development
environment (IDE). IDEs are programs that combine in one place
many common features needed in programming and give them a
graphical user interface.1 In this text, we use an open source version
of an IDE called RStudio, which is very popular among students,
professionals, and researchers who use R.
Installation
Installation of both R and RStudio is straightforward and requires no
special modifications to your system. However, it should be noted
that RStudio does not come with R; therefore, both pieces of
software need to be installed separately. Many R functions for data
analysis are included in free libraries of code written by R users,
called packages. R packages are installed separately and will be
discussed at the end of this Appendix.
page 622
Installing RStudio
. Navigate to https://www.rstudio.com/products/rstudio/.
. Select RStudio Desktop, then select Download RStudio Desktop.
. Scroll down to the Installers for Supported Platforms section, select
the link that corresponds to your operating system, and then select
Open or Run.
. Select Yes when asked about verifying the software publisher.
. Follow the instructions in the RStudio Setup window.
The Interface
Installation should now be complete. You can close all windows and
then double-click on the RStudio icon.
The RStudio interface consists of several panes. By default, three
panes are visible. We will refer to these by the names of the default
tab shown in each: Console, Environment, and Help. We will also
briefly discuss the Source pane, which is hidden until you open it.
Figure C.1 shows what you should see when you open RStudio for
the first time.
Source: R Studio
Console pane: The Console pane is the primary way that you
interact with R. It is here that you input commands (at the >
prompt) and then view most of your output.
Environment pane: Two relevant tabs in the Environment pane
are: Environment and History. A common feature between them is
the broom icon, which clears the content of each tab. The
Environment tab shows the data, objects, and variables in the
current R session. The History tab provides a list of all console
commands issued in the session.
Help pane (or Files pane): The help section has five tabs. We
discuss two of these here: Help and Plots.
The Help tab is where you can view R documentation (help files).
For example, to learn about the print function, select the Help
tab and then enter print next to the magnifying glass icon. (You
can also view R documentation by entering a question mark
followed immediately by the topic of interest in the Console pane;
so, for this example, you would enter ?print after the prompt.)
The Plots tab is where you can see all graphs and charts. Any
graph or chart can be cleared with the broom icon.
The Source pane: The Source pane is hidden by default in R.
This is where you can write your own scripts. As you will see, most
of what we do in this text can be accomplished by importing a data
set and then using a single command in the Console. Nonetheless,
here is an example of how you would write a simple script:
A. From the menu, select File > New File > R Script
B. In the new window, enter the following:
page 623
print(“This is my first script.”)
print(“This is easy!”)
Save the script with File > Save As. Name your script Script1.
Figure C.2 shows what you should see in the Source pane.
Source: R Studio
You should see Example_1 listed in the Environment pane. You can
view the data in the Console pane by entering Example_1 after the
prompt. Additionally, you can use the View function and the data will
appear in the Source pane. (Note that R is case sensitive.) We enter:
View(Example_1)
page 624
> mean(Example_1)
And R returns: −0.6666667.
Source: R Studio
page 625
See Figure C.5. We select the Browse button and then navigate
to the Admission data in the Data folder. Once we select the
Admission data, we should see the data in the Data Preview dialog
box. In the R instructions in this text, we label all data files as
myData for simplicity and consistency. Because of this, in the Import
Options dialog box, replace Admission with myData. Once you select
the Import button (see the bottom of Figure C.5), you have
successfully imported the data. You can verify this in a number of
ways. For instance, you should now see myData in the Environment
pane under Data, or you can enter View(myData) in the Console pane
and a portion of the data will appear in the Source pane.
Note: For an Excel file with multiple worksheets, select the appropriate worksheet from the
Sheet drop-down option.
> mean(myData$SAT)
> summary(myData[,c(3,5)])
> rm(myData)
You will find that myData no longer appears under Data in the
Environment pane.
page 626
Entry 2:
Packages
Part of what makes R so powerful is its large collection of packages,
or collections of objects not included in the base version. Packages
greatly expand what can be done with R by adding custom functions
and data structures.
To use a package, you must install it and then load it. We use the
caret package, which stands for Classification and Regression
Training, to demonstrate how this is done:
> install.packages(“caret”)
> library(caret)
> install.packages(c(“caret”,“gains”,“pROC”))
page 627
However, you still need to use the library function separately for
each package as follows:
> library(caret)
> library(gains)
> library(pROC)
APPENDIX D
Statistical Tables
TABLE D.1 Standard Normal Curve Areas
Entries in this table provide cumulative probabilities, that is, the area under the curve to the
left of −z. For example, P(Z ≤ −1.52) = 0.0643.
Source: Probabilities calculated with Excel.
page 629
Entries in this table provide cumulative probabilities, that is, the area under the curve to the
left of z. For example, P(Z ≤ 1.52) = 0.9357.
Source: Probabilities calculated with Excel.
page 630
APPENDIX E
Chapter 2
2.4 Choice c is the correct definition of a foreign key.
2.6 Choices b, c, and d correctly describe SQL. Choice a describes NoSQL.
2.12 a. 1 of the 10 highest income earners is married and always exercises.
b. 9 individuals are married, exercise often, and earn more than $110,000 per year.
c. 5 values are missing for Exercise, 2 for Marriage, and 3 for Income.
d. 281 individuals are married, and 134 are not.
e. 69 married individuals always exercise, and 74 unmarried individuals never
exercise.
2.18 a. There are no missing values in x2, x3, and x4.
b. The remaining data set has 35 observations. The average values for x3 and x4 are
170.6571 and 21.2286 respectively.
2.24 a. Variables x1, x3, x4, and x5 all have at least one missing value.
b. Observations 15, 24, 33, 34, 43, 63, 66, 78, and 95 have missing values.
c. There are 9 missing values in the data set.
d. 2 missing values of x2 were replaced.
e. After replacing missing values of x3, x4, and x5, their means and median in the new
data set are 3.8832, 1260, and 3.1330 respectively.
f. There are 43 observations remain in the dataset.
2.26 a. There are missing values in the variable Travel Plan. All other variables are
complete.
b. 300 observations were removed.
2.28 a. There are 3 missing values in the data set. In the Yards variable, the missing value
is for observation 25. In the Attempts variable, the missing value is for observation
28. In the Interceptions variable, the missing value is for observation 30. All other
variables and observations are complete.
b. Using the omission strategy, 3 observations were removed.
2.32 a. All variables other than Name and Market Cap have missing values. Observation
20 is missing for the variable Price; observation 173 is missing for variable
Dividend; observations 38 is missing for the variable PE; observation 26 is missing
for variable EPS, observation 98 is missing for the variable Book Value; observation
26, and 154 are missing for the variable 52 week low, observation 46 is missing for
the variable 52 week high, and observation 51 is missing for variable the EBITDA.
There are 9 missing values in the data set.
b. 8 observations were removed.
c. Missing values for variable PE, EPS, and EBITDA were replaced with 21.915,
3.255, and 1.845, respectively. There were no missing values for Market Cap,
whose median value is 21.17.
2.34 a. Observation 26 has missing values for the Siblings variable, observation 13 is
missing for the variable Height, observation 47 is missing for the variable Weight,
and observations 17 and 51 have missing values for the Income variable. All other
variables are complete. In the data set, there are 5 missing values.
b. 3 observations were removed.
c. Observation 13 has a missing value for the Height variable, and observation 47 has
a missing value for the Weight variable. The missing value for Height was replaced
with its mean, 67.3088, and the missing value for Weight was replaced with its
mean, 150.8971. The missing value for Siblings was replaced with its median, 2,
and the missing values for Income were replaced with its median, 33,361. There
were no missing values for FamilySize, whose median is 4.
page 633
2.38 a. There are 41 observations in the data set. After binning, 4 of the five bins should
have 8 observations, and the remaining bin should have 9 observations.
b. The number of observations with a score of 431 is: 0. The number of observations
with a score of 222 is: 2.
2.40 a. The average of Difference is −1.91.
b. The average PercentDifference is 0.2273.
c. The average of Log is 12.84.
2.42 a. The average difference in year is 29.01.
b. The average month value is 6.
c. Group 3 has the highest number of observations.
2.46 a. Number of customers assigned to Group 4 is 26.
b. Number of customers assigned to Group 3 is 21.
c. Number of customers assigned to Group 2 is 36.
d. The average difference is 124.7.
e. The average percentage difference is 0.5642.
f. The average age of the players is 28.70 years.
g. September is the most frequent birth month.
2.50 a. There are 23 observations in the “Other” category.
b. The average category score for x2 is 3.6200.
c. Three dummy variables should be created.
2.52 a. 15 observations have a category score of 3.
b. There are 3 observations in the “Other” category.
c. Four dummy variables should be created.
2.54 a. The variables LoanType, PropertyType, and Purpose are nominal data because
they do not have naturally ordered categories.
b. Conventional is the most frequent category for LoanType. Multi-family is the most
frequent category for PropertyType. Refinancing is the most frequent category for
Purpose.
c. Three dummy variables should be created. Conventional should be the reference
category of LoanType. Multi-family should be the reference category of
PropertyType. Refinancing should be the reference category of Purpose.
2.56 a. One dummy variable should be created.
b. The average damage score of the cell towers is 0.7959.
Chapter 3
3.2 a. The proportion of the sales for medium-sized shirts was 0.302.
b. Sales of large-sized shirts had the highest frequency and sales of small-sized shirts
had the lowest frequency.
3.4 a. 19.3% of people in the Midwest are living below the poverty level.
b. The South has the highest relative frequency as compared to the other three
regions, which are roughly equal.
3.6 a. A rating of 5 has the highest frequency
b. The higher ratings have the higher frequencies.
3.8 a. Not Religious is the most common response.
b. About 35% responded “Not Religious” which is consistent with the Pew Research
survey.
3.10 a. 125 stocks had returns of at least 10% but less than 20%.
b. The distribution is symmetric.
3.14 a. No. The distribution is not symmetric. It is positively skewed.
b. Over this time period, the stock price was between $50 and $250.
c. The $100 up to $150 interval has the highest relative frequency, which is about
0.44.
3.16 a. 14 customers spent between $700 and $999.
b. 52 customers spent $1,300 or less; 48 customers spent more than $1,300.
3.18 a. The DJIA was more than 26,000 on 44 days in the first half of 2019.
b. The distribution is not symmetric; it is negatively skewed.
3.20 It does not; by using a relatively high value as an upper limit on the vertical axis
($500), the rise in stock price appears dampened.
3.22 a. 202 of the customers were male; 60 of the customers drank wine.
b. 142/202 = 0.7030; 38/68 = 0.5588.
c. Beer is the popular drink at this bar, followed by wine, and then soft drinks. Both
men and women are more likely to choose beer over the other two options.
3.24 a. 5 components from shift 1 were defective; 94 components from shift 2 were not
defective.
b. 6/24 = 0.2500; 13/24 = 0.5417; components constructed during shift 3 seem to be
defective at a higher rate.
c. No, defective rates are not consistent over the shifts.
3.26 a. 120 students are business majors; 68 students study hard.
b. 20/120 = 0.1667; 48/150 = 0.3200; the data suggest that nonbusiness majors are
more likely to study hard, which supports the report.
c. The majority of both business and nonbusiness students do not study hard, but
nonbusiness students are more likely to study hard.
3.28 Negative relationship between obesity and life expectancy.
3.30 No relationship between the returns of A and B, so investing in both would diversify
risk.
3.32 a. Negative relationship between the price of a car and its age.
b. Negative relationship between the price of a car and its mileage.
3.34 Both countries have an upward trend, but China’s begins to stall slightly around 2000.
3.36 a. Age is negatively correlated with price and positively correlated with mileage. Price
is negatively correlated with mileage.
b. Seven cars have mileage greater than 50,000.
c. The negative relationship between age and price is consistent for cars of both
mileage categories.
3.38 a. There are 511 cases of a burglary in a residence.
b. Theft on the street is the most common crime, followed by narcotics on a sidewalk,
then motor vehicle theft on the street.
page 634
3.40
a. HD median = 77,349
b. s = 124.6086. The smallest and the largest observations for the Debt
Since the absolute value of both z-scores is less than 3, we conclude that there are
no outliers for the Debt variable. This is consistent with the boxplot, which showed
no outliers.
3.64 a. The Technology boxplot suggests that there are outliers in the upper part of the
distribution.
Since the z-score for the largest observation is greater than 3, we conclude that
there are outliers for Technology. This is consistent with the boxplot, which showed
outliers in the upper part of the distribution.
c. The Energy boxplot suggests that there is an outlier in the lower part of the
distribution.
Since the absolute value of both z-scores is less than 3, we conclude that there are
no outliers for Energy. This is not consistent with the boxplot, which showed an
outlier in the lower part of the distribution. Since the boxplot indicates that Energy is
not symmetric, we are better served identifying outliers in this case with a boxplot.
e.
Chapter 4
4.2 a. Not exhaustive because you may not get any offer.
b. Not mutually exclusive because you may get both offers.
4.4 Let event A correspond to “Firm raising an alarm”, and event F to “Fraudulent
Transaction”. We have P(A) = 0.05, P(A|F) = 0.80, and P(F) = 0.01.
P(Ac ∩ F) = 0.002, and therefore, P(F|Ac) = 0.0021
4.6 a. P(A) = 0.70, P(Ac) = 0.30, P(B) = 0.50
b. Not mutually exclusive
c. A ∩ B
4.10 Let event Ai be “the i-th selected member is in favor of the bonus”.
a. P(A1 ∩ A2) = 0.4286
b.
page 635
a. P(C ∩ U) = 0.1634
b. P(U|C) = 0.2635
4.16 Let event D be “Experience a decline”, and event N be “Ratio is negative”. We have
P(D) = 0.20, P(N|D) = 0.70, and P(N|Dc) = 0.15.
P(N) = 0.26, and therefore, P(D|N) = 0.54
4.18 Let event O correspond to “obese”, W to “white”, B to “black”, H to “Hispanic”, and A
to “Asian”. We have P(O|W) = 0.33, P(O|B) = 0.496, P(O|H) = 0.43, P(O|A) = 0.089,
P(W) = 0.48, P(B) = 0.19, P(H) = 0.26, and P(A) = 0.07.
a. P(O) = 0.3707
b. P(W|O) = 0.4273
c. P(B|O) = 0.2542
d. P(A|O) = 0.0168
4.20 Let F = “Player is fully fit to play”, S = “Player is somewhat fit to play”, N = “Player is
not able to play”, and W = “The Lakers win the game”. We have P(F) = 0.40, P(S) =
0.30, P(N) = 0.30, P(W|F) = 0.80, P(W|S) = 0.60, and P(W|N) = 0.40.
a. P(W) = 0.62
b. P(F|W) = 0.52
4.22 Let event R correspond to “Republican”, D to “Democrat”, I to “Independent”, and S to
“Support marijuana legalization”. We have P(R) = 0.27, P(D) = 0.30, P(I) = 0.43,
P(S|R) = 0.41, P(S|D) = 0.66, and P(S|I) = 0.63.
a. P(S ∩ R) = 0.1107
b. P(S ∩ D) = 0.1980
c. P(S ∩ I) = 0.2709
d. P(S) = 0.5796
e. P(R|S) = 0.1910
4.26 Let X be the amount spent on a warranty and Y be the revenue earned by the store.
E(X) = $30
E(Y) = $3,600
4.32 a. P(X < 2) = 0.6517
b. P(X < 2) = 0.4580
4.34 a. E(X) = 905; SD(X) = 27.22
b. E(X) = 1430; SD(X) = 31.95
4.36 a. P(X > 2) = 0.5276
b. P(X > 2) = 0.1362
4.38 Let X be the number of designers who show the acceptable design.
a. P(X ≥ 1) = 0.4375 < 0.50; statement is not correct
b. P(X ≥ 1) = 0.5781 > 0.50; statement is correct
4.40 Let X be the number of attendees whom the manager contacts.
a. P(X = 10) = 0.1171
b. P(X ≤ 10) = 0.8725
c. P(X ≥ 15) = 0.0016
4.42 a. P(X ≤ 4) = 0.8153
b. P(X ≥ 3) = 0.5768
4.44 a. μ1 = 6; P(X = 2) = 0.0446
b. μ1 = 6; P(X ≥ 2) = 0.9826
c. μ10 = 60; P(X = 40) = 0.001
4.46 a. μ1 = 2; P(X > 2) = 0.3233
b. μ5 = 10; P(X = 6) = 0.0631
c. μ180 = 360
4.50 a. μ = 304; P(X > 320) = 0.1717
b. μ = 2128; P(X > 2200) = 0.0586
4.52 Let X equal points scored in a game.
a. P(85 < X < 125) = 0.9544
b. P(X > 125) = 0.0228; approximately 2 games (82 × 0.0228 = 1.87)
4.54 Let X represent the mpg rating of passenger cars.
a. P(X ≥ 40) = 0.0384
b. P(30 ≤ X ≤ 35) = 0.4952
c. x = 33.8 + 2.326(3.5) = 41.94
4.56 Let X represent the return on a portfolio.
P(X > 16) = 0.2514 ≠ 0.15; not normal.
4.62 Let X equal the length of time of a football game.
a. P(X < 2.5) = 0.1056
b. P(X < 2.5) + P(X > 3.5) = 0.2112
c. x = 3 − 2.326(0.4) = 2.07
Chapter 5
5.4
a.
b.
b.
b.
b.
b.
c.
For n = 100,
You would choose 50 balls because with larger sample sizes the standard deviation
of is reduced. The probability of getting 70% green balls is slightly higher with a
page 636
5.26 a. microeconomics: [68.74, 74.91]; macroeconomics: [66.16, 74.64]
b. The widths are different because the sample standard deviations for
microeconomics and macroeconomics are different.
5.28 a. For research expenditure: [231.44, 373.49]
b. For duration: [18.40, 22.60]
5.30 a. 0.37 ± 0.011 or [0.359, 0.381]
b. 0.37 ± 0.017 or [0.353, 0.387]
c. The margin of error in part b is greater because it uses a higher confidence level.
5.32 a. 0.47 ± 0.025 or [0.445, 0.495].
b. 0.025
c. 0.038
5.34 a. 0.275 ± 0.037 or [0.238, 0.312]
b. No, because the value 0.30 falls in the interval.
5.38 a. Type I error; the new software is purchased even though it does not reduce
assembly costs.
b. Type II error; the new software is not purchased even though it reduces assembly
costs.
5.40 a. H0: μ ≤ 5; HA: μ > 5
b. t6 = 0.643; p-value = 0.272. It is necessary to assume that the population is
normally distributed.
c. Because 0.272 > 0.10, do not reject H0. The average waiting time is not more than
5 minutes at the 10% level.
5.42 H0: μ ≤ 6.6; HA: μ > 6.6; t35 = 2.7; p-value = 0.0053. Because 0.0053 < 0.05, reject
H0. At the 5% significance level, he can conclude that the mean increase in home
prices in the West is greater than the increase in the Midwest.
5.44 a. H0: μ = 50; HA: μ ≠ 50
b. t49 = −2.324; p-value = 0.024
c. Because 0.024 < 0.05, reject H0 At the 5% significance level, we conclude that the
average differs from 50.
5.46 a. H0: μ ≤ 65; HA: μ > 65
b. t39 = 2.105; p-value = 0.021.
c. Because 0.021 > 0.01, do not reject H0. At 1% significance level, we cannot
conclude that the average speed is greater than the stated speed limit of 65mph.
5.52 H0: p ≤ 0.60; HA: p > 0.60; z = 1.04; p-value = 0.1492. Because 0.1492 > 0.01, do not
reject H0. At the 1% significance level, we cannot conclude that more than 60% of
seniors have made serious adjustments to their lifestyle.
Chapter 6
6.2 a. The positive sign for the Poverty coefficient is as expected; the slope coefficient for
Income is not as expected.
b.
b. 3.35
6.8
a.
b.
c. $28,279.05.
6.14
a.
b. For every additional $1,000,000 in drive-through sales, net profits are predicted to
increase by $112,300, holding counter sales constant.
c. 0.7647; or net profits are predicted to be $764,700 when counter sales are
$6,000,000 and drive-through sales are $4,000,000.
6.16
a.
b. 1.48 startups
c. A $1 million increase in research expenditure results in a predicted increase in the
number of startups by 0.0087, holding everything else constant. Approximately
needed to have 1 additional predicted startup, everything else being the same.
Note that $114.94 × 0.0087 equals (approximately) 1.
6.18
a. R2 = 0.8299
b. 0.1701
6.22 Model 2 is a better fit since it has a smaller standard error and a higher adjusted R2.
We cannot use R2 for comparison because the models have different numbers of
predictor variables.
6.24 a. se = 0.2144.
b. 0.7573; 75.73% of the variability in net profits is explained by the model.
c.
6.32
a.
b. Because 0.077
< 0.10, reject H0. At the 10% significance level, the two explanatory variables are
jointly significant.
c. H0: β1 = 0; HA: β1 ≠ 0; p-value = 0.738. Because 0.738 > 0.10, do not reject H0. At
the 10% significance level, P/E is not significant in explaining Return.
H0 : β2 = 0; HA: β2 ≠ 0; p-value = 0.025. Because 0.025 < 0.10, reject H0. At the
10% significance level, P/S is significant in explaining Return.
6.34
a.
b.
page 637
Because 0.006 < 0.05, reject H0. At the 5% significance level, an
extra second of Time decreases Watches by more than 0.02.
6.36
a.
significance level, the predictor variables are jointly significant in explaining the
electricity costs.
c. At the 10% significance level, the average temperature is significant in explaining
the electricity costs, the number of work days is not significant, and the tons
produced is not significant.
6.42
a. at the 5% significance level, BMI
influences salary.
b. White college-educated man: 38.01; non-white college-educated man: 33.52.
6.44
a.
b. 14.13 minutes
c. 85.30%
d. At the 5% significance level, the predictor variables are jointly significant. At the 5%
significance level, each predictor variable is individually significant.
6.46
a.
b. As the number of bays increases by 1 bay the number of vehicles served per month
is predicted to increase by 23.5056, holding other variables constant; as the
population in a 5-mile radius increases by 1,000 people, the number of vehicles
served per month is predicted to increase by 0.5955, holding other variables
constant. When changing to convenient interstate access, the number of vehicles
served per month is predicted to increase by 84.5998, holding other variables
constant. When changing to the winter, the number of vehicles served per month is
predicted to increase by 77.4646, holding other variables constant.
c. At the 5% (and 10%) significance level, the predictor variables are jointly significant
in explaining the number of vehicles served per month. At the 5% significance level,
Access and Winter are significant, but Garage Bays and Population are not
significant. At the 10% significance level, Access, Winter, and Garage Bays are
significant, but Population is not significant.
d. 0.8739
e. 361.34 vehicles
6.48
a.
b. $1,057,485.47
c. At the 5% significance level, all the quarters are significantly different from the
fourth quarter.
d. At the 5% significance level, we cannot conclude that the sales differ between
quarters 2 and 3.
6.50 Create dummy variables for White, Black, Asian, and Hispanic students using the
Ethnicity variable.
a.
Chapter 7
7.8
a.
level.
b. 80% (Just Woman); 55% (Woman and Hispanic Man)
7.10
a. Rural + 20.4745 College + 49.8701
(Rural × College); predictor variables are jointly and, except for College, individually
significant at the 5% level.
b. 160.64 (Rural); 153.10 (Urban)
page 638
7.12
a.
Train + 0.9785 (Exper × Train)
b. The model with the interaction term is preferred because it has a higher adjusted
R2 and the interaction variable is significant at the 10% level.
c. 12.46 (10 years); 8.79 (20 years)
d. Less experienced employees benefit more from the training program (reduced pick
errors) than more experienced employees.
7.14
a. 1227.6866 Years + 36655.1363 Grad − 0.5227
× HSGPA) − 0.0010 (SAT × White); predictor variables are jointly and, except for
the second interaction term, individually significant at the 5% level.
b. White: 3.23 (SAT = 1200); 3.41 (SAT = 1300); 3.59 (SAT = 1400)
Non-White: 2.81 (SAT = 1200); 3.10 (SAT = 1300); 3.38 (SAT = 1400)
7.18
a. 319.0062 Grad + 219.1327 Debt +
2643.2320 City + 0.0185 (Cost × Grad); the partial effect of Cost increases with
Grad and the partial effect of Grad increases with Cost.
b. $42,666 (Cost = $20,000); $46,398 (Cost = $30,000); $50,130 (Cost = $40,000)
c. $43,706 (Cost = $20,000); $51,148 (Cost = $30,000); $58,589 (Cost = $40,000)
7.26
a. 0.0004 Agesq − 0.0810 Female + 0.0788
NPS + 0.22142/2)
b. 47.91
c. $91,903 (Males); $84,750 (Females)
7.28
a.
(Quadratic)
For every 50 second increase in the time, there is a decrease of about one watch.
b. 5.7493 Male
For every 10% increase in the time, there is a decrease of about 1.55 watches.
b. $15,912.73
b. At the 10% level, Age is significant, but Gender is not; neither is significant at the
5% level.
c. 0.8410 (Male); 0.5970 (Female)
7.52
a. 0.4343
b. 0.4282
7.54
a. Linear: Age − 0.2019 Religious
Logistic:
Chapter 8
8.2 a. The Euclidean distance between observations 1 and 2 = 421.0001
b. The Manhattan distance between observations 1 and 2 = 421.3300
page 639
c. s1 = 2369.6638; s2 =
0.3958; s3 = 0.2211.
The normalized Euclidean distance between observation 1 and 2 = 0.7902
d. x1 min = 1346, x2 min = 2.67, x3 min = 0.01, Range1 = 7587, Range2 = 1.16,
Range3 = 0.61.
The min-max standardized Euclidean distance between observations 1 and 2 =
0.269.
8.6 a. The Euclidean distance between observation 1 and 2 = 0.469
The Euclidean distance between observation 1 and 3 = 0.8307
The Euclidean distance between observation 2 and 3 = 1.1705
page 640
Observations 2 and 4 = 2/4
Observations 2 and 5 = 0/4
Observations 3 and 4 = 1/3
Observations 3 and 5 = 0/3
Observations 4 and 5 = 1/3
c. Most similar: The pairs of observations 1 and 2, 1 and 4, and 2 and 4 each have a
matching coefficient of 2/4 and Jaccard’s coefficients of 2/4.
Least similar: The pairs of observations 1 and 3, and 2 and 5 have matching
coefficients of 0, and Jaccard’s coefficients of 0.
Comparing Matching and Jaccard’s coefficients: Jaccard’s coefficients are all less
than or equal to their corresponding matching coefficients.
8.18
Class 1 3 2
Class 0 2 3
Class 1 18 1
Class 0 45 36
Class 1 14 5
Actual Class Predicted Class 1 Predicted Class 0
Class 0 12 69
Class 1 7 12
Class 0 2 79
page 641
c. We need to retain 3 principal components to accounts for at least
85% of the total variance in the data.
Chapter 9
9.2 Analytic Solver:
a. Optimal value of k: 5
b. Accuracy = 50%; Specificity = 0.3846; Sensitivity = 0.6667; Precision = 0.4286.
c. The model is not effective due to low overall predictive accuracy and low AUC
value (0.5085).
R:
a. Optimal value of k: 4
b. Accuracy = 0.5814; Specificity = 0.6190; Sensitivity = 0.5455; Precision = 0.6.
c. The model is not effective due to low overall predictive accuracy and low AUC
value (0.5411).
9.4 Analytic Solver:
a. Optimal value of k: 9
b. Misclassification rate: 18.33%
c. Accuracy = 67.5%; Specificity = 0.8438; Sensitivity = 0; Precision = 0.
d. Accuracy = 77.5%; Specificity = 0.9375; Sensitivity = 0.125; Precision = 0.333.
e. The model is only slightly effective due to relatively low overall predictive accuracy
and AUC value (0.5313).
R:
a. Optimal value of k: 6
b. Misclassification rate: 0.2163
c. Accuracy = 0.75; Specificity = 0.9375; Sensitivity = 0; Precision = 0.
d. Accuracy = 0.6; Specificity = 0.625; Sensitivity = 0.5; Precision = 0.25.
e. The model is only slightly effective due to relatively low overall predictive accuracy
and AUC value (0.6406).
9.6 Analytic Solver:
a. Optimal value of k: 9
b. Accuracy = 91.5%; Specificity = 0.9564; Sensitivity = 0.4545; Precision = 0.4839;
AUC = 0.8839.
c. The KNN classifier is better at classifying non-target class cases than classifying
target class cases.
d. The plots show that the KNN classifier performs better than the baseline model.
Both the lift curve and ROC curve lie above the diagonal line. The AUC value is
0.8839 suggesting an effective model.
R:
a. Optimal value of k: 10
b. Accuracy = 0.9036; Specificity = 0.9623; Sensitivity = 0.3902; Precision = 0.5424;
AUC = 0.9118.
c. The KNN classifier is better at classifying non-target class cases than classifying
target class cases.
d. The plots show that the KNN classifier performs better than the baseline model.
Both the lift curve and ROC curve lie above the diagonal line. The AUC value is
0.9118 suggesting an effective model.
9.8 Analytic Solver:
a. Optimal value of k: 9
b. Accuracy = 79.17%; Specificity = 0.8889; Sensitivity = 0.7333; Precision = 0.9167.
c. AUC = 0.8037; The predictive performance measures and graphs suggest that the
KNN classifier is effective in classifying the data.
d. Predicted admission outcome for the first new applicant: 1(Admit)
R:
a. Optimal value of k: 7
b. Accuracy = 0.8298; Specificity = 0.8421; Sensitivity = 0.8214; Precision = 0.8846.
c. AUC = 0.828; The predictive performance measures and graphs suggest that the
KNN classifier is effective in classifying the data.
d. Predicted admission outcome for the first new applicant: 1(Admit)
9.10 Analytic Solver:
a. Optimal value of k: 3
b. Accuracy = 56.14%; Specificity = 0.625; Sensitivity = 0.48; Precision = 0.5.
c. AUC = 0.6031; The predictive performance measures and graphs suggest that the
KNN classifier is slightly effective in classifying the data.
d. Predicted outcome for the first new consumer: 0 (Do not respond)
e. Accuracy = 54.39%; Specificity = 0.5313; Sensitivity = 0.56; Precision = 0.4828.
R:
a. Optimal value of k: 6
b. Accuracy = 0.6814; Specificity = 0.7681; Sensitivity = 0.5455; Precision = 0.6.
c. AUC = 0.7085; The predictive performance measures and graphs suggest that the
KNN classifier is moderately effective in classifying the data.
d. Predicted outcome for the first new consumer: 0 (Do not respond)
e. Accuracy = 0.6106; Specificity = 0.4203; Sensitivity = 0.9091; Precision = 0.5.
9.12 Analytic Solver:
a. Optimal value of k: 3
b. Misclassification rate: 33.33%
c. Accuracy = 58.33%; Specificity = 0.2857; Sensitivity = 0.8438; Precision = 0.5745.
d. No. Part of the lift curve lies below the baseline.
e. Lift value of the leftmost bar of the decile-wise lift chart: 1.25
f. AUC = 0.6908
g. The predictive performance measures and graphs suggest that the KNN classifier
is slightly effective in classifying the data.
page 642
R:
a. Optimal value of k: 9
b. Misclassification rate: 0.3072
c. Accuracy = 0.6167; Specificity = 0.6154; Sensitivity = 0.6176; Precision = 0.6774.
d. Yes. The entire lift curve lies above the baseline.
e. Lift value of the leftmost bar of the decile-wise lift chart: 1.5
f. AUC = 0.7216
g. The predictive performance measures and graphs suggest that the KNN classifier
is moderately effective in classifying the data.
9.14 Analytic Solver:
a. Optimal value of k: 9
b. Misclassification rate: 27.78%
c. Accuracy = 63.89%; Specificity = 0.6154; Sensitivity = 0.6522; Precision = 0.75.
d. The plots show that the KNN classifier performs better than the baseline model.
Both the lift curve and ROC curve lie above the diagonal line. The AUC value is
0.7508.
e. The predictive performance measures and graphs suggest that the KNN classifier
is moderately effective in classifying the data.
R:
a. Optimal value of k: 9
b. Misclassification rate: 0.2459
c. Accuracy = 0.7042; Specificity = 0.5862; Sensitivity = 0.7857; Precision = 0.7333.
d. The plots show that the KNN classifier performs better than the baseline model.
Both the lift curve and ROC curve lie above the diagonal line. The AUC value is
0.7697.
e. The predictive performance measures and graphs suggest that the KNN classifier
is moderately effective in classifying the data.
9.16 Analytic Solver:
a. Optimal value of k: 7; Predicted outcome for the first new consumer: N (Do not
install solar panels)
b. Misclassification rate: 36.55%
c. Accuracy = 66.30%; Specificity = 0.7371; Sensitivity = 0.5548; Precision = 0.5912.
d. The plots show that the KNN classifier performs better than the baseline model.
Both the lift curve and ROC curve lie above the diagonal line. The AUC value is
0.6929.
e. The predictive performance measures and graphs suggest that the KNN classifier
is moderately effective in classifying the data.
R:
a. Optimal value of k: 8; Predicted outcome for the first new consumer: N (Do not
install solar panels)
b. Misclassification rate: 0.3818
c. Accuracy = 0.6569; Specificity = 0.7535; Sensitivity = 0.5088; Precision = 0.5737.
d. The plots show that the KNN classifier performs better than the baseline model.
Both the lift curve and ROC curve lie above the diagonal line. The AUC value is
0.7022.
e. The predictive performance measures and graphs suggest that the KNN classifier
is moderately effective in classifying the data.
9.18 Analytic Solver:
a. Accuracy = 75%; Specificity = 0.7; Sensitivity = 0.7917; Precision = 0.76.
b. Lift value of the leftmost bar of the decile-wise lift chart: 0.9167; This implies that by
selecting the top 10% of the validation cases with the highest predicted probability
of belonging to the target class, the naïve Bayes model would identify 0.9167 times
as many target class cases as if the cases are randomly selected.
c. AUC = 0.725
d. Predicted outcome for the first three new observations: Yes, No, Yes
R:
a. Accuracy = 0.7674; Specificity = 0.6; Sensitivity = 0.913; Precision = 0.7241.
b. Lift value of the leftmost bar of the decile-wise lift chart: 1.17; This implies that by
selecting the top 19% of the validation cases with the highest predicted probability
of belonging to the target class, the naïve Bayes model would identify 1.17 times as
many target class cases as if the cases are randomly selected.
c. AUC = 0.7196
d. Predicted outcome for the first three new observations: Yes, No, Yes
9.20 Analytic Solver:
a. Accuracy = 69.23%; Specificity = 0.25; Sensitivity = 0.8889; Precision = 0.7273.
b. No, the lift curve does not lie above the baseline entirely.
c. AUC = 0.5417
d. Predicted outcome of the five new observations: 1, 0, 0, 1, 1
e. Accuracy = 46.15%; Specificity = 0.25; Sensitivity = 0.5556; Precision = 0.625; No,
the lift curve does not lie above the baseline entirely; AUC = 0.4236; The naïve
Bayes model that uses all four predictors performs better.
R:
a. Accuracy = 0.5; Specificity = 0.1; Sensitivity = 0.75; Precision = 0.5714.
b. No, the lift curve does not lie above the baseline entirely.
c. AUC = 0.6188
d. Predicted outcome of the five new observations: 1, 1, 1, 0, 1
e. Accuracy = 0.6154; Specificity = 0.1; Sensitivity = 0.9375; Precision = 0.625; No,
the lift curve does not lie above the baseline entirely; AUC = 0.6344; The naïve
Bayes model that uses only x1 and x2 performs better.
9.22 Analytic Solver:
a. Accuracy = 67.09%; Specificity = 0.7143; Sensitivity = 0.6; Precision = 0.5625.
b. Yes, the entire lift curve lies above the baseline.
c. AUC = 0.8051
page 643
d. The predictive performance measures and graphs suggest that
the naïve Bayes classifier is effective in classifying the data.
R:
a. Accuracy = 0.7051; Specificity = 0.7273; Sensitivity = 0.6765; Precision = 0.6571.
b. Yes, the entire lift curve lies above the baseline.
c. AUC = 0.7871
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is moderately effective in classifying the data.
9.24 Analytic Solver:
a. Observation 1: x1 = 1, x2 = 2, x3 = 1; Observation 2: x1 = 1, x2 = 2, x3 = 1.
b. Accuracy = 68.75%; Specificity = 0.7826; Sensitivity = 0.4444; Precision = 0.4444.
c. AUC = 0.6787
d. Accuracy = 37.5%; Specificity = 0.1304; Sensitivity = 1; Precision = 0.3103.
R:
a. Observation 1: x1 = 1, x2 = 2, x3 = 1; Observation 2: x1 = 1, x2 = 2, x3 = 1.
b. Accuracy = 0.6774; Specificity = 0.8571; Sensitivity = 0.3; Precision = 0.5.
c. AUC = 0.4548
d. Accuracy = 0.2903; Specificity = 0.1429; Sensitivity = 0.6; Precision = 0.25.
9.26 Analytic Solver:
a. Observation 1: x1 = 1, x2 = 1, x3 = 2; Observation 2: x1 = 1, x2 = 1, x3 = 1.
b. Accuracy = 48.57%; Specificity = 0.8421; Sensitivity = 0.0625; Precision = 0.25.
c. AUC = 0.3664
d. Accuracy = 41.43%; Specificity = 0.1316; Sensitivity = 0.75; Precision = 0.4211.
R:
a. Observation 1: x1 = 1, x2 = 1, x3 = 2; Observation 2: x1 = 1, x2 = 1, x3 = 1.
b. Accuracy = 0.5797; Specificity = 1; Sensitivity = 0; Precision = NaN.
c. AUC = 0.5384
d. Accuracy = 0.5507; Specificity = 0.4750; Sensitivity = 0.6552; Precision = 0.4750.
9.28 Analytic Solver:
a. Accuracy = 75%; Specificity = 0.5263; Sensitivity = 0.92; Precision = 0.7188.
b. AUC = 0.8189
c. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is effective in classifying the data.
R:
a. Accuracy = 0.7907; Specificity = 0.5882; Sensitivity = 0.9231; Precision = 0.7742.
b. AUC = 0.759
c. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is moderately effective in classifying the data.
9.30 Analytic Solver:
a. Accuracy = 70.39%; Specificity = 0.1639; Sensitivity = 0.9421; Precision = 0.7186.
b. Lift value of the leftmost bar of the decile-wise lift chart: 1.22. This implies that by
selecting the top 10% of the validation cases with the highest predicted probability
of belonging to the target class, the naïve Bayes model would identify 1.22 times as
many target class cases as if the cases are randomly selected.
c. AUC = 0.6223
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is only slightly effective in classifying the data.
R:
a. Accuracy = 69.35%; Specificity = 0; Sensitivity = 1; Precision = 0.6935.
b. Lift value of the leftmost bar of the decile-wise lift chart: 1.20. This implies that by
selecting the top 11% of the validation cases with the highest predicted probability
of belonging to the target class, the naïve Bayes model would identify 1.20 times as
many target class cases as if the cases are randomly selected.
c. AUC = 0.6549
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is only slightly effective in classifying the data.
9.32 Analytic Solver:
a. Accuracy = 81.4%; Specificity = 0.9551; Sensitivity = 0.6593; Precision = 0.9305.
b. Lift value of the leftmost bar of the decile-wise lift chart: 1.92; AUC = 0.8298.
c. Predicted outcome for the first three individuals: 0 (Do not volunteer), 0, 1
(Volunteer)
R:
a. Accuracy = 0.8184; Specificity = 0.9712; Sensitivity = 0.6517; Precision = 0.9541.
b. Lift value of the leftmost bar of the decile-wise lift chart: 1.98; AUC = 0.8481.
c. Predicted outcome for the first three individuals: 0 (Do not volunteer), 0, 1
(Volunteer)
9.34 Analytic Solver:
a. Observation 1: Age = 1, Hours = 2; Observation 2: Age = 1, Hours = 1.
b. Accuracy = 48.75%; Specificity = 0.4079; Sensitivity = 0.5595; Precision = 0.5109.
c. AUC = 0.4865
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is not effective in classifying the data.
R:
a. Observation 1: Age = 1, Hours = 2; Observation 2: Age = 1, Hours = 1.
b. Accuracy = 0.566; Specificity = 0.28; Sensitivity = 0.8214; Precision = 0.5610.
c. AUC = 0.5562
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is not effective in classifying the data.
9.36 Analytic Solver:
a. Accuracy = 66.25%; Specificity = 0.375; Sensitivity = 0.95; Precision = 0.6032.
b. Yes, the entire lift curve lies above the baseline.
page 644
c. AUC = 0.8009
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is effective in classifying the data.
R:
a. Accuracy = 0.6835; Specificity = 0.6944; Sensitivity = 0.6744; Precision = 0.725.
b. Yes, the entire lift curve lies above the baseline.
c. AUC = 0.7565
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is moderately effective in classifying the data.
9.38 Analytic Solver:
a. Accuracy = 71.25%; Specificity = 0.75; Sensitivity = 0.6923; Precision = 0.8372.
b. The plots show that the naïve Bayes classifier performs better than the baseline
model. Both the lift curve and ROC curve lie above the diagonal line.
c. AUC = 0.8310
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is effective in classifying the data.
R:
a. Accuracy = 0.7975; Specificity = 0.5667; Sensitivity = 0.9388; Precision = 0.7797.
b. The plots show that the naïve Bayes classifier performs better than the baseline
model. Both the lift curve and ROC curve lie above the diagonal line.
c. AUC = 0.802
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is effective in classifying the data.
9.40 Analytic Solver:
a. Observation 1: Age = 2, Education = 2, Hours = 1; Observation 2: Age = 2,
Education = 1, Hours = 1.
b. Accuracy = 87.5%; Specificity = 0.854; Sensitivity = 0.8929; Precision = 0.8772.
c. AUC = 0.8531
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is effective in classifying the data.
R:
a. Observation 1: Age = 2, Education = 2, Hours = 1; Observation 2: Age = 2,
Education = 1, Hours = 1.
b. Accuracy = 0.7864; Specificity = 0.7556; Sensitivity = 0.8103; Precision = 0.8103.
c. AUC = 0.7395
d. The predictive performance measures and graphs suggest that the naïve Bayes
classifier is moderately effective in classifying the data.
Chapter 10
10.2 a. Possible split points for days: {191, 208.5, 217.5, 229, 249.5, 287.5}
b. Possible split points for precipitation: {24.9, 35.55, 38, 40.95, 48.25, 58.2}
10.4 a. Gini index for root node: 0.2888
b. Gini index for age < 45.5: 0.1958
c. Gini index for age ≥ 45.4: 0.3648
d. Gini index for the split: 0.2803
e. Rules generate from this split: If age < 45.5, then the probability of a diabetes
diagnosis is 11%. If age ≥ 45.5, then the probability of a diabetes diagnosis is 24%.
10.6 a. C
b. B
c. A
10.8 a. Minimum error: 0.1917; Number of decision nodes: 2.
b. Number of leaf nodes in the best-pruned tree: 3; Number of leaf nodes in the
minimum error tree: 3.
c. Predictor variable: x2; Split value: 63.5.
d. Accuracy = 83.75%; Sensitivity = 0.9762; Specificity = 0.6842; Precision = 0.7736.
e. Yes. The entire lift curve lies above the baseline.
f. AUC = 0.8365
g. Predicted response values of the new observations: 1, 1, 0, 1, 1; Class 1 probability
of the first observation: 0.7887.
10.10a. Number of leaf nodes: 8; Predictor variable: x2; Split value: 62
b. cp value = 0.0075188; Number of decision nodes: 2
c. No
d. Number of leaf nodes: 3
e. Accuracy = 0.8235; Sensitivity = 1; Specificity = 0.625; Precision = 0.75
f. Lift value of the leftmost bar of the decile-wise lift chart: 1.42. This implies that by
selecting the top 71% of the validation cases with the highest predicted probability
of belonging to the target class, the model would identify 1.42 times as many target
class cases as if the cases are randomly selected.
g. AUC = 0.8125
h. Predicted response values of the new observations: 1, 1, 0, 1, 1; Class 1
probability of the first observation: 0.7604
10.12a. Number of leaf nodes: 6; Predictor variable: CreditCard; Split value: “Yes” (whether
or not the customer has credit card debt).
b. cp value = 0.0357143; Number of splits: 2.
c. No
d. Number of leaf nodes: 3
e. Accuracy = 0.7119; Sensitivity = 0.8696; Specificity = 0.6111; Precision = 0.5882.
f. The model is moderately effective because 1) the lift value of the leftmost bar of the
decile-wise lift chart = 1.51, 2) the lift curve lies above the diagonal line, and 3) AUC
= 0.7506
g. Probability of the first new customer: 0.0714; Probability of the second new
customer: 0.2593
10.14a. Number of leaf nodes of the best-pruned tree: 3; Predictive variable: Income; Split
value: 135,000; Number of leaf nodes of the minimum error tree: 5.
b. Accuracy = 97.25%; Sensitivity = 0.8182; Specificity = 0.9864.
page 645
c. The model is effective because 1) the lift value of the leftmost bar
of the decile-wise lift chart = 8.18, 2) the lift curve lies above the diagonal line, and
3) AUC = 0.9352.
d. Probability of the first community member: 0.016; Probability of the second
community member: 0.9589.
10.16a. Number of leaf nodes: 3; Predictor variable: Age; Split value: 35; Rule: If Age is at
least 35, then the individual is more likely to go to church.
b. cp value = 0.00233508; Number of splits: 27; Number of leaf nodes: 28.
c. Yes; Number of splits: 18; cp value = 0.00291886.
d. Accuracy = 0.5317; Sensitivity = 0.4482; Specificity = 0.6118; Precision = 0.5256.
e. AUC = 0.5467; The model is not effective because 1) the lift value of the leftmost
bar of the decile-wise lift chart = 1.05, 2) the lift curve lies only slightly above the
diagonal line, and 3) AUC = 0.5467.
f. 50%
10.18a. Number of leaf nodes: 14; Rule: If the individual is 44 or older with an income of
less than 15000, the individual is likely to download the mobile banking app.
b. cp value = 0.0085714; Number of splits: 4.
c. No
d. Accuracy = 0.5503; Sensitivity = 0.68; Specificity = 0.4189; Precision = 0.5426.
e. Lift value of the leftmost bar of the decile-wise lift chart: 1.02. This implies that by
selecting the top 30% of the validation cases with the highest predicted probability
of belonging to the target class, the model would identify 1.02 times as many target
class cases as if the cases are randomly selected.
f. AUC = 0.5359
g. Number of new customers who are likely to download: 12; Probability of the first
customer downloading: 0.6125.
10.20a. Number of leaf nodes in the best-pruned tree: 2; Number of leaf nodes in the
minimum error tree: 2; Rules: If GPA is less than 2.65, then the student is not likely
to graduate within 4 years. If GPA is at least 2.65, then the student is likely to
graduate within 4 years.
b. Accuracy = 90.25%; Sensitivity = 1; Specificity = 0.7045; Precision = 0.8730.
c. The lift curve lies above the baseline.
d. AUC = 0.8523
e. Students 2 and 3
10.22a. Predictor variable: Income; Split value: 64.
b. cp value = 0.0122180; Number of splits: 2.
c. No
d. Accuracy = 0.8151; Sensitivity = 1; Specificity = 0.6071; Precision = 0.7412.
e. The model is effective because 1) the lift value of the leftmost bar of the decile-wise
lift chart = 1.4, 2) the lift curve lies above the diagonal line, and 3) AUC = 0.8036.
f. Probability of the first gamer: 0.7725; Probability of the second gamer: 0.0714.
10.24a. Possible split values for x1: {179.5, 252, 289}
b. Possible split values for x2: {66, 92.5, 105.5}
c. Possible split values for x3: {6.25, 9.75, 14.25}
d. MSE of the partition x1 = 252:
25
72.25
48.625
19.5)2] = 20.25
50.625
f. MSE of the partition x3 = 14.25:
0 = 51.1667
g. Because split x1 = 252 generates the lowest MSE, the best split is on x1, and the
best split point is x1 = 252.
10.26a. The split x1 = 7.5 will generate the least MSE:
400.1667
b. The split x2 = 22.5 will generate the least MSE:
400.1667
d. Because the split x2 = 22.5 has the lowest MSE, the best split is on x2, and the
best split value is 22.5.
e. Rules: If x2 < 22.5, then y = 25; if x2 ≥ 22.5, then y = 49.5.
10.28a. Minimum MSE: 27.8093; Number of decision nodes: 2
b. Number of leaf nodes in the best-pruned tree: 3; Number of leaf nodes in the
minimum error tree: 3.
c. Predictor variable: x5; Split value: 7.58; Rules: If x5 < 7.58, then y = 36.13; If x5 ≥
7.58, then y = 19.57.
d. RMSE = 5.8746; MAD = 4.5801.
e. Predicted value of the first observation: 15.47
f. Minimum = 15.47; Maximum = 36.1267; Average = 21.15.
10.30a. Number of leaf nodes: 8
b. Predictor variable: x5; Split value: 11; Rules: If x5 < 11, then y = 32; If x5 ≥ 11, then
y = 19.
c. Tree with the lowest cross-validation error: 21st tree; Number of splits: 28.
d. Yes; cp value = 0.027056.
e. Number of leaf nodes: 7
f. ME = −0.0074; RMSE = 5.6235; MAE = 3.9277; MPE = −3.9765; MAPE = 18.8877.
page 646
g. Minimum = 13.47; Maximum = 27.25; Average = 18.90.
10.32a. Minimum MSE: 46.1824; Number of decision nodes: 2.
b. Number of leaf nodes of the best-pruned tree: 2; Number of leaf nodes of the
minimum error tree: 3
c. Predictor variable: x3; Split value: 102.53; Rules: If x3 < 102.53, then y = 299.18; If
x3 ≥ 102. 53, then y = 312.81.
d. RMSE = 7.9182; MAD = 6.7371.
10.34a. Number of leaf nodes of the best-pruned tree: 7; Number of leaf nodes of the
minimum error tree: 7; Rules: If CreditCard = No, then TravelSpend = 2764.35; If
CreditCard = Yes, then TravelSpend = 1502.48.
b. RMSE = 1011.3150; MAD = 740.4973
c. Predicted annual travel spending of the first customer: $1,747.83; Predicted annual
travel spending of the second customer: $2,128.47
10.36a. Predictor variable: SQFT; Split value: 2221; Rules: If SQFT < 2221, then Price =
970,000; If SQFT ≥ 2221, then Price = 1,700,000.
b. cp value = 0.01878; Number of splits: 9.
c. Yes; cp value = 0.023827.
d. Number of leaf nodes: 6
e. ME = −51189.13; RMSE = 467558.6; MAE = 332553; MPE = −34.90362; MAPE =
53.6871; Over-Predict because of the negative ME; The model is not very effective
due to the large prediction errors.
f. Predicted price of the first house: $829, 762.6; Predicted price of the second house:
$1,190,534.5.
10.38a. Number of leaf nodes: 4
b. RMSE = 27.2155; MAD = 21.4680.
c. Mean = 221.947; Median = 234.857.
10.40a. Predictor variable: Generation; Split value: 17; Rules: If Generation < 17, Price ≥
12, then Sales = 7.7; If Generation < 17, Price < 12, the Sales = 13; If Generation ≥
17, then Sales = 18.
b. cp value = 0.018209; Minimum validation error = 0.27671; Number of leaf nodes: 5.
c. Yes; cp value = 0.019442.
d. Number of leaf nodes: 4
e. ME = 0.0523; RMSE = 3.1054; MAE = 2.4323; MPE = −4.4364; MAPE = 17.9348.
f. Predicted per capita electricity sales: 16.6075
10.42a. Number of leaf nodes in the best-pruned tree: 11; Number of leaf nodes in the
minimum error tree: 11.
b. Predictor variable: games_started; Split value: 186.
c. RMSE = 4,636,301.156; MAD = 2,856,216.137.
d. Average predicted salary = 8,174,731.90
10.44 Analytic Solver:
a. Accuracy = 84.375%; Sensitivity = 1; Specificity = 0.8214.
b. AUC = 0.9375
c. Predicted value: 1; Class 1 probability: 0.5.
R:
a. Accuracy = 0.7742; Sensitivity = 0.5; Specificity = 0.84.
b. AUC = 0.87
c. Predicted value: 1; Class 1 probability: 0.56.
10.46 Analytic Solver:
a. Accuracy = 84.375%; Sensitivity = 1; Specificity = 0.8214.
b. AUC = 0.9241
c. Most important predictor: x1
d. Predicted value: 1; Class 1 probability: 0.5.
R:
a. Accuracy = 0.8387; Sensitivity = 0.5; Specificity = 0.92.
b. AUC = 0.9
c. Most important predictor: x4
d. Predicted value: 0; Class 1 probability: 0.43.
10.48 Analytic Solver:
a. Accuracy = 73.75%; Sensitivity = 0.7674; Specificity = 0.7072.
b. Lift value of the leftmost bar of the decile-wise lift chart: 1.8605
c. Predicted value: 1; Class 1 probability: 1.
R:
a. Accuracy = 0.8608; Sensitivity = 0.8250; Specificity = 0.8974.
b. Lift value of the leftmost bar of the decile-wise lift chart: 1.98
c. Predicted value: 1; Class 1 probability: 1.
10.50 Analytic Solver:
a. Accuracy = 84.50%; Sensitivity = 0.8144; Specificity = 0.8738; AUC = 0.8864.
b. Accuracy = 83%; Sensitivity = 0.8351; Specificity = 0.8252; AUC = 0.8865; Most
important predictor: Characters.
c. 53.85% of the cases are predicted as spams.
R:
a. Accuracy = 0.8291; Sensitivity = 0.7864; Specificity = 0.8750; AUC = 0.8924.
b. Accuracy = 0.8442; Sensitivity = 0.8252; Specificity = 0.8646; AUC = 0.8912; Most
important predictor: Hyperlinks.
c. 53.85% of the cases are predicted as spams.
10.52 Analytic Solver:
a. Accuracy = 66.67%; Sensitivity = 0.6341; Specificity = 0.7097.
b. Accuracy = 62.50%; Sensitivity = 0.6098; Specificity = 0.6452.
c. The bagging tree shows more robust performance because of higher AUC value
(0.7352 vs. 0.6943).
R:
a. Accuracy = 0.7324; Sensitivity = 0.7857; Specificity = 0.6552.
b. Accuracy = 0.7042; Sensitivity = 0.8095; Specificity = 0.5517.
c. The bagging tree shows more robust performance because of higher AUC value
(0.7878 vs. 0.757).
page 647
10.54 Analytic Solver:
a. Accuracy = 70.39%; Sensitivity = 0.9421; Specificity = 0.1639; AUC = 0.5560.
b. Accuracy = 70.39%; Sensitivity = 0.9421; Specificity = 0.1639; AUC = 0.5560; Most
important predictor: CollegeParent.
c. Both models have the same performance.
R:
a. Accuracy = 0.7186; Sensitivity = 0.9402; Specificity = 0.2172; AUC = 0.5968.
b. Accuracy = 0.7186; Sensitivity = 0.9402; Specificity = 0.2172; AUC = 0.5968; Most
important predictor: GPA.
c. Both models have the same performance.
10.56 Analytic Solver:
a. Accuracy = 71.25%; Sensitivity = 0.6562; Specificity = 0.75; AUC = 0.6911.
b. The single-tree model is more robust due to higher accuracy, sensitivity, and AUC
values.
c. Probability of the first customer having plans to travel: 0.1; Probability of the second
customer having plans to travel: 0.
R:
a. Accuracy = 0.7215; Sensitivity = 0.5806; Specificity = 0.8125; AUC = 0.7547.
b. The bagging ensemble model is more robust due to higher accuracy, specificity,
and AUC values.
c. Probability of the first customer having plans to travel: 0.01; Probability of the
second customer having plans to travel: 0.14.
10.58 Analytic Solver:
a. Accuracy = 54.95%; Sensitivity = 0.7996; Specificity = 0.3078; AUC = 0.5613; Most
important predictor: Income.
b. The random trees ensemble mode is slightly more robust due to higher accuracy,
sensitivity, and AUC values.
c. 100% of the individuals in the data set are likely to go to church.
R:
a. Accuracy = 0.5223; Sensitivity = 0.4632; Specificity = 0.5788; AUC = 0.5382; Most
important predictor: Age.
b. The random trees ensemble mode is slightly more robust due to higher accuracy,
sensitivity, specificity, and AUC values.
c. 33.33% of the individuals in the data set are likely to go to church.
10.60 Analytic Solver:
a. Accuracy = 84.38%; Sensitivity = 0.8523; Specificity = 0.8333; AUC = 0.9123.
b. The boosting ensemble model is more robust due to higher AUC value.
c. Probability of the first gamer making in-app purchases: 0.3945; Probability of the
second gamer making in-app purchases: 0.2082.
R:
a. Accuracy = 0.8428; Sensitivity = 0.8571; Specificity = 0.8267; AUC = 0.9135.
b. The boosting ensemble model is more robust due to higher accuracy, specificity,
and AUC values.
c. Probability of the first gamer making in-app purchases: 0.4908; Probability of the
second gamer making in-app purchases: 0.4199.
Chapter 11
11.2 Analytic Solver: At most 6 clusters if the minimum distance between clusters is 5; At
most 10 clusters if the minimum distance between clusters is 3.
R: At most 6 clusters if the minimum distance between clusters is 5; At most 10
clusters if the minimum distance between clusters is 3.
11.4 Analytic Solver:
a. Number of observations in the largest cluster (Cluster 1): 42; Average value of x4 of
the largest cluster (Cluster 1): 141.1191.
b. Number of observations in the largest cluster (Cluster 2): 22; Average value of x4 of
the largest cluster (Cluster 1): 157.3636.
c. Number of observations in the largest cluster (Cluster 2): 21; Average value of x4 of
the largest cluster (Cluster 1): 158.1905.
R:
a. Number of observations in the largest cluster (Cluster 1): 42; Average value of x4 of
the largest cluster (Cluster 1): 141.1191.
b. Number of observations in the largest cluster (Cluster 2): 22; Average value of x4 of
the largest cluster (Cluster 1): 157.3636.
c. Number of observations in the largest cluster (Cluster 2): 21; Average value of x4 of
the largest cluster (Cluster 1): 158.1905.
11.6 Analytic Solver: At most 3 clusters if the minimum distance between clusters is 0.8.
R: At most 3 clusters if the minimum distance between clusters is 0.8.
11.8 At most 3 clusters if the minimum distance between clusters is 0.8. The average value
of x5 is 13.14.
11.10 Analytic Solver:
a. At most 10 clusters if the minimum distance between clusters is 0.8; Number of
transfers in the largest cluster (Cluster 1): 30.
b. At most 10 clusters if the minimum distance between clusters is 0.8; Number of
students on the Dean’s list in the largest cluster (Cluster 1): 35.
R:
a. At most 5 clusters if the minimum distance between clusters is 0.8; Number of
transfers in the largest cluster (Cluster 3): 12.
b. At most 3 clusters if the minimum distance between clusters is 0.8; Number of
students on the Dean’s list in the largest cluster (Cluster 1): 31.
11.12 Cluster 1: 142 players; Cluster 2: 1 player; Cluster 3: 1 player.
11.14 Analytic Solver:
a. Yes. Variables are measured using different scales.
page 648
b. Cluster 1, which includes 31 countries, has the highest average
GNI per capita (19439.6774), Cluster 3, which includes 3 countries, has the 2nd
highest average GNI per capita (1106.6667), and Cluster 2, which includes 4
countries, has the lowest average GNI per capita (392.5).
R:
a. Yes. Variables are measured using different scales.
b. Cluster 1, which includes 31 countries, has the highest average GNI per capita
(19439.6774), Cluster 3, which includes 3 countries, has the 2nd highest average
GNI per capita (1106.6667), and Cluster 2, which includes 4 countries, has the
lowest average GNI per capita (392.5).
11.16a. Compared to Cluster 1, customers in Cluster 2 tend to be older (average age of
41.4706 vs. 24.5385), have higher income (average income of 73117.6471 vs.
28307.6923), have larger families (household size of 3.8824 vs. 2.0769), and spend
more on pizzas annually (average spending of 1011.5882 vs. 290.4615).
b. Compared to Cluster 1, Customers in Cluster 2 are more likely to be married (15 vs.
6) and own houses (15 vs. 0).
11.18 Analytic Solver:
a. Number of cities in the largest cluster (Cluster 1): 28.
b. Average January average temperature: 31.6286; Average April average
temperature: 51.7429; Average July average temperature: 73.8357; Average
October average temperature: 55.1821
R:
a. Number of cities in the largest cluster (Cluster 1): 33.
b. Average January average temperature: 33.1455; Average April average
temperature: 53.1030; Average July average temperature: 74.7727; Average
October average temperature: 56.3939.
11.20 Analytic Solver:
a. Yes. Variables are measured using different scales.
b. At most five clusters are generated if the minimum distance between clusters is 10.
c. Number of community areas in the largest cluster: 33; Average median household
income of the largest cluster (Cluster 2): 45228.0606.
R:
a. Yes. Variables are measured using different scales.
b. At most two clusters are generated if the minimum distance between clusters is 10.
c. Number of community areas in the largest cluster: 43; Average median household
income of the largest cluster (Cluster 2): 45449.44.
11.22a. Number of individuals in the largest cluster (Cluster 1): 28.
b. Cluster 1: low weight, low income; Cluster 2: non-Christian; Cluster 3: high income;
Cluster 4: non-white, large family.
11.24 Analytic Solver:
With k = 2, size of the larger cluster (Cluster 1): 23; Average distance for the larger
cluster (Cluster 1): 0.7895.
With k = 3, size of the largest cluster (Cluster 3): 24; Average distance for the
largest cluster (Cluster 3): 0.8002.
With k = 4, size of the largest cluster (Cluster 4): 21; Average distance for the
largest cluster (Cluster 4): 0.6147.
R:
With k = 2, size of the larger cluster (Cluster 2): 27; Average distance for the larger
cluster (Cluster 2): 0.8629.
With k = 3, size of the largest cluster (Cluster 3): 19; Average distance for the
largest cluster (Cluster 3): 0.5980.
With k = 2, size of the largest cluster (Cluster 3): 16; Average distance for the
largest cluster (Cluster 4): 0.5807.
11.26a. Average silhouette width: 0.54
b. Average silhouette width: 0.34
c. Average silhouette width: 0.33
11.28a. Average silhouette width: 0.66
b. Average silhouette width: 0.54
c. Average silhouette width: 0.58
11.30a. Average silhouette width: 0.26
b. Average silhouette width: 0.39
c. Average silhouette width: 0.46
11.32 Analytic Solver:
a. Size of the largest cluster (Cluster 2): 27; Cluster center values of the largest
cluster: Comp = 0.6922, Att = 0.7019, Pct = 0.3269, Yds = 0.6784, Avg = 0.2571,
Yds/G = 0.3333, TD = 0.5974, Int = 0.4804.
b. Size of the largest cluster (Cluster 1): 14; Cluster center values of the largest
cluster: Comp = 0.6531, Att = 0.5442, Pct = 0.7095, Yds = 0.8063, Avg = 0.9606,
Yds/G = 0.8719, TD = 0.8139, Int = −0.2428.
c. The clustering structures are similar for players in both data sets. The largest
clusters in the two clustering structures contain better performing players with more
touchdowns.
R:
a. Size of the largest cluster (Cluster 1): 16; Cluster center values of the largest
cluster: Comp = −1.3499, Att = −1.3296, Pct = −1.1689, Yds = −1.3121, Avg =
−0.7293, Yds/G = −0.9251, TD = −1.1840, Int = −0.8474.
b. Size of the largest cluster (Cluster 3): 13; Cluster center values of the largest
cluster: Comp = 0.3080, Att = 0.3344, Pct = 0.1511, Yds = 0.1018, Avg = −0.3667,
Yds/G = −0.6465, TD = 0.2621, Int = 0.6625.
c. The clustering structures are similar for players in both data sets. The first cluster
includes the lowest performing quarterbacks, the second cluster includes the
highest performing quarterbacks, and the third cluster includes the medium
performing quarterbacks.
page 649
11.34 Analytic Solver:
a. Size of the largest cluster (Cluster 2): 56; Cluster with the highest number of
homeruns: Cluster 3.
b. Size of the largest cluster (Cluster 1): 30; Cluster with the highest number of
homeruns: Cluster 3.
R:
a. Size of the largest cluster (Cluster 1): 63; Cluster with the highest number of
homeruns: Cluster 1.
b. Size of the largest cluster (Cluster 2): 22; Cluster with the highest number of
homeruns: Cluster 1.
11.36a. Yes. Variables are measured using different scales.
b. Cluster characteristics and GNI:
Cluster 1: High population growth, low % of female population, high % of male
population. GNI = 17930.
Cluster 2: Relatively high population growth, high fertility rate, high birth rate. GNI
= 270.
Cluster 3: Medium on all indicators. GNI = 18663.2258.
Cluster 4: Low population growth, high % of female population, low % of male
population, low fertility rate, how birth rate. GNI = 4975.
11.38 Cluster characteristics:
Cluster 1: Food items that are high in calories, total fat, potassium, carbohydrate,
dietary fiber.
Cluster 2: Food items that are high in saturated fat, cholesterol, sodium, and
protein, but low in carbohydrate, dietary fiber, and sugar.
Cluster 3: Food items that are high in sugar but low in calories, total fat, saturated
fat, potassium, and protein.
11.40a. No. All measures are percentage growth.
b. Size of the largest cluster (Cluster 1): 6; The cluster that has the highest average
growth in GDP per capita: Cluster 2.
11.42 Analytic Solver:
a. Cluster characteristics:
Cluster 1: Medium on all indicators.
Cluster 2: Youngest, lowest income, lowest usage, and shortest tenure.
Cluster 3: Older, medium income, high usage, and medium tenure.
Cluster 4: Medium age, high income, high usage, and longer tenure.
b. Percent of unsubscribers: Cluster 1 (43.90%), Cluster 2 (93.94%), Cluster 3 (0%),
Cluster 4 (0%); Cluster with the highest percent of unsubscribers: Cluster 2.
R:
a. Cluster characteristics:
Cluster 1: Relatively younger, lower income, lower usage, and shorter tenure.
Cluster 2: Youngest, lowest income, lowest usage, and shortest tenure.
Cluster 3: Oldest, highest income, highest usage, and longest tenure.
Cluster 4: Relatively older, higher income, higher usage, and longer tenure.
b. Percent of unsubscribers: Cluster 1 (46.15%), Cluster 2 (93.94%), Cluster 3 (0%),
Cluster 4 (0%); Cluster with the highest percent of unsubscribers: Cluster 2.
11.44a. Lift ratio of the top rule ({b} => {c}): 1.2277
b. The support count of 11 implies that 11 of the transactions include both {b} and {c}.
c. Lift ratio of the top rule ({g,e} => {f}): 2.5
d. The confidence of 100 implies that 100% of the transactions that include {g, e} also
include {f}.
11.46a. The least frequent item: d
b. Lift ratio of the top rule ({e} => {c}): 1.3333
c. Lift ratio of the top rule ({c,d}=>{e}: 2.6667
11.48.a. Number of rules generated: 10; Top rule: {c,g} => {e}; Lift ratio of the top rule:
2.3602
b. Number of rules generated: 2; Top rule: {e} => {c}; Lift ratio of the top rule: 1.5633
11.50a. The least frequent item: e
b. Lift ratio of the top rule ({e} => {c}): 1.4006; The lift ratio implies that identifying a
transaction with item set {e} as one which also contains item set {c} is 40.06%
better than just guessing that a random transaction contains {c}.
11.52a. Lift ratio of the top rule ({horror} => {action}): 1.6369; The lift ratio implies that
identifying someone who watches a horror movie as one who also is going to watch
an action movie is 63.69% better than just guessing that a random individual is
going to watch an action movie.
b. Lift ratio of the top rule ({drama, horror} => {action, comedy}: 2.6190; The lift ratio
implies that identifying someone who watches drama and horror movies as one
who also is going to watch action and comedy movies is 161.90% better than just
guessing that a random individual is going to watch action and comedy movies.
11.54 Lift ratio of the top rule ({watermelon} => {cherry}): 1.3285; The lift ratio implies that
identifying someone who purchases watermelon as one who also is going to
purchase cherry is 32.85% better than just guessing that a random individual is
going to purchase cherry.
Lift ratio of the 2nd top rule ({watermelon} => {banana}): 1.1752; The lift ratio
implies that identifying someone who purchases watermelon as one who also is
going to purchase banana is 17.52% better than just guessing that a random
individual is going to purchase banana.
Lift ratio of the third top rule ({orange} => {apple}): 1.0417; The lift ratio implies that
identifying someone who purchases orange as one who also is going to purchase
apple is 4.17% better than just guessing that a random individual is going to
purchase apple.
11.56a. Most frequently downloaded song: Here Comes the Sun
b. Lift ratio of the top rule ({All You Need Is Love, Here Comes the Sun, Yellow
Submarine} => {A Day in the Life}): 1.7730; The lift ratio implies that identifying
someone who downloads All You Need Is Love, Here Comes the Sun, and Yellow
Submarine as one who also is going to download A Day in the Life is 77.30% better
than just guessing that a random individual is going to download A Day in the Life.
page 650
11.58a. Number of rules generated: 11
b. Crime most likely to be committed in the department store: Theft; Crime most likely
to be committed on the sidewalk: Narcotics; Crime most likely to be committed in
apartments: Battery.
11.60a. Number of rules generated: 8
b. Lift ratio of the top rule ({horror} => {action}): 1.2766; The lift ratio implies that
identifying someone who uses Snapchat as one who also uses Instagram is
27.66% better than just guessing that a random individual uses Instagram.
11.62a. Number of rules: 4
b. Lift ratio of the top rule ({HELOC} => {Checking}): 1.2214; The lift ratio implies that
identifying someone who has a HELOC account as one who also has a checking
account is 22.14% better than just guessing that a random individual has a
checking account.
Chapter 12
12.2 a. 127.8867
b. 146.0634
c. The exponential smoothing model is preferred because it leads to the smallest
value for MSE (551.54 < 649.95), MAD (19.94 < 20.40), and MAPE (14.03 <
14.46).
12.4 a. The 3-period moving average model is preferred because it leads to the smallest
b. The α = 0.6 model is preferred because it leads to the smallest values for MSE =
0.0007, MAD = 0.0213, and MAPE = 1.6115%;
12.8
a.
b.
12.10
a. 1124.83d3 + 189.4201t; positive
trend; the revenue consistently higher in the first three quarters, especially the 2nd
quarter.
b.
12.12
a. 0.2330d3 + 0.0549d4 + 1.3323d5 + 0.5542d6
b.
b.
The quadratic model is preferred because it has the higher adjusted R2 (0.9550 >
0.8349)
12.16
a.
0.0007t3
b. The cubic model is preferred because it has the higher adjusted R2 (0.9886 >
0.9741); 3,197.89
12.20
a.
b.
12.22
a. − 10589.2436d3 + 4219.4278t +
27.9857t2
b.
12.26
a.
b. The quadratic trend model is preferred because it leads to the smallest value for
MSE, MAD, and 0.0164 × 592 = 126.07
12.28
a.
b. The exponential trend model is preferred because it leads to the smaller value for
12.30
a. Linear: 924.0796d2 − 671.7664d3 + 229.3586t
b. The exponential trend model is preferred because it leads to the smallest value for
MSE, MAD, and MAPE;
12.32
a. 30045.4236d2 + 9114.5188d3 +
b. The exponential trend model is preferred because it leads to the smallest value for
MSE, MAD, and MAPE;
Analytic Solver:
Chapter 13
13.2 Excel: The range of simulated observations is 13.8218.
R: The range of simulated observations is 13.69922.
13.6 Excel: Mean = 8.9700, Standard deviation = 1.3961, Maximum = 12, Minimum = 4
R: Mean = 8.9980, Standard deviation = 1.4609, Maximum = 12, Minimum = 5
13.10 Excel: Average demand = 11.0259 units, Standard deviation = 4.2025
R: Average demand = 11.1400 units, Standard deviation = 4.0186
13.16 Excel:
a. The likelihood of not meeting a weekly demand is 50.2000%. On average, the
opportunity cost is $8.5890 per week.
b. On average, the cost of having too many eggs is $8.5958 per week.
R:
a. The likelihood of not meeting a weekly demand is 46.60%. On average, the
opportunity cost is $8.9915 per week.
b. On average, the cost of having too many eggs is $8.39456 per week.
13.20 Excel: The likelihood of overstocking is 66.20% (331 out of 500 simulations), and the
likelihood of understocking is 33.80% (169 out of 500 simulations).
R: The likelihood of overstocking is 65.40% (327 out of 500 simulations), and the
likelihood of understocking is 34.60% (173 out of 500 simulations).
13.22 Excel: The average number of riders unable to get on the roller coaster is 7.06.
R: The average number of riders unable to get on the roller coaster is 7.755.
13.24a. Optimal solution: x1 = 20, x2 = 5, and z = 55
b. Both constraints are binding, and there is no slack.
c. Constraint 1: Shadow price = 0.3846, Range of feasibility = 93.3333 to 245.0000
Constraint 2: Shadow price = 0.1538, Range of feasibility = 32.8571 to 86.2500
d. Range of optimality: x1 = 1.7143 to 4.5000, x2 = 1.3333 to 3.5000
13.26a. Optimal solution: x1 = 3.9290, x2 = 6.4286, and z = 111.4286
b. Both constraints are binding, and there is no surplus.
c. Constraint 1: Shadow price = 1.4286, Range of feasibility = 48 to 160
Constraint 2: Shadow price = 0.14286, Range of feasibility = 35.0000 to 116.6667
d. Range of optimality: x1 = 4.0000 to 13.3333, x2 = 9 to 30.
13.30a. Maximum revenue = $2,833.3333 by making 33.3333 liters of Chardonnay wine
and 22.2222 liters of Blanc de Blancs champagne.
b. Both grapes and processing time constraints are binding.
c. Grapes constraint: Shadow price = $5.8333, Range of feasibility = 300 to 480 kgs
Processing time constraint: Shadow price = $3.3333, Range of feasibility = 125 to
200 hours
d. Range of optimality: Chardonnay wine = $37.50 to $60.00, Blanc de Blancs
champagne = $41.25 to $66.00.
13.32 Consuming 2.1739 cups of milk and 2.7536 cups of cereal will meet the daily
requirements of calcium and vitamin D, at a minimum cost of $1.2134.
13.38a. x11 = 5, x12 = 60, x21 = 45, x22 = 0
b. z = 115.
13.42a. The lowest shipping cost is $238,000.
b. Manufacturer 1 should ship 6,600 batteries to Plant 2 and 3,400 batteries to Plant
3. Manufacturer 2 should ship 5,000 batteries to Plant 1 and 3,000 batteries to
Plant 3.
page 652
INDEX
A
Absolute references, Excel, 614
Accuracy rate
binary choice (classification) models, 293–294
classification performance measure, 335, 336
Addition rule, 143
Adidas, 297
Adjusted R2, 225–227
Agglomerative categorical variable
multiple categories, linear regression and, 216–217
transforming numerical to, 398–399
Agglomerative clustering (AGNES algorithm), 478–480
defined, 477
dendrogram and, 479–480
Gower’s coefficient and, 485–487
linkage methods of, 478
mixed data, 485–488
numerical/categorical variables, 478–485
using Analytic Solver/R, 481–455
AGNES algorithm. See Agglomerative clustering
Airbnb, 535
Airline Passengers Rights, Health, and Safety, 164
Alpha (stock), 233
Alternative hypothesis, 194–196
Amazon (AMZN), 113, 535
Amazon.com, 531–532
Amazon Prime, 95, 133
American Heart Association, 262
Analysis of variance (ANOVA). See ANOVA
Analysis ToolPak, Excel. See Microsoft Excel Analysis ToolPak
ANOVA
linear regression model and, 224
table for regression, 230
Antecedent, association rule analysis, 502
Apple, Inc., 107, 147, 519–521, 534, 540, 541, 545, 548, 553
A priori method, association rule analysis, 503
Arithmetic mean, central location measure, 115
Artificial intelligence (AI), data mining and, 318
Ashenfelter, Orley, 287
Assigning probabilities, 141–142
Association, measures of
correlation coefficient, 123–124
covariance, 123
Association rule analysis
antecedent/consequent, If-Then, 502
A priori method, 503
assessing rules, 504–505
confidence of the association rule, 503–504
defined/uses of, 502
If-Then logical statements, 502
items/item set, 502
lift ratio, 504
perform, Analytic Solver/R, 506–508
support of the association rule, 503
Assumptions
common violation of OLS, 243–249. See also Linear regression model, detect/remedy
common violations of
independent, 390
linear regression model and, 241–243
Average, central location measure, 115
Average linkage method, agglomerative clustering and, 478
Average Man, The (Daniels), 564
B
Bagging, ensemble tree models, 454, 455, 456
Banerjee, Sunil, 25
Bankrate.com, 205
Bar chart, frequency distribution, categorical variable, 83–85
Bayes’ Theorem, total probability rule and, 148–151
BellKor, 6
Bell-shaped distribution, 129–133
Benchmark category, 213
Bernoulli process, 156–157
Best-pruned tree, 414
Beta (stock), 233
Big data
characteristics /veracity of, 12
defined, 12
structured/unstructured v., 13
writing with, 26–27, 77–78, 133–135, 172–173, 207–208, 253–254, 309, 364, 404–405,
469–471, 514, 556–557
See also Data; Visualize data
Billboard’s Hot 100, 26–27
Binary choice (classification) models, 288–289
accuracy rate of, 293–294
linear probability regression model, 289–290
logistic regression model, 290–293
Binary variable, 72–73
Binding constraints, 583
Binning, transforming numerical data, 58–62
Binomial distribution, 156–159, 564
Binomial random variable, 157
Blackboard, 14
Blacklists, 369
Bloomberg, 269
Boosting, ensemble tree models, 455, 456
Boston Red Sox, 25
Box-and whisker plot, 127–129
Boxplot, detecting outliers, 127–129
Bubble plot, visualize data, 107–108, 112
Bureau of Economic Analysis, 20
Bureau of Labor Statistics (BLS), 9, 20
Business analytics
defining/importance, 4
fashion retailers and, 6
gaming industry and, 6
healthcare and, 7
online subscriptions and, 6
sports and, 6–7
stages of, 4–5
Business intelligence (BI), 5
Business Week, 21
C
Cambridge Analytica, 7
Capital budgeting, integer programming optimization, 596–599
Car crash data set, 608–609
Case Study
analyze admission/enrollment decisions, predictor variables, 310–312
Billboard’s Hot 100, 26–27
car accidents that result in deaths, 514–516
cost, and optimization, 604–606
evaluate forecasting models, housing starts, 556–557
grading scales, 172–173
housing prices, comparing two areas, 134–135
income inequality, 207–208
performance measurement, 364–365
predicting college acceptance, naïve Bayes’ method, 405–407
predicting personal income, 469–471
predictive model for price of house, 254–256
Castle, Matt, 285
Categorical data, 15, 16
matching coefficient/Jaccard’s coefficient, similarity measures, 327–329
transforming, 69–75
Categorical variable, visualize relationships
bar chart, 83–85
contingency table, 96–97
frequency distribution for, 82–83
scatterplots, 105–106
page 653
Categorizing probabilities, 141–142
Category reduction, 69–72
Category scores, create, 74–75
Caterpillar, Inc., 237
Causal models, time series forecasting and, 533–534
Centers for Disease Control and Prevention (CDC), 164, 240
Central limit theorem (CLT)
for the sample mean, 181–182
for the sample proportion, 183
Central location, measures of, 115
defined, 114
mean, types of, 115
median, 115
mode, 115
percentile, 118–119
Centroid method, agglomerative clustering, 478
Chance magazine, 287
Changing variability, detect/remedy, 246–247
Chartered Financial Analyst (CFA), 191–192, 298–299, 386–387
Charts, visualize data, guidelines, 91–92. See also Visualize data
Cheesecake Factory, 191
Chicago Tribune, 163
Children’s Online Privacy Protection Act, 174
CineMatch, 6
Classical linear regression model, 241–243
Classical probability, 142
Classification and regression trees (CART)
advantages/limitations, 370
analysis applications, 368
defined/uses, 412
full tree, 413–415
minimum error tree/best-pruned tree, 414
nodes of, 412
pure subset, 413
Classification models
k-nearest neighbor (KNN), 370–385
outcomes, summarizing, 368
performance charts for, 340–344
performance evaluation, 335–337
performance measures for, 335–336
Classification performance diagrams
cumulative lift chart, 340–341
decile-wise lift chart, 341
receiver operating characteristic (ROC) curve, 342–344
Classification summary table, k-nearest neighbor (KNN) and, 376
Classification trees
decision tree, building, 419–421
develop, Analytic Solver/R, 422–433
Gini impurity index and, 416–419
pruning, 421
split points, identify possible, 415–416
Cloud computing, 113
Cluster analysis
defined, 477
k-means, 492–493
Cluster plot, 496
CNBC, 147, 207, 298
CNN, 113, 163
CNNMoney, 542
Cobb-Douglas production function, 288
Coefficient of determination, R2, 224–225, 283
Coefficients, quadratic regression model, 273
College admission data set, 609–610
College Investor, The, 280
College Scorecard, 211, 235–236
Color, visualizing data and, 109–112
Combining events, 141
Competing hypothesis, 195
Complement of event, 141
Complement rule, 142–143
Complete linkage method, agglomerative clustering, 478
Composite primary key, 34
Computer-generated test statistic and p-value, 233
Conditional probability, 144–145
Confidence interval
population mean μ, 186–189, 188
population proportion p, 190–191
tdf values and probabilities, locating, 186–188
Confidence of the association rule, analysis, 503–504
Confusion matrix
computing performance measures from, 335
k-nearest neighbor (KNN), 380
use Excel to obtain, 337–338
Consequent, association rule analysis, 502
Constraints
linear programming (LP), optimization with, 583–584
optimization and, 578, 579
Consumer information, data mining and, 368
Contingency table, categorical variable, relationship between two, 96–100
Continuous random variable, 154
Continuous uniform distribution, 165
Continuous variable, 15
Correlated observations, detect/remedy, 247–249
Correlation coefficient, 123–124
Counting, sorting data, 39–44
Covariance, measure of association, 123
CRISP-DM, data mining process, 319–320
Cross-Industry Standard Process for Data Mining (CRISP-DM), 320. See CRISP-DM
Cross-sectional data, 9
Cross-validation methods, regression models
holdout method, 301–304
k-fold method, 300, 306–307
overfitting and, 300–301
using R, 547
Cubic trend model, 538–539
Cumulative distribution function, 154
Cumulative lift chart, classification, 340–341, 344
Cut-off values, selecting, 338–340
Cyclical component, time series forecasting process, 520
D
Daimler AG, 319
Daniels, Gilbert S., 564
Data
cross-sectional, 9
defining, 8
file formats, 20–24
measurement scales and, 16–19
sample/population, 8–9
sources of, 20–21
structured/unstructured, 1012
time series, 10
variables and, 15–16
writing with, 604–606
See also Big data; Data entries; Visualize data
Database
defined, 32
NoSQL and, 37
SQL, retrieving data and, 35–36
Database management system (DBMS), 32
Data breach, Marriott International, 7
Data inspection, 39–44
Data management
defining, 32
software systems, 32
wrangling, 32
Data mart, 37–38
page 654
Data mining
consumer information and, 368
CRISP-DM process for, 319–320
data partitioning, 332–333
defining/methods of, 318
performance evaluation, supervised, 334
SEMMA methodology, 320–321
supervised, 321. See also Supervised data mining
unsupervised, 322. See also Unsupervised data mining
Data modeling, entity relationship diagram (ERD) and, 33–35
Data partitioning, 332–333
GINI impurity index and, 417–418
k-nearest neighbor (KNN) method and, 372–378
naïve Bayes’ method and, 391
recursive, 420
Data partitioning, model selection and, 544–547
cross-validation with time series, 544
defined/uses for, 544–546
Data preparation
missing values, handling, 46–50
subsetting, 50–53
Data retrieval, 35–36
Data sets (exercises)
car crash, 608–609
college admissions, 609–610
house price, 610
longitudinal survey data, 610–611
NBA, 612–613
tech sales reps, 613
Data transformation, categorical, 69–75
category reduction, 69–72
category scores, 74–75
dummy variables, 72–73
Data transformation, numerical, 57–66
binning, 58–62
rescaling, 66
Data warehouse, 36–38
Data wrangling, 32, 79
DBMS packages, 36
Decile-wise lift chart, classification, 341, 344–345
Decision tree
building, 419–421
minimum error/best-pruned, 414
nodes of, 412
pruning, 421
See also Classification and regression trees (CART)
Decision variables, linear programming (LP) and, 578, 579
Definite events, 140
Degrees of freedom, 186
Delimited file format, 21, 22
Dendrogram, agglomerative clustering, 479–480
Dependent events, independent events v., 145
Descriptive analytics, 4, 5
Deterministic relationship, variables, 212
DIANA divisive analysis. See Divisive clustering, DIANA
Dimension reduction, unsupervised learning, 322
Dimension table, 37–38
Discrete probability distribution, 154
generate random observations of, 564–567
Monte Carlo simulation and, 564
Poisson distribution, 159–160
Discrete random variable, 154–155
Discrete uniform distribution, 156
Discrete variable, 15
Discriminate analysis
advantages/limitations, 370
defined, 369
Dispersion, measures of
defined, 120–121
interquartile range (IQR), 120
mean absolute deviation (MAD), 120
range, defined, 119
Sharpe ratio, 120
variance/standard deviation, 120
Divisive clustering, DIANA, 477
Dow Jones Industrial Average (DJIA), 16, 17, 20, 95
Dual price, 583
Dummy variables
avoiding trap of, 216
category reduction and, 72–73
exponential trend model with seasonal, 539
interaction of two, regression model, 260–262
linear trend model with seasonal, 531–533
numerical variable, interaction between, 262–264
predictor variable and, 212–213
seasonal, quadratic trend model and, 540
Duracell, 8
E
Economic Policy Institute, 207
Economist, The, 21
E-mail filters, 369
Empirical probability, 142
Empirical rule, z-scores and, 130
Ensemble tree models
bagging, 454, 455, 456
boosting, 455, 456
defined/uses for, 454
develop using Analytic Solver/R, 456–465
random forest, 455, 456
Ensemble tree models, advantages/limitations, 370
Entity-relationship diagram (ERD), 33–35
Equal Employment Opportunity Commission (EEOC), 164
Error rate, classification performance measure, 335
Error sum of squares (ESS), 478
Error sum of squares (SSE), 214–215
Estimate, 178
competing models, 302, 303
linear probability/logistic regression models, 288–296
linear regression model, 215–216
maximum likelihood estimation (MLE) and, 290
nonlinear relationship, regression models for, 272–284. See also Regression models,
nonlinear relationships
OLS method, 214–215
regression models, interaction variables and, 260–267. See also Regression model,
interaction variables
standard error of, 223–224
Estimation
confidence interval for μ and p, 186
confidence interval/interval estimate, 185–186
population mean μ, confidence interval, 186–189, 188
population proportion p, confidence interval, 190–191
Estimator, 178
Euclidean distance, measure similarities with numerical data, 323–324
Events, probability and, 140–141
Excel. See Microsoft Excel; Excel Analysis ToolPak; Microsoft Excel formula option;
Microsoft Excel Solver, Simplex method
Excess kurtosis, 122
Excluded variables, detect/remedy, 249
Exhaustive events, 140
Expected value
standard error of the sample mean, 179
standard error of the sample proportion, 182
Experiment, 140
Exponential regression model, 278–282
Exponential smoothing, time series forecasting
advanced methods, 548–555
Holt exponential smoothing method, 549–552
Holt-Winters, 552–555
page 655
moving average, 528
simple, 525–528
Exponential trend model, nonlinear regression, 535–539
eXtensible Markup Language (XML), 21–24
Extensions of total probability rule, Bayes’ Theorem and, 150–151
F
Facebook, 7, 10, 12, 15, 193, 387, 439
Fact table, 37–38
Fahrenheit scale, 18
False negative outcome (FN), 368
False positive outcome (FP), 368
Fashion retailers, data analytics and, 6
Federal Reserve Bank, 238, 542
Federal Reserve Economic Data, 20–21
Fidelity Gold Fund, 208
Fidelity Growth and Value mutual funds, 125
Fidelity Growth Index, 81, 86
Fidelity Select Automotive Fund, 208
Fidelity’s Technology and Energy mutual funds, 133
File formats, data, 20–21
delimited, 22
eXtensible Markup Language (XML), 22–23, 24
fixed width, 21
HyperText Markup Language (HTML), 23, 24
JavaScript Object Notation (JSON), 23–24
markup languages, 23–24
summary, 24
Financial data, 11
Fisher, Ronald, 498
Fixed-width format, 21
Flaw of Averages, The (Savage), 563
Forbes, 21, 25, 184, 403
Forecasting process, time series
causal models and, 533–534
cross-validation with, 544
defining/components and uses for, 520–521
exponential smoothing, moving average, 528
model selection criteria, 521–522
moving average smoothing, 523–525
nonlinear regression models, trend/seasonality, 535–541. See also Nonlinear regression
models, trend/seasonality and
performance measures, 522
qualitative/quantitative methods of, 521
simple exponential smoothing, 525–528
smoothing techniques for, 522–523
trend/seasonality, linear regression models for, 529–534. See also Linear regression
models
Foreign key (FK), 33
Formulas, Excel, 614–616
Fortune, 21
Frequency distribution
categorical variable, 82–83
numerical variable, 86–87
Full tree, 413
Functions
Analytic Solver (Excel/R), 46–50
Cobb-Douglas production, 288
cumulative distribution, 154
MS Excel, 615
R/RStudio, entering/importing data, 623–625
software. See Microsoft Excel entries; R; R lpsolver; RStudio for program functions in
bold print
G
Gallup, 299
Gallup-Healthways Well-Being Index, 209
Gaming industry, data analytics and, 6
Gap, Inc., 6
Gartner.com, 12
General Electric (GE), 10
ggplot2 package, R and, 113
Gini impurity index, 416–419
Goodness-of-fit measures
adjusted R2, 225–227
coefficient of determination R2, 224–225
models for, 226–227
standard error of the estimate and, 223–224
validation and, 227
Google, 7, 113, 193, 369
Google Chrome, 23
Google Docs, 22
Gower, John C., 485
Gower’s coefficient, agglomerative clustering and, 485–487
Grammar of Graphics, 113
Graphs, visualize data, guidelines, 91–92. See also Visualize data
Great Depression, 208
Great Recession, 208, 289
Grouping object, cluster analysis and, 476
H
Happiness Index Report (UN), 27
Harper, Bryce, 25
Harrah’s Entertainment, 6
Healthcare, business analytics and, 7
Heat map, visualize data, 109–112
Hierarchical cluster analysis, 485–488
agglomerative clustering, numerical/categorical variables, 478–485
defined, 477
dendrogram and, 479–480
divisive, 477
hidden patterns in, 480
Histogram, numerical variable, visualize, 87–91
H&M, 6
Hold exponential smoothing method, recursive equations and, 548–549, 552
Holdout method
cross-validation, 301–304
defined, 300
linear regression model, Analytic Solver/R, 305–306
Holt exponential smoothing, 548–552
Holt-Winters exponential smoothing, 552–555
Home Depot, 15, 125
House price data set, 610
HTML tags, 23
Human-generated data, 11–12
HyperText Markup Language (HTML), 21, 23, 24
Hypothesis testing
competing hypothesis, 195
defining/types of, 194
four-step procedure, p-value approach, 200
null/alternative, 194–196
one-tailed/two-tailed, 195
population mean μ, 197–200
population proportion p, 203–204
test statistic for μ, 198
Type I/Type II errors, 196–197
I
IBM, 318
IBM DB2, 32
If-Then rules
association rule analysis, 502
classification and regression trees (CART), 412
classification tree, 424
decision nodes and, 444
decision tree methodology, 414
support of the association rule, 503
page 656
Impossible events, 140
Independent assumption, 390
Independent events, dependent events v., 146
Indicator variable, 72–73
Inflation, data on, 21
Information, data as, 8
Input parameters, linear programming (LP), 578, 579
Inspecting data, 39–44
Instagram, 15
Integer programming, optimization with
capital budgeting, 596–599
transportation problems, 599–603
Interaction effect, regression model, 260
Interaction variables, regression model
dummy variable/numerical variable, 262–264
two dummy variables, 260–262
two numerical variables, 264–267
Interior node, decision tree, 412
Internet Addiction Test (IAT), 362
Interquartile range (IQR), 120
Intersection of two events, 141
Interval estimate, 185–186
Interval scale, 18
Inverse transformation, 168
iPhone, 14, 107
Items, association rule analysis, 502
Item set, association rule analysis, 502
J
Jaccard’s coefficient, categorical data similarity measures, 327–329
JavaScript, 24
JavaScript Object Notation (JSON), 21, 23–24
Johnson & Johnson (J&J), 234, 253
Joint probability, 143–144
Jones, Linda, 25
Journal of Happiness, 127, 244
K
Keyword filtering, 369
k-fold cross-validation method, 300, 306–307
k-means cluster analysis
defining/uses of, 492–493
perform with Analytic Solver, 494–495
perform with R, 495–497
k-nearest neighbors (KNN) method
advantages/limitations, 370
confusion matrix, 380
cumulative gains chart, 382
cumulative lift chart, 383
data partitioning, Analytic Solver, 372–378
decile-wise lift chart, 383
defined, 368
memory-based-reasoning, 371
observations and, 371–372
performance charts, 377
predicted probabilities, 381
ROC curve, 384
scoring results, 385
similarity measures and, 322
10-fold cross-validation, R, 378–385
underlying distribution and, 372–373
uses for, 370
Knowledge, data as, 8
Kurtosis coefficient, 122
L
Laplace smoothing, 390
Leaf node, decision tree, 412, 413
Left-tailed test, 199–200
Leptokurtic, 122
Lift ratio, association rule analysis, 504
Linear probability regression model, 289–290
Linear programming (LP), optimization with
binding constraints, 584
constraints in, 583
defined/uses, 577–578
dual price, 584
formulating, 579–581
four components of, 579
nonbinding constraint, 583
objective function and, 578, 579
shadow price, 584
solve/interpret problems, 581–593
Linear regression model
adjusted R2, 225–227
ANOVA and, 224, 230
assumptions, 241–243
categorical variables, multiple categories, 216–217
coefficient of determination, R2, 224–225
components of, 212–216
defined, 214
dummy variable, avoiding trap of, 216
estimate, 215–216
estimating with Excel or R, 218–219
goodness-of-fit measures and, 223–224
log-transformed v., 282–284
multiple, 213–214
OLS estimators, 214–215
regression analysis and, 212
reporting regression results, 235
simple, examples of, 213
tests of significance and, 229–334. See also Tests of significance
trend forecasting, 530–531
Linear regression model, detect/remedy common violations of
changing variability, 246–247
correlated observations, 247–249
excluded variables, 249
multicollinearity, 244–246
nonlinear patterns, 243–244
Linear relationship, visualize, 101
Linear trend model
estimating with seasonality, R, 533
seasonality and, 531–533
time series forecasting, 530–531
Line chart
defined, 112
numerical variables, 108–109
LinkedIn, 10
Logarithmic regression model, 278–279
Logarithmic transformation, 276–277
Logistic regression model, 290–293
advantages/limitations, 370
college admissions/enrollment, 310–312
estimate, 294–296
Log-log regression model, 277–278
Log-transformed regression model, comparing linear and, 282–284
Longitudinal survey data set, 610–611
Los Angeles Angels, 25
Los Angeles Lakers, 152
Lowe’s Companies, Inc., 125, 534
lpSolve package, R. See R lpSolve package
M
Machine-learning, data mining and, 318
Machine-structured data, 11–12
Macintosh, 22
Major League Baseball, 7, 362, 499
Management information systems (MIS), 261
page 657
Manhattan distance, measure similarities with numerical data, 323–324
Manhattan Project at Los Alamos National Laboratory, 563
Marginal effect, quadratic regression model (x on y), 273
Margin of error, 186
Markup languages, 22–24
Marriot International, 7
Matching coefficient, categorical data similarity measures, 327–329
Mathematical transformations, 62–66
Maximization function, 578
Maximum likelihood estimation (MLE), 290
Mean, as central location measure, 115
Mean absolute deviation (MAD), 120, 300, 346, 368, 522
Mean absolute error (MAE), 346
Mean absolute percentage error (MAPE), 300, 346, 522
Mean error (ME), 346
Mean percentage error (MPE), 346
Mean squared error (MSE), 300, 522
Measurement scales
nominal, 16–17
ordinal, 17–18
types of, 18
Measures of association, 123
Measures of dispersion, 120–121. See also Dispersion, measures of
Measures of shape, 121–122
Median, as central location measure, 115
Memory-based reasoning, 371
Microsoft Access, 32, 36
Microsoft Corp., 193
Microsoft Excel (exercise functions)
add-ins, 620
Analysis ToolPak/Solver add-ins, 620
Analytic Solver, 620
Analytic Solver, R and, functions, 46–50
bar chart, frequency distribution, 84–85
binning, transforming numerical data (Analytical Solver, R and), 58–62
binomial probabilities, 161
bubble plot, 107
category reduction (Analytical Solver, R and), 70–72
category scores, 74–75
confidence interval for μ, 189
confusion matrix/performance measures, 337–338
construct residual plots with, 250–251
correlation coefficient, 124
counting/sorting data, functions, 40–43
cumulative lift chart, 344
decile-wise lift chart, 344–345
dummy variable (Analytic Solver, R and), 73
formulas, 614–616
generate random observations, continuous probability distribution, 568
heat map, 110–111
histogram, numerical variable, 89–90
linear regression model, estimating, 218
line chart, numerical variable, 108–109
mathematical transformation, functions, 64–65
Monte Carlo simulation, formulate/develop, 570–573
moving averages, exponential smoothing, 528
normal distribution, 170
numerical variable, visualize relationship, 102
percentile, quartile/summary, 119
Poisson probabilities, 162
prediction performance measures, 347–348
random observations, discrete probability distribution, 565
regression models, compare linear and log-transformed, 283–284
relative/absolute/mixed references, 614–615
ROC curve, 345
scatterplot, categorical variable, 105–106
spreadsheet model, developing, 617–619
stacked column chart/contingency table, pivot table fields, 91–100
summary data, functions, 51–53
testing for μ, 202
Microsoft Excel Analysis ToolPak (exercise functions)
generate random observations, continuous probability distribution, 568
how to add-in, 620
random observations, discrete probability distribution, 565–566
random seed, number generation, 570
summary measures, 117
Microsoft Excel Analytic Solver (exercise functions)
agglomerative clustering, 481–483
association rule analysis, 506–508
binning, transforming numerical data, 58–62
category reduction, R and, 70–72
classification using five predictor variables, naïve Bayes’ method, 392–394
decision tree, develop, 422–425
defining, 620
develop prediction tree, 443–445
dummy variable, create (R and), 73
ensemble tree models, develop, 456–459
estimate logistic regression model, 294–296
holdout method, logistic regression model, 305
Holt exponential smoothing method, 550
Holt-Winders exponential smoothing, 553
how to add-in, 620
interface of, 620
k-means analysis, 494–495
k-method, 373–378
missing values, R and, 46–50
principal component analysis, perform, 355–359
scoring options, 375
standard data partition, 374
Microsoft Excel formula option, calculate mean/median, 116–117
Microsoft Excel Solver, Simplex method (exercise functions)
capital budgeting, integer programming optimization, 597–598
formulate/solve, linear programming problem, 584–589, 590–591
integer programming optimization, transportation problems, 601–602
Microsoft Excel XLMiner, 620
Microsoft Windows, 22
Microsoft Word, 22
Minimization function, 578
Minimum error tree, 414
Misclassification rate
classification performance measure, 335, 336
k-nearest neighbor (KNN) and, 376
Missing values, data preparation, 46–50
Mixed data, agglomerative clustering and, 485–488
Mixed references, Excel, 614–615
Modeling data, 33–35
Model selection
adjusted R2, 225–227
coefficient of determination, 224–225
data partitioning and, 544–547
standard error of the estimate, 223–224
time series forecasting, 521–522
Moneyball, 7
Money Magazine, 288
Monte Carlo simulation
defined/uses of, 563–564
discrete uniform distribution and, 564
formulate/develop, 569–575
generate random observations, continuous probability distribution, 567–569
Poisson distribution and, 564
random observations, discrete probability distribution, 564–567
Mortgage Banking Association, 171
Moving average smoothing
exponential, 528
time series forecasting, 523–525
page 658
Multicollinearity, linear relationship detect/remedy, 244–246
Multiple linear regression model, 213–214, 232
Multiplication rule, probability, 145
Mutually exclusive events, 140
Myers-Briggs Personality assessment, 77, 82, 96–97
MySQL, 32
N
Naïve Bayes’ Theorem
advantages/limitations, 370
applications of, 369–370
classification using five predictor variables, Analytic Solver, 392–394
defining/uses of, 390–391
partitioning data, R, 394–398
predicting college acceptance, 405–407
scoring results, 394
smoothing/Laplace smoothing, 390
transforming numerical into categorical values, 398–399
National Association of Business Economists (NABE), 205
National Association of Colleges and Employees, 261
National Association of Securities Dealers Automated Quotations (NASDQ), 16, 17
National Basketball Association (NBA), 9, 364–365, 612
National Football League, 361, 499
National Health and Nutrition Examination Study, 152
National Institute of Health, 262
National Longitudinal Surveys (NLS), 469–470, 610–611
National Public Radio (NPR), 228, 542
NBA data set, 612–613
NBCnews.com, 206
NCR, 319
Negative linear relationship, 213
Negatively skewed distribution, 88
Netflix, 6, 535
Net promoter score (NPS), 77–78
Neural networks, 369–370
New York City Open Data, 501, 513
New Yorker, The, 240, 312
New York Stock Exchange (NYSE), 16, 17, 528
New York Times, The, 6, 15, 21, 240, 274, 542
Nielsen ratings, 164
Nike, 297
No linear relationship, 213
Nominal scale, 16–17
Noncausal models, time series forecasting, 533–534. See also Forecasting process, time
series
Nonlinear patterns, detection/remedy, 243–244
Nonlinear regression models, trend/seasonality and
exponential trend model, 535–538
polynomial trend model, 538–539
seasonality, 539–541
Nonlinear relationship
regression models for, 272–284. See also Regression models, nonlinear relationships
visualize, 101
Nonzero slope coefficient, test for, 233–334
Normal distribution
defined, 165
standard, 165–167
transformation, normal random variables, 167–169
Normalization, numerical data, 325–327
Normally distributed population, sampling from, 180–181
Normal random variables, transformation, 167–169
NoSQL, 37
Notepad, 22
Null hypothesis, 194–196
Numerical data, 15, 16
binning, transforming, 58–62
Euclidian/Manhattan distance, measure similarities with, 323–324
standardizing/normalizing similarity measures with, 325–327
transforming, 57–66
transforming, categorical values, 398–399
Numerical variable
agglomerative clustering and, 478–485
interaction of two, regression model and, 264–267
regression model, numerical variable interaction, 262–264
Numerical variable, visualize relationships
bubble plot, between three, 107–108
frequency distribution, 86–87
histogram, 87–91
line chart, 108–109
scatterplots, 101–102
O
Oakland Athletics, 6–7
Objective function, linear programming (LP) and, 578, 579
Observations
binning, categorization of and, 58
generate random, discrete probability distribution, 564–567
generate random observations, continuous probability distribution, 568–569
identify, 322. See also Similarity measures
k-nearest neighbor (KNN) method and, 371
measurement scales and, 18
missing values, 49–50
similarity measures for. See Similarity measures
OHRA, 319
Omission strategy, missing values, 46–50
One-tailed hypothesis test, 195, 199–200
Online subscription services, data analytics and, 6
Opinion Today, 193
Optimization algorithms, prescriptive analytics and, 562
constraints, 578, 579
defined, 577, 578
integer programming, 596–603. See also Integer programming, optimization with
linear programming and, 577–593. See also Linear programming, optimization with
Oracle, 25, 32, 36
Oracle database, 113
Orbitz.com, 9
Ordinal scale, 17–18
Ordinary least squares (OLS) method, estimating, 214–215
classical linear regression model, assumptions and, 241–243
common violations of assumptions of, 243–249. See also Linear regression model,
detect/remedy common violations of
Outcomes
false negative/false positive, 368
summarizing, classifications and, 368
true negative/true positive, 368
Outliers
boxplots and, 127–129
defined, 115
z-scores and, 129–132. See also Z–scores
Overfitting
cross-validation method, 300–301
performance evaluation and, 334
Oversampling, performance evaluation, 333
P
Pairwise observations, 323
Parameter, population, 178
Parameters, linear programming (LP) and, 578, 579
Partitioning data. See Data partitioning
Pattern recognition, unsupervised learning, 322
Paypal, 535
p-value approach
four-step process, hypothesis testing, 200
sample mean, alternative hypothesis, 199
page 659
Peck, Art, 6
Percentile, quartile functions, 118–119
Performance charts, classification
cumulative lift chart, 340–341
decile-wise lift chart, 341
Excel to obtain, 344–345
receiver operating characteristic (ROC) curve, 342–344
Performance evaluation
classification models and, 335–337
cut-off values, selecting, 338–340
data partitioning, 332–333
oversampling, 333
performance charts, classification, 340–344
prediction and, 345–348
supervised data mining, 334
using Excel to obtain confusion matrix/performance measures, 337–338
Performance measures
case study, 364–365
prediction, Excel, 347–348
time series forecasting, 522
using Excel to obtain confusion matrix and, 337–338
Pew Research Center, 93, 139, 152, 184, 208–209, 259
Philadelphia Phillies, 25
Platykurtic, 122
Point estimator, 178
Point-of-sale data, 11
Poisson distribution, 159–160, 564
Poisson process, 159
Poisson random variable, 159
Polynomial trend model, nonlinear regression model time series forecasting
cubic trend model, 538–539
quadratic trend model, 538–539
Population
advantages/disadvantage, 8–9
mean, parameter, 115
See also Sampling distributions
Population mean μ
confidence interval, 186–189
hypothesis testing, 197–200
Population parameter, 178
Population proportion p
confidence interval, 186
hypothesis testing for, 203–204
test statistic for, 203
Positive linear relationship, 213
Positively skewed distribution, 88
Positive predictive value, classification performance measure, 335–336
Posterior probability, 148–149, 390
Precision, classification performance measure, 335–336, 336
Predicted probabilities
k-nearest neighbors (KNN), 381–382
linear probability regression models, 291
Prediction, performance evaluation for, 345–348
Prediction performance measures, 346
Prediction tree, 441–443. See also Regression tree
Predictive analytics, 4, 5
Predictor variables, 212–213
predictive model for price of house, case study, 254–256
regression trees and, 441–443
Preparing data, 45–53
Prescriptive analytics, 4–5, 562
Price, David, 25
Primary key (PK), 33
Principal component analysis (PCA)
applying to data set, 353
defining/uses for, 352
perform with Analytic Solver/R, 355–359
rotating data, 353–355
Prior probability, 148–149, 390
Prison Legal News, 298
Probabilistic simulation, 563
Probability
assigning, 141–142
defined, 140
events and, 140–141
locating tdf values, 186–188
rules of, 142–146
total probability rule, Bayes’ Theorem and, 148–151
Probability distribution, 154
Programming languages, 24
Pruning, decision tree, 421
Psychology Today, 285, 288
Pure subset, CART, 413
p-value, 198
Python, 24
Q
Quadratic regression model, 272–276
Quadratic trend model, 538–540
Qualitative forecasting, time series forecasting, 521
Qualitative variables, 15, 16
Quantitative method, time series forecasting, 521
Quantitative variables, 15, 16
Quarterly data, 539–540
R
R-exercise functions
agglomerative clustering, 483–455
association rule analysis, 508–510
bar chart, frequency distribution, 85
binomial probabilities, 161
boxplots, detecting outliers, 128–129
bubble plot, 107–108
confidence interval for μ, 190
correlation coefficient, 124
cross-validation of regression models with, 547
cumulative lift chart, 383
decile-wise lift chart, 383
decision tree, develop in, 425–433
defining, 621
develop prediction tree, 445–449
ensemble tree models, develop, 449–465
entering/importing data, 623–625
estimate logistic regression model, 296
estimating linear trend model with seasonality, 533
generate random observations, continuous probability distribution, 568–569
ggplot2 package, 113
Gower’s coefficient, agglomerative clustering, 486–487
heat map, 111–112
histogram, numerical variable, 90–91
holdout method, logistic regression model, 306
Holt exponential smoothing method, 550–552
Holt-Winder exponential smoothing method, 553–555
installing for windows, 621
k-means cluster analysis, 495–497
k-nearest neighbor (KNN), 10-fold cross-validation, 378–385
linear regression model, estimating with, 218–219
line breaks, 626
line chart, numerical variable, 109
mean/median, summary function, 117–118
Monte Carlo simulation, formulate/develop, 573–575
normal distribution, 170
numerical variable, visualize relationship, 102
packages, 626–627
partitioning data, naïve Bayes’ method, 394–398
percentile/quartile/summary, 119
page 660
Poisson probabilities, 162
principal component analysis, perform, 355–359
random observations, discrete probability distribution, 566–567
regression models, compare linear and log-transformed, 283–284
residual plots, construct, 250–251
ROC curve and, 384
RStudio, 621, 622
scatterplot, categorical variable, 106
scoring results, 385
stacked column chart/contingency table, 100
testing for μ, 202
See also R lpSolve package; RStudio
Random component, time series forecasting process, 520
Random forest, ensemble tree models, 455, 456
Random variable
defined, 153
discrete, summary measures, 154–155
discrete probability/cumulative distribution and, 154
Poisson, 159
types of, 154
Ratio scale, 18
Real-world business scenarios, see Simulation algorithms
Recall, classification performance measure, 335
Receiver operating characteristic (ROC) curve
classification, 342–345, 368
k-nearest neighbor (KNN) and, 384
Recursive equations
Holt exponential smoothing and, 548–549
Holt-Winters exponential smoothing, 542–555
Recursive partitioning process, 420
Reference category, 213
Regression analysis
defined, 212
reporting results, 235
Regression model
cross-validation methods, 301–307. See also Cross-validation methods, regression
model
time series data, using R, 547
Regression model, interaction variables
dummy variable/numerical variable, 262–264
numerical variables, between two, 264–267
two dummy variables, 260–262
Regression model, nonlinear relationships
comparing linear/log-transformed, 282–284, 283–284
exponential, 278, 279–282
logarithmic, 278–279
logarithms, 276–277
log-log regression model, 277–278
quadratic regression model, 272–276
semi-log, 278
Regression sum of squares (SSR), 224–225
Regression trees
defining/uses for, 440–441
develop prediction tree, Analytic Solver/R, 443–449
split points, identify, 441–443
Relational database, 32
Relationship(s)
defining, 33
deterministic/stochastic, between variables, 212
entity diagram, 33–35
error rate/model complexity, 334
nonlinear, regression models for, 272–284. See also Regression models, nonlinear
relationships
simple linear regression model and, 213
Relationship between two variables
contingency table, categorical variable, 96–97
stacked column chart, categorical variables, 97–100
Relative references, Excel, 614
Religious Landscape Study, 93
Rescaling, transforming numerical data, 66
Residual, 214
Residual plots
against changing variability, 246–247
computing, 242
construct using Excel/R, 250–251
correctly specified model, 242
nonlinear patterns, 243–244
residuals against time t, 247–248
Response variable, 212
Retail customer data, 31
Retrieving data, 35–36
Reuters, 184
Reward-to-variability ratio, 120
Right-tailed test, 199–200
Risk-adjusted stock return, 234
R lpSolve package (exercise functions)
capital budgeting, integer programming optimization, 599
formulate/solve, linear programming problem, 589–590, 592
integer programming optimization, transportation problems, 602
See also R; RStudio
ROC curve. See Receiver operating characteristic (ROC) curve
Root mean squared error (RMSE), 300–301, 346, 368
Root node, decision tree, 412, 419
Rotating data, principal component analysis (PCA), 353–355
RStudio-exercise functions
defining, 621
entering data/using functions, 623–624
installing for windows, 622
interface, 622–623
Rules of probability, 142–146
S
Safari browser, 23
Sample, Explore, Modify, Model, and Assess (SEMMA), 320. See SEMMA methodology
Sample data, advantages/disadvantage, 8–9
Sample mean
central limit theorem (CLT) for, 181–182
sampling distributions of, 178–180
standard error of, 179
statistic, 115, 178
Sample proportion, 178
central limit theorem (CLT) for, 183
sampling distribution of, 182–183
Sample regression equation, 214
Sample regression line, 214
Sample space, 140
Sampling distributions
central limit theorem (CLT) for the sample man, 181–182
defining/types of, 178
normally distributed population, 180–181
of sample mean, 178–180
sample proportion of, 182–183
Samsung, 107
Samuel, Arthur, 318
Savage, Sam L., 563
SB Nation, 171
Scatterplots
categorical variable and, 102, 105–106
defined, 112
Schneider, Robert, 25
Scholastic Aptitude Test (SAT), 228
Scoring a record, 368
Seasonal component, time series forecasting process, 520
exponential trend, dummy variables and, 539
nonlinear trend model with, 539–541
Seasonality, linear trend model and forecasting, 531–533
page 661
SeekingAlpha.com, 534
Semi-log regression model, 278
SEMMA methodology, data mining, 320–321
Sensitivity, classification performance measure, 335, 336
Shadow price, 583
Shape, measures of
kurtosis coefficient, 122
skewness coefficient, 121–122
Sharpe, William, 120
Sharpe ratio measures, 120
Significance level, 199
Silhouette plot, 496
Similarity measures
defining/uses of, 322–323
matching coefficient/Jaccard’s coefficient, categorical data, 327–329
numerical data, 323–324
standardization/normalization, numerical data, 325–327
Simple exponential smoothing, time series forecasting, 525–528
Simple linear regression model, 213
Simulation algorithms
defining/uses for, 562
Monte Carlo real-world business scenarios, 562–575. See also Monte Carlo simulation
prescriptive analytics and, 562
probabilistic/stochastic, 563
Simulation algorithms, prescriptive analytics and, 562
Single linkage method, agglomerative clustering, 478
Skewness coefficient, 88, 121–122
Smoothing, 390
exponential, advanced methods of, 548–555. See also Exponential smoothing
Holt exponential, 548–552
Holt-Winders exponential, 552–555
moving average, 523–525
moving averages, and exponential, 528
simple exponential, time series forecasting, 525–528
time series forecasting, 522–523
Snapchat, 15
Social media, 15, 317
Software
database management systems (DBMS), 32
Excel. See Microsoft Excel entries
ggplot2 package, R, 113
R, See also R lpSolve package; RStudio
Tableau, 113
text editing, 22
Sorting, counting data, 39–44
Sources of data, 20–21
Spam filters, 369
Specificity, classification performance measure, 336
Split points
CART and, 415–416
identify, regression trees, 441–443
Sports, data analytics and, 6–7
Spotify, 528–529
Spreadsheet model, develop in Excel, 617–619
S&P’s Case-Shiller home price index, 542
SPSS, 319
SQL Server, 32
Stacked column chart, categorical variables, relationship between two, 97–100
Standard deviation, measure of dispersion, 120
Standard error of the estimate, 223–224
Standard error of the sample mean, 179
Standard error of the sample proportion, expected value, 182
Standardization, numerical data, 325
Standard Normal Curve Areas, 628–629
Standard normal distribution, 165–167
Standard normal table, 165
Standard transformation, random normal variables, 167–169
Star schema, 37–38
State of American Well-Being--2017 State Well-Being Rankings, 209
Statistic, 115, 178
Stochastic relationship, variables, 212
Stochastic simulation, 563
Structured data, 10–11
file formats for, 21
unstructured/big data v., 13
Structured Query Language (SQL), data retrieval and, 35–36
Student’s t distribution, 186, 630–631
Subjective probability, 142
Subsetting, data preparation, 50–53, 79
Summary measures, 50–51. See also individual measures
association, 123–124
central location, 114–119
discrete random variable, 154–155
dispersion, 119–121
shape, 121–122
Sunnyville Bank, 411, 438
Supervised data mining, 321. See also Data mining; Unsupervised data mining
advantages/limitations of, 369
comparing techniques, 370
defining/uses for, 368
k-nearest neighbors (KNN) method, 370–385
naïve Bayes’ method, 390–399
oversampling, 333
performance evaluation, 334
Support of the association rule, association rule analysis, 503
Support of the item or item set, association rule analysis, 503
Symmetric distribution, 88, 129–133
T
Tableau data visualization, 113
Tabular data, 22
Target, 126
t distribution, 186, 630–631
tdf distribution, 186
Techfoogle.com, 185
Tech sales reps data set, 613
TeraData, 319
Test for a nonzero slope coefficient, 233–334
Test of individual significance, 231–232
Test of joint significance, 229–331
Tests of significance
computer-generated test statistic and p-value, 233
individual, 231–332
joint, 229–331
nonzero slope coefficient, 233–334
Test statistic and the p-value, 233
Test statistic for μ, 198
Text analytics, 12
TextEdit, 22
Text files, data format, 21
Time series data, 10
cross-validation with, 544–546
forecasting process for. See Forecasting process, time series
tdf values and probabilities, locating, 186–188
Total probability rule, Bayes’ Theorem and, 148–151
defined, 148–149
extensions of, 150–151
Total sum of squares (SST), 224–225
Training data set, 368
Training set, 300–301
Transforming data
categorical, 69–75
normal random variables, 167–169
numerical, 57–66
page 662
Transportation problems, integer programming optimization, 599–603
Trend component, time series forecasting process, 520, 530–531
cubic trend model, 538–539
exponential trend model, nonlinear regression model, 535–538
exponential with dummy variables, seasonal, 539
polynomial trend model, 538–539
quadratic, seasonal dummy variables, 540
Trout, Michael, 25
TrueCar, 542
True negative outcome (TN), 368
True positive (TP) outcome, 368
Trump, Donald, 7
Twitter, 10, 11
Two-tailed hypothesis test, 195, 199–200
Type I errors, 196–197
Type II errors, 196–197
U
Ulam, Stanislaw, 563
Unbiased estimator, 179
Uncertainty, random variable and, 153–154
Unconditional probability, 144
Under Armour, 15, 297–298, 309
Union of two events, 140–141, 143–145
Uniqlo, 6
United Nations, 27
Unstructured data, 11–13
Unsupervised data mining
agglomerative clustering, mixed data, 485–488
defined, 322. See also Data mining; Supervised data mining
hierarchical cluster analysis, 476–485
USA Today, 21
U.S. Bureau of Labor Statistics, 469–470
U.S. Census Bureau, 10, 15, 21, 55, 93, 95, 152, 163, 469–470, 542
U.S. Department of Agriculture (USDA), 475
U.S. Department of Health and Human Services, 268
U.S. Energy Information Administration, 529
U.S. National Cancer Institute, 7
U.S. News and World Report, 25
U-shaped cost curve, 272–276
V
Validation set, 300–301
Value, big data, 12
Values, handling missing, 46–50
Vanguard Balanced Index Fund, 170
Variable(s)
categorical. See Categorical variable
decision, 578, 579
defined, 15
dummy, category reduction and, 72–73
mathematical transformations, 64–65
numerical. See Numerical variable
predictor, 212
relationship between two, visualize, 96
response, 212
transforming numerical data, 58–62
types of, 15
Variance, measure of dispersion, 120
Variance inflation factor (VIF), 245
Variety, big data and, 12
Velocity, big data and, 12
Veracity, big data, 12
Vertical axis, charts/graphs, guidelines, 91–92
Visualize data
advanced options, ggplot2, Tableau, 112–113
bubble plot, numerical variable, relationship between three, 107–108
categorical variable, 82–85
charts/graphs, guidelines, 91–92
cluster plot, 496
heat map, 109–112
line chart, numerical variable, 108–109
numerical variable, 86–91
numerical variable, relationship between two, 101–102
relationship between two variables, 96
scatterplot, categorical variable, 105–106
silhouette plot, 496
Tableau, 113
Volume, big data and, 12
von Neumann, John, 563
W
Wall Street Journal, The, 21, 147, 177, 203, 519
Walmart, 126
Ward, Joe H., 478
Ward’s method, agglomerative clustering, 478
Washington Post, The, 104, 147
Washington Post-Kaiser Family Foundation, 152
Websites, data providing, 20–21
Whitelists, 369
World Bank, 113, 355, 361, 499, 501, 536
World development indicator data, 21
World Economic Forum, 501
World Health Organization, 238, 270, 271
Wrangling, data, 32, 79
Writing with data, 26–27, 77–78, 133–135, 207–208, 253–254, 309, 364, 404–405, 469–
471, 514, 556–557, 604–606
Y
Yahoo finance, 15
Young, Kimberly, 362
YouTube, 11
Z
Zara, 6
Zhang, Chun, 25
Zillow.com, 14, 254, 499
z-scores, detecting outliers
calculating, 131–132
empirical rule and, 130
z-table, 165