Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Essentials of Data Science and Analytics

Essentials of Data Science and Analytics


Statistical Tools, Machine Learning, and
R-Statistical Software Overview

Amar Sahay
Essentials of Data Science and Analytics:
Statistical Tools, Machine Learning, and R-Statistical Software Overview

Copyright © Business Expert Press, LLC, 2021.

Cover design by Charlene Kronstedt

Interior design by Exeter Premedia Services Private Ltd., Chennai, India

All rights reserved. No part of this publication may be reproduced,


stored in a retrieval system, or transmitted in any form or by any
means—electronic, mechanical, photocopy, recording, or any other
except for brief quotations, not to exceed 400 words, without the prior
permission of the publisher.

First published in 2021 by


Business Expert Press, LLC
222 East 46th Street, New York, NY 10017
www.businessexpertpress.com

ISBN-13: 978-1-63157-345-3 (paperback)


ISBN-13: 978-1-63157-346-0 (e-book)

Business Expert Press Quantitative Approaches to Decision Making


Collection

Collection ISSN: 2163-9515 (print)


Collection ISSN: 2163-9582 (electronic)

First edition: 2021

10 9 8 7 6 5 4 3 2 1
To Priyanka Nicole, Our Love and Joy
Description
This text provides a comprehensive overview of Data Science. With
­continued advancement in storage and computing technologies, data
science has emerged as one of the most desired fields in driving busi-
ness ­ decisions. Data science employs techniques and methods from
many other fields such as statistics, mathematics, computer science,
and information science. Besides the methods and theories drawn from
several fields, data  science uses visualization techniques using specially
designed big data  software and statistical programming language, such
as R p ­ rogramming, and Python. Data science has wide applications in
the areas of Machine Learning (ML) and Artificial Intelligence (AI).
The book has four parts divided into different chapters. These chapters
explain the core of data science. Part I of the book introduces the field
of data science, different disciplines it comprises of, and the scope with
future outlook and career prospects. This section also explains analytics,
business analytics, and business intelligence and their similarities and dif-
ferences with data science. Since the data is at the core of data science, Part
II is devoted to explaining the data, big data, and other features of data.
One full chapter is devoted to data analysis, creating visuals, pivot table,
and other applications using Excel with Office 365. Part III explains the
statistics behind data science. It uses several chapters to explain the statis-
tics and its importance, numerical and data visualization tools and meth-
ods, probability, and probability distribution applications in data science.
Other chapters in the Part III are sampling, estimation, and hypothesis
testing. All these are integral part of data science applications. Part IV of
the book provides the basics of Machine Learning (ML) and R-statistical
software. Data science has wide applications in the areas of Machine
Learning (ML) and Artificial Intelligence (AI) and R-statistical software
is widely used by data science professionals. The book also outlines a brief
history, the body of knowledge, skills, and education requirements for
data scientist and data science professionals. Some statistics on job growth
and prospects are also summarized. A career in data science is ranked at
viii Description

the third best job in America for 2020 by Glassdoor and was ranked the
number one best job from 2016 to 2019.29

Primary Audience

The book is appropriate for majors in data science, analytics, business,


statistics and data analysis majors, graduate students in business, MBAs,
professional MBAs, and working people in business and industry who are
interested in learning and applying data science in making effective busi-
ness decisions. Data science is a vast area and the tools of data science are
proven to be effective in making timely business decisions and predicting
the future outcomes in this current competitive business environment.
The book is designed with a wide variety of audience in mind. It takes
a unique approach of presenting the body of knowledge and integrating
such knowledge to different areas of data science, analytics, and predic-
tive modeling. The importance and applications of data science tools in
analyzing and solving different problems is emphasized throughout the
book. It takes a simple yet unique learner-centered approach in teaching
data science and predictive, knowledge, and skills requires as well as the
tools. The students in Information Systems interested in data science will
also find the book to be useful.

Scope

This book may be used as a suggested reading for professionals in inter-


ested in data science and can also be used as a real-world applications text
in data science analytics, and business intelligence.
Because of its subject matter and content, the book may also be
adopted as a suggested reading in undergraduate and graduate data sci-
ence, data analytics, statistics, data analysis courses, and MBA, and pro-
fessional MBA courses. The businesses are now data-driven where the
decisions are made using real data both collected over time and current
real-time data. Data analytics is now an integral part of businesses and a
number of companies rely on data, analytics, and business intelligence,
and machine learning and artificial intelligence (AI) applications in mak-
ing effective and timely business decisions. The professionals involved
Description ix

in data science and analytics, big data, visual analytics, information


systems and business intelligence, business and data analytics will find
this book useful.

Keywords
data science; data analytics; business analytics; business intelligence;
data analysis; decision making; descriptive analytics; predictive analytics;
prescriptive analytics; statistical analysis; quantitative techniques; data
mining; predictive modeling; regression analysis; modeling; time-series
forecasting; optimization; simulation; machine learning; neural networks;
artificial intelligence
Contents
Preface������������������������������������������������������������������������������������������������xiii
Acknowledgments��������������������������������������������������������������������������������� xxi

Part I Data Science, Analytics, and Business Analytics���������� 1


Chapter 1 Data Science and Its Scope�����������������������������������������������3
Chapter 2 Data Science, Analytics, and Business
Analytics (BA)����������������������������������������������������������������19
Chapter 3 Business Analytics, Business Intelligence, and
Their Relation to Data Science���������������������������������������39

Part II Understanding Data and Data Analysis


Applications��������������������������������������������������������������� 49
Chapter 4 Understanding Data, Data Types, and
Data-Related Terms��������������������������������������������������������51
Chapter 5 Data Analysis Tools for Data Science and
Analytics: Data Analysis Using Excel������������������������������67

Part III Data Visualization and Statistics for Data


Science��������������������������������������������������������������������� 135
Chapter 6 Basic Statistical Concepts for Data Science�������������������137
Chapter 7 Descriptive Analytics_Visualizing Data Using
Graphs and Charts�������������������������������������������������������155
Chapter 8 Numerical Methods for Data Science Applications�������199
Chapter 9 Applications of Probability in Data Science������������������255
Chapter 10 Discrete Probability Distributions Applications
in Data Science�������������������������������������������������������������301
Chapter 11 Sampling and Sampling Distributions: Central
Limit Theorem�������������������������������������������������������������351
Chapter 12 Estimation, Confidence Intervals,
Hypothesis Testing�������������������������������������������������������381
xii Contents

Part IV Introduction to Machine Learning and R-statistical


Programming Software�������������������������������������������� 421
Chapter 13 Basics of MachLearning (ML)��������������������������������������423
Chapter 14 R Statistical Programing Software for Data Science�������437

Online References��������������������������������������������������������������������������������443
Additional Readings����������������������������������������������������������������������������445
About the Author��������������������������������������������������������������������������������449
Index�������������������������������������������������������������������������������������������������451
Preface
This book is about Data Science, one of the fastest growing fields with
applications in almost all disciplines. The book provides a comprehensive
­overview of data science.

Data science is a data-driven decision making approach that


uses ­several different areas, methods, algorithms, models, and
­disciplines with a purpose of extracting insights and knowledge
from s­ tructured and unstructured data. These insights are helpful
in applying ­algorithms and models to make decisions. The models
in data science are used in predictive analytics to predict future
outcomes. Machine learning and artificial intelligence (AI) are
major application areas of data science.

Data science is a multidisciplinary field that provides the knowledge


and skills to understand, process, and visualize data in the initial stages
followed by applications of statistics, modelling, mathematics, and tech-
nology to address and solve analytically complex problems using struc-
tured and unstructured data. At the core of data science is data. It is about
using this data in creative and effective ways to help businesses in making
data-driven business decisions. Data science is about extracting knowl-
edge and insights from data. Businesses and processes today are run using
data. The amount of data collected now is in massive scale and is usually
referred as the age of Big Data. The rapid advancement in technology is
making it possible to collect, store, and process volumes of data rapidly. It
is about using this data effectively using visualization, statistical analysis,
and modeling tools that can help businesses driving business decisions.
The knowledge of statistics in data science is as important as the appli-
cations of computer science. Companies now collect massive amounts of
data from exabytes to zettabytes, which are both structured and unstruc-
tured. The advancement in technology and the computing capabilities
xiv Preface

have made it possible to process and analyze this huge data with smarter
storage spaces.
Data science is a multidisciplinary field that involves the ability to
understand, process, and visualize data in the initial stages followed by
applications of statistics, modeling, mathematics, and technology to
address and solve analytically complex problems using structured and
unstructured data. At the core of data science is data. It is about using
this data in creative and effective ways to help businesses in making
data-driven business decisions.
The field of data science is vast and has a wide scope. The terms data
science, data analytics, business analytics, and business intelligence are often
used interchangeably even by the professions in the fields. All these areas
are somewhat related with the field of data science having the largest
scope. This book tries to outline the tools, techniques, and applications of
data science and explain the similarities and differences of this field with
data analytics, analytics, business analytics, and business intelligence.
The knowledge of statistics in data science is as important as the
­applications of computer science. Statistics is the science of data and vari-
ation. Statistics and data analysis, and statistical analysis constitute major
applications of data science. Therefore, a significant part of this book
emphasizes the statistical concepts needed to apply data science in real
world. It provides a solid foundation of statistics applied to data ­science.
Data visualization and other descriptive and inferential tools—the knowl-
edge of which are critical for data science professionals are discussed in
detail. The book also introduces the basics of machine learning that is
now a major part of data science and introduces the statistical program-
ming language R, which is widely used by data scientists. A chapter by
chapter synopsis is provided.
Chapter 1 provides an overview of data science by defining and out-
lining the tools and techniques. It describes the differences and similar-
ities between data science and data analytics. This chapter also discusses
the role of statistics in data science, a brief history of data science, knowl-
edge and skills for data science professionals, and a broad view of data
science with associated areas. The body of knowledge essential for data
science, and different tools technologies used in data science are also parts
of this chapter. Finally, the chapter looks into the future outlook of data
Preface xv

science and carrier career path for data scientists along with future out-
look of data science as a field. The major topics discussed in Chapter 1 are:
(a) broad view of data science with associated areas, (b) data science body
of knowledge, (c) technologies used in data science, (d) future o­ utlook,
and (d) career path for data science professional and data scientist.
The other concepts related to data science including analytics, busi-
ness analytics, and business intelligence (BI) are discussed in subsequent
chapters. Data science continues to evolve as one of the most sought-after
areas by companies. The job outlook for this area continues to be one of
the highest of all field.
The discussion topic of Chapter 2 is analytics and business analytics.
One of the major areas of data science is analytics and business analyt-
ics. These terms are often used interchangeably with data science. We
outline the differences between the two along with the explanation of
different types of analytics and the tools used in each one. The deci-
sion-making process in data science heavily makes use of analytics and
business analytics tools and these are integral parts of data analysis. We,
therefore, felt it necessary to explain and describe the role of analytics
in data ­science. Analytics is the science of analysis—the processes by
which we analyze data, draw conclusions, and make decisions. Business
analytics (BA) covers a vast area. It is a complex field that encompasses
visualization, statistics and modeling, optimization, simulation-based
modeling, and statistical analysis. It uses descriptive, predictive, and pre-
scriptive analytics ­including text and speech analytics, web analytics, and
other ­application-based analytics and much more. This chapter also dis-
cusses different predictive models and predictive analytics. Flow diagrams
­outlining the tools of each of the descriptive, predictive, and prescriptive
analytics presented in this chapter. The decision-making tools in analytics
are part of data science.
Chapter 3 draws a comparison between the business intelligence (BI)
and business analytics. Business analytics, data, analytics, and advanced
analytics fall under the broad area of business intelligence (BI). The broad
scope of BI and the distinction between the BI and business analytics
(BA) tools are outlined in this chapter.
Chapter 4 is devoted to the study of collection, presentation, and
­various classification of data. Data science is about the study of data.
xvi Preface

Data  are of various types and are collected using different means. This
chapter explained the types of data and their classification with exam-
ples. Companies collect massive amounts of data. The volume of data
collected and analyzed by businesses is so large that it is referred to as
“Big Data.” The volume, variety, and the speed (velocity) with which data
are collected requires specialized tools and techniques including specially
designed big data software for analysis.
In Chapter 5, we introduce Excel, a widely available and used software
for data visualization and analysis. A number of graphs and charts with
stepwise instructions are presented. There are several packages available as
add-ins to Excel to enhance its capabilities. The chapter presents basic to
more involved features and capabilities. The chapter is divided into sec-
tions including “Getting Stated with Excel” followed by several applica-
tions including formatting data as a table, filtering and sorting data, and
simple calculations. Other applications in this chapter are analyzing data
using pivot_table/pivot chart, descriptive statistics using Excel, visualiz-
ing data using Excel charts and graphs, visualizing categorical data—bar
charts, pie charts, cross tabulation, exploring the relationship between
two and three variables—scatter plot bubble graph, and time-series plot.
Excel is very widely used software application program in data science.
Chapters 6 and 7 deal with basics of statistical analysis for data
science. Statistics, data analysis, and analytics are at the core of data
science applications. Statistics involves making decisions from the
data. Making effective decisions using statistical methods and data require
the understanding of three areas of statistics: (1) descriptive statistics,
(2)  probability and probability distributions, and (3) inferential statis-
tics. Descriptive statistics involves describing the data using graphical and
numerical methods. Graphical and numerical methods are used to create
visual representation of the variables or data and to calculate various sta-
tistics to describe the data. Graphical tools are also helpful in identifying
the ­patterns in the data. This chapter discusses data visualization tools.
A number of ­graphical techniques are explained with their applications.
There has been an increasing amount of pressure on businesses to pro-
vide high-quality products and services. This is critical to improving their
market share in this highly competitive market. Not only it is critical for
businesses to meet and exceed customer needs and requirements, it is also
Preface xvii

important for businesses to process and analyze a large amount of data


(in real time, in many cases). Data visualization, processing, analysis, and
using data timely and effectively are needed to drive business decisions
and also make timely data-driven decisions. The processing and analysis
of large data sets comes under the emerging field known as big data, data
mining, and analytics.
To process these massive amounts of data, data mining uses statistical
techniques and algorithms and extracts nontrivial, implicit, previously
unknown, and potentially useful patterns. Because applications of data
mining tools are growing, there will be more of a demand for profession-
als trained in data science and analytics. The knowledge discovered from
this data in order to make intelligent data driven decisions is referred to
as business intelligence (BI) and business analytics. These are hot topics
in business and leadership circles today as it uses a set of techniques and
processes which aid in fact-based decision making. These concepts are
discussed in various chapters of the book.
Much of the data analysis and statistical techniques we discuss in
Chapters 6 and 7 are prerequisites to fully understanding data science
and business analytics.
In Chapter 8, we discuss numerical methods that describe several
measures critical to data science and analysis. The calculated measures
are also known as statistics when calculated from the sample data. We
explained the measures of central tendency, measures of position,
and measures of variation. We also discussed empirical rule that relates
the mean and ­standard deviation and aid in the understanding of what
it means for a data to be normal. Finally, in this chapter, we study the
statistics that ­measure the association between two variables—covariance
and ­correlation coefficient. All these measures along with the visual tools
are essential part of data analysis.
In data analytics and data science, probability and probability dis-
tributions play an important role in decision making. These are essen-
tial parts of drawing conclusion from the data and are used in problems
involving inferential statistics. Chapter 9 provides a comprehensive review
of probability.
Chapter 10 discusses the concepts of random variable and discrete
probability distributions. The distributions play an important role in
xviii Preface

the decision-making process. Several discrete probability distributions


including the binomial, Poisson, hypergeometric, and geometric distri-
butions were discussed with applications. The second part of this chapter
deals with continuous probability distribution. The emphasis is on nor-
mal distribution. The normal distribution is perhaps the most important
distribution in statistics and plays a very important role in statistics and
data analysis. The basis of quality programs such as, Six Sigma is the nor-
mal distribution. The chapter also provides a brief explanation of expo-
nential distribution. This distribution has wide applications in modeling
and reliability engineering.
Chapter 11 introduces the concepts of sampling and sampling dis-
tribution. In statistical analysis, we almost always rely on sample to draw
conclusion about the population. The chapter also explains the concepts
of standard error and the concept of central limit theorem.
Chapter 12 discusses the concepts of estimation, confidence inter­
vals, and hypothesis testing. The concept of sampling theory is important
in studying these applications. Samples are used to make inferences about
the population, and this can be done through sampling distribution. The
probability distribution of a sample statistic is called its sampling distri-
bution. We explained the central limit theorem. We also discussed several
examples of formulating and testing hypothesis about the population
mean and population proportion. Hypothesis tests are used in assessing
the validity of regression methods. They form the basis of many of the
assumptions underlying the regression analysis to be discussed in the
coming chapters.
Chapter 13 provides the basics of machine learning. It is a widely used
method in data science and is used in designing systems that can learn,
adjust, and improve based on the data fed to them without being explic-
itly programmed. Machine Learning is used to create models from huge
amount of data commonly referred to as big data. It is closely related to
artificial intelligence (AI). In fact, it is an application of artificial intelli-
gence (AI). Machine learning algorithms are based on teaching a computer
how to learn from the training data. The algorithms learn and improve as
more data flows through the system. Fraud detection, e-mail spam, and
GPS systems are some examples of machine learning applications.
Preface xix

Machine learning tasks are typically classified into two broad


­categories: supervised learning and unsupervised learning. These concepts
are described in this chapter.
Finally, in Chapter 14, we introduce R statistical software. R is a
powerful and widely used software for data analysis and machine learn-
ing applications. This chapter introduced the software and provided the
basic statistical features, and instructions on how to download R and
R studio. The software can be downloaded to run on all major operating
systems including Windows, Mac OS X, and Unix. It is supported by
R ­Foundation for Statistical Computing. R statistical analysis program-
ming language was designed for statistical computing and graphics and is
widely used by statisticians, data mining,36 and data science professionals
for data analysis. R is perhaps one of the most widely used and powerful
programming platforms for statistical programming and applied machine
learning. It is widely used for data science and analysis application and is
a desired skill for data science professionals.
The book provides a comprehensive overview of data science and the
tools and technology used in this field. The mastery of the concepts in this
book are critical in the practice of data science. Data science is a growing
field. It continues to evolve as one of the most sought-after areas by com-
panies. A career in data science is ranked at the third best job in America
for 2020 by Glassdoor and was ranked the number one best job from
2016 to 2019. Data scientists have a median salary of $118,370 per year
or $56.91 per hour. These are based on level of education and experience
in the field. Job growth in this field is also above average, with a projected
increase of 16 percent from 2018 to 2028.
Salt Lake City, Utah, U.S.A.
amar@xmission.com
amar@realleansixsigmaquality.com
Acknowledgments
I would like to thank the reviewers who took the time to provide excellent
insights, which helped shape this book. I wish to thank many people who
have helped to make this book a reality. I have benefitted from numer-
ous authors and researchers and their excellent work in the areas of data
­science and analytics.
I would especially like to thank Mr. Karun Mehta, a friend and
­engineer whom I miss so much. I greatly appreciate the numerous hours
he spent in correcting, formatting, and supplying distinctive comments.
The book would not be possible without his tireless effort. Karun has
been a wonderful friend, counsel, and advisor.
I am very thankful to Prof. Edward Engh for his thoughtful advice
and counsel.
I would like to express my gratitude to Prof. Susumu Kasai, Professor
of CSIS for reviewing and administering invaluable suggestions.
Thanks to all of my students for their input in making this book
possible. They have helped me pursue a dream filled with lifelong
­
­learning. This book will not be a reality without them.
I am indebted to senior acquisitions editor, Scott Isenberg; Charlene
Kronstedt, director of production, Sheri Dean, director of marketing, all
the reviewers, and the publishing team at Business Expert Press for their
counsel and support during the preparation of this book. I also wish to
thank Mark Ferguson, Editor, for reviewing the manuscript and provid-
ing helpful suggestions for improvement. I acknowledge the help and
support of Exeter Premedia Services, Chennai, India team for their help
with editing and publishing.
I would like to thank my parents who always emphasized the impor-
tance of what education brings to the world. Lastly, I would like to express
a special appreciation to my lovely wife Nilima, to my daughter Neha,
and her husband Dave, my daughter Smita, and my son Rajeev—both
engineers for their creative comments and suggestions. And finally, to our
beautiful Priyanka for her lovely smiles. I am grateful to all for their love,
support, and encouragement.
Part I

Data Science, Analytics, and


Business Analytics
CHAPTER 1

Data Science and Its Scope


Chapter Highlights
• Introduction
• What Is Data Science?
• Objective and Overview of Chapters
• What Is Data Science?
• Another Look at Data Science
• Data Science and Statistics
• Role of Statistics in Data Science
• Data Science: A Brief History
• Difference between Data Science and Data Analytics
• Knowledge and Skills for Data Science Professionals
• Some Technologies used in Data Science
• Career Path for Data Science Professional and Data Scientist
• Future Outlook
• Summary

Introduction
Data science is about extracting knowledge and insights from data. The
tools and techniques of data science are used to drive business and process
decisions. It can be seen as a major data-driven decision-making approach
to decision making. Data science is a multidisciplinary field that involves
the ability to understand, process, and visualize data in the initial stages
followed by applications of statistics, modeling, mathematics, and tech-
nology to address and solve analytically complex problems using struc-
tured and unstructured data. At the core of data science is data. It is about
using this data in creative and effective ways to help businesses in making
data-driven business decisions.
The knowledge of statistics in data science is as important as the
applications of computer science. Companies now collect massive
­
4 Essentials of Data Science and Analytics

amounts of data from exabytes to zettabytes, which are both structured


and unstructured. The advancement in technology and the computing
capabilities have made it possible to store, process, and analyze this huge
data with smarter storage spaces.
Data science is applied to extract information from both structured
and unstructured data.1,2
Unstructured data is usually not organized in a structured manner and
may contain qualitative or categorical elements, such as dates, categories,
and so on, and are text heavy. They also contain numbers and other forms
of measurements. Compared to structured data, the unstructured data
contain irregularities. The ambiguities in unstructured data make it dif-
ficult to apply traditional tools of statistics and data analysis. Structured
data are usually stored in clearly defined fields in databases. The software
applications and programs are designed to process such data. In recent
years, a number of newly developed tools and software programs have
emerged that are capable of analyzing big and unstructured data. One
of the earliest applications of unstructured data is in analyzing text data
using text-mining and other methods.
Recently, unstructured data is becoming more prevalent. In 1998,
Merrill Lynch said, “unstructured data comprises the vast majority of
data found in an organization, some estimates run as high as 80%.”1
Here are some other predictions: As of 2012, IDC (International Data
Group)3 and Dell EMC4 project that data will grow to 40 z­ ettabytes by
2020, resulting in a 50-fold growth from the beginning of 2010.4 More
recently, IDC and Seagate predict that the global datasphere will grow to
163 zettabytes by 20255 and majority of that will be unstructured. The
Computer World magazine7 states that unstructured information might
account for more than 70 to 80 percent of all data in in ­organizations.
(https://en.wikipedia.org/wiki/Unstructured_data)8

Objective and Overview of Chapters


The objective of this book is to provide an introductory overview of
data science, understand what data science is, and why data science is
such an important field. We will also explore and outline the role of data
­scientists/professionals and what they do.
Data Science and Its Scope 5

The initial chapters of the book introduce data science and closely
related areas. The terms data science, data analytics, business analytics,
and business intelligence are often used interchangeably even by the pro-
fessions in the fields. Therefore, Chapter 1, which provides an overview
of data science, is followed by two chapters that explain the relationship
between data science, analytics, and business intelligence. Analytics itself
is wide area and different forms of analytics including descriptive, pre-
dictive, and prescriptive analytics are used by companies to drive major
business decisions. Chapters 2 and 3 outline the differences and similari-
ties between data science, analytics, and business intelligence. Chapter 2
also outlines the tools of descriptive, predictive, and prescriptive analytics
along with the most recent and emerging technologies of machine learn-
ing and artificial intelligence. Since the field is data science is about the
data, a chapter is devoted to data and data types. Chapter 4 provides defi-
nitions of data, different forms of data, and their types followed by some
tools and techniques for working with data. One of the major objectives
of data science is to make sense from the massive amounts of data compa-
nies collect. One of the ways of making sense from data is to apply data
visualization or graphical techniques used in data analysis. Understand-
ing other tools and techniques for working with data are also important.
A chapter is devoted to data visualization.
Data science is a vast area. Besides visualization techniques and
statistical analysis, it uses statistical programming language such as
R  ­programming, and a knowledge of databases (SQL or MySQL) or
other data base management system.
One major application of data science is in the area of Machine
Learning (ML) and Artificial Intelligence. The book provides a detailed
overview of data science by defining and outlining the tools and tech-
niques. As mentioned earlier, the book also explains the differences and
similarities between data science and data analytics. The other concepts
related to data science including analytics, business analytics, and busi-
ness intelligence (BI) are discussed in detail. The field of data science is
about processing, cleaning, and analyzing data. These concepts and topics
are important to understand the field of data science and are discussed in
this book. Data science is an emerging field in data analysis and decision
making.
6 Essentials of Data Science and Analytics

What Is Data Science?


Data science may be thought of as a data driven decision making approach
that uses several different areas, methods, algorithms, models, and ­disciplines
with a purpose of extracting insights and knowledge from structured and
unstructured data. These insights are helpful in applying algorithms and
models to make decisions. The models in data science are used in predictive
analytics to predict future outcomes.
Data science, as a field, has much broader scope than analyt-
ics, b­ usiness analytics, or business intelligence. It brings together and
­combines ­several disciplines and areas including statistics, data analysis9,
­statistical modeling, data mining,10,11,12,13,14 big data,15 machine learning,16
and artificial intelligence (AI), management science, o­ptimization
­techniques, and related methods in order to “understand and analyze
actual ­phenomena” from data.17
Data science employs techniques and methods from many other fields,
such as mathematics, statistics, computer science, and i­ nformation ­science.
Besides the methods and theories drawn from several fields, data science
also uses data visualization techniques using specially designed software—
Tableau and other big data software. The concepts of relational data
bases (such as SQL), R-statistical software, and programming language
Python are all used in different applications to analyze, extract informa-
tion, and draw conclusions from data. These are the tools of data science.
These tools, techniques, and programming languages provide a unifying
approach to explore, analyze, draw conclusions, and make d ­ ecisions from
massive amounts of data companies collect.
Data science employs the tools of information technology,
­management science (mathematical modeling, and simulation), along
with data mining and fact-based data to measure past performance to
guide an organization in planning and predicting future outcomes to aid
in ­effective decision making.
Turing award18 winner Jim Gray viewed data science as a “fourth
­paradigm” of science (empirical, theoretical, computational, and now data-
driven) and asserted that “everything about science is changing because
of the impact of information technology” and the data d ­ eluge. In 2015,
the American Statistical Association identified database management,
Data Science and Its Scope 7

statistics and machine learning, distributed and parallel systems as the


three emerging foundational professional communities.

Another Look at Data Science


Data science can be viewed as a multidisciplinary field focused on finding
actionable insights from large sets of raw, structured, and unstructured
data. The field primarily uses different tools and techniques in unearthing
answers to the things we don’t know. Data science experts use several dif-
ferent areas from data and statistical analysis, programming from varied
areas of computer science, predictive analytics, statistics, and machine
learning to parse through massive datasets in an effort to find solutions to
problems that haven’t been thought of yet.
Data scientists emphasis lies in asking the right questions with a goal
to seek the right or acceptable solutions. The emphasis is asking the right
questions and not seeking specific answers. This is done by predicting
potential trends, exploring disparate and disconnected data sources, and
finding better ways to analyze information. (https://sisense.com/blog/
data-science-vs-data-analytics/)19
(Data Science: Wikipedia.org https://en.wikipedia.org/wiki/Data_
science (From Wikipedia, the free encyclopedia))

Data Science and Statistics


Conflicting Definitions of Data Science and Its Relation to Statistics

Stanford professor David Donoho, in September 2015, rejected the three


simplistic and misleading definitions of data science in lieu of criticisms.20
(1) For Donoho, data science does not equate to big data, in that the size
of the data set is not a criterion to distinguish data science and statistics.20
(2) Data science is not defined by the computing skills of s­ orting big data sets,
in that these skills are already generally used for a­ nalyses across all disciplines.20
(3) Data science is a heavily applied field where academic programs right now
do not sufficiently prepare data scientists for the jobs, in that many graduate
programs m ­ isleadingly ­advertise their analytics and statistics training as the
data science program.20,21 As a statistician, Donoho, following many in his
8 Essentials of Data Science and Analytics

field, c­ hampions the broadening of ­learning scope in the form of data sci-
ence.20 John ­Chambers who urges statisticians to adopt an inclusive ­concept
of l­earning from data.22 Together, these statisticians envision an increasingly
inclusive applied field that grows out of traditional statistics and beyond.

Role of Statistics in Data Science


Data science professionals and data scientists should have a strong
­background in statistics, mathematics, and computer applications. Good
analytical and statistical skills are a prerequisite to successful application
and implementation of data science tools. Besides the simple statistical
tools, data science also uses visualization, statistical modeling includ-
ing descriptive analytics, and predictive modeling for predicting future
­business outcomes. Thus, a combination of mathematical methods along
with computational algorithms and statistical models is needed for
­generating successful data science solutions. Here are some key statistical
­concepts that every data scientist should know.

• Descriptive statistics and data visualization


• Inferential statistics concepts and tools of inferential statistics
• Concepts of probability and probability distributions
• Concepts of sampling and sampling distribution/ over and
under-sampling
• Bayesian statistics
• Dimensionality reduction

Data Science: A Brief History


1997 In November 1997, C.F. Jeff Wu gave the inaugural lecture
titled “­Statistics = Data Science?”28 for his appointment to the
H. C. Carver ­Professorship at the University of Michigan. In this lecture,
he c­ haracterized statistical work as a trilogy of data collection, data
­modeling and analysis, and decision making.
In his conclusion, he initiated the modern, non-computer science, usage
of the term “data science” and advocated that statistics be renamed data
science and statisticians data scientists.28 Later, he presented his lecture
titled “Statistics = Data Science?” as the first of his 1998 P.C. Mahalanobis
Memorial Lectures.
Data Science and Its Scope 9

2001 William S. Cleveland introduced data science as an independent


­discipline, extending the field of statistics to incorporate “advances in
computing with data” in his article “data science.

2002 In April 2002, the International Council for Science (ICSU): Commit-
tee on Data for Science and Technology (CODATA)17 started the Data
­Science Journal, a publication focused on issues such as the description
of data systems, their publication on the Internet, applications and legal
issues.

2003 in January 2003, Columbia University began publishing The


­Journal of Data Science,17 which provided a platform for all data
­workers to ­present their views and exchange ideas. The journal was
largely devoted to the application of statistical methods and quantitative
research.

2005 The National Science Board published “Long-lived Digital Data


­Collections: Enabling Research and Education in the 21st Century”
defining data scientists as “the information and computer scientists,
database and software and programmers, disciplinary experts, curators and
expert annotators, librarians, archivists, and others, who are crucial to the
successful management of a digital data collection” whose primary activity
is to “conduct creative inquiry and analysis.”18

2006/2007 Around 2007,Turing award winner Jim Gray envisioned “data-driven


science” as a “fourth paradigm” of science that uses the computational
analysis of large data as primary scientific method and “to have a world
in which all of the science literature is online, all of the science data is
online, and they interoperate with each other.”

2012 In the 2012 Harvard Business Review article “Data Scientist: The
­Sexiest Job of the 21st Century”,24 DJ Patil claims to have coined this term
in 2008 with Jeff Hammerbacher to define their jobs at ­LinkedIn and
Facebook, respectively. He asserts that a data scientist is “a new breed” and
that a “shortage of data scientists is ­becoming a serious constraint in some
sectors” but describes a much more ­business-­oriented role.

2014 The first international conference, IEEE International Conference on Data


Science and Advanced Analytics, was launched in 2014.

In 2014, the American Statistical Association (ASA) section on


­ tatistical Learning and Data Mining renamed its journal to Statistical
S
Analysis and Data Mining: The ASA Data Science Journal.

2015 In 2015, the International Journal on Data Science and Analytics was
launched by Springer to publish original work on data science and big data
analytics.

2016 In 2016, The ASA changed its section name to “Statistical Learning and
Data Science.”

Reference 17 cited above has excellent articles on Data Science.


10 Essentials of Data Science and Analytics

Data Science and Data Analytics


(https://sisense.com/blog/data-science-vs-data-analytics/)
Data analytics focuses on processing and performing statistical
­analysis on existing datasets. Analysts apply different tools and methods
to capture, process, organize, and perform data analysis to data in the
data bases of companies to uncover actionable insights from data and find
ways to present this data. More simply, the field of data and analytics is
directed toward solving problems for questions we know we don’t know
the answers to. More importantly, it’s based on producing results that can
lead to immediate improvements.
Data analytics also encompasses a few different branches of broader
statistics and analysis, which help combine diverse sources of data and
locate connections while simplifying the results.

Difference Between Data Science and Data Analytics


While the terms data science and data analytics are used interchangeably,
data science and big data analytics are unique fields with major ­difference
being the scope. Data science is an umbrella term for a group of fields
that are used to mine large datasets. Data science has much broader
scope compared to data analytics, analytics, and business analytics. Data
­analytics is a more focused version of data science and focuses more on
data analysis and statistics and can even be considered part of the larger
process that uses simple to advanced statistical tools. Analytics is devoted
to realizing actionable insights that can be applied immediately based on
existing queries.
Another significant difference between the two fields is a question of
exploration. Data science isn’t concerned with answering specific queries,
instead parsing through massive datasets in sometimes unstructured ways
to expose insights. Data analysis works better when it is focused, having
questions in mind that need answers based on existing data.
Data science produces broader insights that concentrate on which
questions should be asked, while big data analytics emphasizes ­discovering
answers to questions being asked.
Data Science and Its Scope 11

More importantly, data science is more concerned about asking


­questions than finding specific answers. The field is focused on establish-
ing potential trends based on existing data, as well as realizing better ways
to analyze and model the data. Table 1.1 outlines the differences.

Table 1.1  Difference between data science and data analytics

Data Science Data Analytics


Scope Macro Micro
Goal Ask the right questions Find actionable data
Major fields Machine learning, AI, search Healthcare, gaming, travel,
engine engineering, statistics, industries with immediate
analytics data needs
Analysis of Data and Yes Yes
Big Data

Some argue that the two fields—data science and data analytics—
can be considered different sides of the same coin, and their functions
are highly interconnected. Data science lays important foundations and
parses big datasets to create initial observations, future trends, and poten-
tial insights that can be important. This information by itself is useful
for some fields, especially modeling, improving machine learning, and
enhancing AI algorithms as it can improve how information is sorted
and understood. However, data science asks important questions that we
were unaware of before while providing little in the way of answers. By
combining data analytics with data science, we have additional insights,
prediction capabilities, and tools to apply in practical applications.
When thinking of these two disciplines, it’s important to forget about
viewing them as data science versus data analytics. Instead, we should
see them as parts of a whole that are vital to understanding not just the
information we have, but how.

Knowledge and Skills for Data Science Professionals


The key function of the data science professional or a data scientist is to
understand the data and identify the correct method or methods that
will lead to desired solution. These methods are drawn from different
12 Essentials of Data Science and Analytics

fields including data and big data analysis (visualization techniques)


statistics (statistical modeling) and probability, computer science and
­information systems, programming skills, and an understanding of data
bases ­including querying and data base management.
Data science professionals should also have the knowledge of many of
the software packages that can be used to solve different types of problems.
Some of the commonly used programs are statistical packages (R ­statistical
computing software), SAS, and other statistical packages, relational data
base packages (SQL, MySQL, Oracle, and others), machine learning
libraries (recently, many software to automate machine learning tasks are
available from software vendors). The two known auto machine learning
software are Azur by Microsoft and SAS auto ML. Figure 1.1 provides
a broader view and the key areas of data science. Figure 1.2 outlines the
body of knowledge a data science professional is expected to have.

Machine Learning & Statistics/ mathematics


Computer Science/
Artificial Intelligence (AI) Simulation & Management
Information system
science
DATA
Knowledge of Data Base
SCIENCE
& Data Visualization

Knowledge of Business & Process


and Related Data

Figure 1.1  Broad view of data science with associated areas


There are a number of off-the-shelf data science software and p
­ latform
in use. The use of these software requires significant knowledge and
expertise. Without proper knowledge and background the off-the-shelf
software may not be used relatively easily. (https://innoarchitech.com/
blog/what-is-data-science-does-data-scientist-do)23

Some Technologies Used in Data Science


The following is a partial list of technologies used in solving data science
problems. Note that the technologies are from different fields including
statistics, data visualization, programming, machine learning, and big data.
Data Science and Its Scope 13

Statistics&
Data
Analysis Math &
(R Statistical Mathematical
Predictive Programming) Modeling
Analytics

Machine
Data DATA Learning and
Visualization SCIENCE Artificial
Intelligence
(AI)

Data Base
Management
Business
and
& Process
query(SQL)
Programming Knowledge
(Python)

Figure 1.2  Data science body of knowledge

• Python is a programming language with simple syntax that


is commonly used for data science.34 There are a number of
python libraries that are used in data science and machine
learning applications including NumPy, pandas, Matplot,
Scikit Learn, and others.
• R statistical analysis, a programming language that was
designed for statistics and data mining17,30 applications and is
one of the popular application packages used by data scientists
and analysts.
• TensorFlow is a framework for creating machine learning
models developed by Google machine learning models and
applications.
• Pytorch is another framework for machine learning developed
by Facebook.
• Jupyter Notebook is an interactive web interface for Python
that allows faster experimentation and is used in machine
learning applications of data science.
14 Essentials of Data Science and Analytics

• Tableau makes a variety of software that is used for data


visualization.32 It is a widely used software for big data
applications and is used for descriptive analytics and data
visualization.
• Apache Hadoop is a software framework that is used to
process data over large distributed systems.

Career Path for Data Science Professional


and Data Scientist
In order to pursue a carrier in data science, significant amount of educa-
tion and experience is required. As evident from Figure 1.2, a data scien-
tist requires knowledge and expertise from varied fields. The field of data
science provides a unifying approach by combining varied areas ranging
from statistics, mathematics, analytics, business intelligence, computer
science, programming, and information systems. It is rare to find a data
science professional with knowledge and background in all these areas. It
is often the case that a data scientist has specialization in a subfield. The
minimum education requirement for a data science professional is a bach-
elor’s degree in mathematics, statistics, or computer science. A number
of data scientists possess a master’s or a PhD degree in data science with
adequate experience in the field. The application of data science tools
varies depending on the field it is applied to. Note that data science tools
and applications when applied to engineering may be different from com-
puter science or business. Therefore, successful application of tools of data
science requires expertise and the knowledge of the process.

Future Outlook
Data science is a growing field. It continues to evolve as one of the most
sought-after areas by companies. An excellent outlook is ­provided in
­reference24: Davenport, T. H., and D.J. Patil (October 1, 2012).  “Data
Scientist: The Sexiest Job of the 21st Century”.  Harvard ­Business
­
Review (­October 2012). ISSN 0017-8012. Retrieved 3 April 2020.
Data science is a growing field. It continues to evolve as one of the
most sought-after areas by companies. An excellent outlook is provided
in reference.24
Data Science and Its Scope 15

A career in data science is ranked at the third best job in America for
2020 by Glassdoor, and was ranked the number one best job from 2016
to 2019.29 Data scientists have a median salary of $118,370 per year or
$56.91 per hour.30 These are based on level of education and experience
in the field. Job growth in this field is also above average, with a projected
increase of 16 percent from 2018 to 2028.30 The largest employer of
data scientists in the United States is the federal government, employing
28 percent of the data science workforce.30 Other large employers of
data scientists are computer system design services, research and devel-
opment laboratories, big technology companies, and colleges and univer-
sities. Typically, data scientists work full time, and some work more than
40 hours a week. See references17,26,27 for the above paragraphs.
The outlook for data science field looks promising. It is estimated that
2 to 2.5 million jobs will be created in this area in the next ten years. The
data science area is vast and requires the knowledge and training from
different fields. It is one of the fastest growing areas. Data scientists can
have a major positive impact on a business success.
Data science continues to evolve as one of the most promising and
in-demand career paths for skilled professionals. Today, successful data
professionals understand that they must advance past the traditional skills
of analyzing large amounts of data, data mining, and programming skills.
In order to uncover useful intelligence for their organizations, data sci-
entists must master the full spectrum of the data science life cycle and
possess a level of flexibility and understanding to maximize returns at
each phase of the process.
Much of the data collected by companies underutilized. This data,
through meaningful information extraction and discovery, can be used to
make critical business decisions and drive significant business change. It
can also be used to optimize customer success and subsequent ­acquisition,
retention, and growth.
Business and research treat their data as an asset. The businesses, pro-
cesses and companies are run using their data. The data and variables
collected are highly dynamic and continuously change. Data science pro-
fessionals are needed to process, analyze, and model the data, which is
usually in the big data form to be able to visualize and help companies in
making timely data-driven decision. “The data science professionals must
be trained to understand, clean, process, and analyze the data to extract
16 Essentials of Data Science and Analytics

value from it. It is also important to be able to visualize the data using
conventional and big data software in order to communicate data in a
meaningful way. This will enable applying proper statistical, modeling,
and programming techniques to be able to draw conclusions. All these
require knowledge and skills from different areas and these are hugely
important skills in the next decades,” says Hal Varian, chief economist
at Google and UC Berkeley professor of information sciences, business,
and economics3 The increase in demand for data science jobs is expected
to grow by 28 percent by 2020 https://datascience.berkeley.edu/about/
what-is-data-science/.

Summary
Data science is a data-driven decision-making approach that uses s­ everal
different areas, methods, algorithms, models, and disciplines with a pur-
pose of extracting insights and knowledge from structured and unstruc-
tured data. These insights are helpful in applying algorithms and models
to make decisions. The models in data science are used in predictive
analytics to predict future outcomes. Businesses collect massive amounts
of data in different forms and by different means. With the continued
advancement in technology and data science, it is now possible for busi-
nesses to store and process huge amounts of data in their data bases. At
the core of data science is data. The field of data science is about using
this data in creative and effective ways to help businesses in making
­data-driven business decisions.
Data science uses several disciplines and areas including, statistical
modeling, data mining, big data, machine learning, and artificial intel-
ligence (AI), management science, optimization techniques, and related
methods in order to “understand and analyze actual phenomena” from
data.3
Data science also employs techniques and methods from many other
fields, such as mathematics, statistics, computer science, and informa-
tion science. Besides the methods and theories drawn from several fields,
data science uses visualization techniques using specially designed big data
software and statistical programming language, such as R programming,
and Python. Data science has wide applications in the areas of machine
Data Science and Its Scope 17

learning (ML) and artificial intelligence (AI). The chapter provided over-
view of data science by defining and outlining the tools and techniques
and explained the differences and similarities between data science and
data analytics. The other concepts related to data science including ana-
lytics, business analytics, and business intelligence (BI) were discussed.
Data science continues to evolve as one of the most sought-after areas by
companies. The chapter also outlined the career path and job-outlook for
this area, which continues to be one of the highest of all field. The field is
promising and is showing tremendous job growth.
Index
Addition law Box and Whisker Plot
mutually exclusive events, 279 home sales data, 116
nonmutually exclusive events, income data, 120–121
280–287 interpretation, 116–118
Analytics, 10–11, 21 Box plots, 200
analytical models, 37–38 applications, 192
big data, 35–36 categorical data (see Categorical data)
data mining, 37 displays, 170
Apache Hadoop, 14 exploratory data analysis, 241
Area plot, variation, 197 samples vs. machines, 173
Artificial neural network (ANN), samples vs. operators, 173
430, 435 shaft manufacturing process, 172
utility bill data, 242
Bar charts variations, 192
applications, 193 waiting time data, 170–171
cluster, 175, 193–194 Bubble graph/chart, 130–132
connected lines, 174 Business analytics (BA)
data visualization, 173–176 applications and implementation,
employment status and major, 182 31–32
gender and major, 182 vs. business intelligence (BI), 46–47
monthly sales, 174 business performance, 22
product rating, 180, 181 categories, 21–22
stacked, 176 data mining, 23–24
tally, 177–179 decision making, 22–24
variation, 193 definition, 21
vertical, 174–175 descriptive analytics, 25–26
Bayes’ theorem, 299–300 objectives, 40–41
Big data, 21, 22, 25, 146 overall process, 41
algorithm, 35 predictive analytics, 26–29
analytics, 35–36 prescriptive analytics, 29–31
data analysis, 63–64 statistical analysis, 23
definition, 35 statistics, 148–149
Gartner, 35 tools and algorithms, 20
visualization, 157 Business intelligence (BI), 24
Binomial distribution, 305, 310–311 advanced analytics projects, 45
binomial formula, 312 broad area, 45
binomial table, 312–313 vs. business analytics, 46–47
excel function, 314–315 definition, 45
mean or expected value, 316 statistics, 149
probability calculations, 314
probability of success, 311 Categorical data, 56
standard deviation, 316–317 bar chart
452 Index

clustered column chart, 123, 124 Classical method, probability,


largest Internet companies, 121 273–274
Pareto chart, 122 Classification, 436
revenue of Amazon, 126 Class-width, 162
stacked area chart, 124–125 Cluster bar chart, 175, 194
stacked column chart, 123, 124 Clustered column chart, 123
variations, 122–123 Clustering technique, 427–428, 436
vertical bar, 121 Cluster sampling, 359
bubble graph/chart, 130–132 Coefficient of correlation, 100–102
data visualization (see Data calculation, 249–251
visualization) covariance, 251
line chart, 125–126 linear relationship, 247–248
pie chart scatterplots, 252
bar, 127–128 Coefficient of variation, 224–226,
pie, 129 230
U.S. market share, 126–128 Complete contingency table, 265
scatter plot Comprehensive R Archive Network
fitted line plot, 130, 131 (CRAN), 439–340
sales and profit, 129, 130 Conditional probability
time series plot, 132–133 statistical dependence, 293–297
Census, 354 statistical independence, 292
Central limit theorem, 369–370, 419 Confidence interval estimate
sample mean, 370, 371 difference between two means,
sampling, 372–373 396–399
sampling distribution, 377 interpretation, 385
Central tendency mean, 386–387, 389
mean, 203–207 normal distribution, 388
mean, median and mode point estimate, 384
comparison, 209–211 population proportion, 389–390
median, 207–208 sample size determination, 390–395
mode, 209 sampling error, 385
symmetrical distribution, 203 t-distribution, 389
Charts and graphs. See also Data Contingency table
visualization categorical variables, 263–264
applications, 190–198, 194 complete, 265
area plot, 197 example, 263
bar chart, 193 incomplete, 265
box-plot, 192 marginal distribution, 264
dot plot, 192 movie preference by gender,
histogram, 190–191 263–264
line graph, 194 row and column totals, 264
pie chart, 195 Continuous data, 57, 152
probability plot, 198 Continuous probability distributions
sequence plot, 197 exponential distribution, 347–349
stem-and leaf plot, 191 location parameter, 346
symmetry plot, 198 normal distribution, 332–336
Chebyshev’s theorem, 232–234 probability density function, 328,
Class frequency, 160–161 331–332
Index 453

random variables, 328–331 data analysis, 64


scale parameter, 346 descriptive, 432
shape parameter, 346 financial applications, 149
standard normal distribution, knowledge discovery in databases,
336–338 432
Continuous random variables, 304, 305 machine learning, 431–432
Counting rules, 266–268 predictive, 432
Covariance statistics, 148
calculation, 246–247 Data point, 56
interpretation, 247 Data quality, 54–55, 65
limitation, 247 Data quality assurance (DQA), 55, 65
sales and advertising, 245–246 Data science, 40
sample, 245 body of knowledge, 13
Cross-sectional data, 56, 152 vs. data analytics, 10–11
data visualization (see Data
Data visualization)
categorical, 56 definition, 6, 7
continuous, 57, 152 history, 8–9
cross-sectional, 56, 152 knowledge of statistics, 3
discrete, 57 professional/scientist
elements, 56–57 career path, 14
levels of measurement, 58–60 knowledge and skills, 11–12
qualitative, 56, 152 statistics, 7–8
quantitative, 56, 152 structured data, 4
time series, 56, 152 techniques and methods, 16–17
variable, 57 technologies used, 12–14
Data analysis tools, 6
big data, 63–64 unstructured data, 4
data classification, 56–58, 61 Data set, 56
data cleansing, 53 Data transformation, 53
data mining, 64 Data visualization
data preparation, 53–54 bar charts, 173–176
data quality, 54–55 big data analysis, 157
data transformation, 53 box plots, 170–173
data warehouse, 54, 64 business intelligence, 156
developments, 52–53 categorical data
scripting/script language, 53 bar chart, 177
Data cleansing, 53 cross tabulation, 179–180
Data collection example, 176
experimental design, 63 interval plot, 184–185
government agencies, 62 pie chart, 181–184
processes, 63 sequence plot, 187–190
telephone/mail surveys, 63 time series plots, 185–187
web data source, 62 dashboards, 157
Data-driven decision-making data collection, 158–159
approach, 16 data organization, 159–160
Data mining, 21, 23–24, 27, 33, 37 frequency distribution, 160–165
current developments, 139–140 graphical presentation, 157, 158
454 Index

graphical techniques, 158 Poisson distribution, 305, 317–321


stem-and-leaf plot, 167–170 random variable, 302, 304–306
tableau and Olick, 157 single die rolling, 303–304
techniques, 6 standard deviation, 309
tools and software, 156 statistical experiments, 302
variation, graphical display, 166, tabular form, 303
167 variance, 309
Data warehouse (DH), 54, 64 Disjoint events, 260
Deep learning, 430–431, 435 Dot plot, 192
Dell EMC project, 4
Descriptive analytics, 22 Empirical rule
big data, 25 areas under the normal curve, 235
dashboards, 36 normal curve, 239–240
data visualization (see Data percent of observations, 236
visualization) standard normal table, 239
numerical methods, 25 symmetrical/bell-shaped
tools, 26, 31, 32, 42 distribution, 234–235
visual and simple analysis, 25 verification, 104–108
Descriptive data mining, 432 z-score, 236–239
Descriptive statistics, 143–144, 153 Empty set, 276
excel, 244 Energy consumption, line chart,
coefficient of correlation, 125–126
100–102 Enterprise data warehouse (EDW), 54
computation, 96–97 Equality of sets, 276
covariance, 100–102 Equally likely events, 260
empirical rule verification, Estimates
104–108 confidence interval estimate (see
home sales, 117 Confidence interval estimate)
measures, 94–95 interval estimate, 384
numerical measures, 95–96 interval estimates, 399–400
random numbers generation, point estimates, 383, 399–400
102–104 Events
statistics tool, 97–99 disjoint events, 260
Z-score, 99–100 equally likely, 260
MINITAB, 214–216 example, 259
utility bill data, 242, 243–244 exhaustive, 260
Dimensionality reduction algorithm, mutually exclusive, 260
428 Excel
Discrete data, 57 Box and Whisker Plot
Discrete probability distributions home sales data, 116
binomial distribution, 305, income data, 120–121
310–317 interpretation, 116–118
customer arrival pattern, 301–302 buttons and tabs, 68–69
expected value, 308–309 chart editing, 74
geometric distribution, 325–327 data entering, 70–72
graphical display, 303 data formatting
hypergeometric probability blank column insertion, 80–81
distribution, 321–325 conditional formatting, 82–83
Index 455

as currency, 81 fitted distribution, 191


filtering, 77–79 frequency distribution, 163–165
profit calculation, 81–82 home price, 164–166
Sales and Ad Data, 76 home sales data
sorting, 79–80 data sorting, 110–111
table, 77 frequency table, 111
data saving, 71 home sales distribution, 108–109
descriptive statistics (see Descriptive interpretation, 113
statistics, Excel) shape approximation, 112
edited chart with data, 76 user-specified bins, 112
graph printing, 75 income data
histogram (see Histogram) negatively skewed distribution,
Move Chart, 73–74 120
Office 365, 68 Pareto chart, 118, 119
Pareto chart, 109, 110 positively skewed data, 119
pivottable/pivot chart (see pivot chart and pivot table option
Pivottable/pivot chart) aggregation changes, 115
poisson distribution, 321 frequency distribution table, 114
saved file retrieving, 71–72 frequency table, 116
spreadsheet program, 67 row labels grouping, 115–116
time series plot, 72–73 Hypergeometric probability
Exhaustive events, 260 distribution
Exploratory data analysis binomial distribution, 321–322
box plot, 241, 242 excel, 323
descriptive statistics, 243–244 item selection, 322–323
outliers detection, 243 probability of success, 323–324
sorted data, 241–242 Hypothesis, 401
Exponential distribution, 347–349 Hypothesis testing
critical values and decision areas,
Finite population correction facto, 416
377 error types, 404
Fitted line plot, 130, 131 left-sided test, 406
Frequency distribution, 146 population parameter, 401
class frequency, 160–161 power of the test, 403
class interval, 163 p-value approach, 410–412,
class-width, 162 417–418
grouping/forming, 161–162 rejection and nonrejection areas, 404
histogram, 163–165 right-sided test, 406–407
single population mean, 402–403,
Geometric distribution, 325–327 405
Graphical tools, 150 two population means, 412–416
Graphs two-sided/one-sided hypothesis,
cumulative frequency, 191 418–419
data summary, 191 two-sided test, 407–409
variation, 191
Incomplete contingency table, 265
Histogram Inferential statistics, 144–145, 153
applications, 190–191 confidence level, 382
456 Index

definition, 381 Python, 434


estimation, 382 reinforcement learning, 430
interval estimate, 384 Scikit-learn, 434
point estimates, 383 semisupervised machine learning
tools, 381–382 algorithms, 429
Institute of Operations Research supervised learning, 425–426
and Management Science TensorFlow, 434
(INFORMS), 40 unsupervised learning, 426–429
Interference problems, 145 Matplotlib, 435
International Data Group (IDG), 4 Mean
Internet of Thing (IOT), 37 average wage data, 204–205
Interquartile range (IQR), 212 disadvantages, 206
calculation, 227, 230–231 formula, 204
salary data, 227–228 sample and population, 205
third quartile, 228 weighted, 206–207
Interval estimates Measures of variation
definition, 384 coefficient of variation, 224–226
different population proportions, interquartile range, 227–231
399–400 range, 219–220
Interval plot, 184–185 standard deviation, 223–224
application, 196 variance, 220–223
variation, 196 Median, 207–208
Interval Scale, 60 Microsoft machine learning software,
147
Joint probability, 290–291 Mode, 209
statistical dependence, 297–299 Mutually exclusive events, 260
statistical independence, 290–291
Jupyter Notebook, 13, 434 Negatively skewed distribution, 210,
211
Knowledge discovery in databases Neural network (NN), 430, 435
(KDD), 431 Nominal scale, 59
Nonmutually exclusive events,
Levels of measurement, 58–60 280–287
Line chart, 125–126 Nonprobability sampling, 375
Line graph applications, 194 Normal distribution
area property, 333, 334
Machine learning (ML), 12, 37 calculations, 338–339, 341–343
algorithms, 424 maximum probability, 332
analytical models, 424 normal curve, 333
applications, 424 parameters, 340–341
artificial neural network, 430 probability density function, 332
data mining, 431–432 same mean and different standard
deep learning, 430–431 deviations, 334–335
Jupyter Notebook, 434 same standard deviation and
Matplotlib, 435 different mean, 335
NumPy, 435 statistical theory, 335–336
Pandas, 434 Null set, 276
problems and tasks, 433 Numerical methods
Index 457

bivariate relationship, 245 crosstabulation, 92–94


central tendency, 203–207 pie chart, 92
coefficient of correlation, 247–251 revenue and profit, 88–90
covariance, 245–247 Sales and Ad Data, 85–86
exploratory data analysis, 241–244 Point estimates
mean and standard deviation definition, 383
Chebyshev’s theorem, 232–234 different population proportions,
empirical rule, 234–241 399–400
measures of variation (see Measures Poisson distribution, 305
of variation) binomial distribution, 317
Numerical Python, 435 car arrival probability, 318
Numeric data, 58 characteristics, 318
Excel function, 321
Olick, 157 number of occurrences, 318
Ordinal scale, 59 probability density function, 317
two-car accidents, 319–321
Pandas, 434 Population, 144–145, 152
Parameter, 153 mean, 205
Pareto chart, 109, 110 variance, 220
income data, 119 Positively skewed distribution, 210,
largest Internet companies, 122 211
Pearson correlation, 251 Posterior probabilities, 300
Percentiles and quartiles Predictive analytics, 22
box plot, 212 applications, 29, 32
calculation, 212–213 data mining, 27, 33
five-measure data summary, 212 prerequisite, 28
interquartile range, 212 regression models, 27
monthly income, 215 techniques, 36
sorted data, 213 time series forecasting, 27
sorted income data, 215 tools of, 27–29, 31, 43
Permutations, 268–270 Predictive data mining, 432
Pie chart Prescriptive analytics
bar, 127–128, 195 flow chart, 30
pie, 129, 183–184, 195 operations management tools,
U.S. federal budget, 182–183 29–30
U.S. market share, 126–128 tools, 31, 33, 36, 44
variation, 195 Prior probability, 300
variations, 183 Probability, 145
Pivottable/pivot chart Addition Law
crosstabulation, 89 mutually exclusive events, 279
data analysis, 83–84 nonmutually exclusive events,
dimensions adding, 84, 85 280–287
for each day, 86–87 applications, 256
monthly revenue plotting, 87–88 Bayes’ theorem, 299–300
months and sales values, 84, 85 classical method, 273–274
qualitative/categorical data combination, 270–277
bar chart, 92 contingency table application,
brand television sets, 89–91 262–265
458 Index

counting rules, 266–268 Regression models, 27


distributions, 150, 307 Reinforcement learning, 430
event A, 257 Relative frequency, 274–275
events, 259–260 R statistical analysis, 13
experiment, 257 R statistical programing software
permutations, 268–270 and R-Studio
random event, 255–256 Comprehensive R Archive
relative frequency, 274–275 Network, 439–440
sample space, 257 console window, 441
defective/non defective, 258 different windows, 440
examples, 262 environment widow, 441–442
experiments, 257–259 script window, 441
six-sided die, 257, 258 statistical features, 438
tossing three coins, 260–261
sets, 275–279 Sample, 152
statistical dependence, 292–299 mean, 205
statistical independence, 289–291 statistics, 145
subjective probability, 275 variance, 220, 221
uncertainty, 255 Sampling
Probability density function nonprobability, 375
continuous probability population parameter, 374
distributions, 328, 331–332 probability, 375
normal distribution, 332 Sampling distributions, 419
Probability mass function (PMF), 306 bias, 355
Probability plot, 198 census, 354
Probability sampling methods, 375 exponential distribution, 368
cluster sampling, 359 formula for, 379
simple random sampling, 356–357 Medicaid, 352
stratified sampling, 358 population, 359
systematic sampling, 358 population parameter estimation,
p-value approach, 410–412, 417–418 355
Python, 13, 434 probability distribution, 376
Pytorch, 13 probability sampling methods
(see Probability sampling
Qualitative data, 56 methods)
Qualitative variable category, 59–60 process of, 360
Quantitative data, 56, 152 risks, 354–355
sample mean
Random variables, 302, 328–331 different populations, 366–367
continuous probability distribution, 361
distributions, 328–331 histogram, 362
discrete probability distributions, mean and standard deviation,
302, 304–306 362–363
Range, 219–220 population, 360
Rank order, 160 sample proportion, 360
Ratio scale, 60 standard deviation, 363–366
Raw data, 63, 160 samples, 353
Regression, 426 sampling error, 359
Index 459

survey methods, 352–354 simple probability, 289


uniform distribution, 368–369 Statistical inference, 145, 150, 359,
uninsured rate, 351 381
Sampling error, 354–355, 375 Statistical programming language, 5
Scatter plot Statistical techniques, 56
fitted line plot, 130, 131 Statistical thinking, 147–148
sales and profit, 129, 130 Statistics, 7–8, 52
Scikit-learn, 434 characteristics, 141–142
Scripting/script language, 53 computers role, 146–147
Semisupervised machine learning data analysis, 139–140
algorithms, 429 data collection, 140
Sequence plot, 197 data mining, 148
measurements on machined parts, decision making, 150
188 definition, 141
pizza delivery time, 189–190 descriptive, 143–144, 151
process over time, 187 fields, 137–138, 151
specification limits, 189 frequency distribution, 146
Sets gaining skills, 138
complement of set A, 276–277 inferential, 144–145, 152
equality, 276 manufacturing, United states,
intersection, 278–279 138–139
union of sets A and B, 277–278 mathematics, 141–142
universal, 276 population, 142
venn diagram, 276, 277 population parameters and
Simple data analysis tools, 63 symbols, 202
Simple probability, 289 retail business data, 149
Simple random sampling, 356–357 statistical thinking, 147–148
Skewness, 210 student loan Debt, United states,
Stacked area chart, 124–125 138
Stacked bar chart tools, 52
application, 194 uses and application areas, 142
carbon dioxide emissions by sector, US households income level,
176 200–202
Stacked column chart, 123 variable, 151
Standard deviation, 222, 223–224 variation, 140–141, 151
binomial distribution, 316–317 Stem-and-leaf plot, 167–170, 192
sample mean, 363–366 Stratified sampling, 358
Standard error of the mean, 376–377 Structured data, 4, 38, 54, 65
Standardized value, 240–241 Subjective probability, 275
Standard normal distribution, Supervised learning, 425–426, 436
336–338, 344–345 Symmetrical data distribution, 210
Statistical dependence, 292 Symmetry plot, 198
conditional probability, 293–297 Systematic sampling, 358
joint probability, 297–299
marginal probability, 299 Tableau, 14, 157
Statistical independence, 289–291 Telephone/mail surveys, 63
conditional probability, 292 TensorFlow, 13, 434
joint probability, 290–291 Time series data, 56, 152
460 Index

Time series forecasting, 27 Variable, 151


Time series plots, 72–73, 132–133 Variance
applications, 194 calculation, 221–222
demand data, 185, 186 dollar value, 222
sales, 186 features, 223
seasonal pattern, 187 population, 220
trend, 187 sample, 220, 221
variation, 196–197 Variations, 140–141. See also
Measures of variation
Uncertainty, 255 bar chart, 122–123
Ungrouped data, 160 graphical display, 166, 167
Universal set, 276 Venn diagram, 276
Unstructured data, 4, 38, 54, 65 Vertical bar chart, 174–175
Unsupervised learning, 436
classificationvs. clustering, 428–429 Waiting time data, boxplot, 170–171
clustering, 427–428 Weighted mean, 206–207
dimensionality reduction, 428
goal, 426 Z-score, 99–100, 237–238

You might also like