Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Study Unit 14: F.4. Data Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Study Unit 14: F.4.

Data Analytics CMA Part 1

Study Unit 14: F.4. Data Analytics


Data analytics is the process of gathering and analyzing data in a way that produces meaningful
information that can be used to aid in decision-making. As businesses become more technologically
sophisticated, their capacity to collect data increases. However, the stockpiling of data is meaningless with-
out a method of efficiently collecting, aggregating, analyzing, and utilizing it for the benefit of the company.

Data analytics can be classified into four types: descriptive analytics, diagnostic analytics, predictive
analytics, and prescriptive analytics.

1) Descriptive analytics report past performance. Descriptive analytics are the simplest type of
data analytics, and they answer the question, “What happened”?

Example: Sales revenue may have recently declined. Descriptive analytics are used to recog-
nize the situation.

2) Diagnostic analytics are used with descriptive analytics to answer the question, “Why did it
happen”? The historical data is mined to understand the past performance and to look for the
reasons behind success or failure. For example, sales data might be broken down into segments
such as revenue by region or by product rather than revenue in total.

Example: If the decline in sales is centered in a particular product line, diagnostic analytics can
be used to find possible reasons for the decline by analyzing relationships between the sales
decline and other data such as poor product reviews, a price increase, or a competitor’s price
decrease.

3) Predictive analytics focus on the future using correlative112 analysis. Predictive analytics answer
the question, “What is likely to happen”? Historical data is combined with other data using rules
and algorithms. Large quantities of data are processed to identify patterns and relationships be-
tween and among known random variables or data sets. Those patterns and relationships are then
used to make predictions about what is likely to occur in the future.

Example: A sales forecast made using past sales trends is a form of predictive analytics. In the
event of a sales decline in a particular product, predictive analytics can be used to determine
whether the sales decline is likely to continue.

4) Prescriptive analytics answer the question “What needs to happen?” by charting the best course
of action based on an objective interpretation of the data. Prescriptive analytics make use of
structured and unstructured data and apply rules to predict what will happen and to help manage-
ment decide how best to take advantage of the predicted events to create added value. Prescriptive
analytics are the type of analytics likely to yield the most impact for an organization, but they are
also the most complex type of analytics.

Example: Prescriptive analytics might generate a sales forecast and then use that information
to determine what is needed to address the forecasted situation. If sales of a particular product
have declined, prescriptive analytics could help management develop plans to reverse the de-
cline. Prescriptive analytics can incorporate new data and re-predict and re-prescribe, as well.

112
If two things are correlated with one another, it means there is a close connection between them. It may be that one
of the things causes or influences the other, or it may be that something entirely different is causing or influencing both
things that are correlated.

296 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 14: F.4. Data Analytics

Business Intelligence (BI)


Business intelligence is the combination of architectures, analytical and other tools, databases, applications,
and methodologies that enable interactive access—sometimes in real time—to data such as sales revenue,
costs, income, and product data. Business intelligence provides historical, current, and predicted values for
internal, structured data regarding products and segments. Further, business intelligence gives managers
and analysts the ability to conduct analysis to be used to make more informed strategic decisions and thus
optimize performance.

The business intelligence process involves the transformation of data into information, then to knowledge,
then to insight, then to strategic decisions, and finally to action.

• Data is facts and figures, but data alone is not information.

• Information is data that has been processed, analyzed, interpreted, organized, and put into con-
text such as in a report, so that it is meaningful and useful.

• Knowledge is the theoretical or practical understanding of something. It is facts, information, and


skills acquired through experience or study. Thus, information becomes knowledge through expe-
rience, study, or both.

• Insight is a deep and clear understanding of a complex situation. Insight can be gained through
perception or intuition, but it can also be gained through use of business intelligence: data analyt-
ics, modeling, and other tools.

• The insights gained from the use of business intelligence lead to recommendations for the best
action to take. Strategic decisions are made by choosing from among the recommendations.

• The strategic decisions made are implemented and turned into action.

A Business Intelligence system has four main components:

1) A data warehouse (DW) containing the source data.

2) Business analytics, that is, the collection of tools used to mine, manipulate, and analyze the data
in the DW. Many Business Intelligence systems include artificial intelligence capabilities, as well as
analytical capabilities.

3) A business performance management component (BPM) to monitor and analyze performance.

4) A user interface, usually in the form of a dashboard.

Note: A dashboard is an information management tool. It is a screen in a software application, a


browser-based application, or a desktop application, and it organizes and displays in one place infor-
mation relevant to a given objective or process, or for senior management, it may show patterns and
trends in data across the organization.

A dashboard is typically connected to an underlying data source so it can be updated continuously. It


frequently has an interactive element that enables a user to view the data in different ways, drill down
to learn more about a particular data point, sort the data, or filter the data to exclude portions of it.

For example, a dashboard for a manufacturing process might show productivity information for a period,
variances from standards, and quality information such as the average number of failed inspections per
hour. For senior management, it might present key performance indicators, balanced scorecard data, or
sales performance data, to name just a few possible metrics that might be chosen.

A dashboard for a senior manager may show data on manufacturing processes, sales activity, and current
financial metrics.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 297
Study Unit 14: F.4. Data Analytics CMA Part 1

Big Data and the Four “V”s of Big Data


Big Data refers to vast datasets that are too large to be analyzed using standard software tools and so
require new processing technologies. Those new processing technologies are data analytics.

Big Data can be broken down into three categories:

1) Structured data is in an organized format that enables it to be input into a relational database
management system and analyzed.

2) Unstructured data has no defined format or structure. It is typically free-form and text-heavy,
making in-depth analysis difficult.

3) Semi-structured data has some format or structure but does not follow a defined model. It
cannot be organized in a relational database, or it does not have a strict structural framework,
although it does have some structural properties or a loose organizational framework.

Examples of Structured, Unstructured, and Semi-structured Data

Structured Unstructured Semi-Structured

Transaction data Word processing documents XML (Extensible Markup Lan-


guage) files
Customer data Emails
CSV (Comma-Separated Val-
Financial data Call center communications ues) files
and other customer service in-
Employee data HTML (Hyper-Text Markup Lan-
teractions
guage) files
Vendor data Contracts
Server logs
Audio and video recordings Tweets organized by hashtags
Untagged photos Photos containing meta tags
Data from RFID (radio-fre- relating to the location, date,
or by whom taken
quency identification) tags
Social media information that
Online reviews
may be organized by user,
friends, groups, and so forth.
Folders organized by topic
Web pages
Text organized by subject or
topic, although the text itself
has no structure.

Note: Emails are usually considered unstructured data because the message field is free-form and text-
heavy. However, emails may sometimes be considered semi-structured data because they do have struc-
ture according to sender, recipient, subject, and date; and they may be automatically categorized with
the help of machine learning, for example as spam.

Big Data is characterized by four attributes, known as the four V’s: volume, velocity, variety, and veracity.

1) Volume: Volume refers to the amount of data that exists. The volume of data available is in-
creasing exponentially as people and processes become more connected, creating problems for
accountants. The tools used to analyze data in the past—spreadsheet programs such as Excel and
database software such as Access—are no longer adequate to handle the complex analyses that
are needed. Data analytics is best suited to processing immense amounts of data.

298 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 14: F.4. Data Analytics

2) Velocity: Velocity refers to the speed at which data is generated and changed, also called its flow
rate. As more devices are connected to the Internet, the velocity of data grows, and organizations
can be overwhelmed with the speed at which the data arrives. The velocity of data can make it
difficult to discern which data items are useful for a given decision. Data analytics is designed to
handle the rapid influx of new data.

3) Variety: Variety refers to the diverse forms of data that organizations create and collect. In the
past, data was created and collected primarily by processing transactions. The information was in
the form of currency, dates, numbers, text, and so forth. It was structured, that is, it was easily
stored in relational databases and flat files. However, today unstructured data such as media files,
scanned documents, Web pages, texts, emails, and sensor data are being captured and collected.
These forms of data are incompatible with traditional relational database management systems
and traditional data analysis tools. Data analytics can capture and process diverse and complex
forms of information.

4) Veracity: Veracity is the accuracy of data, or the extent to which it can be trusted for decision-
making. Data must be objective and relevant to the decision at hand if it is to have value for use
in making decisions. However, various distributed processes—such as millions of people signing up
online for services or free downloads—generate data, and the information they input is not subject
to controls or quality checks. If biased, ambiguous, irrelevant, inconsistent, incomplete, or even
deceptive data is used in analysis, poor decisions will result. Controls and governance over data to
be used in decision-making are essential to ensure the data’s accuracy. Poor-quality data leads to
inaccurate analysis and results, commonly referred to as “garbage in, garbage out.”

Some data experts have added two additional Vs that characterize data:

5) Variability: Data flows can be inconsistent, for example, they can exhibit seasonal peaks. Fur-
thermore, data can be interpreted in varying ways. Different questions require different
interpretations.

6) Value: Value is the benefit that the organization receives from data. Without the necessary data
analytics processes and tools, the information is more likely to overwhelm an organization than to
help the organization. The organization must be able to determine the relative importance of dif-
ferent data to the decision-making process. Furthermore, an investment in Big Data and data
analytics should provide benefits that are measurable.

Data Science
Data science is a field of study and analysis that uses algorithms and processes to extract hidden knowledge
and insights from data. The objective of data science is to use structured, unstructured, and semi-struc-
tured data to extract information that can be used to develop knowledge and insights for forecasting and
strategic decision making.

The difference between data analytics and data science is in their goals.

• The goal of data analytics is to provide information about issues that the analyst or manager either
knows or knows he or she does not know (that is, “known unknowns”).

• On the other hand, the goal of data science is to provide actionable insights into issues where
the analyst or manager does not know what he or she does not know (that is, “unknown un-
knowns”).

Example: Data science would be used to try to identify a future technology that does not exist today
but that will impact the organization in the future.

Decision science, machine learning (that is, the use of algorithms that learn from the data to predict
outcomes), and prescriptive analytics are three examples of means by which actionable insights can be
discovered in a situation where “unknowns are unknown.”

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 299
Study Unit 14: F.4. Data Analytics CMA Part 1

Data science involves data mining, analysis of Big Data, data extraction, and data retrieval. Data science
draws on knowledge of data engineering, social engineering, data storage, natural language processing,
and many other fields.

The size, value, and importance of Big Data has brought about the development of the profession of data
scientist. Data science is a multi-disciplinary field that unifies several specialized areas, including statistics,
data analysis, machine learning, math, programming, business, and information technology. A data scientist
is a person with skills in all the areas, though most data scientists have deep skills in one area and less
deep skills in the other areas.

Note: “Data mining” involves using algorithms in complex data sets to find patterns in the data that can
be used to extract usable data from the data set. Data mining is discussed in more detail below.

Data and Data Science as Assets


Data and data science capabilities are strategic assets to an organization, but they are complementary
assets.

• Data science is of little use without usable data.

• Good data cannot be useful in decision-making without good data science talent.

Good data and good data science, used together, can lead to large productivity gains for a company and
the ability to do things it has never done before. Data and data science together can provide the following
opportunities and benefits to an organization:

• They can enable the organization to make decisions based on data and evidence.

• The organization can leverage relevant information from various data sources in a timely manner.

• When the cloud is used, the organization can get the answers it needs using any device, any time.

• The organization can transform data into actionable insights.

• The organization can discover new opportunities.

• The organization can increase its competitive advantage.

• Management can explore data to get answers to questions.

The result can be maximized revenue, improved operations, and mitigated risks. The return on investment
from the better decision-making that results from using data and data science can be significant.

As with any strategic asset, it is necessary to make investments in data and data science. The investments
include building a modern business intelligence architecture using the right tools, investing in people with
data science skills, and investing in the training needed to enable the staff to use the business intelligence
and data analytics tools.

Challenges of Managing Data Analytics


Some general challenges of managing data analytics include data capture, data curation (that is, the or-
ganization and integration of disparate data collected from various sources), data storage, security and
privacy protection, data search, data sharing, data transfer, data analysis, and data visualization.

In addition, some specific challenges of managing data analytics include:

• The growth of data and especially of unstructured data.

• The need to generate insights in a timely manner for the data to be useful.

• Recruiting and retaining Big Data talent. Demand has increased for data engineers, data scientists,
and business intelligence analysts, causing higher salaries and creating difficulty filling positions.

300 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 15: F.4. Data Mining

Study Unit 15: F.4. Data Mining


Data mining is the use of statistical techniques to search large data sets to extract and analyze data to
discover previously unknown, useful patterns, trends, and relationships within the data that go beyond
simple analysis and that can be used to make decisions. Data mining uses specialized computational meth-
ods derived from the fields of statistics, machine learning, and artificial intelligence.

Data mining involves trying different hypotheses repeatedly and making inferences from the results that
can be applied to new data. Data mining is thus an iterative process. Iteration is the repetition of a
process to generate a sequence of outcomes. Each repetition of the process is a single iteration, and the
outcome of each iteration is the starting point of the next iteration. 113

Data mining is a process with defined steps, and thus it is a science. Science is the pursuit and application
of knowledge and understanding of the natural and social world following a systematic methodology based
on evidence.114

Data mining is also an art. In data mining, decisions must be made regarding what data to use, what tools
to use, and what algorithms to use. For example, one word can have many different meanings. In mining
text, the context of words must be considered. Therefore, instead of just looking for words in relation to
other words, the data scientist looks for whole phrases in relation to other phrases. The data scientist must
make thoughtful choices to get usable results.

Data mining differs from statistics. Statistics focuses on explaining or quantifying the average effect of an
input or inputs on an outcome, such as determining the average demand for a product based on some
variable like price or advertising expenditures. Statistical analysis includes determining whether the rela-
tionships observed could be a result of the variable or could be a matter of chance instead.

An example of statistical analysis is a linear regression model that relates total historical sales revenues
(the dependent variable) to various levels of historical advertising expenditures (the independent variable)
to discover whether the level of advertising expenditures affects total sales revenues.

Statistics may involve using a sample from a dataset to make predictions about the population. Alterna-
tively, it may involve using the entire dataset to estimate the best-fit model to maximize the information
available about the hypothesized relationship in the population and predict future results.

In contrast, data mining involves open-ended exploring and searching within a large dataset without putting
limits around the question being addressed. The goal is to predict outcomes for new individual records. The
data is usually divided into a training set and a validation set. The training set is used to estimate the model,
and the validation set is used to assess the model’s predictive performance on new data.

Data mining might be used to classify potential customers into different groups to receive different market-
ing approaches based on some characteristic common to each group that is yet to be discovered. It may
be used to answer questions such as what specific online advertisement should be presented to a particular
person browsing on the Internet based on their previous browsing habits and the fact that other people
who browsed the same topics purchased a particular item.

Thus, data mining involves generalization of patterns from a data set. “Generalization” is the ability to
predict or assign a label to a “new” observation based on a model built from experience. In other words,
the generalizations developed in data mining should be valid not just for the data set used in observing the
pattern but should also be valid for new, unknown data.

Software used for data mining uses statistical models, but it also incorporates algorithms that can “learn”
from the patterns in the data. An algorithm is applied to the historical data to create a mining model, and
then the model is applied to new data to create predictions and make inferences about relationships. For

113
Definition of “iteration” from Wikipedia, https://en.wikipedia.org/wiki/Iteration, accessed May 8, 2019.
114
Definition of “science” from the Science Council, https://sciencecouncil.org/about-science/our-definition-of-science/,
accessed May 8, 2019.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 301
Study Unit 15: F.4. Data Mining CMA Part 1

example, data mining software can help find customers with common interests and determine which prod-
ucts customers with each particular interest typically purchase, to direct advertising messages about specific
products to the customers who are most likely to purchase those products.

Data mining is used in predictive analytics. Basic concepts of predictive analytics include:

• Classification – Any data analysis involves classification, such as whether a customer will pur-
chase or not purchase. Data mining is used when the classification of the data is not known. Similar
data where the classification is known is used to develop rules, and then those rules are applied to
the data with the unknown classification to predict what the classification is or will be.

Example: Customers may be classified as predicted purchasers or predicted non-purchasers.

• Prediction – Prediction is similar to classification, but the goal is to predict the value of a variable.
Although classification also involves prediction, “prediction” in this context refers to prediction of a
numerical value, which can be an integer (a whole number such as 1, 2, or 3) or a continuous
variable.115

Example: Prediction involves not just classifying customers as predicted purchasers or pre-
dicted non-purchasers, but for those who are predicted purchasers, predicting the amount of
their purchases.

• Association rules – Also called affinity analysis, association rules are used to find patterns of
association between items in large databases, such as associations among items purchased from
a retail store, or “what goes with what.”

Example: When customers purchase a 3-ring notebook, do they usually also purchase a pack-
age of 3-hole punched paper? If so, the 3-hole punched paper can be placed on the store shelf
next to the 3-ring notebooks.

• Online recommendation systems – In contrast to association rules, which generate rules that
apply to an entire population, online recommendation systems use collaborative filtering to de-
liver personalized recommendations to users. Collaborative filtering generates rules for “what goes
with what” at the individual user level. Recommendations can be made to individuals based on
their historical purchases, online browsing history, or other measurable behaviors that indicate
their preferences, as well as other users’ historical purchases, browsing, or other behaviors.

Example: Online purchasers may be presented with a suggestion for another product to go
with what they just ordered because other people who bought what they just ordered have
ordered this other product, as well.

• Data reduction – Data reduction is the process of consolidating a large number of records into a
smaller set by grouping the records into homogeneous groups.

• Clustering – Clustering is discovering groups in data sets that have similar characteristics without
using known structures in the data or fixed groups. Clustering can be used in data reduction to
reduce the number of groups to be included in the data mining algorithm.

Example: Customer segmentation is a form of clustering. Customers can be segmented ac-


cording to geographic region, demographics such as age or gender, behavior such as usage
level of a product or service, or lifestyle such as outdoor enthusiast or sports fan.

115
A continuous variable is a numerical variable that can take on any value at all. It does not need to be an integer such
as 1, 2, 3, or 4, though it can be an integer. A continuous variable can be 8, 8.456, 10.62, 12.3179, or any other number,
and the variable can have any number of decimal points.

302 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 15: F.4. Data Mining

• Dimension reduction – Dimension reduction entails reducing the number of variables in the data
before using it for data mining to improve its manageability, interpretability, and predictive ability.

• Data exploration – Data exploration is used to understand the data and detect unusual values.
The analyst explores the data by looking at each variable individually and looking at relationships
between and among the variables to discover patterns in the data. Data exploration can include
creating charts and dashboards, called data visualization or visual analytics (see next item). Data
exploration can lead to the generation of a hypothesis.

• Data visualization – Data visualization is another type of data exploration. Visualization, or vis-
ual discovery, consists of creating graphics such as histograms and boxplots for numerical data
to visualize the distribution of the variables and to detect outliers 116 in order to gain insights to
support better decisions.

Example: Pairs of numerical variables can be plotted on a scatter plot 117 graph to discover
possible relationships. When the variables are categorized, bar charts can be used.

Visualization is covered in detail later in this section.

Supervised and Unsupervised Learning in Data Mining


Supervised learning algorithms are used in classification and prediction. To “train” the algorithm, it is
necessary to have a dataset in which the value of the outcome to be predicted is already known, such as
whether the customer made a purchase. The dataset with the known outcome is called the training data
because that dataset is used to “train” the algorithm. The data in the dataset is called labeled data because
it contains the outcome value (called the label) for each record. The classification or prediction algorithm
“learns” or is “trained” about the relationship between the predictor variables and the outcome variable in
the training data. After the algorithm has “learned” from the training data, it is tested by applying it to
another sample of labeled data for which the outcome is already known but is initially hidden (called the
validation data) to see if it works properly. If several different algorithms are being tested, additional test
data with known outcomes should be used with the selected algorithm to predict how well it will work. After
the algorithm has been thoroughly tested, it can be used to classify or make predictions in data where the
outcome is unknown.

Example: Simple linear regression is an example of a basic supervised learning algorithm. The x
variable, the independent variable, serves as the predictor variable. The y variable, the dependent vari-
able, is the outcome variable in the training and test data where the y value for each x value is known.
The regression line is drawn so that it minimizes the sum of the squared deviations between the actual
y values and the values predicted by the regression line. Then, the regression line is used to predict the
y values that will result for new values of x for which the y values are unknown.118

Unsupervised learning algorithms are used when there is no outcome variable to predict or classify.
Association rules, dimension reduction, and clustering are unsupervised learning methods.

Neural Networks in Data Mining


Neural networks are systems that can recognize patterns in data and use the patterns to make predictions
using new data. Neural networks derive their knowledge from their own data by sifting through the data
and recognizing patterns. Neural networks are used to learn about the relationships in the data and combine

116
Outliers are data entries that do not fit into the model because they are extreme observations.
117
A scatter plot uses dots to represent the intersection of two different numeric variables. The position of each dot on
the horizontal and vertical axes indicates the two values for an individual data point. Scatter plots can be used to observe
relationships between the two variables.
118
Regression analysis is covered in more detail later in this section.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 303
Study Unit 15: F.4. Data Mining CMA Part 1

predictor information in such a way as to capture the complicated relationships among predictor variables
and between the predictor variables and the outcome variable.

Neural networks are based on the human brain and mimic the way humans learn. In a human brain, neurons
are interconnected, and humans can learn from experience. Similarly, a neural network can learn from its
mistakes by finding out the results of its predictions. In the same way as a human brain uses a network of
neurons to respond to stimuli from sensory inputs, a neural network uses a network of artificial neurons,
called nodes, to simulate the brain’s approach to problem solving. A neural network solves learning prob-
lems by modeling the relationship between a set of input signals and an output signal.

The results of the neural network’s predictions—the output of the model—becomes the input to the next
iteration of the model. Thus, if a prediction made did not produce the expected results, the neural network
uses that information in making future predictions.

Neural networks can look for trends in historical data and use it to make predictions. Some examples of
uses of neural networks include the following.

• Picking stocks for investment by performing technical analysis of financial markets and individual
investment holdings.

• Making bankruptcy predictions. A neural network can be given data on firms that have gone bank-
rupt and firms that have not gone bankrupt. The neural network will use that information to learn
to recognize early warning signs of impending bankruptcy, and it can thus predict whether a par-
ticular firm will go bankrupt.

• Detecting fraud in credit card and other monetary transactions by recognizing that a given trans-
action is outside the ordinary pattern of behavior for that customer.

• Identifying a digital image as, for example, a cat or a dog.

• Self-driving vehicles use neural networks with cameras on the vehicle as the inputs.

The structure of neural networks enables them to capture complex relationships between predictors and an
outcome by fitting the model to the data. It calculates weights for the individual input variables. The weights
allow each of the inputs to contribute a greater or lesser amount to the output, which is the sum of the
inputs. Depending on the effect of those weights on how well the output of the model—the prediction it
makes—fits the actual output, the neural network then revises the weights for the next iteration.

The learning algorithm of a neural network can be either supervised or unsupervised. If the desired output
is known, the neural network is supervised. If there are no desired outputs because it is not known what
the result of the learning process will be, the neural network is unsupervised.

Steps in Data Mining


A typical data mining project will include the following steps.

1) Understand the purpose of the project. The data scientist needs to understand the user’s needs
and what the user will do with the results. Also, the data scientist needs to know whether the
project will be a one-time effort or ongoing.
2) Select the dataset to be used. The data scientist will take samples from a large database or
databases, or from other sources. The samples should reflect the characteristics of the records of
interest so the data mining results can be generalized to records outside of the sample. The data
may be internal or external.
3) Exploring, cleaning (cleansing), and preprocessing the data. Verify that the data is in usable
condition, that is, whether the values are in a reasonable range and whether there are obvious
outliers. Determine how missing data (that is, blank fields) should be handled. Visualize the data
by reviewing the information in chart form.119 If using structured data, ensure consistency in the

119
Data visualization, also called data discovery or discovery analysis, is covered later in this topic.

304 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 15: F.4. Data Mining

definitions of fields, units of measurement, time periods covered, and so forth. New variables may
be created in this step, for example using the start and end dates to calculate the duration of a
period.

Note: Data cleaning, or data cleansing, involves modifying the data in a data set so that the
data set is consistent with other, similar data sets in the system. It may include:

• Finding and correcting corrupt or inaccurate records in a data set, such as typographical
errors that may have occurred during data entry or corruption that may have occurred in
data transmission or storage.

• Identifying incomplete parts of the data and modifying it to make it complete.

• Identifying irrelevant portions of the data and deleting them.

• Cross-checking the data with a validated data set.

• Adding related information to the data, called data enhancement, for example adding
names of additional people associated with an address.

• Harmonization, or normalization, of the data, which brings together data sets that may
contain differing file formats, naming conventions, or columns, and makes it consistent.

• Consolidating the data and transforming it into a cohesive, consistent, data set that can
be used for analysis.

○ Data consolidation involves taking data from various sources throughout the organiza-
tion and combining it into a single location.

○ Data transformation involves changing the format, structure, or values of the data. It
can be constructive, including adding, copying, or replicating data; destructive, including
deleting fields and records; aesthetic, involving standardizing the data items such as
street names; or structural, including renaming, moving, and combining columns in a
database.

4) Data reduction: reduce the data dimension if needed. Eliminate unneeded variables, trans-
form variables as necessary, and create new variables. The data scientist should be sure to
understand what each variable means and whether it makes sense to include it in the model.
5) Determine the data mining task. Determining the task includes classification, prediction, clus-
tering, and other activities. Translate the general question or problem from Step 1 into the specific
data mining question.
6) Partition the data. If supervised learning will be used (classification or prediction), partition the
dataset randomly into three parts: one part for training, one for validation, and one for testing.
7) Select the data mining techniques to use. Techniques include regression, neural networks,
hierarchical clustering, and so forth.
8) Use algorithms to perform the task. The use of algorithms is an iterative process. The data
scientist tries multiple algorithms, often using multiple variants of the same algorithm by choosing
different variables or settings. The data scientist uses feedback from an algorithm’s performance
on validation data to refine the settings.
9) Interpret the results of the algorithm. The data scientist chooses the best algorithm and tests
the final choice on the test data to learn how well it will perform.
10) Deploy the model. The model is run on the actual records to produce actionable information that
can be used in decisions. The chosen model is used to predict the outcome value for each new
record, called scoring.120

120
Shmueli, Galit, Bruce, Peter C., Yahav, Inbal, Patel, Nitin R., and Lichtendahl Jr., Kenneth C., Data Mining for Business
Analytics: Concepts, Techniques, and Applications in R, 1st Edition, John Wiley & Sons, Hoboken, NJ, 2018, pp. 19-21.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 305
Study Unit 15: F.4. Data Mining CMA Part 1

A data mining project does not end when a particular solution is deployed, however. The results of the data
mining may raise new questions that can then be used to develop a more focused model.

Challenges of Data Mining


Some of the challenges inherent in data mining include the following.

• Poor data quality. Data stored in relational databases may be incomplete, out of date, or incon-
sistent. For example, mailing lists can contain duplicate records, leading to duplicate mailings and
excess costs. Poor quality data can lead to poor decisions.

Furthermore, use of inaccurate data can cause problems for consumers. For example, when credit
rating agencies have errors in their data, consumers can have difficulty obtaining credit.

• Information exists in multiple locations within the organization and thus is not centrally located,
for example Excel spreadsheets that are in the possession of individuals in the organization. Infor-
mation that is not accessible cannot be used.

• Biases are amplified in evaluating data. The meaning of a data analysis must be assessed by a
human being, and human beings have biases. A “bias” is a preference or an inclination that gets
in the way of impartial judgment. Most people tend to trust data that supports their pre-existing
positions and tend not to trust data that does not support their pre-existing positions. Other biases
include relying on the most recent data or trusting only data from a trusted source. All such biases
contribute to the potential for errors in data analysis.

• Analyzed data often displays correlations.121 However, correlation does not prove causation.
Establishing a causal relationship is necessary before using correlated data in decision-making. If
a causal relationship is assumed where none exists, decisions made using the data will be flawed.

• Ethical issues such as data privacy related to the aggregation of personal information on millions
of people. Profiling according to ethnicity, age, education level, income, and other characteristics
results from the collection of so much personal information.

• Data security is an issue because personal information on individuals is frequently stolen by


hackers or even employees.

• A growing volume of unstructured data. Data items that are unstructured do not conform to
relational database management systems, making capturing and analyzing unstructured data more
complex. Unstructured data includes items such as social media posts, videos, emails, chat logs,
and images, for example images of invoices or checks received.

121
A “correlation” is a relationship between or among values in multiple sets of data where the values in one data set
move in relation to the values in one or more other data set or sets.

306 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

Study Unit 16: F.4. Regression Analysis


Analytic Tools

Linear Regression Analysis


Linear regression analysis is a statistical technique that is used to develop a mathematical equation that
models the extent to which one variable (called the dependent or response variable) has historically been
affected by one or more other variables (called the independent or predictor variables). If the historical
relationship between the cause (or causes) and the effect has been sufficiently strong, regression analysis
using the historical data can be used to make decisions and predictions about the dependent variable.

Time Series Analysis

Note: Time series analysis was introduced in Section B in Volume 1 of this textbook, topic B.3. Fore-
casting Techniques. Candidates may wish to review that information before proceeding. The trend
pattern in time series analysis was introduced in Forecasting Techniques and will be further explained in
this topic. Additional patterns will be discussed in this topic, as well.

A time series is a sequence of measurements taken at equally spaced, ordered points in time. A time series
looks at relationships between the passage of time as the independent variable and another variable as the
dependent variable. The dependent variable may be sales revenue for a segment of the organization, pro-
duction volume for a plant, expenses in one expense classification, or anything being monitored over time.
In time series analysis, only one set of historical time series data is used and that set of historical data is
not compared to any other set of data.

A time series can be descriptive or predictive.

• When time series analysis is used for descriptive modeling, a time series is modeled to determine
its components, that is, whether it demonstrates a trend pattern, a seasonal pattern, a cyclical
pattern, or an irregular pattern. The information gained from a time series analysis can be used
for decision-making and policy determination.

• When time series analysis is used for predictive modeling, it involves using the information from
a time series to forecast future values of that series.

A time series may have one or more of four patterns (also called components) that influence its behavior
over time:

1) Trend pattern or component

2) Cyclical pattern or component

3) Seasonal pattern or component

4) Irregular pattern or component

Trend Pattern in Time Series Analysis


A trend pattern is the most frequent time series pattern and the one most amenable to use for predicting
because the historical data exhibits a gradual shifting to a higher or lower level. If a long-term trend exists,
short-term fluctuations may take place within that trend; however, the long-term trend will be apparent.
For example, sales from year to year may fluctuate but overall, they may be trending upward, as is the
case in the example that follows.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 307
Study Unit 16: F.4. Regression Analysis CMA Part 1

Example of a Trend Pattern in Time Series Analysis


Sales revenues for each year, Year 1 through Year 10, are as follows:

Year Sales Revenue


(x) (y)

Year 1 $2,250,000

Year 2 $2,550,000

Year 3 $2,300,000

Year 4 $2,505,000

Year 5 $2,750,000

Year 6 $2,800,000

Year 7 $2,600,000

Year 8 $2,950,000

Year 9 $3,000,000

Year 10 $3,200,000

The following graph illustrates the trend pattern of the sales revenue. It indicates a strong relationship
between the passage of time (the independent variable x) and sales revenue (the dependent variable y)
because the historical data points fall close to the regression line. Because of the trend pattern, this histor-
ical data can be used to make a forecast for sales revenues for Year 11, as seen in the extension of the
trend line on the graph.

Sales Revenues Year 1 - Year 10 with Year 11 Forecast

$4,000,000

Sales

$3,000,000 Regression Line


Sales Revenues

with Forecast

$2,000,000

$1,000,000

$0
1 2 3 4 5 6 7 8 9 10 11
Years

308 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

Trends in a time series analysis are not always upward and linear like the preceding graph. Time series data
can exhibit an upward linear trend, a downward linear trend, a nonlinear (that is, curved) trend, or no trend
at all. A scattering of points that have no relationship to one another would represent no trend at all.

Note: The CMA exam tests linear regression only.

Cyclical Pattern in Time Series Analysis


Any recurring fluctuation that lasts longer than one year is attributable to the cyclical component of the
time series. A cyclical component in sales data is usually due to the cyclical nature of the economy.

A long-term trend can be established even if the sequential data fluctuates greatly from year to year due
to cyclical factors.

Example of a Cyclical Pattern in Time Series Analysis


Sales for each year, Year 1 through Year 10, are as follows:

Year Sales Revenue


(x) (y)

Year 1 $1,975,000

Year 2 $2,650,000

Year 3 $2,250,000

Year 4 $2,450,000

Year 5 $2,250,000

Year 6 $2,250,000

Year 7 $2,600,000

Year 8 $2,450,000

Year 9 $3,000,000

Year 10 $2,750,000

The following graph illustrates the cyclical pattern of the sales. The fluctuations from year to year are
greater than they were for the graph containing the trend pattern. However, a long-term trend is still
apparent.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 309
Study Unit 16: F.4. Regression Analysis CMA Part 1

Sales Revenues Year 1 - Year 10 with Year 11 Forecast

$4,000,000

$3,000,000
Sales Revenues

Sales

Regression Line
with Forecast
$2,000,000

$1,000,000

$0
1 2 3 4 5 6 7 8 9 10 11
Years

Seasonal Pattern in Time Series Analysis


Usually, trend and cyclical components of a time series are tracked as annual historical movements over
several years. However, a time series can fluctuate within a year due to seasonality in the business. For
example, a surfboard manufacturer’s sales would be highest during the warm summer months, whereas a
manufacturer of snow skis would experience its peak sales in the wintertime. Variability in a time series
due to seasonal influences is called the seasonal component.

Note: Seasonal behavior can take place within any period. Seasonal behavior is not limited to periods
of a year. A business that is busiest at the same time every day is said to have a within-the-day
seasonal component. Any pattern that repeats regularly is a seasonal component.

Seasonality in a time series is identified by regularly spaced peaks and troughs with a consistent direction
that are of approximately the same magnitude each time, relative to any trend. The graph that follows
shows a strongly seasonal pattern. Sales are low during the first quarter each year. Sales begin to increase
each year in the second quarter and they reach their peak in the third quarter, then they drop off and are
low during the fourth quarter. However, the overall trend is upward, as illustrated by the trend line.

310 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

Example of a Seasonal Pattern in Time Series Analysis


Sales for each quarter, March 20X6 through December 20X8, are as follows:

Year Sales Revenues


(x) (y)

Mar. 20X6 $1,200,000

Jun. 20X6 $2,500,000

Sep. 20X6 $3,200,000

Dec. 20X6 $1,500,000

Mar. 20X7 $1,400,000

Jun. 20X7 $2,800,000

Sep. 20X7 $3,800,000

Dec. 20X7 $1,400,000

Mar. 20X8 $1,700,000

Jun. 20X8 $2,500,000

Sep. 20X8 $3,900,000

Dec. 20X8 $1,600,000

The graph that follows contains historical sales by quarter for three years and forecasted sales by quarter
for the fourth year. The fourth year’s quarterly forecasts were calculated in Excel using the FORECAST.ETS
function, which is an exponential smoothing algorithm.

Note: Exponential smoothing is outside the scope of the CMA exams, so it is not covered any further in
these study materials.

The graph illustrates that sales volume begins to build in the second quarter of each year. The sales volume
reaches its peak in the third quarter and is at its lowest in the fourth quarter of each year.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 311
Study Unit 16: F.4. Regression Analysis CMA Part 1

Seasonal Pattern
Historical Sales Revenue By Quarter, 20X6-20X8
with 20X9 Quarterly Forecasts

$4,000,000

$3,000,000
Sales Revenues

Sales
Forecast
$2,000,000 Trend Line

$1,000,000

$0
Q1- Q3- Q1- Q3- Q1- Q3- Q1- Q3-
X6 X6 X7 X7 X8 X8 X9 X9

Quarters

Irregular Pattern in a Time Series


A time series may vary randomly, not repeating itself in any regular pattern. Such a pattern is called an
irregular pattern. It is caused by short-term, non-recurring factors and its impact on the time series
cannot be predicted.

Example of an Irregular Pattern in Time Series Analysis


Sales revenues for each year, Year 1 through Year 10, are as follows:

Year Sales Revenue


(x) (y)

Year 1 $1,500,000

Year 2 $3,200,000

Year 3 $2,100,000

Year 4 $2,500,000

Year 5 $1,400,000

Year 6 $1,600,000

Year 7 $3,600,000

Year 8 $2,000,000

Year 9 $2,500,000

Year 10 $1,700,000

312 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

The following graph exhibits the irregular pattern of the sales. A trend line is not useful, although it can be
and has been added.

Irregular Pattern
Sales Revenues 20X0 - 20X9
$4,000,000

$3,500,000

$3,000,000
Sales Revenues

$2,500,000
Sales
$2,000,000 Trend Line

$1,500,000

$1,000,000

$500,000

$0
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9

Years

Time Series Trend Pattern and Regression Analysis


A time series that has a long-term upward or downward trend pattern can be used to make a forecast.
Simple linear regression analysis is used to create a trend projection and to forecast values using historical
information from all available past observations of the value.

Note: Simple regression analysis is called “simple” to differentiate it from multiple regression analysis.
The difference between simple linear regression and multiple linear regression is in the number of inde-
pendent variables.

• A simple linear regression has only one independent variable. In a time series, that independent
variable is the passage of time.

• A multiple linear regression has more than one independent variable.

Linear regression means the regression equation graphs as a straight line.

Simple linear regression analysis relies on two assumptions:

• Variations in the dependent variable (the value being predicted) are explained by variations in
one single independent variable (the passage of time, for a time series).

• The relationship between the independent variable and the dependent variable (whatever is being
predicted) is linear. A linear relationship is one in which the relationship between the independent
variable and the dependent variable can be approximated by a straight line on a graph. The re-
gression equation, which approximates the relationship, will graph as a straight line.

Note: The equation of a linear regression line graphs as a straight line because none of the variables in
the equation are squared or cubed or have any other exponents. If an equation contains any exponents,
the graph of the equation will be a curved line.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 313
Study Unit 16: F.4. Regression Analysis CMA Part 1

The regression line is called a trend line when the regression is being performed on a time series. The
equation of a simple linear regression line is:

ŷ = a + bx

Where:

ŷ= the predicted value of y, the dependent variable on the regression


line corresponding to each value of x, the independent variable.
a= the constant coefficient, or the value of ŷ on the regression line when
x is zero; also the y-intercept, the point where the linear representa-
tion of ŷ on the regression line crosses the x-axis.
b= the variable coefficient, the amount by which the ŷ value on the
regression line changes (increases or decreases) when the value of x
changes by one unit; also the slope of the line.122 The variable coeffi-
cient is always next to the independent variable in the regression
equation. In a time series, x represents time, so the value of x only
increases because time flows in only one direction.
x= the independent variable, the value of x on the x-axis that corre-
sponds to the predicted value of ŷ on the regression line.

The symbol over the “y” in the formula is called a “hat,” and it is read as “y-hat.” The y-hat indicates the
predicted value of y, not the actual value of y. The predicted value of y is the value of y on the
regression line (the line created from the historical data) at any given value of x.

The line of best fit as determined by simple linear regression is a formalization of the way one would fit a
trend line through the graphed data just by looking at it. To fit a line by looking at it, one would use a ruler
or some other straight edge and move it up and down, changing the angle, until it appears the differences
between the points and the line drawn with the straight edge have been minimized. The line that results
will be a straight line located at the position where approximately the same number of points are above the
line as are below it and the distance between each point and the line has been minimized (that is, the
distance is as small as possible).

Linear regression is used to calculate the location of the regression line mathematically. Linear
regression analysis is performed on a computer or a financial calculator, using the observed values of x and
y.

On a graph, the difference between each actual, observed point and its corresponding point on the calcu-
lated regression line is called a deviation or a residual. When the position of the regression line is
calculated mathematically, the line will be in the position where the deviations between each graphed
value and the regression line have been minimized. The resulting regression line is the line of best
fit. That line may be able to be used to predict the value of y for any given value of x, if the independent
variable being used (such as the passage of time for a time series) serves well as the predictor variable.

Note: The statistical method used to perform simple regression analysis is called the Least Squares, also
known as the Ordinary Least Squares method or OLS. The regression line is called the least squares
regression line.

122
The slope of a line on a graph is the “rise over the run” or the amount of change in the ŷ value (either an increase
or a decrease) divided by the amount of increase in the x value (the amount of increase when moving from left to right
on the graph). For example, if ŷ increases by 20,000 when x increases by 1, the slope of the line is 20,000. If ŷ decreases
by 20,000 when x increases by 1, the slope of the line is −20,000.

314 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

Simple linear regression was used to calculate the regression line and the forecast on the graph presented
earlier as an example of a trend pattern. The regression line can be extended out for one or more addi-
tional years to create forecasts.

Before Using Regression Analysis to Develop a Prediction


Before using regression analysis to predict a value, though, determine whether regression analysis even
can be used to make a prediction. The following two requirements must be met:

1) The dependent variable, y, must have a linear relationship with the independent varia-
ble, x.

To determine whether a linear relationship exists, make a scatter plot123 of the actual historical
values in the time series and review the results. Plotting the x and y coordinates on a scatter plot
will indicate whether there is a linear relationship between them.

If the long-term trend appears to be linear, simple linear regression analysis may be able to
be used (subject to correlation analysis as described below) to determine the location of the linear
regression line, and that linear regression line can be used to make a prediction.

Below is a scatter plot that exhibits no correlation between the x-variable, time, and the y-
variable, sales revenue. The graph below is also an example of the irregular pattern described
previously.

Scatter Plot
Sales Year 1 - Year 10
$4,000,000

$3,000,000
Sales

$2,000,000

$1,000,000

$0
0 1 2 3 4 5 6 7 8 9 10
Years

Historical sales like the above indicate that regression analysis using a time series would not be a
good way to make a prediction.

123
A scatter plot uses dots to represent the intersection of two different numeric variables. The position of each dot on
the horizontal and vertical axes indicates the two values for an individual data point. Scatter plots can be used to observe
relationships between the two variables.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 315
Study Unit 16: F.4. Regression Analysis CMA Part 1

On the other hand, the following scatter plot does display a linear relationship between the x-
values and y-values, and the use of regression analysis to make a prediction could be helpful.

Scatter Plot
Sales Year 1 - Year 10
$4,000,000

$3,000,000
Sales

$2,000,000

$1,000,000

$0
0 1 2 3 4 5 6 7 8 9 10
Years

2) Correlation analysis should indicate a high degree of correlation between the independ-
ent variable x, and the dependent variable y.

In addition to plotting the points on a scatter plot graph, correlation analysis should be per-
formed before relying on regression analysis to develop a prediction.

Correlation Analysis
Correlation analysis is used to understand the relationship or absence of a relationship between two varia-
bles and to determine the strength of the linear relationship between the two variables.

Note: Correlation describes the degree of the relationship between two variables. If two things are
correlated with one another, it means there is a close connection between them.

• If high measurements of one variable tend to be associated with high measurements of the other
variable, or low measurements of one variable tend to be associated with low measurements of the
other variable, the two variables are said to be positively correlated.

• If high measurements of one variable tend to be associated with low measurements of the other
variable, or if low measurements of one variable tend to be associated with high measurements of
the other variable, the two variables are said to be negatively correlated.

• If there is a close match in the movements of the two variables over a period, either positive or
negative, it is said that the degree of their correlation is high.

However, correlation alone does not prove causation. Rather than one variable affecting the other
variable, it may be that some other entirely different factor is affecting both variables.

Correlation analysis is used to determine how well correlated the variables are and assess how well a model
can predict an outcome to decide whether the independent variable or variables can be used to make
decisions and predictions regarding the dependent variable used in the analysis.

316 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

In a time series, the only independent variable is the passage of time. Many factors in addition to time can
affect the dependent variable. For example, if sales revenue is being predicted, economic cycles, promo-
tional programs undertaken, the size of the sales staff, and industry-wide conditions such as new
government regulations can cause changes in sales revenue. If time series regression analysis is used to
develop a prediction, the prediction should be adjusted for other known factors that may have affected the
historical data and that may affect the prediction.

Correlation analysis involves several statistical measures calculated on a computer using a statistics appli-
cation and the observed values of x and y. Financial calculators have some limited capabilities with respect
to correlation analysis but cannot provide all the information that a statistics application on a computer can.

The constant coefficient and the variable coefficient for the regression equation are part of the output of a
regression analysis, and thus that output provides the equation of the regression line. In addition, the
output of the regression analysis provides the statistics used in correlation analysis. The most important of
those statistics are as follows:

1) The correlation coefficient, R

2) The coefficient of determination, R2

3) The standard error of the estimate, also called the standard error of the regression

1) The Correlation Coefficient (R)


The correlation coefficient measures the strength of the relationship between the independent variable
and the dependent variable. The coefficient of correlation expresses how closely related, or correlated, the
two variables are and the extent to which a variation in the independent variable has historically resulted
in a variation in the dependent variable.

Mathematically, the correlation coefficient, represented by R, is a numerical measure that expresses both
the direction of the correlation—positive or negative—and the strength of the linear association be-
tween the two variables.

Note: In a time series linear regression analysis, the period of time serves as the independent variable
(x-axis) while the variable such as sales revenue serves as the dependent variable (y-axis). Since time
moves only forward, the independent variable in a time series only increases.

When a time series (such as sales revenue over a period of several years) is graphed, the data points on
the graph may show an upsloping linear pattern, a downsloping linear pattern, a nonlinear pattern (such
as a curve), or no pattern at all. The pattern of the data points indicates the amount of correlation between
the values on the x-axis (time) and the values on the y-axis (such as sales revenue).

The correlation coefficient (R) is expressed as a number between −1 and +1. The sign of the correlation
coefficient describes the direction of the relationship (positive or negative), and the absolute value of
the correlation coefficient describes the magnitude of the relationship between the two variables.

• A correlation coefficient (R) of +1 means the linear relationship between each value for x and its
corresponding value for y is perfectly positive. As time on the x-axis increases in a time series,
y increases by the same proportion, so the regression line is upsloping.

• A correlation coefficient (R) of −1 means the linear relationship between each value for x and its
corresponding value for y is perfectly negative. As time on the x-axis increases in a time series,
y decreases by the same proportion, so the regression line is downsloping.

• A correlation coefficient (R) that is close to zero usually means there is very little or no relation-
ship between each value of x and its corresponding y value. However, a correlation coefficient that
is close to zero may also mean there is a strong relationship between the two variables, but the
relationship is not a linear one. (Candidates do not need to know how to recognize a non-linear
relationship. Just be aware that non-linear relationships occur.)

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 317
Study Unit 16: F.4. Regression Analysis CMA Part 1

A high correlation coefficient (R), that is, a number close to either +1 or −1, means that simple linear
regression analysis would be useful as a way of making a projection. Generally, a correlation coefficient of
±0.50 or higher indicates enough correlation that a linear regression can be useful for forecasting. The
closer R is to ±1, the better the forecast should be.

A moderate correlation coefficient (R), generally defined as ±0.30 to ±0.49, indicates a lower amount of
correlation and questionable value of the historical data for forecasting.

A low correlation coefficient (R), around ±0.10, indicates that a forecast made from the data using
simple regression analysis would not be useful.

The correlation coefficient (R) does not indicate how much, that is, the proportion, of the variation in the
dependent variable that is explained by changes in the independent variable. The correlation coefficient
indicates only whether there is a direct (upsloping) or inverse (downsloping) relationship between the pairs
of x and y variables and the strength of that relationship.

Note: It is important to first look at the plotted data points on the scatter plot graph when determining
whether a relationship exists between the independent variable and the dependent variable. Do not rely
on the value of the correlation coefficient alone to indicate whether there is a relationship between the
two variables, because the correlation coefficient will not detect a non-linear relationship.

2) The Coefficient of Determination (R2)


The coefficient of determination is the percentage of the total variation in the dependent variable (y)
that can be explained by variations in the independent variable (x), as depicted by the regression
line.

In a simple linear regression with only one independent variable, the coefficient of determination is the
square of the correlation coefficient. The coefficient of determination is represented by the term R2.

R2 is expressed as a number between 0 and 1.

• If R2 is 1, then 100% of the variation in the dependent variable is explained by variations in the
independent variable.

• If R2 is 0, then none of the variation in the dependent variable is explained by variations in the
independent variable.

• If R2 is greater than 0 but less than 1, for example 0.68, it means that 68% of the total variation
in the dependent variable can be explained by variations in the independent variable.

In a regression analysis with a high coefficient of determination (R2), the data points will all lie close to the
trend line. In a regression analysis with a low R2, the data points will be scattered at some distances above
and below the regression line. The higher the R2, the better the predictive ability of the linear regression.

There is no absolute standard for an acceptable value of the coefficient of determination. The coefficient of
determination should be evaluated along with other statistics. However, higher is better.

3) The Standard Error of the Estimate (SEE) or the Standard Error of the Regression
In algebra, an equation such as y = 2,000 + 300x means that y is exactly 2,000 + 300x. However, with
regression data, the equation ŷ = 2,000 + 300x is true on average but is not true for any given value of
x.

The equation of the simple linear regression model results in the average, or predicted, value of ŷ (the
response) for any given value of x (the predictor). However, the actual observed data has responses that
are not on the line itself, but rather they are scattered around the regression line. Thus, on a graph of
a regression of historical data, there are two y values for each value of x: one is the actual historical value
of y for that value of x, and the other is the estimated, or predicted, value of y (ŷ) for that value of x,
represented by the ŷ value on the trend line aligned with each value of x.

318 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

The scatter, that is, the difference between the actual value of the dependent variable y and the predicted
value of the dependent variable y (that is, ŷ) for each value of the independent variable x, is called the
error term or the residual for that value of x. The error term—the scatter of the data around the regression
line—is represented by e in the linear regression model. If all the historical data fell on the regression line,
the error term for each value of x would be zero.
Therefore, the equation that describes a given actual, historical, value of y for a given value of x
used in a regression is as follows:

y = a + bx + e

Where:

y= the dependent variable (its actual, historical value, not its predicted
value) corresponding to each given value of x.
a= the constant coefficient, or the y-intercept, the value of ŷ on the
regression line when x is zero.
b= the variable coefficient and the slope of the regression line, or the
average amount of change in ŷ resulting from one unit of change in x.
x= the independent variable, or the value of x on the x-axis that corre-
sponds to the value of ŷ on the regression line.

e= the error term, also called the residual, which for each value of the
independent variable x is the difference between the predicted value of
y (ŷ) on the regression line for that value of x and the actual value of y
for that value of x. The error term will be different for each value
of the independent variable x used in the regression function.

Each value of x has one residual, or error term. For any given value of x,

e=y−ŷ

The standard error of the estimate (SEE), also called the standard error of the regression, measures
the variation in the full set of data from its mean, or the average distance that the actual values of the
dependent variable y fall from the predicted values of y on the regression line (the residuals). It is computed
as a standard deviation, with the deviations being the residuals, or the distance of each actual value of y
from its predicted value at that value of x. In other words, it describes how wrong the regression model is
on average, using the units of the dependent variable, y.

Some of the residuals for a dataset that has been regressed will be positive and some will be negative.
However, in a regression that has a constant term, the sum of the residuals will be exactly zero.

The standard error of the estimate can be used to determine the precision with which the regression model
can be expected to predict the dependent y-variable. It provides the average difference between the actual
values of y (that is, the values that did occur historically) and the predicted values of ŷ on the regression
line. The predicted values of ŷ on the regression line are the values that result from putting the various
values for the independent variable x into the regression function and calculating the resulting predicted
value of the dependent variable y (that is, ŷ) at each value of x.

The lower the standard error of the estimate is, the closer are the actual dots on the graph to the regression
line and the more accurate will be the predictions made using the regression model.

The size of the standard error of the estimate must be interpreted in relationship to the average
size of the dependent variable. If the size of the standard error of the estimate is less than 5 to 10

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 319
Study Unit 16: F.4. Regression Analysis CMA Part 1

percent of the average size of the dependent variable, the regression analysis is fairly precise and should
be usable for prediction.

Example: If the average size of the dependent variable is 5,000,000 and the standard error is 250,000,
250,000 is only 5 percent of 5,000,000. The percentage of error is small, and the model should be usable
for prediction.

The inclusion of an error term in the regression model recognizes that:

• The regression model is imperfect.

• Some variables that help to “explain” the behavior of the dependent variable might not be included.

• The included variables may have been measured with error.

There is always some component in the variation of the dependent variable that is completely random.

Note: The regression equation may be written in various ways, though the standard form of the equation
for an upsloping line used in statistics is
ŷ = a + bx
x will usually represent the independent variable and y will usually represent the dependent variable.

If the variables are negatively correlated, that is, the regression line has a negative slope, meaning
the values of ŷ decrease for each increase in the value of x, the equation will be ŷ = a – bx.

The constant coefficient in the equation is the letter that stands by itself. The constant coefficient
represents the y-intercept on the graph because it is the value of y when x is zero. In the equation
above, a represents the constant coefficient.

The coefficient of the independent variable, or the variable coefficient, is whatever term is next
to the independent variable in the formula. That term represents the amount of change in the pre-
dicted value of the dependent variable for each unit of increase in the independent variable.
The variable coefficient is the slope of the regression line. In the equation above, b represents the
variable coefficient.

However, different letters may be used to represent all the variables, and the terms on the right side of
the equation may be reversed. The constant coefficient may come first as in the equation above, or it
may be the final term.

Thus, the equations ŷ = a + bx, ŷ = bx + a, ŷ = b + ax, and ŷ = ax + b are the same equation. The
a and the b just stand for different things and the terms on the right side of the equation are in different
orders. Furthermore, letters other than x and y may be used to represent the independent variable and
the dependent variable, although that is uncommon.

The independent variable, usually x, can be recognized because it will always have a coefficient next to
it. The coefficient next to the independent variable will be the variable coefficient, or the amount of
change in the predicted value of the dependent variable for each unit of change in the independent
variable.

The term that is all by itself will be the constant coefficient, usually y. The constant coefficient will also
be the y-intercept, or the value of ŷ on the regression line when x is zero.

The symbol over the y is a circumflex, also called a “hat,” and it is read as “y-hat.” The y-hat indicates
the predicted value of y, not the actual historical, value of any y in the input to the regression model.
The predicted value of y is the value of y on the regression line (the trend line created from the
historical data) at any given value of x.

320 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

Summary
Following is a summary table of the interpretation of a correlation analysis.

Summary – Interpretation of a Correlation Analysis

Standard for
Measure Description
Prediction Use

Coefficient of Correlation, R A number between −1 and +1 ±0.50 or higher.


that expresses the direction of the
correlation (a positive coefficient
means a direct correlation and a
negative coefficient means an in-
direct correlation) and the
strength of the linear relationship
between the independent and de-
pendent variables. It indicates the
direction and the extent to which
a variation in the independent var-
iable x has historically resulted in
a variation in the dependent vari-
able y.

Coefficient of Determination, R2 The percentage of the total varia- There is no absolute


tion in the dependent variable y standard, but higher is
that can be explained by varia- better.
tions in the independent variable
x.

Standard Error of the Estimate, The average distance the ob- Less than 5 to 10% of
SEE served values of the dependent the average size of the
variable y fall from the regression dependent variable.
line. Indicates the predictive abil-
ity of the regression model.

Multiple Regression Analysis


When more than one independent variable is known to impact a dependent variable and each independent
variable can be expressed numerically, regression analysis using all the independent variables to forecast
the dependent variable is called multiple regression analysis.

Note: Remember that there must be a reasonable basis to assume a cause-and-effect relationship be-
tween the independent variable(s) and the dependent variable. If there is no reason for a connection,
any correlation found by using regression analysis is accidental. A linear relationship does not prove
a cause-and-effect relationship, and correlation does not prove causation.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 321
Study Unit 16: F.4. Regression Analysis CMA Part 1

Recall that the graph of a multiple regression has more than one x-axis, because a multiple regression has
more than one independent variable.


• •
• •
• • • • •

• • • • • •
• • • x2

• • •

x1

The equation of a multiple regression function is usually written with either all “as” or all “bs” as the coef-
ficients, with a subscripted zero to indicate the constant coefficient and subscripted subsequent numerals
to indicate the variable coefficients, such as the following, although any letters could be used:

ŷ = a0 + a1x1 + a2x2 + ... + akxk

Note: The variables and the coefficients in a multiple regression equation could be identified using any
letters. To identify the various components, look for the form of the equation rather than the specific
letters.

• The equation will have one component that stands by itself on the right side of the equals sign, and
that will be the constant coefficient. If a is used for the coefficients, the constant coefficient should
have a subscripted “0.”

• The independent variables may or may not be identified by “x”s, but if x is used, the independent
variables should be identified as x1, x2, and so forth.

• The variable coefficients will be next to their independent variables. If a is used for the coefficients,
the variable coefficient on x1 will be a1, the variable coefficient on x2 will be a2, and so forth.

Evaluating the Reliability of a Multiple Regression Analysis


In simple regression analysis, the coefficient of determination, R2, is the proportion of the total variation
in the dependent variable (y) that can be explained by variations in the independent variable (x). Thus, R2
is an indicator of the reliability of a simple regression analysis.

R2 is used as an indicator of the reliability of a multiple regression analysis, as well. In multiple regression
analysis, though, the R2 value evaluates the whole regression, including all the independent variables
used. The higher the R2 is, the better. If R2 is above 0.50 or 50%, then the regression is fairly reliable
because the regression can be used to predict that percentage of the total variation in the dependent
variable.

322 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

Note: Remember, correlation does not prove causation. In addition to a strong correlation between each
independent variable and the dependent variable, there must be a logical cause-and-effect relationship
between them before the independent variable can be used effectively to predict the dependent variable.

Goodness of Fit in Linear Regression Analysis


The term goodness of fit describes how close the actual values used in a statistical model are to the
expected values, that is, the predicted values, in the model.

In regression analysis, the regression equation is the model used to predict future values based on the
behavior of the actual observations in response to a predictor. Thus, the correlation analysis described in
this topic leads to the measurement of the regression equation’s goodness of fit.

When the independent variable or variables used in the regression are not well correlated with the obser-
vations of the dependent variable used in the regression, the regression line is said to have a low goodness
of fit.

Example of Low Goodness of Fit


The graph that follows exemplifies low goodness of fit for the regression equation, ŷ = $2,352,273 + $0.75x.
Sales of soda (the dependent variable on the y-axis) are regressed on sales of broccoli (the independent
variable on the x-axis). Although a regression line is shown on the graph, it is not meaningful. Most of the
points on the regression line are very far from the observed values of the dependent variable at that value
for the independent variable, indicating very little correlation between the two variables.

Sales of Soda as a Function of Sales of Broccoli

$4,000
Sales of Soda (in thousands of dollars)

$3,500

$3,000

$2,500 Sales

$2,000 Regression Line:


ŷ = $2,352,273
+ $0.75x
$1,500

$1,000

$500

$0
$0 $20 $40 $60 $80 $100 $120 $140 $160 $180 $200

Sales of Broccoli (in thousands of dollars)

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 323
Study Unit 16: F.4. Regression Analysis CMA Part 1

Example of High Goodness of Fit


On the other hand, when the independent variable or variables used in the regression are highly correlated
with the observations of the dependent variable used in the regression, the regression equation is said to
have a high goodness of fit.

Following is a graph showing sales of soda (the dependent variable on the y-axis) regressed on historical
sales of hot dogs (the independent variable on the x-axis) for a representative period. The regression
equation, ŷ = $2,062,045 + $9.77x, has a high goodness of fit. The points representing historical sales of
soda as a function of sales of hot dogs are very close to the regression line.

Sales of Soda as a Function of Sales of Hot Dogs


Sales of Soda (in thousands of dollars)

$4,000

$3,500 Sales

$3,000 Regression Line: ŷ =


$2,062,045 + $9.77x
$2,500

$2,000

$1,500

$1,000

$500

$0
$0 $10 $20 $30 $40 $50 $60 $70 $80 $90 $100

Sales of Hot Dogs (in thousands of dollars)

The equation of the regression line on the preceding graph is:

ŷ = $2,062,045 + $9.77x
The constant coefficient is $2,062,045, meaning that if no hot dogs are sold, sales of soda would still be
$2,062,045.

The variable coefficient is $9.77, meaning that for every dollar that hot dog sales increase, sales of soda
increase by $9.77.

Therefore, when sales of hot dogs equal $80,000, the predicted value of sales of soda is $2,843,645:

ŷ = $2,062,045 + ($9.77 × $80,000)


ŷ = $2,843,645
And that can be seen on the graph, as well.

324 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 16: F.4. Regression Analysis

Confidence Interval
The confidence interval is used to describe the amount of uncertainty caused by the sampling method
used when drawing conclusions about a population based on a sample. If several samples are drawn from
a population using the same sampling method and a confidence interval at a confidence level of 95% is
used, 95% of the interval estimates in the samples can be expected to include the true parameter
of the population.

The following graph contains information on number of sales made by an Internet retailer on its website
regressed against the number of unique visitors to the website each day during a representative calendar
period, January 1 through January 31, 20X1. The graph includes the upper and lower bands of a confidence
interval at a 95% confidence level.

The equation of the regression line is ŷ = 38.0667 + 0.0158x

When the number of unique website visitors is 15,000, the predicted number of sales is:

ŷ = 38.0667 + (0.0158 × 15,000)


ŷ = 275

Number of Sales Versus Number of Unique Website Visitors Per Day


January 1 through January 31, 20X1

1,000
Sales
900
Regression Line: ŷ = 38.0667 + 0.0158x
Number of Sales Per Day

800
Upper Confidence Interval Boundary, 95%
confidence level
700 Lower Confidence Interval Boundary, 95%
confidence level
600

500

400

300

200

100

0
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

Number of Unique Visitors to Website Per Day

Note that several of the observations are outside the 95% confidence interval. That fact illustrates what the
confidence interval is and highlights what it is not.

This sample’s confidence interval at a confidence level of 95% does not mean that 95% of the observations
in this sample, in any other sample, or in the population will be within the confidence interval, nor does it
mean that the true value of sales as a function of website visitors will be within that interval 95% of the
time. Instead, a confidence interval of 95% means that if several periods are sampled and analyzed
using the same 95% confidence interval, the proportion of those sample intervals that would
contain the true number of sales to website visitors in the population would be equal to 95%.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 325
Study Unit 17: F.4. Sensitivity Analysis CMA Part 1

Example: If the same sampling of sales versus website visitors were performed for each month of 20X1,
20X2, and 20X3 (36 months) using the same 95% confidence interval, the location of the confidence
interval bands would be slightly different for each month. However, for 34 out of the 36 sampled
months (95%), the bands would contain the true value of sales related to website visitors for the pop-
ulation.

Benefits and Limitations of Regression Analysis

Benefits of Regression Analysis

• Regression analysis is a quantitative method and as such it is objective. A given data set generates
specific results. The results can be used to draw conclusions and make predictions.

• Regression analysis is an important tool for drawing insights, making recommendations, and deci-
sion-making.

Limitations of Regression Analysis

• To use regression analysis, historical data are required. If historical data are not available, regression
analysis cannot be used.

• Even when historical data are available, the use of historical data is questionable for making predic-
tions if a significant change has taken place in the conditions surrounding that data.

• The usefulness of the data generated by regression analysis depends on the choice of independent
variable(s). If the choice of independent variable(s) is inappropriate, the results can be misleading.

• The statistical relationships that can be developed using regression analysis may be valid only for the
range of data in the sample.

Study Unit 17: F.4. Sensitivity Analysis

Sensitivity Analysis
Sensitivity analysis can be used to determine how much the prediction of a model will change if one input
to the model is changed. It can be used to determine which input parameter is most important for achieving
accurate predictions. Sensitivity analysis is known as “what-if” analysis.

To perform sensitivity analysis, define the model and run it using the base-case assumptions to determine
the predicted output. Next, change one assumption at a time, leaving the other assumptions unchanged
and run the model again to determine what effect changing that one assumption has on the predictions of
the model. The amount of sensitivity of the prediction to the change in the input is the percentage of change
in the output divided by the percentage of change in the input. Sensitivity analysis may reveal some area
of risk that the company had not been aware of previously.

Monte Carlo Simulation Analysis


Whereas sensitivity analysis involves changing one input variable at a time, a Monte Carlo simulation
analysis can be used to find solutions to mathematical problems that involve changes to multiple variables
at the same time. Monte Carlo simulation can be used to develop an expected value when the situation is
complex and the values cannot be expected to behave predictably. Monte Carlo simulation uses repeated
random sampling and can develop probabilities of various scenarios coming to pass that can be used to
compute a predicted result.

326 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 17: F.4. Sensitivity Analysis

Adding a Monte Carlo simulation to a model allows the analyst to assess various scenario probabilities
because various random values for the probabilistic inputs can be generated based on their probability
distributions. The analyst determines ranges for the probabilistic inputs and their probability distributions,
means, and standard deviations. The application then generates the random values for the probabilistic
inputs based on their ranges, probability distributions, means, and standard deviations as determined by
the analyst.

The values for the probabilistic inputs are used to generate multiple possible scenarios, similar to performing
statistical sampling experiments, except that it is done on a computer and over a much shorter time span
than actual statistical sampling experiments. Enough trials are conducted (indeed, hundreds or thousands)
with different values for the probabilistic inputs to determine a probability distribution for the resulting
scenario, which is the output. The repetition is an essential part of the simulation.

For example, if the simulation is run to evaluate the probability that a new product will be profitable, the
output may include a prediction for average profit and the probability of a loss.

Benefits of Sensitivity Analysis and Simulation Models

• Sensitivity analysis can identify the most critical variables, that is, the variables that are most likely
to affect the result if they are inaccurate. Since those are the variables that will make the most
difference, those are the variables that should receive the most attention in making predictions.
• Simulation is flexible and can be used for a wide variety of problems.
• Both sensitivity analysis and simulation analysis can be used for “what-if” situations, because they
enable the study of the interactive effect of variables.
• Both sensitivity analysis and simulation analysis are easily understood.
• Many simulation models can be implemented without special software packages because most
spreadsheet packages provide useable add-ins. For more complex problems, simulation applications
are available.

Limitations of Sensitivity Analysis and Simulation Models

• The results of sensitivity analysis or simulation analysis can be ambiguous when the inputs used are
themselves predictions.
• The variables used in a sensitivity analysis are likely to be interrelated. Changing just one variable
at a time may fail to take into consideration the effect that variable’s change will have on other
variables.
• Simulation is not an optimization technique. It is a method that can predict how a system will
operate when certain decisions are made for controllable inputs and when randomly generated val-
ues are used for the probabilistic inputs.
• Although simulation can be effective for designing a system that will provide good performance,
there is no guarantee it will be the best performance.
• The results will be only as accurate as the model that is used. A poorly developed model or a model
that does not reflect reality will provide poor results and may even be misleading.
• There is no way to test the accuracy of assumptions and relationships used in the model until a
certain amount of time has passed.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 327
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

Benefits of Data Analytics in General

• The process of cleaning the data preparatory to processing it can detect errors, duplicate infor-
mation, and missing values. If the errors and duplicate information can be corrected and the missing
values supplied, the data quality can be improved.
• The results of data analytics done correctly can lead to improved sales revenues and profits.
• It can help to reduce fraud losses by recognizing potentially fraudulent transactions and flagging
them for investigation.
• Some easy-to-use data analytics tools are available that average users with little knowledge of data
science can use to access data, perform queries, and generate reports. As a result, data scientists
can be freed up to do more critical data analysis projects.
• The use of data analytics can vastly improve forecasting.

Limitations of Data Analytics in General

• Big Data is used in data analytics to find correlations between variables. However, correlation does
not prove causation. The fact that two variables are correlated does not mean that one variable
caused the other. Both variables could have been caused by a third, unidentified, factor.
• Big Data can be used to find correlations and insights using an endless number of questions. But if
the wrong questions are asked of the data, the answer will be meaningless even though it may be
the “right” answer.
• Failure to take into consideration all relevant variables can lead to inaccurate predictions.
• Data breaches are a risk of using Big Data.
• Customer privacy issues and the risk of the misuse of data obtained from data analytics are matters
for concern.
• In addition to the cost of the data analytics tools themselves, training on the use of the tools so
they are used to their best advantage may entail costs, as well.
• Some easy-to-use data analytics tools are available that average users with little knowledge of data
science can make use of to access data, perform queries, and generate reports. Use of the tools by
those without a background in statistical analysis and data science and without adequate training,
though, can cause risks such as data inconsistency, a lack of knowledgeable verification of the
results, a lack of proper data governance, and ultimately, poor decisions.
• Selection of the right data analytics tools can be difficult.

Study Unit 18: F.4. Visualization or Visual Discovery


Data visualization, also called visual discovery or data discovery analysis, is the visual representation
of information. It is used for better understanding data and predictions from data. Data discovery analysis
is the process by which businesses collect data from various sources and analyze it by detecting patterns
and trends within data sets and outliers in the data by means of advanced analytics and visual analysis of
the data. Business information can be consolidated and used to develop enhanced business processes and
models, share insights across departments, develop intelligent business strategies, and make data-driven
decisions. Use of data discovery analysis can enable businesses to gain a competitive edge, meet their
goals, and generate business value.

Charts, tables, and dashboards can be used to explore, examine, analyze, and display data. Interactive
dashboards allow users to access and interact with real-time data and give managers a means to quickly
see what might otherwise not be readily apparent. The choice of information to include in a dashboard
depends on what a manager needs to see and can include visual presentations such as colored graphs

328 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 18: F.4. Visualization or Visual Discovery

showing, for instance, current customer orders. A dashboard can include drill-down capability to enable the
user to explore the details behind the visual.

In the data mining process, visualization is primarily used in exploration and cleansing of the data in the
preprocessing step of data mining and in the reduction of the data dimensions step of the pro-
cess.124 For example:

• Visualization used in data exploration can help the analyst determine which variables to include in
the analysis and which variables might be unnecessary.

• Visualization is used in data cleansing to find erroneous values in the historical data that need to
be corrected (such as a sale recorded with a date 10 years in the future or a patient aged 250
years because his birth date is incorrect), missing values, duplicate records, columns that may
have the same values in all the rows, and so forth.

• In data reduction, visualization can help in determining which categories can be combined.

Visualization is also used in communicating information to users of data. Examples include charts and graphs
presented on managers’ dashboards to help them understand the information by visualizing it, enabling
faster decision-making.

Benefits of Data Visualization

• Data visualization can make data more understandable.


• It can promote quick assimilation of large amounts of data.
• Visualization can illustrate relationships between data items and events, enabling easier identification
of correlations such as causes of sales increases and decreases.
• Data visualization can lead to improved understanding of business operations, facilitating faster deci-
sion-making.
• It helps to communicate business insights and promotes interaction with data.
• Visualization can be used in data mining to help the data analyst determine which variables to include
and which may not be necessary.
• In data cleansing, visualization helps in finding erroneous values that need to be corrected.
• In data reduction, visualization can help determine categories that can be combined.
• Visualization can enable identification of trends and better interpretation of the data.
• It helps in recognizing patterns, for example in customer behavior.

Limitations of Data Visualization

• Visualization may lead to speculative conclusions when it is used to make estimations and projections.
• Data that would be meaningful may be excluded in choosing the data to use in the visualization,
leading to biased results.
• The information presented in the visualization may be oversimplified, leading to oversimplified con-
clusions.
• Users may rely too much on the visuals and fail to look more deeply into the data on which the visuals
are based, thereby missing important insights.
• The potential exists for misrepresentation or distortion of the data either accidentally or deliberately
through choices made in the way it is presented.

124
See Steps in Data Mining earlier in this section.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 329
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

Dashboard Design Best Practices


A dashboard displays several items of information in a single place. For a middle manager, the information
would be relevant to a given objective or process to enable monitoring and analysis of the root cause of
problems. For senior management, it may show patterns and trends in data across the organization.

Characteristics of a well-designed dashboard are:

• All the required information is available on a single screen in a way that can be quickly understood,
without distractions.

• It uses visual components such as charts to highlight the information and exceptions requiring
action.

• It is easy to use, requiring minimal training.

• It allows for further exploration, but it does not require it. That is, a user does not need to click
through several screens or perform several operations to get the basic information needed.

• It draws data from multiple systems and combines them into a summarized view of the business.

• It does not perform complex calculations that are not transparent and that could even incorporate
a small calculation error that would produce inaccurate information and cause a manager to make
a wrong decision.

• It provides the ability to drill down to the underlying data to get more detail for evaluating and
analyzing it.

• The information presented is refreshed in a timely manner so the user can keep up to date, and
alerts and exceptions are streamed to the dashboard.

• It provides benchmarks against which to compare key performance indicators. Benchmarks are
current practices of other firms, current practices of the best-performing divisions within the same
company, or the company’s own historical performance that serve as a standard against which
current performance can be compared.

Table and Chart Design Best Practices


One of the limitations of data visualization is that the potential exists for misrepresentation or distortion of
data either accidentally or deliberately through choices made in the way it is presented. Tables and charts
should be designed to enhance understanding and avoid distortion in the communication of complex infor-
mation.

330 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 18: F.4. Visualization or Visual Discovery

A few design best practices for avoiding distortion are:

• Don’t omit important data. For example, in a time series that displays the increase in global ocean
temperatures, the graph should begin at the point of the industrial revolution since that is when
the increase in global ocean temperatures began to accelerate. Beginning with a much later date
omits important information and misrepresents the proportionate amount of the increase.

• If the graph is a time series, make sure the dates on the horizontal axis (x-axis) are sequential
and the amount of time elapsed from one date to the next is consistent.

• The scale of the vertical axis (y-axis) should be the right size to properly present the data. An
inappropriate y-scale can either intentionally or accidentally obscure or magnify differences in data.
A line on a chart or graph may give the wrong impression by appearing too flat or too steep,
depending on the scale used; and a small movement can appear to be a big movement, or a big
movement can appear to be a small movement if the scale is inappropriate.

• If it is practical, the scale of the y-axis should begin with zero, particularly if two or more items
are being compared on the graph or if a value is being presented over time. For example, if the
scale begins at a value just below the lowest item on the graph, the proportional difference between
or among the values represented on the graph will be distorted. One item may appear to be twice
the size of another item when that is not the case. Depending on the data, though, sometimes it
is better to start with a number other than zero. The starting number of the y-scale is a matter of
the chart designer’s judgment.

• The increments (the amount of increase in the values of the tick marks) on the y-axis should be
consistent. Varying the amounts of the increments without varying the size of the increments
displayed creates a misleading graph. If presenting multiple graphs using the same data, be con-
sistent in the y-scale used from one graph to the next.

• On a pie chart, make sure the percentages of the items presented on the chart sum to 100%.

• Keep the number of items included on a pie chart to a minimum to avoid clutter. The maximum
should be about five items. If the number of items is greater than five, a pie chart may not be the
best type of visualization to use.

• Indicate sampling error with confidence intervals when appropriate, so viewers can separate real
variations from random variations.

• Provide a comparison to an external benchmark such as an industry average when available to


enhance understanding and interpretation of the data by viewers.

• Use the right chart or graph for the job. Different types of charts and graphs convey different types
of information. For example:

o Line charts work well for time series presentations.

o Bar charts are useful for depicting data that can be easily categorized.

o Pie charts show proportions.

o Scatter plots and bubble charts can be used to illustrate relationships between variables.

o Histograms look like bar charts, but they are used to depict frequency distributions.

• Consider whether a table might be a better solution than a graph or chart. A table works better
when values need to be expressed precisely or when there is a need to look up individual values.
A graph or chart works better when patterns, trends, and exceptions need to be emphasized or
when full sets of values need to be compared.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 331
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

Best practices in table and chart design also take into consideration the way the table or chart will appear
to viewers. Some general guidelines are the following.

• Keep it simple. In general, less is more in chart and graph design. Do not use elements that are
not needed to convey the information desired. Thick gridlines, heavy borders, shadows, and 3-D
elements distract from the information in the graph unless they are necessary such as in a bubble
chart.125

• Use gridlines and tick marks only when necessary to make the information clear, and then use a
shade of gray to avoid distracting from the actual information in the chart. However, do not make
the lines so light that a person with low vision or another visual disability would have difficulty
seeing them.

• Use lines for the axes only if necessary to make the information clear, and then use gray, as above.

• Use grayed-out labels for the axes for the same reason: to avoid distracting from the data.

• Borders around the graph or around the plot area are usually not needed. But if a border is needed,
use a thin gray line.

• Use white for the background of a graph.

• Whenever possible, avoid using slanted labels for the x-axis because they are difficult to interpret.

• Color should be used meaningfully, not simply to decorate the chart or graph. Using color only as
a decoration distracts from the information being presented.

• In general, use different colors only when it will make the information more understandable, such
as using a different color for each data series. Using different colors for various items in a single
data series is not good practice.

• Colors of small or thin objects are harder to distinguish than colors of larger objects. For objects
such as bars, use colors of medium intensity. For lines and dots, use colors of higher intensity and
if necessary, enlarge the dots or widen the lines.

• If the number of separate data series is large, it is better to use multiple panels than to try to
display all of them on one chart.

• If a chart contains only one data series, a legend is not needed because the chart title will identify
what is being presented.

• Data labels are information directly in the chart identifying the number that each data point rep-
resents, such as a number above each bar on a vertical bar chart. If data labels are used, axis
labels (usually y-axis) that would provide the same information are not necessary. The colors of
the fonts used for the data labels should match the colors of the bars or lines that each relates to.
The vertical bar, bubble chart, and histogram examples that follow illustrate the use of data labels
instead of numbers on the y-axis.

• When choosing colors to use for the elements in a chart, keep in mind the needs of viewers with
visual disabilities such as low vision or color-blindness. Use colors with sufficient contrast so that
the elements will be easily distinguishable from the background by people with low vision. Check
whether colors used for different data series are distinguishable by color-blind people. Color con-
trast checkers and information on what colors are not distinguishable from each other by people
with the various types of color-blindness are available on the Internet.

125
A bubble chart is illustrated in the examples that follow.

332 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 18: F.4. Visualization or Visual Discovery

Note: Color-blindness, which is usually a genetic condition, is an inability to perceive the dif-
ference between certain colors. Approximately 10 percent of males and 1 percent of females
are color-blind, and most of them cannot distinguish between red and green. If red and green
are used on a graph to distinguish “poor” values from “good” values, the distinction will be lost
on color-blind people who cannot tell the difference. Therefore, when color coding information,
avoid using red and green in the same presentation. Another, less-common, condition is an
inability to distinguish between shades of blue and yellow. A few rare individuals see no color
at all but see everything in shades of black and white.

The best practices above are illustrated in the visualization examples that follow and in other graphics
throughout this textbook.

Tables Used in Visualization


A table can be in any form and can include all the data available or only certain data.

The data table below will be used in all the chart examples that follow. The table contains data on the
number of pounds of strawberries sold on each day of the week by a grocery store over an eleven-week
period from June 3 through August 18. The store’s produce buyer uses this information in placing orders,
so that enough strawberries are purchased to meet the anticipated demand each day without over-buying
and having too many unsold strawberries that will need to be thrown away because they spoil.

In addition to daily strawberry sales for each of the days of the week for eleven weeks, the data table below
contains the mean (average) for each day of the week over the eleven-week period.

Day Mean
of the Jun. Jun. Jun. Jun. Jul. Jul. Jul. Jul. Jul.29- Aug. Aug. (calcu-
Week 3-9 10-16 17-23 24-30 1-7 8-14 15-21 22-28 Aug. 4 5-11 12-18 lated)

Mon. 15 10 28 39 48 25 12 20 30 23 28 25

Tues. 35 25 45 40 46 49 30 60 38 32 22 38

Wed. 68 42 57 74 84 55 30 55 60 75 88 63

Thur. 60 80 65 90 65 85 50 70 110 75 45 72

Fri. 95 60 85 90 70 80 105 85 50 80 105 82

Sat. 110 75 85 98 75 102 85 50 120 100 65 88

Sun. 11 7 14 10 40 18 20 25 35 20 22 20

Scatter Plot
A scatter plot can be used to show all the values for a dataset, typically when there are two variables, and
illustrate the relationship between the variables. A scatter plot shows the intersection of the two variables.
One variable may be independent and the other value dependent, or both variables may be independent.
When there is a dependent variable, the independent variable is generally plotted on the horizontal (x) axis
and the dependent variable is plotted on the vertical (y) axis.

A scatter plot can reveal a correlation between variables or alternatively, a lack of correlation. For example,
do sales of strawberries correlate with days of the week? A scatter plot can answer that question. On the
scatter plot that follows, all sales occurring on Mondays are plotted on the x-axis at “Monday,” all Tuesday
sales are plotted at “Tuesday,” and so forth.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 333
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

Note: If a particular day of the week has the same number of sales for two or more weeks, only one
value is visible on the scatter plot that follows. For example, 65 pounds of strawberries were sold on
Thursday of the week of June 17-23 and again on Thursday of the week of July 1-7. However, only the
point corresponding to Thursday of the week of July 1-7 can be seen at “Thurs.” on the x-axis and “65”
on the y-axis. The point representing Thursday’s sales during the week of June 17-23 is covered up by
the data point for Thursday of the later week, so only ten points are visible for Thursdays instead of
eleven. That does not affect the usefulness of the scatter plot, however.

In this case, it appears that strawberry sales do correlate with days of the week. Sales build from Monday
through Saturday and then they drop off on Sunday each week.

Scatter Plot Chart


Number of Pounds of Strawberries Sold Per Day
June 3 through August 18

120
Number of Pounds Sold Per Day

100

80

60

40

20

0
Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Jun. 3-9 Jun. 10-16 Jun. 17-23 Jun. 24-30


Jul. 1-7 Jul. 8-14 Jul. 15-21 Jul. 22-28
Jul. 29-Aug. 4 Aug. 5-11 Aug. 12-18

Charts Containing Summarized Statistics


Several charts are used to present summarized statistics such as means, maximum values, and minimum
values.

Dot Plot
A dot plot provides information in the form of dots. A dot plot can be used to visualize summarized data
points for each category on the x-axis. For example, the following dot plot shows the minimum, the maxi-
mum, and the mean number of pounds of strawberries sold for each day of the week during the eleven-
week period.

334 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 18: F.4. Visualization or Visual Discovery

Dot Plot
Strawberry Sales By Day of the Week
June 3 through August 18

Minimum Maximum Mean


Number of Pounds Sold Per Day

140

120

100

80

60

40

20

0
Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Day of the Week

The dot plot above displays the central tendency as well as the dispersion in the values of the dataset.

Note: A measure of central tendency is a value that represents the center point of a set of data. It
helps a user to understand a set of data more quickly than would be possible by simply looking at all the
individual values. Measures of central tendency are the mean, the median, and the mode.

• The mean is the numerical average of the numbers in the dataset. It is the sum of the individual
values divided by the total number of values.

• The median is the value in a dataset that has an equal number of values above it and below it.
When all the values are arranged from the smallest to the largest, if the number of values is an odd
number, the median is the middle value. If the number of values is an even number, the median
is the average of the two values in the middle.

• The mode is the value occurring most frequently in the dataset. If none of the values repeat, the
dataset has no mode. If several values repeat an equal number of times, a dataset can have multiple
modes.

Dispersion describes how much individual values in a set of data are scattered or spread out about their
center. The amount of dispersion in a dataset is important because it is an indicator of the amount of
risk involved in any prediction of future events. If historical values are highly dispersed about their
mean, they vary widely. When observations vary widely, the probability is greater that future results will
vary widely from predicted results, and that wide variability creates risk.

The dot plot shows that the dispersion of the data points about their means is greater on Wednesdays,
Thursdays, Fridays, and Saturdays and less so on Mondays, Tuesdays, and Sundays. The produce buyer
can see that strawberry sales volume is more volatile on Wednesdays, Thursdays, Fridays, and Saturdays
and less volatile on Mondays, Tuesdays, and Sundays. Thus, sales projections for Mondays, Tuesdays, and
Sundays may be more accurate than they would be for the days with greater variation.

Bar Chart
A bar chart is useful for comparing a statistic across groups. It works well for depicting data that can be
easily categorized. The height of the bar or the length of it, if the bar is displayed horizontally, displays the
value of the statistic.

The following bar chart shows the mean number of pounds of strawberries sold per day over the eleven-
week period. Thus, the Monday sales figure is the average of eleven Mondays, and so forth for each of the

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 335
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

days of the week. This chart can be used to easily visualize which are the heaviest days of the week for
selling strawberries so that orders can be placed at the proper times.

Vertical Bar Chart


Mean Strawberry Sales by Day of the Week
June 3 through August 18
Mean Number of Pounds Sold Per Day

88
82
72
63

38

25
20

Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Day of the Week

The vertical bar chart above illustrates the use of data labels (the numbers above each bar that tell what
the bar size depicts) and the elimination of axis labels for the y-axis, one of the best practices for data
presentation. Note that because of the labels, a y-axis with numbers does not need to be shown.

A bar chart can also be used to portray values horizontally, and when it does, it becomes the exception to
the rule that the independent variable is generally on the x-axis (horizontal) and the dependent variable is
generally on the y-axis (vertical). When a bar chart portrays the values horizontally, the independent vari-
able is on the vertical axis and the bars extend to their values on the horizontal axis, representing the
dependent variable.

The following horizontal bar chart is used to show not only the mean sales in pounds for each day of the
week but also the minimum sales and maximum sales for each day, which are also important information.
Data labels are not used on the horizontal bar chart due to a lack of space, but if the bars were wider, labels
could be used.

336 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 18: F.4. Visualization or Visual Discovery

Horizontal Bar Chart


Strawberry Sales by Day of the Week
June 3 through August 18

Monday Minimum Mean Maximum


Days of the Week

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

0 20 40 60 80 100 120 140

Number of Pounds Sold Per Day

Here is the same information presented vertically. The vertical bar chart illustrates the use of data labels
and the elimination of axis labels for the y-axis, one of the best practices for data presentation. This chart
presents three data series, and the color of the data label for each data series matches the color of the bar.

Vertical Bar Chart


Strawberry Sales by Day of the Week
June 3 through August 18
120
110
105
Number of Pounds Sold Per Day

88 88
82
72
60 63

48 50 50
45
38 40
30
25 22 20
10 7

Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Day of the Week

Maximum Mean Minimum

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 337
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

Pie Chart
Pie charts are primarily for showing proportions. A pie chart is in the form of a circle that portrays one value
for each category, marked as pieces of a pie. In the example that follows showing the mean pounds of
strawberries sold per day for each day of the week over the eleven-week period, the pieces of the pie
represent the days of the week and each one is sized to represent the mean sales for that day. The size of
the “pieces” helps the user to visualize the relative sizes of the mean sales for each day.

A pie chart does not use axes. Therefore, including values on the chart as labels helps users to interpret
and use the information. The mean pounds of strawberries sold on each day have been added to each pie
piece in the chart that follows. Values can be added as labels to data in other types of charts as well, but
in some charts, doing so tends to make the chart hard to read.

A limitation of the pie chart is that it can present only one value for each category.

Pie Chart
Mean Number of Pounds of Strawberries Sold
by Day of the Week
June 3 through August 18

Sun. Mon.
20 25
Tues.
38
Sat.
88
Wed.
63

Fri.
82 Thurs.
72

Best practices in pie chart design include limiting the number of items on the chart to five items. To illustrate
a pie chart using the same daily sales data as is used in the other charts and graphs in this study unit, it is
necessary to present seven sets of daily data per week. Therefore, in this case a pie chart may not be the
best way to visualize the data.

If a pie chart is used to present percentages of a whole, make sure the percentages total to 100%.

338 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 18: F.4. Visualization or Visual Discovery

Line Chart
A line chart can be used to visualize several observations for each category, using one line for each series
of observations. A line chart can also work well to depict change over time, such as in a time series, as
illustrated in the topic of Regression Analysis in this volume.

The strawberry sales data are shown on the following line chart as the minimum, the maximum, and the
mean values for each day of the week for the eleven-week period.

Line Chart
Strawberry Sales by Day of the Week
June 3 through August 18

140
Number of Pounds Sold Per Day

Maximum Minimum Mean


120

100

80

60

40

20

0
Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Day of the Week

Bubble Chart
Like a scatter plot (shown earlier in this topic), a bubble chart can be used to illustrate relationships between
variables, but a bubble chart replaces data points with bubbles that vary in size according to the size of the
values they depict, thus adding the relative sizes of the values plotted as an additional dimension to the
chart.

However, unlike the scatter plot in which all the values for a dataset can be depicted, a bubble chart works
better when only one data series is depicted. Because the bubbles can be significantly larger than the data
points on a scatter plot, including more than one data series on a bubble chart can become unwieldy.

The bubble chart that follows shows only one value for each day of the week: the mean of the strawberry
sales for that weekday.

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 339
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

Bubble Chart
Mean Strawberry Sales by Day of the Week
June 3 through August 18

88
Mean Number of Pounds Sold

82

72
63

38

25
20

Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Day of the Week

Note the data labels used on each bubble in place of numbers on the y-axis.

Charts Containing the Entire Distribution of Values


Although summaries and means (averages) are very useful, much can be gained by looking at additional
statistics such as the median of a set of data or by examining the full distribution of the data. Histograms
and boxplots can be used to display the entire distribution of a numerical variable.

Histogram
A histogram shows the frequencies of a variable using a series of vertical bars. The values of the variable
may occur over time, or they may be as of a moment in time.

A histogram looks like a bar graph. However, it is different from a bar graph. A bar graph relates two
variables to one another, whereas a histogram communicates only one variable, and a histogram is used
to depict the frequency distribution of that variable.

Note: A frequency distribution is a representation of the number of observations within each of sev-
eral intervals. For example, on 11 days, a total of 1-20 pounds of strawberries were sold.

To construct a histogram, the range of values of the variable must be first divided into intervals, or bins,
and then the number of values that fall into each interval are counted.

The following histogram shows frequencies: how many days during the eleven-week period the sales of
strawberries were 1 through 20 pounds, how many days 21 through 40 pounds were sold, and so forth. Six
bins are used for the strawberry sales data.

340 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 18: F.4. Visualization or Visual Discovery

The bins and the values in each bin are:

Number of Pounds Frequency


Sold Per Day (Number of Days)
1-20 pounds 11
21-40 pounds 18
41-60 pounds 16
61-80 pounds 14
81-100 pounds 12
101-120 pounds 6

Histogram
Pounds of Strawberries Sold Per Day
June 3 through August 18
Frequency (Number of Days)

18
16
14
12
11

1-20 21-40 41-60 61-80 81-100 101-120

Pounds Sold Per Day

The histogram illustrates that on 11 days out of the 77 days in the eleven-week period, the total number
of pounds of strawberries sold during the day was 1 through 20 pounds; on 18 days it was 21 through 40
pounds, and so forth.

Note again the use of data labels instead of y-axis labels.

Boxplot
A boxplot is another type of chart that can be used to display the full distribution of a variable.

In the example of strawberry sales, the boxplot will show the minimum, the maximum, the mean, the
median, the first quartile, the second quartile (which is the same as the median), and the third quar-
tile for each day’s distribution, plus all the individual observations and any outliers.

• The minimum is the smallest value in a distribution, excluding smaller outliers.

• The maximum is the largest value in the distribution, excluding larger outliers.

• The mean is the numerical average of all the values in a particular set of data. It is the sum of the
individual values divided by the total number of values.

• The median is the middle value in a distribution when the values are ordered from the smallest to
the largest. In general, 50 percent of the data is larger than the median and 50 percent is smaller.
If the distribution contains an odd number of values, the median is the value with an equal number

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 341
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

of values below it and above it. If the distribution contains an even number of values, the median
is the mean (average) of the middle two numbers.

• The first quartile (Q1) is the middle value between the minimum value and the median of the
distribution. It is called the 25 th percentile because 25% of the values in the data set are below
Q1. The first quartile may also be called the “lower quartile.”

• The second quartile (Q2) is the same as the median of the distribution. It is the 50th percentile
because 50% of the values in the data set are below Q2.

• The third quartile (Q3) is the middle value between the median and the maximum value in the
data set. It is called the 75th percentile because 75% of the values in the data set are below Q3.
The third quartile may also be called the “upper quartile.”

• Outliers are values that are far away from most of the other values in the dataset. Outliers are
explained in more detail later in this topic.

The boxplot chart of the strawberry sales data follows. An explanation of what is depicted follows the chart.

342 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.
Section F Study Unit 18: F.4. Visualization or Visual Discovery

The boxplot covers the full eleven weeks of data by day of the week. Wednesdays will be used to exemplify
and explain the interpretation of the boxplot.

Below are the Wednesday sales volumes for the eleven weeks, with the individual Wednesday sales re-
ordered in ascending order and the minimum, maximum, median, mean, and first and third quartiles indi-
cated. The first and third quartiles are the averages of the two data points in between the minimum and
the mean (Q1) and in between the mean and the maximum (Q3).

Mean
Day of Mini- Q1 Median Q3 Maxi- (Calcu-
the mum (Average) (Q2) (Average) mum lated)
Week 30 55 60 74.5 88 63

Wed. 30 42 55 55 57 60 68 74 75 84 88 63

Each day’s box on the boxplot encloses the first quartile through the third quartile of values for that day
during the eleven-week period. For Wednesdays:

• The bottom of the box marks the first quartile, 55 (the average of 55 and 55, the two middle
values between the minimum of 30 and the median of 60).

• The top of the box marks the third quartile, 74.5 (the average of 74 and 75), the two middle
values between the median and the maximum, 88.

• The horizontal bar through the box marks the location of the median at 60.

• The “X” in the box marks the location of the mean (the average) for that day, which for Wednes-
days is 63 (rounded).

• The circles mark the individual data points other than Q1, Q3, the median, and the mean.

The horizontal lines at the ends of the vertical lines above and below each box are called “whiskers.” The
horizontal lines mark the minimum and the maximum for each day (excluding outliers),

• The minimum for Wednesdays is 30.

• The maximum for Wednesdays is 88.

The area from the bottom “whisker” to the bottom of the box (Q1) is the bottom 25% of the observations,
while the area from the top of the box (Q3) to the top “whisker” is the top 25% of the observations.

Outliers
Outliers are values that are far away from most of the other values in the data set, either because they
are much greater than the third quartile or much less than the first quartile. On the boxplot, a circle above
the top whisker or below the bottom whisker is an outlier.

There are no outliers in the Wednesday sales data. However, there is an outlier in the Monday data. When
there is an outlier, the whiskers mark a minimum or maximum for that day that is other than the outlier.
For Mondays, there is an outlier above the top whisker.

An outlier is determined by the locations of the first and third quartiles and the size of the interquartile
range (IQR) of the data set. The interquartile range is calculated as (Q3 − Q1).

Note: The definition of an outlier within a data set is a value that is less than Q1 or greater than Q3 by
more than 1.5 times the set’s interquartile range (Q3 − Q1).

Thus, an outlier is any value that is either

• Less than Q1 − (1.5 × IQR), or

• Greater than Q3 + (1.5 × IQR)

© HOCK international, LLC. For personal use only by original purchaser. Resale prohibited. 343
Study Unit 18: F.4. Visualization or Visual Discovery CMA Part 1

For Mondays, the first quartile is at 17.5 and the third quartile is at 29, so the IQR (interquartile range) is
29 − 17.75, which equals 11.25.

Therefore, for Mondays, an outlier is any value that is either

• Less than (17.5 − [1.5 × 11.25]), which equals 0.625, or

• Greater than (29 + [1.5 × 11.25]), which equals 45.875.

The maximum value for Mondays is 48, which is greater than 45.875, so for Mondays, 48 is an outlier. That
point is visible as the circle on the boxplot for Mondays that is above the whisker. The whisker marking
the maximum for Mondays is instead at 39, because 39 is the highest Monday value that is not greater
than 45.875.

344 © HOCK international, LLC. For personal use only by original purchaser. Resale prohibited.

You might also like