Data Literacy Course Notes 365 Data Science
Data Literacy Course Notes 365 Data Science
Data Literacy
365 DATA SCIENCE 2
TABLE OF CONTENTS
ABSTRACT ............................................................................................................................................4
1. Introduction ..................................................................................................................................6
1.1. What Exactly Is Data Literacy ................................................................................................... 6
1.2. Why Do We Need Data Literacy ............................................................................................. 7
1.3. Data-Driven Decision Making ................................................................................................. 9
1.4. Benefits of Data Literacy......................................................................................................... 10
1.5. How to Get Started .................................................................................................................. 12
2. Understanding Data.................................................................................................................. 14
2.1. Data Definition ......................................................................................................................... 14
2.2. Types of Data............................................................................................................................ 15
2.2.1. Qualitative vs. Quantitative Data .................................................................................. 15
2.2.2. Structured vs. Unstructured Data ................................................................................. 17
2.2.3. Data at Rest vs. Data in Motion ..................................................................................... 20
2.2.4. Transactional vs. Master Data........................................................................................ 21
2.2.5. Big Data ............................................................................................................................. 22
2.3. Storing Data .............................................................................................................................. 24
2.3.1. Database ........................................................................................................................... 25
2.3.2. Data Warehouse .............................................................................................................. 26
2.3.3. Data Marts ......................................................................................................................... 27
2.3.4. The ETL Process ............................................................................................................... 28
2.3.5. Apache Hadoop............................................................................................................... 29
2.3.6. Data Lake........................................................................................................................... 31
2.3.7. Cloud Systems.................................................................................................................. 32
2.3.8. Edge Computing ............................................................................................................. 33
2.3.9. Batch vs. Stream Processing .......................................................................................... 35
2.3.10. Graph Database ........................................................................................................... 37
3. Using Data .................................................................................................................................. 40
3.1. Analysis vs. Analytics ............................................................................................................... 40
3.2. Statistics ..................................................................................................................................... 43
3.3. Business Intelligence (BI)........................................................................................................ 45
3.4. Artificial Intelligence (AI) ........................................................................................................ 46
3.5. Machine Learning (ML) ........................................................................................................... 48
365 DATA SCIENCE 3
ABSTRACT
Being data literate means having the necessary competencies to work with data.
Any manager or business executive worth their salt is able to articulate a problem that
If you want to build a successful career in any industry, acquiring full data literacy
First, you will start with understanding data terminology – we will discuss the
different types of data, data storage systems, and the technical tools needed to analyze
data.
Then, we will proceed with showing you how to use data. We’ll talk about
Business Intelligence (BI), Artificial Intelligence (AI), as well as various machine and
In the third chapter of the course, you will learn how to comprehend data,
perform data quality assessments, and read major statistics (measures of central
data. You will become familiar with fundamental analysis techniques such as
1. Introduction
А data literate person has the necessary competencies to work with data:
Some of the most important questions a data literate person should be able to
answer are:
• How do we store data?
• Are the data complete and clean enough to support a correct decision?
• Customers leave their digital footprints when engaging with products and
visits
conditions.
• Even cars are on the verge of becoming autonomous, thanks to the huge
• Devices and machines are producing and processing more and more
data. Most people feel overwhelmed with the flood of data and
Data literacy helps organizations make sense of all their data, creating value and:
• Optimizing activities
• Improving productivity
2018 Survey “Lead with Data - How to Drive Data Literacy in the Enterprise” by the
Results:
• 24% of business decision makers are fully confident in their own data
literacy
• 94% of respondents using data in their current role recognize that data
• 82% of respondents believe that greater data literacy gives them stronger
professional credibility
Conclusion:
remain low.
2020 Survey “The Human Impact of Data Literacy” by Accenture and Qlik
Results:
• Only 32% of companies are able to realize tangible and measurable value
from data
365 DATA SCIENCE 9
• Only 21% of the global workforce are fully confident in their data literacy
skills
Conclusion:
Although employees are expected to become self-sufficient with data and make
data-driven decisions, many do not have sufficient skills to work with data comfortably
and confidently.
Gut-feeling decisions:
• The human mind forgets, ignores, or rejects times you were wrong, while
• Gut feeling can be blind to key input if the relevant facts are not concrete
or objective enough
information
Definition:
“Confirmation bias is the tendency to search for, interpret, favor, and recall
It leads people to unconsciously select the information that supports their views
Data-driven decisions:
• They constitute evidence of past results; with data, you have a record of
• A large pool of evidence is the starting point for a virtuous cycle that
builds over time and may benefit not only a single decision-maker, but
Conclusion:
Gut feeling should not be the only basis for decision making. Data should not be
overlooked, but rather added to the mix of ideas, intuition, and experience when
making decisions
More and more professionals will be expected to be able to use data in their
roles.
• Which posts had the strongest impact in the last social media campaign?
• What kind of users visit the company’s website and are more likely to
convert?
Important notice:
The use of data is not about weakening humans and their decision-making
power. It should not be seen as a threat to managers’ jobs. Instead, data support and
Conclusion:
There are four stages of data literacy. This course concerns the first stage only.
First stage:
Second stage:
• Prepare data
Third stage:
• Understand statistical concepts, and apply the right techniques and tools
techniques employed: Put data into operation into one’s own business
Fourth stage:
• Interpret the output of a predictive model to make sure that the results
are reliable
2. Understanding Data
Definition:
Data should be used in plural; the singular form is datum, which is a single value of a
single variable.
Data ≠ Information
Examples of data:
• E-mails
• Browsing history
Types of data:
365 DATA SCIENCE 15
Quantitative data:
Data that can be measured in numerical form. The value is measured in the form
of numbers or counts. Each data set is associated with a unique numerical value.
Quantitative data are used to describe numeric variables. Types of quantitative data:
• Discrete data: Data that can only take certain values (counts). It involves
Examples:
• Continuous data: Data that can take any value. It involves real numbers.
Examples:
- Stock price
365 DATA SCIENCE 16
• Interval (scale) data: Data that are measured along a scale. Each point on
that scale is placed at an equal distance (interval) from one another, with
• Ratio (scale) data: Data that are measured along a scale with an equal
Examples:
- Revenue
- Age
Qualitative data:
• Nominal (scale) data: Data that do not have a natural order or ranking. They
Examples:
- Marital status
• Ordinal (scale) data: Data that have ordered categories; the distances
between one another are not known. Order matters but not the difference
between values.
365 DATA SCIENCE 17
Examples:
• Dichotomous data: Qualitative data with binary categories. They can only take
two values, typically 0 or 1. Nominal or ordinal data with more than two
Examples:
- Default on a loan
• They come in a tabular (grid) format; each row is a record (case) and each
• These rows and columns can, but do not have to, be labeled
• The cells at each intersection of the rows and columns contain values
Examples:
- Spreadsheets
- Relational databases
365 DATA SCIENCE 18
Semi-structured data: Data that do not conform to a tabular data model but
Examples:
Unstructured data: Data that are not organized in a predefined manner. They are not
arranged according to a pre-set data model or schema and cannot be stored in the
relational databases
Examples:
- Presentations
- Emails
365 DATA SCIENCE 19
- Invoices
Metadata: These are data about data, providing information about other data.
Examples:
- Author
- Geo-location
- File size
Data at rest: Data that are stored physically on computer data storage, such as cloud
• They are inactive most of the time, but meant for later use
gained to the storage media, e.g., by hacking into the operating system
Examples:
- Databases
- Data warehouses
- Spreadsheets
- Archives
Data in motion: Data that are flowing through a network of two or more systems or
• Data in motion are often sent over public networks (such as the internet)
Examples:
- Data of a user logging into a website
- Telemetry data
- Video streaming
When working with data involving larger systems, such as ERP (Enterprise
Salesforce Marketing Cloud), people make a distinction between two other types of
data.
Transactional data: Data recorded from transactions. They are volatile, because they
change frequently.
Examples:
- Purchase orders
- Sales receipts
- Log records
Master data: Data that describe core entities around which business is conducted.
Master data may describe transactions, but they are not transactional in nature. They
Examples:
- Prospects
- Customers
- Accounting items
- Contracts
-
365 DATA SCIENCE 22
Illustrative example:
transaction record needs to be created for each purchase (transactional data). However,
the data about the supplier itself stay the same (master data).
• Master data are seldom stored in one location; they can be dispersed in
• Various parts of the business may have different definitions and concepts
• Master data are often shared and used by different business units or
corporate functions
Big data are data that are too large or too complex to handle by conventional
data-processing techniques and software. They are characterized across three defining
properties:
1.Volume: This aspect refers to the amount of data available. Data become big when
memory
• RAM is different from storage capacity, which refers to how much space
the hard disk drive (HDD) or solid-state drive (SSD) of a device can store
• Big data starts with the real-time processing of gigabytes (GB) of data,
2.Variety: This aspect refers to the multiplicity of types, formats, and sources of data
available
• Nowadays, they use data from different sources, e.g. ERP, CRM, Supply
• Data users need to structure and clean the data before they can analyze
• Systems need to log users’ activities as data points in order to provide the
• The update needs to be made in a near real time (e.g., daily, several times
• Big data flow from different sources in real time into a central environment
Examples:
There are many applications and tools needed for end-to-end data
management.
Examples:
- Metadata management
365 DATA SCIENCE 25
The end consumers of data can access data through databases, data warehouses,
data lakes, etc. Depending on the type of data stored and processed, a distinction is
2.3.1. Database
Simple databases are made of tables, which consist of rows and columns:
• Rows - represent a set of related data and has the same structure (i.e
customer record)
numbers) describing the item in the row. It provides one single value for
each row
different tables. It documents the way data are stored and retrieved
Relationships
Tables
365 DATA SCIENCE 26
Relational databases:
• Data are organized into tables that are linked (or “related”) to one another
• Each row in the table contains a record with a unique ID (the key), e.g.,
customer ID
structured information
SQL: One term that you will often hear in the context of relational databases is SQL.
• It is used to query, but also to manipulate and define data, and to provide
access control
MySQL
disparate sources.
architected fashion
(BI)
• Dashboards
line of business, department of an organization. IBM, Oracle, SAP or Teradata use data
simpler setup
data
Building data warehouses and data mart involves copying data from one or
The data in the destination system may represent the data differently from the
To prepare data and deliver these in a format so that they can feed into or be
ETL Process
Loading (the data into the specified target data warehouse or data mart)
Transforming includes:
• Changing the format and structure so that the data can be queried or
visualized
Companies offering ETL tools include Informatica, IBM, Microsoft and Oracle
Data was “regular”, i.e., there were limited volumes of structured data at rest.
Companies store big data “just in case”. For this, they need a broader range of
business intelligence and analytical capabilities, which should enable them to:
• All these data cannot all be fit in a single file, database or even a single
computer
computer
Apache Hadoop was the first major solution to address these issues, becoming
Definition:
computers to solve “big” data problems, i.e., problems involving massive amounts of
• Make data available to users far faster than with traditional data
warehousing
• Stream data from different sources and in different formats with Hadoop
Definition:
“A data lake is a single repository of data stored in their native format.”. They
can store structured, semi-structured and unstructured data, at any scale.
Examples:
- Text documents
- Images
commodity hardware.
amounts of data
• Companies commonly dump their data in the lake in case they need them
later
• Data lakes are advantageous in scenarios where it is not yet clear whether
• The lack of structure of the data make them more suitable for exploration
• Data lakes contain information that has not been pre-processed for
analysis
• Data lakes retain all data (not just data that can be used today)
• Data lakes support all types of data (not just structured data)
• If this does not take place, data lakes can deteriorate or become
inaccessible to their intended users; such valueless data lakes are also
known as data swamps
Vendors that provide data lake technology: Amazon, Databricks, Delta Lake, or
Snowflake
Data lakes can be stored in two locations – “on premise” and “in the cloud”.
On premise: Data lakes are stored within an organization's data centers. Advantages:
• Security: The owner’s data are under control within its firewall
third-party service provider (at least if the system is under constant use)
In the cloud: Data lakes are stored, using Internet-based storage services such as
Amazon Web Services (AWS), Google Cloud, Microsoft Azure, etc. Advantages:
• Accessibility: Since the data are online, users can retrieve them wherever
fallback
Edge (Fog) computing: The computing structure is located between the cloud and
analyzed on the device itself without having to be transferred all the way
• It can also take on some of the workload from the central computer
Benefits:
speed
Example:
Example:
offline
• Privacy: Data are not exposed to the internet or to the central storage or
processing unit
Example:
launched. Many countries opted for a version where the storage and processing
365 DATA SCIENCE 35
of users’ geo-location data take place on their mobile devices (as opposed to
• After the collection is complete, the data are then fed into a system for
further processing
• Batch processing is efficient for large sets data at and where a deeper or
• Consumers can explore and use the data to develop statistical models,
revenue and processes the batches of each outlet’s numbers once per day.
Stream processing: Data (in motion) are fed into a system as soon as they are
reactions
- Cybersecurity
- Stock trading
- Programmatic buying (in advertising)
- Fraud detection
- Air traffic control
- Production line monitoring
transaction as soon as the user swipes the card. Its systems can immediately
languages)
connections across beyond a certain degree. To achieve that, companies use graph
databases.
themselves
gender, nationality
• There is no limit to the number and kind of relationships a node can have
• An edge always has a start node, end node, type, and direction
365 DATA SCIENCE 38
• Like nodes, edges can have properties, which are usually quantitative,
Alex Bill
Likes
Likes
Chris
Music Sports
The people (Alex, Bill, Chris) and hobbies (Music, Sports) are data entities,
represented by “nodes”.
Example: On Facebook, the nodes represent users and content, while the edges
3. Using Data
Definition:
• Data analysis is the in-depth study of all the components of a given data
set
phenomenon
• It involves the dissection of a data set and the examination of all parts
• The data being analyzed describe things that already happened in the
past
Examples:
Definition:
• It encompasses not only the examination of data, but also their collection,
working with data throughout their life cycle, which includes: data
Examples:
• Analysis only looks at the past, whereas analytics also tries to predict the
future
Microsoft Excel.
365 DATA SCIENCE 42
3.2. Statistics
Definition:
“Statistics concerns the collection, organization, analysis, interpretation, and
presentation of data.”
Well-known software tools for statistical analysis: IBM SPSS Statistics, SAS,
STATA, and R.
Types of statistics:
• Descriptive statistics
• Inferential statistics
• How many people in Georgia voted for Joe Biden during the US
aged 18 to 34?
grasp
• Does this new packaging significantly increase the sales volume of this
product?
• In this country, is the brand preference of women the same as that of men?
sample size
• Econometrics
• Operations research
Definition:
Practical Applications:
business process
data
budgeting
Definition:
computer, to do tasks that are usually done by humans because they require human
Related Fields:
Practical Applications:
• Translating languages
Examples:
• In 2016, AlphaGo (developed by DeepMind Technologies, later acquired
with 86% for human healthcare professionals) and correctly gave the all-
clear 93% of the time (compared with 91% for human experts)
Narrow AI:
• Such a machine would have the capacity to learn any intellectual task that
Definition:
• A formula that takes some values as input and delivers one or more values
as output
• The computer can use this description to learn and then make predictions
Definition:
An algorithm:
• Can be run again with a few parameters variations, to see if the new
• It can take several iterations for the algorithm to produce a good enough
• Users must determine how to define when the problem is solved or what
Training data:
• These are past observations, used to “feed” the algorithm so that it can
• The process an algorithm goes through training data again and again is
Definition:
“Mapping is the association of all elements of a given data set with the elements
of a second set.”
observations)
validation data
• The algorithm “learns” the mapping function from the input to the output
• The algorithm learns by comparing its own computed output with the
• Once the patterns and relationships have been computed, these can then
be applied to new data, e.g., to predict the output for new observations
• That result is then compared with the “correct” answer (provided by the
• The models produced through supervised learning are such that they are
• Each observation in the training data set is tagged with the outcome the
3.6.1. Regression
predicted
365 DATA SCIENCE 52
variables, i.e., the variables that are used to make the prediction
• The goal of the regression analysis is to draw a red line on a plot, which is
represent factors that can be controlled (which is not the case for rainfall).
Example:
Simple linear regression
80
60
Umbrellas sold
40
20
0
0.0 50.0 100.0 150.0 200.0
-20
-40
y = 0.45x - 18.69 Rainfall (mm)
• The dots on the scatter plot correspond to the observed data, each of
• For each month, we can see the rainfall in mm (the independent variable,
on the horizontal axis) and the number of umbrellas sold (the dependent
rainfall, the higher the number of umbrellas sold – and vice versa
Example 1:
Exponential regression
80
Weight (kg)
60
40
20
0
50 70 90 110 130 150 170 190
Height (cm)
Example 2 :
Polynomial regression
2,000,000
Salary (EUR)
1,500,000
1,000,000
500,000
0
15 20 25 30 35 40 45
Age
Definition:
“Time series forecasting is defined as the use of models to predict future values
• We employ previous time periods as input variables, while the next time
• Time series analysis has the objective to forecast the future values of a
• Time series are used in various fields, e.g., economics, finance, weather
• Time series are typically plotted via temporal line charts (run charts):
20
0
-20
Examples:
3.6.3. Classification
Definition:
called category, target, or label) is predicted for a new observation. The main goal is to
• The learning is based on a training set of data with observations for which
• Classification is best use when the output has finite and discrete values
• When there are only two categories, we have “binary” classification, e.g.
Positive/Neutral/Negative, Win/Draw/Loss, 0 to 9
365 DATA SCIENCE 56
there two classes – “Spam” or “Not spam”, respectively Spam “yes” or “no”. The input
The model (here - a mapping function) quantifies the extent to which the input
variables affect the output (or “predicted”) variable. The objective is to approximate
the mapping function so that when we can predict the nature of an email (spam or not)
Classifier :
An algorithm that implements classification. Classifiers are used for all kinds of
applications:
• Image classification
• Churn prediction -> What is the chance that this subscriber switch to
Definition:
“An unsupervised learning is based on an algorithm that analyzes the data and
• The only objective is for the algorithm to arrange the data as best as it can
techniques
corresponding output
• The two most useful unsupervised learning techniques are clustering and
association
• Detect anomalies
system)
attributes
Definition:
“Clustering is the splitting of a data set into a number of categories (or classes,
• The algorithm must group the data in a way that maximizes the
• The interpretation of the clusters constitutes a test for the quality of the
clustering
365 DATA SCIENCE 59
Example:
X-axis - the average weekly consumption of chocolate bars in the last 12 months
Customer segmentation
Consumption Growth
4.0
3.0
2.0
1.0
0.0
0.0 5.0 10.0 15.0
Average weekly consumption
Customer segmentation
Consumption growth
4.0
3.0
2.0
1.0
0.0
0.0 5.0 10.0 15.0
adverts. Cluster B could be lured with discounts; Cluster A might prefer healthier
products.
• What do you know about chocolate bar consumption habits and their
• Why did these patterns emerge and how are they relevant to our
business?
Practical applications:
• Customer segmentation
characteristics
Definition:
sets.”
The associations are usually represented in the form of rules or frequent item sets.
Practical applications:
finding associations between items that shoppers put into their baskets,
e.g., “tent and sleeping bags” or “beer and diapers”; it produces the
• Design of promotions: When the retailer offers a special price for just one
of the products (crackers) but not for the other one at the same time (dip)
Definition:
randomly, and “sees what happens” in every iteration. Over time, it learns
• They use mapping between input and output; outcomes can be positive
or negative
• The agent is not given any directions on how to complete the task
Example:
In 2016, AlphaGo defeated LEE Sedol, the world champion in the game of Go.
Within a period of just a few days, AlphaGo had accumulated thousands of years of
experience.
Definition:
reinforcement learning
knowledge
365 DATA SCIENCE 63
outcome
• Neural networks can be taught to classify data and identify patterns like
humans
more likely the algorithm is to solve the problem, e.g., recognize a certain
animal on an image
Artificial neural networks (ANN) - The algorithms used to achieve deep learning.
• ANN were inspired by the way the brain processes information and how
neurons)
the network
Practical applications:
• Image recognition
Definition:
language data
languages
• It taps into the treasure troves of unstructured data that are held by
service, e.g., to determine how they feel about their brands, products,
services, etc.
• Brands can apply it to emails, social media posts, phone calls, reviews on
unprompted fashion
decisions
• An NLG system makes decisions about how to turn a concept into words
and sentences
• Less complex than NLU since the concepts to articulate are usually known
365 DATA SCIENCE 66
potential ones
4. Reading Data
In some areas of life, the right questions are more important than the right
answers. This also applies to working with data. It will often be the case that the
ultimately beneficiary of data (for example, the decision makers) is different from the
person who processed these data and produced the results (or the answers).
done with these data, i.e., how these were collected, manipulated,
• Оne needs to be able to “read” the data, which includes the assessment
Acceptable (or high) data quality - if the data are fit for their intended uses (e.g.,
etc.)
Poor data quality (e.g., if the data contain errors) - decision makers could be
misled or algorithms will not learn efficiently, or perhaps even worse, learn the wrong
things.
* “Garbage in, garbage out” (GIGO) - Without meaningful data, you cannot get
meaningful results
The impact that poor data quality can be significant. Financial expenses,
productivity losses, missed opportunities and reputational damage are but some of the
*In 2016, IBM estimated that businesses were losing, in the US alone, $3.1 trillion every
• A clerk misreads the handwriting of the customer and fills in the wrong
• The CRM system only has space for 40 characters in the “address” field
instead
• A data analyst makes a mistake when joining two data sets, leading to a
Incomplete data: This is when there are missing values in the data set.
• If there are too many missing values for one attribute, the latter becomes
with the missing values or make an assumption about what the values
could be
• Spelling mistakes
• Inconsistent units
• A famous case is that of NASA’s Mars Climate Orbiter, which in 1999 was
• Inconsistent formats
• E.g., when numbers are declared as text, which makes them unreadable
• Impossible values
but could also be an outlier (i.e., a data point that differs significantly from
other observations)
are ways to explore data and to get an overview about what is available.
• A description of the data also helps determine how useful they can be to
Data quality (including completeness and accuracy) are two key properties
Beyond that, the questions that a data consumer should ask are the following:
• What does each record represent? How granular (vs. coarse) are they?
• How “fresh” are the data? Do they reflect the current situation? When
summary about the sample and the measures. It allows to simplify large amounts of
2) Measures of spread
They describe the most common patterns of the analyzed data set. There are three
main such measures: 1) the mean, 2) the median, and 3) the mode
• It is easy to calculate
Example: The following data set indicates the number of chocolate bars consumed by
11 individuals in the previous week. The series has 11 values (one value per person):
0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16
Mean = (0 + 1 + 2 + 2 + 3 + 4 + 5 + 6 + 7 + 9 + 16) / 11 = 55 / 11 = 5
This means that any given person in this sample ate 5 chocolate bars in the previous
The median is the value lying in the “middle” of the data set.
• It separates the higher half from the lower half of a data sample
• It is calculated as follows:
position and average it with the number in the next higher position
=> one half of the people in the sample consumed 4 or less chocolate bars and the
Median is 4.5 (the average of 4 and 5 in the middle of the dataset), Mean = 5.5
=> In that case, half of the people in the sample ate less than 4.5 chocolate bars and
Median = 4, Mean = 12
Тhe median is not affected by extreme values or outliers (i.e., data points that
• For example, when discussing the “average” income for a country, analyst
will often use the median (rather than the mean) salary.
The mode is the most commonly observed value in the data set. It is calculated as
follows:
Mode = 2.
• The value “2” is represented twice, while all the other values are only
represented once
• The modes are 2 and 4, as they are both represented twice while all the
• This shows that a data set can have two (or more) modes
The mode is not frequently used in daily or business life. However, it presents
the key advantage that it can be used for nominal data (which is not possible with the
This means that it would be possible to calculate the mode if the data set showed
the brand names of the chocolate bars purchased). Yet it makes no sense to speak of
• To compare the income, wealth and tax rate for different subgroups of a
N.B. A comparison of the mean and median can offer additional clues about the data
• A similar mean and median indicate that the values are quite balanced
• If, however, the median is lower than the mean, it is possible that there
are outliers at the high end of the value spectrum (or “distribution” in the
statistics jargon)
If median income < mean income => there is a large low- or middle-class population
If median income > mean income => the economy probably consists of a large middle
dispersion”) describe the dispersion of data within the data set. The dispersion is the
extent to which data points depart from the center and from each other
The higher the dispersion, the more “scattered” the data points are.
• the range,
Minimum and maximum - the lowest, respectively the highest values of the data set
A minimum or maximum that appears too low, respectively too high may
suggest problems with the data set. The records related to these values should be
carefully examined.
• It is easy to calculate
• It provides a quick estimate about the spread of values in the data set, but
Variance - describes how far each value in the data set is from the mean (and hence
• It is always positive
• Due to its squared nature, the variance is not widely used in practice
Standard deviation - measures the dispersion of a data set relative to its mean
• It expresses by how much the values of data set differ from the mean value
mean.
• A low standard deviation reveals that the values tend to be close to the
• Inversely, a high standard deviation signifies that the values are spread
5. Interpreting Data
can be statistics, coefficients, probabilities, errors, etc., which can provide different
mean, to know what kind of conclusions can be drawn and how to apply them for
Data interpretation requires domain expertise, but also curiosity to ask the right
questions. When the results are not in line with one’s expectations, these should be
met with a sufficient level of doubt and examined in further depth. Such sense of
1) Correlation
2) Linear regression
3) Forecasting
4) Statistical tests
5) Classification
scatter plot:
When a correlation exists, you should be able to draw a straight line (called
Positive correlation (upward sloping regression line) - both variables move in the
well.
20
15
10
5
0
6.0 8.0 10.0 12.0 14.0 16.0 18.0
Daylight (hours)
11
6
6.0 8.0 10.0 12.0 14.0 16.0 18.0
Daylight (hours)
365 DATA SCIENCE 79
The more tightly the plot forms a line rising from left to right, the stronger the
correlation.
“Perfect” correlation - all the points would lie on the regression line itself
Lack of correlation - If a line does not fit, i.e. if the dots are located far away from the
line, it means that the variables are not correlated. In that case, the line is relatively
flat.
Correlation between daylight and rainfall
amount
100
Rainfall (mm)
50
0
6.0 8.0 10.0 12.0 14.0 16.0 18.0
Daylight (hours)
Although the visual examination of a scatter plot can provide some initial clues
about the correlation between two variables, relying solely on them may lead to an
When the value of one variable increases/decreases, the value of the other one
The higher the absolute value of the correlation coefficient, the stronger the
correlation
variables.
However, this does not necessarily mean the variables are not related at all. They
Correlation in the real world rarely return coefficients of exactly +1.0, –1.0, or 0.
A rule of thumb for the interpretation of the absolute value of correlation coefficients:
When applying this rule of thumb, always bear in mind that the definition of a weak,
moderate, or strong correlation may vary from one domain to the next.
significant the correlation is, however, it never provides sufficient evidence to claim a
causal link between variables. This applies to both the existence or the direction of a
Тhe strong correlation between two variables A and B can be due to various
scenarios:
further investigation.
enough to make quality decisions. А good analyst or a wise decision maker will prefer
to strive for an explanation and always bear in mind the old statistical adage:
With simple linear regression, it is possible to predict the value of the dependent
20
15
10
5
0
-5 0.0 5.0 10.0 15.0 20.0
-10
Y = 1.67X - 6.64 Daylight (hours)
Regression equation: Y = a + bX
of Y
N.B. The higher the absolute value of b, the steeper the regression curve
The sign of b indicates the direction of the relationship between the Y and X
increase of Y
decrease of Y
Notice that a prediction using simple linear regression does not prove any
causality. The coefficient b, no matter its absolute value, says nothing about a causal
Summary: The objective of the regression is to plot the one line that best
5.2.1. R-squared
After running a linear regression analysis, you need to examine how well the
model fits the data., i.e., determine if the regression equation does a good job
The regression model fits the data w ell w hen the differences betw een the
observations and the predicted values are relatively small. If these differences are
too large, or if the model is biased, you cannot trust the results.
365 DATA SCIENCE 84
inherent variability.
• R-squared = 0 => the model does not explain any of the variance in the
make predictions
dependent variable. Its predictions perfectly fit the data, as all the
Example:
15
-15
Y = 1.67X - 6.64 Daylight (hours)
R² = 0.72
365 DATA SCIENCE 85
R-squared = 0.72
=> the number of daylight accounts for about 72% of the variance of the temperature
What is a good R-squared value? At w hat level can you trust a model?
• Some people or textbooks claim that 0.70 is such a threshold, i.e., that if
• Feel free to use this rule of thumb. At the same time, you should be aware
of its caveats.
• Indeed, the properties of R-squared are not as clear-cut as one may think
A higher R-squared indicates a better fit for the model, producing more accurate
predictions
*A model that explains 70% of the variance is likely to be much better than one
that explains 30% of the variance. However, such a conclusion is not necessarily correct.
• The sample size: The larger the number of observations, lower the R-
• The granularity of the data: Models based on case-level data have lower
country data)
365 DATA SCIENCE 86
• The type of data employed in the model: When the variables are
continuous data
• The field of research: studies that aim at explaining human behavior tend
5.3. Forecasting
Forecast accuracy – the closeness of the forecasted value to the actual value.
For a business decision maker, the key question is how to determine the
accuracy of forecasts:
“Can you trust the forecast enough to make a decision based on it?
As the actual value cannot be measured at the time the forecast, the accuracy
known as errors. These errors can be assessed without knowing anything about a
*The MAE, MAPE and RMSE only measure typical errors. They cannot anticipate black
swan events like financial crises, global pandemics, terrorist attacks, or the Brexit.
As the errors associated with these events are not covered by the time series data,
they cannot be modeled. Accordingly, it is impossible to determine in advance how
big the error will be.
5.3.1. Forecast errors
Мean Absolute Error (MAE) - the absolute value of the difference between the
• With the MAE, we can get an idea about how large the error from the
• The main problem with it is that it can be difficult to anticipate the relative
size of the error. How can we tell a big error from a small error?
That depends on the underlying quantities and their units. For example, if your
monthly average sales volume is 10’000 units and the MAE of your forecast for the next
month is 100, then this is an amazing forecasting accuracy. However, if sales volume is
Mean Absolute Percentage Error (MAPE) - the sum of the individual absolute errors
scales.
365 DATA SCIENCE 88
value and volume in US$, EUR, units, liters or gallons - even though these
• Thanks to this property, the MAPE is one of the most used indicators to
• One key problem with the MAPE is that it may understate the influence of
big, but rare, errors. Consequently, the error is smaller than it should be,
Root Mean Square Error (RMSE) - the square root of the average squared error.
RMSE has the key advantage of giving more importance to the most significant errors.
Accordingly, one big error is enough to lead to a higher RMSE. The decision maker is
not as easily pointed in the wrong direction as with the MAE or MAPE.
• Best practice is to compare the MAE and RMSE to determine whether the
forecast contains large errors. The smaller the difference between RMSE
and MAE, the more consistent the error size, and the more reliable the
value.
What is a “good” or “acceptable” value of the MAE, MAPE or RSME depends on:
not vary so much over time (e.g., electricity or water distribution), demand
forecasting model may therefore yield a very low MAPE, possibly under
5%.
• The industry: In volatile industries (e.g., machine building, oil & gas,
the FMCG or travel industries), sales volumes vary significantly over time
MAPE of a model could be much higher than 5%, and yet be useful for
continental or national level) are generally more accurate than for smaller
• The time frame: Longer period (e.g., monthly) forecasts usually yield
With three well-established indicators available, one cannot conclude that one is
better than the other. Each indicator can help you avoid some shortcomings but will be
prone to others. Only experimentation with all three indicators can tell you which one
domains - social sciences, medicine, and market research. The purpose of hypothesis
365 DATA SCIENCE 90
If you want to compare the satisfaction levels of male and female employees in your
Sample - the specific group that data are collected from. Its size is always smaller than
If you randomly select 189 men and 193 women among these employees to carry out
collected, for example because the process would be too lengthy or too expensive.
In these cases, researchers need to develop specific experiment designs, and rely on
Modern statistical software is there to calculate various relevant statistics, test values,
probabilities, etc. All you need to do is to learn interpret the most important ones:
- null hypothesis
- the p-value
- statistical significance.
365 DATA SCIENCE 91
A hypothesis resembles a theory in science. But it is “less” than that, because it first
propose two opposite, mutually exclusive, hypotheses so that only one can be right:
- A blue conversion button on the website results in the same CTR as a red
button
In statistics terms, the null hypothesis is therefore usually stated as the equality
- The mean CTRs of the red and blue conversion buttons are the same.
OR: the difference of the mean CTRs of the red and the blue conversion buttons
is equal to zero
365 DATA SCIENCE 92
It is called “null” hypothesis, because it is usually the hypothesis that we want to nullify
or to disprove.
The alternative hypothesis (H1) is the one that you want to investigate,
follows:
- A blue conversion button on the website will lead to a different CTR than the
one with a red button
In this case, the objective is to determine whether the population parameter is
generally distinct or differs in either direction from the hypothesized value. It is called
differs from the hypothesized value in a specific direction, i.e. is smaller or greater than
- Example: The difference of the mean CTRs of the blue and the red
- Here we only care about the blue button yielding a higher CTR than the red
button
We can also be even more aggressive in our statement and quantify that
- The difference of the mean CTRs of the red and the blue conversion button
- That would be equivalent to stating that the mean CTR of the blue button is
N.B. You do not have to specify the alternative hypothesis. Given that the two
hypotheses are opposites and mutually exclusive, only one can, and will, be true. For
the purpose of statistical testing, it is enough to reject the null hypothesis. It is therefore
5.4.2. P-value
The greater the dissimilarity between these patterns, the less likely it is that the
Examples:
If p-value = 0.0326 => there is a 0.0326 (or 3.26%) chance that the results
happened randomly.
If p-value = 0.9429 => the results have a 94.29% chance of being random
The smaller the p-value, the stronger the evidence that you should reject the
null hypothesis
When you see a report with the results of statistical tests, look out for the p-value.
Normally, the closer to 0.000, the better – depending, of course, on the hypotheses
experiment
• If the p-value falls below the significance level, the result of the test is
statistically significant.
• Unlike the p-value, the alpha does not depend on the underlying
• The alpha will often depend on the scientific domain the research is being
carried out in
If p-value < alpha => you can reject the null hypothesis at the level alpha.
If the P-value is lower than the significance level alpha, which should be set in
advance, then we can conclude that the results are strong enough to reject the old
notion (the null hypothesis) in favor of a new one (the alternative hypothesis).
Example:
If p-value = 0.0321, alpha = 0.05 => “Based on the results, we can reject the null
If the p-value = 0.1474, alpha = 0.05 => “Based on the results, we accept the null
“important”. That will depend on the real-world relevance of that result, which the
5.5. Classification
A classification model can only achieve two results: Either the prediction is
correct (i.e., the observation was placed in the right category), or it is incorrect.
classification model, especially when there are only two available categories or labels.
Оut-of-sample validation - withholding some of the sample data used for the
training of the model. Once the model is ready, it is validated with the data initially set
Example:
Imagine that we trained a model for a direct marketing campaign. We used the data
We set aside 100 customer records, which constitute our validation data.
• For these 100 customers, we use the model to predict their responses.
• As these customers also receive the marketing offer, we also get to know
who responded favorably, and who did not. These responses constitute
• You can compare the predicted with the actual classes, and find out which
Confusion matrix - shows the actual and predicted classes of a classification problem
(correct and incorrect matches). The rows represent the occurrences in the actual
class, while the columns represent the occurrences in the predicted class.
Predicted class
n = 100 Yes No
Yes 10 5 15
Actual class
No 15 70 85
25 75 100
Out of the 100 customer who received the offer, the model predicted that 25
customers would accept it (i.e., 25 times “yes”) and that 75 customers would reject it
After running the campaign, it turned out that 15 customers responded favorably
Based on the confusion matrix, one can estimate the quality of a classification model
by calculating its:
- Accuracy
- Recall
- Precision
5.5.1. Accuracy
The model correctly predicted 10 “Yes” cases and 70 “No” cases =>
10 + 70
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = = 80%
100
of classification models.
• It comes with a major flaw, which becomes apparent when the classes are
imbalanced.
• Experienced analysts are familiar with this issue, and have at their disposal
Recall (also known as sensitivity) is the ability of a classification model to identify all
relevant instances.
10
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = = 66.67%
15
• This means that only two-thirds of the positives were identified as positive,
Precision is the ability of a classification model to return only relevant instances (to be
predicted =>
10
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = = 40%
25
There are two types of incorrect predictions: false positives and false negative
Copyright 2022 365 Data Science Ltd. Reproduction is forbidden unless authorized. All rights reserved.
Olivier Maugain
Email: team@365datascience.com