Data+Literacy+-+Course+Notes
Data+Literacy+-+Course+Notes
Data Literacy
By Olivier Maugain
365 DATA SCIENCE Page 2
TABLE OF CONTENTS
Abstract ....................................................................................................................................................... 4
1. Introduction ...................................................................................................................................... 5
1.1. What Exactly Is Data Literacy ................................................................................................... 5
1.2. Why Do We Need Data Literacy ............................................................................................. 5
1.3. Data-Driven Decision Making ................................................................................................. 7
1.4. Benefits of Data Literacy ........................................................................................................... 7
1.5. How to Get Started .................................................................................................................... 8
2. Understanding Data ....................................................................................................................... 9
2.1. Data Definition ........................................................................................................................... 9
2.2. Types of Data............................................................................................................................ 10
2.2.1. Qualitative vs. Quantitative Data .................................................................................. 10
2.2.2. Structured vs. Unstructured Data ................................................................................. 11
2.2.3. Data at Rest vs. Data in Motion ..................................................................................... 12
2.2.4. Transactional vs. Master Data ........................................................................................ 13
2.2.5. Big Data ............................................................................................................................. 14
2.3. Storing Data .............................................................................................................................. 15
2.3.1. Database ........................................................................................................................... 15
2.3.2. Data Warehouse .............................................................................................................. 16
2.3.3. Data Marts ......................................................................................................................... 17
2.3.4. The ETL Process ............................................................................................................... 17
2.3.5. Apache Hadoop............................................................................................................... 18
2.3.6. Data Lake........................................................................................................................... 19
2.3.7. Cloud Systems.................................................................................................................. 20
2.3.8. Edge Computing ............................................................................................................. 21
2.3.9. Batch vs. Stream Processing .......................................................................................... 21
2.3.10. Graph Database ........................................................................................................... 22
3. Using Data ....................................................................................................................................... 24
3.1. Analysis vs. Analytics ............................................................................................................... 24
3.2. Statistics ..................................................................................................................................... 25
3.3. Business Intelligence (BI)........................................................................................................ 27
3.4. Artificial Intelligence (AI) ........................................................................................................ 27
3.5. Machine Learning (ML) ........................................................................................................... 28
365 DATA SCIENCE Page 3
ABSTRACT
Being data literate means having the necessary competencies to work with data.
Any manager or business executive worth their salt is able to articulate a problem that
can be solved using data.
If you want to build a successful career in any industry, acquiring full data literacy
should certainly be one of your key objectives.
First, you will start with understanding data terminology – we will discuss the different
types of data, data storage systems, and the technical tools needed to analyze data.
Then, we will proceed with showing you how to use data. We’ll talk about Business
Intelligence (BI), Artificial Intelligence (AI), as well as various machine and deep
learning techniques.
In the third chapter of the course, you will learn how to comprehend data, perform data
quality assessments, and read major statistics (measures of central tendency and
measures of spread).
1. Introduction
Definition:
“Data literacy is the ability to read, understand, create, and communicate data as
information.”
А data literate person has the necessary competencies to work with data:
Articulate a problem that can potentially be solved using data
Understand the data sources used
Check the adequacy and fitness of the data involved
Interpret the results of an analysis and extract insights
Make decisions based on the insights
Explain the value generated with a use case
Some of the most important questions a data literate person should be able to
answer are:
How do we store data?
Which are the systems we use to do that?
Are the data complete and clean enough to support a correct decision?
What are the main characteristics of a data set?
What methodology was applied to analyze the data?
How reliable is the result of an analysis or forecast?
Data literacy helps organizations make sense of all their data, creating value and:
Better customer understanding
Faster decision making
More accurate predictions
Optimizing activities
Reducing risks and costs
Improving productivity
Better serving customers, suppliers, patients, etc.
2018 Survey “Lead with Data - How to Drive Data Literacy in the Enterprise” by the
software vendor Qlik
Results:
24% of business decision makers are fully confident in their own data literacy
32% of senior leaders are viewed as data literate
Only 21% of 16 to 24-year-olds are data literate
94% of respondents using data in their current role recognize that data help
them do their jobs better
82% of respondents believe that greater data literacy gives them stronger
professional credibility
Conclusion:
While enterprise-wide data literacy is considered important, data literacy levels
remain low.
2020 Survey “The Human Impact of Data Literacy” by Accenture and Qlik
Results:
Data-driven organizations benefit from improved corporate performance,
leading to an increased enterprise value of 3-5 percent
Only 32% of companies are able to realize tangible and measurable value
from data
Only 21% of the global workforce are fully confident in their data literacy
skills
At the same time, 74% of employees feel overwhelmed or unhappy when
working with data
Conclusion:
Although employees are expected to become self-sufficient with data and make data-
driven decisions, many do not have sufficient skills to work with data comfortably and
confidently.
365 DATA SCIENCE Page 7
Definition:
“Confirmation bias is the tendency to search for, interpret, favor, and recall information
in a way that confirms or supports one's prior beliefs or values.”
It leads people to unconsciously select the information that supports their views while
dismissing non-supportive information. Confirmation bias is not the only cognitive
bias the human mind is exposed to.
Data-driven decisions:
They constitute evidence of past results; with data, you have a record of
what works best, which can be revisited, examined, or scrapped if useless
Data and their analysis allow us to get more information
A large pool of evidence is the starting point for a virtuous cycle that
builds over time and may benefit not only a single decision-maker, but
the organization as a whole
The analysis of data allows us to handle more attributes, values,
parameters, and conditions than the human mind could process
Conclusion:
Gut feeling should not be the only basis for decision making. Data should not be
overlooked, but rather added to the mix of ideas, intuition, and experience when
making decisions.
1.4. Benefits of Data Literacy
Benefits of working with data:
Important notice:
The use of data is not about weakening humans and their decision-making power. It
should not be seen as a threat to managers’ jobs. Instead, data support and to inform
them to make better decisions.
Conclusion:
To be successful in a “digital world”, one needs to become more confident in the use
of data.
1.5. How to Get Started
There are four stages of data literacy. This course concerns the first stage only.
First stage:
Terminology: Understand and employ the right terms
Describe and communicate with data
Ascertain the quality of data
Interpret and question the results of analyses performed by others
Extract information: Understand and internalize the importance of data for
one’s own success
Second stage:
Prepare data
Choose the right chart to visualize information
Carry out complete analyses
Identify patterns and extract insights
Tell business stories using data: Become autonomous in the processing of data
and the extraction of value from these
365 DATA SCIENCE Page 9
Third stage:
Design analyses and experiments
Understand statistical concepts, and apply the right techniques and tools
Remember and observe the assumptions and conditions related to the
techniques employed: Put data into operation into one’s own business domain
(e.g., marketing, medicine, sports, social science, etc.)
Fourth stage:
Develop and fine-tune statistical or mathematical models
Choose the right machine learning algorithm
Apply scientific standards in the use of data
Interpret the output of a predictive model to make sure that the results are
reliable
Using programming language: Data as a profession
2. Understanding Data
2.1. Data Definition
Definition:
“Data are defined as factual information (such as measurements or statistics) used as a
basis for reasoning, discussion, or calculation.”
Data should be used in plural; the singular form is datum, which is a single value of a
single variable.
Data ≠ Information
Data are the raw facts
Information is derived from data
Data need to be processed and interpreted in a certain context in order
to be transformed into information
Examples of data:
A Spreadsheet with sales information
E-mails
Browsing history
Video files you shared on social media
Geolocation as tracked by mobile phones
The amount of fuel consumption recorded by vehicles
Types of data:
Quantitative vs. qualitative data
Structured vs. unstructured data
Data at rest vs. data in motion
Transactional vs. master data
(Small) data vs. “Big” data
365 DATA SCIENCE Page 10
Quantitative data:
Data that can be measured in numerical form. The value is measured in the form of
numbers or counts. Each data set is associated with a unique numerical value.
Quantitative data are used to describe numeric variables. Types of quantitative data:
Discrete data: Data that can only take certain values (counts). It involves
integers (positive and negative). Examples:
- Sales volume (in units)
- Website traffic (number of visitors, sessions)
Continuous data: Data that can take any value. It involves real numbers.
Examples:
- ROI (return of investment) of a project
- Stock price
Interval (scale) data: Data that are measured along a scale. Each point on that
scale is placed at an equal distance (interval) from one another, with no
absolute zero. Examples:
- Credit score (300-850)
- Year of the foundation of a company
Ratio (scale) data: Data that are measured along a scale with an equal ratio
between each measurement and an absolute zero (the point of origin). They
cannot be negative. Examples:
- Revenue
- Age
Qualitative data:
Data that is collected in a non-numerical form. It is descriptive and involves text or
categories, but also integers (when recoded). Types of qualitative data:
Nominal (scale) data: Data that do not have a natural order or ranking. They
cannot be measured. Calculations with these data are meaningless.
Examples:
- Marital status
- Response to an email campaign (yes/no)
Ordinal (scale) data: Data that have ordered categories; the distances
between one another are not known. Order matters but not the difference
between values. Examples:
- Socio-economic class (“poor”, “working class”, “lower middle class”, “upper
middle class”, “upper class”)
365 DATA SCIENCE Page 11
Examples:
- Text documents (word processing files, pdf files, etc.)
- Presentations
- Emails
- Media files (images, audio, video)
- Customer feedback (written and spoken complaints)
- Invoices
- Sensor data (as collected by mobile devices, activity trackers, cars, airplane
engines, machines, etc.)
Metadata: These are data about data, providing information about other data.
They are inactive most of the time, but meant for later use
Data at rest are vulnerable to theft when physical or logical access is
gained to the storage media, e.g., by hacking into the operating system
hosting the data or by stealing the device itself
Examples:
- Databases
- Data warehouses
- Spreadsheets
- Archives
Data in motion: Data that are flowing through a network of two or more systems or
temporarily residing in computer memory.
Data in motion are often sent over public networks (such as the internet)
and must be protected against spying attacks
Examples:
- Data of a user logging into a website
- Telemetry data
- Video streaming
- Surveillance camera data
2.2.4. Transactional vs. Master Data
When working with data involving larger systems, such as ERP (Enterprise Resource
Planning, e.g., S/4HANA) or CRM (Customer Relationship Management, e.g.,
Salesforce Marketing Cloud), people make a distinction between two other types of
data.
Transactional data: Data recorded from transactions. They are volatile, because they
change frequently.
Examples:
- Purchase orders
- Sales receipts
- Bank transaction history
- Log records
Master data: Data that describe core entities around which business is conducted.
Master data may describe transactions, but they are not transactional in nature. They
change infrequently and are more static than transaction data.
Examples:
- Prospects
- Customers
- Accounting items
- Contracts
Illustrative example:
If a manufacturer buys multiple pieces of equipment at different times, a transaction
record needs to be created for each purchase (transactional data). However, the data
about the supplier itself stay the same (master data).
Challenges of master data management:
Master data are seldom stored in one location; they can be dispersed in
various software applications, files (e.g., databases, spreadsheets) or physical
media (e.g., paper records)
365 DATA SCIENCE Page 14
Various parts of the business may have different definitions and concepts for
the same business entity
Master data are often shared and used by different business units or corporate
functions
2.2.5. Big Data
Big data are data that are too large or too complex to handle by conventional data-
processing techniques and software. They are characterized across three defining
properties:
1.Volume: This aspect refers to the amount of data available. Data become big when
they no longer fit on a desktop or laptop RAM.
Examples:
- Sensor data produced by autonomous cars (up to 3.6 TB per hour, from about
100 in-built sensors constantly monitoring speed, engine temperature, braking
processes, etc.)
- Search queries on Google or other search engines (40’000 per second, or 3.5
billion per day)
- Data generated by aircraft engines (1 TB per flight according to GE)
Some sources mention “Veracity” (or Validity) as a fourth defining property. This
aspect refers to the accuracy or truthfulness of a data set.
Rows - represent a set of related data and has the same structure (i.e
customer record)
Columns (a.k.a attribute, field) - made of data values (text values or
numbers) describing the item in the row. It provides one single value for
each row
Example: customer ID, customer name, customer address
Relationships
Tables
Relational databases:
Data are organized into tables that are linked (or “related”) to one
another
Each row in the table contains a record with a unique ID (the key), e.g.,
customer ID
Relational databases are very efficient and flexible when accessing
structured information
SQL:
One term that you will often hear in the context of relational databases is SQL.
The data in the destination system may represent the data differently from the source,
involving a series of treatments of the data.
To prepare data and deliver these in a format so that they can feed into or be read by
applications, companies go through an ETL process.
ETL Process
Extracting (the data from source)
Transforming (the data while in transit)
Loading (the data into the specified target data warehouse or data mart)
Transforming includes:
Selecting specific columns
Recoding value (e.g., No -> 0, Yes -> 1)
Deriving new value through calculation
Deduplicating, i.e., identifying and removing duplicate, records
Joining data from multiple sources
Aggregating multiple rows of data
Conforming data so that separate sources can be used together
Cleansing data to ensure data quality and consistency
Changing the format and structure so that the data can be queried or
visualized
Companies offering ETL tools include Informatica, IBM, Microsoft and Oracle
2.3.5. Apache Hadoop
Data storage in the past
A company needed data to:
All these data cannot all be fit in a single file, database or even a single
computer
Processing them simultaneously is also practically impossible with one
computer
Apache Hadoop was the first major solution to address these issues, becoming
synonymous with “big” data.
Definition:
“Hadoop is a set of software utilities designed to enable the use of a network of
computers to solve “big” data problems, i.e., problems involving massive amounts of
data and computation.”
This technology allows you to:
The lack of structure of the data make them more suitable for
exploration than for operational purposes
Security: The owner’s data are under control within its firewall
Speed: Internet tends to be faster within the same building
Cost: It is cheaper purchase one’s own hardware than to lease it from a
third-party service provider (at least if the system is under constant use)
In the cloud: Data lakes are stored, using Internet-based storage services such as
Amazon Web Services (AWS), Google Cloud, Microsoft Azure, etc. Advantages:
Accessibility: Since the data are online, users can retrieve them wherever and
whenever they need them
Adaptability: Additional storage space and computing power can be added
immediately as needed
Convenience: All maintenance (e.g., replacement of broken computers) is taken
care of by the cloud service provider
Resilience: Service providers typically offer redundancy across their own data
centers by making duplicates of their clients’ data, retained as a fallback
365 DATA SCIENCE Page 21
Example: A fast-food chain keeps track of daily revenue across all restaurants. Instead
of processing all transactions in real-time, it aggregates the revenue and processes
the batches of each outlet’s numbers once per day.
Stream processing: Data (in motion) are fed into a system as soon as they are
generated, one bit at a time.
A better choice when the speed matters
Each data point or “micro-batch” flows directly into an analytics platform,
producing key insights in near real-time or real-time.
Good for supporting non-stop data flows, enabling instantaneous
reactions
It works best when speed and agility are required
Common fields of applications include:
- Cybersecurity
- Stock trading
- Programmatic buying (in advertising)
- Fraud detection
- Air traffic control
- Production line monitoring
Example: A credit card provider receives data related to a purchase transaction as
soon as the user swipes the card. Its systems can immediately recognize and block
anomalous transactions, prompting additional inspection procedures. In case of non-
fraudulent charges are approved without delay, so that customers do not have to wait
unnecessarily.
Both processing techniques:
Require different types of data (at rest vs. in motion)
Rely on different infrastructure (storage systems, database,
programming languages)
Impose different processing methods
Involve different analytics techniques
Address different issues
Serve different goals
2.3.10. Graph Database
Traditional relational database systems are not equipped to support connections
across beyond a certain degree. To achieve that, companies use graph databases.
Graph (semantic) databases:
Alex Bill
Likes
Likes
Chris
Music Sports
The people (Alex, Bill, Chris) and hobbies (Music, Sports) are data entities,
represented by “nodes”.
Example: On Facebook, the nodes represent users and content, while the edges
constitute activities such as “is a friend”, “posted”, “like”, “clicked”, etc.
365 DATA SCIENCE Page 24
3. Using Data
3.1. Analysis vs. Analytics
Definition:
“Analysis is a detailed examination of anything complex in order to understand its
nature or to determine its essential features.”
Data analysis is the in-depth study of all the components of a given data
set
Analysis is about looking backward to understand the reasons behind a
phenomenon
It involves the dissection of a data set and the examination of all parts
individually and their relationship between one another
The ultimate purpose of analysis is to extract useful information from the
data (discovery of trends, patterns, or anomalies)
It involves the detailed review of current or historical facts
The data being analyzed describe things that already happened in the
past
Examples:
- Comparison of the sales performances across regions or products
- Measurement of the effectiveness of marketing campaigns
- Assessment of risks (in finance, medicine, etc.)
Definition:
“Analytics is a broader term covering the complete management of data.”
It encompasses not only the examination of data, but also their
collection, organization, and storage, as well as the methods and tools
employed
Data analytics implies a more systematic and scientific approach to
working with data throughout their life cycle, which includes: data
acquisition, data filtering, data extraction, data aggregation, data
validation, data cleansing, data analysis, data visualization, etc.
365 DATA SCIENCE Page 25
3.2. Statistics
Definition:
“Statistics concerns the collection, organization, analysis, interpretation, and
presentation of data.”
Well-known software tools for statistical analysis: IBM SPSS Statistics, SAS, STATA, and
R.
365 DATA SCIENCE Page 26
Types of statistics:
Descriptive statistics
Inferential statistics
Practical Applications:
Operating decisions, e.g., the allocation of marketing budget to different
campaigns, brand positioning, pricing
Strategic decisions, e.g., international expansion, opening or closing of
plants, priority setting
Enterprise reporting — the regular creation and distribution of reports
describing business performance
Dashboarding — the provision of displays that visually track key performance
indicators (KPIs) relevant to a particular objective or business process
Online analytical processing (OLAP) — querying information from multiple
database systems at the same time, e.g., the multi-dimensional analysis of
data
Financial Planning and Analysis — a set of activities that support an
organization's financial health, e.g., financial planning, forecasting, and
budgeting
Related Fields:
Artificial intelligence research relies on approaches and methods from various fields:
Mathematics, statistics, economics, probability, computer science, linguistics,
psychology, philosophy, etc.
Practical Applications:
Recognizing and classifying objects (on a photo or in reality)
Playing a game of chess or poker
Driving a car or plane
Recognizing and understanding human speech
Translating languages
Moving and manipulating objects
Solving complex business problems and making decisions like humans
365 DATA SCIENCE Page 28
Examples:
In 2016, AlphaGo (developed by DeepMind Technologies, later
acquired by Google) defeated Go champion LEE Sedol
In 2017, Libratus (developed at Carnegie Mellon University) won a Texas
hold 'em tournament involving four top-class human poker players
In 2019, a report pooling the results of 14 separate studies revealed that
AI systems correctly detected a disease state 87% of the time (compared
with 86% for human healthcare professionals) and correctly gave the all-
clear 93% of the time (compared with 91% for human experts)
In 2020, researchers of Karlsruhe Institute of Technology developed a
system that outperforms humans in recognizing spontaneously spoken
language with minimum latency.
Narrow AI:
AI programs that are able to solve one single kind of problem
Narrow AI applications that work in different individual domains could
be incorporated into a single machine
Such a machine would have the capacity to learn any intellectual task
that a human being can, a.k.a. “artificial general intelligence” (AGI)
Conclusions:
AI is used to improve the effectiveness or efficiency of processes or decisions. It
should be thought of as a facilitator of human productivity – not as a replacement for
human intelligence.
Definition:
“An algorithm is a procedure for solving a mathematical problem in a finite number of
steps that frequently involves repetition of an operation.”
An algorithm:
Provides step-by-step guidance on what to do to solve the problem
Can be run again with a few parameters variations, to see if the new
configuration leads to a better result
It can take several iterations for the algorithm to produce a good
enough solution to the problem
Selects the model that yields the best solution
Machine learning algorithms find their way to better solutions without
being explicitly programmed where to look
Users must determine how to define when the problem is solved or what
kind of changes need to be made for each iteration
Training data:
These are past observations, used to “feed” the algorithm so that it can gain
initial experience
These observations represent units of information that teach the machine
trends and similarities derived from the data
The machine gets better with every iteration
Once the algorithm is able to distinguish patterns, you can make predictions
on new data
The process an algorithm goes through training data again and again is called
“training the model”.
3.6. Supervised Learning
Definition:
“Mapping is the association of all elements of a given data set with the elements of a
second set.”
30
20
10
0
-10 0.0 50.0 100.0 150.0 200.0
-20 y = 0.45x - 18.69
-30
Rainfall (mm)
The dots on the scatter plot correspond to the observed data, each of
them representing one month (24 months in total)
For each month, we can see the rainfall in mm (the independent
variable, on the horizontal axis) and the number of umbrellas sold (the
dependent variable, on the vertical axis)
There’s a positive relationship between both variables: the larger the
rainfall, the higher the number of umbrellas sold – and vice versa
Polynomial regression - when the relationship between the dependent and
independent variable is non-linear.
365 DATA SCIENCE Page 32
Example 1:
Exponential regression
70
60
50
Weight (kg)
40
30
20
10
0
50 70 90 110 130 150 170 190
Height (cm)
Example 2:
Polynomial regression
2,000,000
1,500,000
Salary (EUR)
1,000,000
500,000
0
15 20 25 30 35 40 45
Age
We employ previous time periods as input variables, while the next time
period is the output variable
Time series analysis has the objective to forecast the future values of a
series based on the history of that series
Time series are used in various fields, e.g., economics, finance, weather
forecasting, astronomy, supply chain management, etc.
Time series are typically plotted via temporal line charts (run charts):
30
20
10
0
-10
-20
Examples:
Daily closing prices of stocks
Weekly sales volume of laundry detergents
Monthly count of a website’s active users
Annual birth or death rate for a given country
3.6.3. Classification
Definition:
“Classification is a supervised machine learning approach where a class (also called
category, target, or label) is predicted for a new observation. The main goal is to
identify which class the new observation will fall into. “
Detect anomalies
- Fraud detection (via flagging of outliers)
- Predictive maintenance (via discovery of defective parts in a machine or system)
- Network intrusion detection
Reduce features in a data set
- Describing customers with 5 attributes almost as precisely as with 10 attributes
3.7.1. Clustering Analysis
Definition:
“Clustering is the splitting of a data set into a number of categories (or classes, labels),
which are initially unknown.”
You do not know in advance what we are looking for
The data are unlabeled
The objective is to discover possible labels automatically
These categories produced are only interpreted after their creation
A clustering algorithm automatically identifies categories to arrange the
data in a meaningful way
The grouping of the data into categories is based on some measure of
inherent similarity or distance
The algorithm must group the data in a way that maximizes the
homogeneity within and the heterogeneity between the clusters
The interpretation of the clusters constitutes a test for the quality of the
clustering
Example:
30 data points (each data point corresponding to one consumer of chocolate
bars) and two variables:
X-axis - the average weekly consumption of chocolate bars in the last 12 months
Y-axis - the growth in consumption of chocolate bars in the 12 months, compared to
the previous 12 months
365 DATA SCIENCE Page 36
Customer segmentation
4.0
Consumption Growth
3.0
2.0
1.0
0.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0
Average weekly consumption
Customer segmentation
4.0
3.0
Consumption growth
2.0
1.0
0.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0
Average weekly consumption
The associations are usually represented in the form of rules or frequent item sets.
Practical applications:
Market basket analysis: The exploration of customer buying patterns by finding
associations between items that shoppers put into their baskets, e.g., “tent and
sleeping bags” or “beer and diapers”; it produces the probability of a joint
occurrence
Product recommendation: When a customer puts a tent in his online shopping
cart, the website displays an offer for a sleeping bag
Design of promotions: When the retailer offers a special price for just one of
the products (crackers) but not for the other one at the same time (dip)
Customization of store layout: When you increase customer traffic so that
customers have more opportunities to buy products; strawberries and
whipped cream can be placed at the opposite ends of the store
3.8. Reinforcement Learning
Definition:
“Reinforcement learning is a technique where a machine (“agent”) learns through trial
and error, using feedback from its own actions.“
Example:
In 2016, AlphaGo defeated LEE Sedol, the world champion in the game of Go. Within
a period of just a few days, AlphaGo had accumulated thousands of years of human
knowledge, enough to beat a champion with 21 years of professional experience.
3.9. Deep Learning
Definition:
“Deep learning is a subfield of machine learning, applying methods inspired by the
structure and function of the human brain.”
Deep learning can be supervised, unsupervised or take place through
reinforcement learning
With deep learning, computers learn how to solve particularly complex
tasks – close to what we describe as “artificial intelligence”
Deep learning teaches machines to imitate the way humans gain
knowledge
The algorithm performs a task repeatedly, adjusting it a bit to improve
the outcome
Neural networks can be taught to classify data and identify patterns like
humans
When the machine receives new information, the algorithm compares it
with known objects and tries to draw similar conclusions
The more it knows about the objects’ distinguishing characteristics, the
more likely the algorithm is to solve the problem, e.g., recognize a
certain animal on an image
The system improves its ability as it gets exposed to more examples
Artificial neural networks (ANN) - The algorithms used to achieve deep learning.
ANN were inspired by the way the brain processes information and how
communication nodes are distributed
A neural network is made of a collection of connected nodes (called
neurons)
By connecting the nodes to other another, information can “flow”
through the network
Neurons are organized in layers
The more layers involved, the “deeper” the neural network
Practical applications:
Image recognition
Autonomous driving: The computer can differentiate between
pedestrians of different sizes, e.g., children vs. adults
365 DATA SCIENCE Page 39
An NLG system makes decisions about how to turn a concept into words
and sentences
Less complex than NLU since the concepts to articulate are usually
known
An NLG system must simply choose the right expressions several
potential ones
Classic examples: the production of (written out) weather forecasts from
weather data, automated journalism, generate product descriptions for
eCommerce sites, interact with customers via chatbots, etc.
4. Reading Data
In some areas of life, the right questions are more important than the right answers.
This also applies to working with data. It will often be the case that the ultimately
beneficiary of data (for example, the decision makers) is different from the person
who processed these data and produced the results (or the answers).
A customer makes a spelling mistake when writing his address in the paper
registration form
A clerk misreads the handwriting of the customer and fills in the wrong address
into the system
The CRM system only has space for 40 characters in the “address” field and cuts
every character beyond this limit
The migration tool cannot read special characters, using placeholders instead
A data analyst makes a mistake when joining two data sets, leading to a
mismatch of attributes for all records
Key data flaws:
Incomplete data: This is when there are missing values in the data set.
If there are too many missing values for one attribute, the latter becomes
useless and needs to be discarded from the analysis
If only a small percentage is missing, then we can eliminate the records with the
missing values or make an assumption about what the values could be
Business decision makers should ask the following questions to the people who
processed the data:
What is the proportion of missing values for each attribute?
How were the missing values replaced?
Inaccurate data: This can happen in many ways, for examples:
Spelling mistakes
Inconsistent units
A famous case is that of NASA’s Mars Climate Orbiter, which in 1999 was
unintentionally destroyed during a mission in 1999. The loss was due to a piece
of software that used the wrong unit of impulse (pound-force seconds instead
of the SI units of newton-seconds). The total cost of the mistake was estimated
at $327.6 million
Inconsistent formats
E.g., when numbers are declared as text, which makes them unreadable for
some programs
Impossible values
A negative age (for a person)
A height of 5.27 meters for a human
A household size of 4.8 (for one family)
A future birthday (for someone already born)
An employer called “erufjdskdfnd”
“999” years as a contract duration
365 DATA SCIENCE Page 42
Unusually high or low value does not necessarily have to be inaccurate, but
could also be an outlier (i.e., a data point that differs significantly from other
observations)
4.2. Data description
When communicating about or with data, it is essential to be able to describe these.
Although it is not always possible to go through them one by one, there are
ways to explore data and to get an overview about what is available.
The objective is to single out possible issues or patterns that might be worth
digging further into
A description of the data also helps determine how useful they can be to
answer the questions one is interested in
Data quality (including completeness and accuracy) are two key properties about
the data that should be clarified
Beyond that, the questions that a data consumer should ask are the following:
o What do the data describe?
o Do the data cover the problems we are trying to solve?
o What does each record represent? How granular (vs. coarse) are they?
o How “fresh” are the data? Do they reflect the current situation? When
were they collected?
Descriptive statistics can be used to describe and understand the basic
characteristics of a data set. A descriptive statistic is a number that provides a simple
summary about the sample and the measures. It allows to simplify large amounts of
data in a convenient way
Descriptive statistics are broken into two basic categories:
1) Measures of central tendency
2) Measures of spread
4.3. Measures of central tendency
Measures of central tendency (or “measures of center”) - focus on the average or
middle values of a data set
They describe the most common patterns of the analyzed data set. There are three
main such measures: 1) the mean, 2) the median, and 3) the mode
The mean is the average of the numbers.
It is easy to calculate
It is the sum of the values divided by the number of values
365 DATA SCIENCE Page 43
Example: The following data set indicates the number of chocolate bars consumed by
11 individuals in the previous week. The series has 11 values (one value per person):
0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16
Mean = (0 + 1 + 2 + 2 + 3 + 4 + 5 + 6 + 7 + 9 + 16) / 11 = 55 / 11 = 5
This means that any given person in this sample ate 5 chocolate bars in the previous
week (on average)
The median is the value lying in the “middle” of the data set.
It separates the higher half from the lower half of a data sample
It is calculated as follows:
o Arrange the numbers in numerical order
o Count how many numbers there are
o If it is an odd number, divide by 2 and round up to get the position of the
median number
o If you have an even number, divide by 2. Go to the number in that
position and average it with the number in the next higher position
Example 1: Dataset with 11 values: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16
Median = 4 (4 is in the middle of the dataset), Mean = 5
=> one half of the people in the sample consumed 4 or less chocolate bars and the
other half consumed 4 or more bars
Example 2: Dataset with 10 values: 1, 2, 2, 3, 4, 5, 6, 7, 9, 16
Median is 4.5 (the average of 4 and 5 in the middle of the dataset), Mean = 5.5
=> In that case, half of the people in the sample ate less than 4.5 chocolate bars and
the other half ate more than 4.5 bars
Example 3: Dataset with 11 values: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 93
Median = 4, Mean = 12
Тhe median is not affected by extreme values or outliers (i.e., data points that differ
significantly from other observations)
The median is a “robust” measure of central tendency
Тhe mean is susceptible to outliers
N.B. A comparison of the mean and median can offer additional clues about the data
A similar mean and median indicate that the values are quite balanced around
the mean
If, however, the median is lower than the mean, it is possible that there are
outliers at the high end of the value spectrum (or “distribution” in the statistics
jargon)
If median income < mean income => there is a large low- or middle-class population
with a small minority with extremely high incomes (billionaires).
If median income > mean income => the economy probably consists of a large middle
class and a small, extremely poor, minority.
4.4. Measures of spread
Measures of spread (also known as “measures of variability” and “measures of
dispersion”) describe the dispersion of data within the data set. The dispersion is the
extent to which data points depart from the center and from each other
The higher the dispersion, the more “scattered” the data points are.
Main measures of spread: 1) The minimum and maximum, 2) the range, 3) the
variance and standard deviation
Minimum and maximum - the lowest, respectively the highest values of the data set
Dataset with 11 values: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16
o The minimum = 0; the maximum = 16
A minimum or maximum that appears too low, respectively too high may suggest
problems with the data set. The records related to these values should be carefully
examined.
Range - the difference between the maximum and the minimum
It is easy to calculate
It provides a quick estimate about the spread of values in the data set, but is
not a very good measure of variability
In the previous example, the range is 16 (= 16-0)
Variance - describes how far each value in the data set is from the mean (and hence
from every other value in the set)
Mathematically, it is defined as the average of the squares of the differences
between the observed and the mean
It is always positive
Due to its squared nature, the variance is not widely used in practice
Dataset: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16 - the variance is 18.73
365 DATA SCIENCE Page 46
Standard deviation - measures the dispersion of a data set relative to its mean
It is calculated as the square root if the variance
It expresses by how much the values of data set differ from the mean value for
that data set
In simple terms, it can be regarded as the average distance from the mean.
A low standard deviation reveals that the values tend to be close to the mean
of the data set
Inversely, a high standard deviation signifies that the values are spread out
over a wider range
A useful property of the standard deviation compared to the variance is that it
is expressed in the same unit as the data
Dataset: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16 - the standard deviation is 4.32
5. Interpreting Data
Different types of analyses or method produce various forms of output. These can be
statistics, coefficients, probabilities, errors, etc., which can provide different insights
to the reader.
It is important to be able to interpret these results, to understand what they mean, to
know what kind of conclusions can be drawn and how to apply them for further
decision making.
Data interpretation requires domain expertise, but also curiosity to ask the right
questions. When the results are not in line with one’s expectations, these should be
met with a sufficient level of doubt and examined in further depth. Such sense of
mistrust and inquisitiveness can be trained.
Five types of data interpretation approaches:
1) Correlation
2) Linear regression
3) Forecasting
4) Statistical tests
5) Classification
When a correlation exists, you should be able to draw a straight line (called
“regression line”) that fits the data well.
Positive correlation (upward sloping regression line) - both variables move in the
same direction. When one increases/decreases, the other increases/decreases as
well.
Negative correlation (downward sloping regression line) - the variables move in
opposite directions. When one increases/decreases, the other decreases/increases.
12
10
8
6
6.0 8.0 10.0 12.0 14.0 16.0 18.0
Daylight (hours)
The more tightly the plot forms a line rising from left to right, the stronger the
correlation.
365 DATA SCIENCE Page 48
15
10
0
0 2 4 6 8 10 12 14
Temperature low (°C)
“Perfect” correlation - all the points would lie on the regression line itself
Lack of correlation - If a line does not fit, i.e. if the dots are located far away from the
line, it means that the variables are not correlated. In that case, the line is relatively
flat.
50
0
6.0 8.0 10.0 12.0 14.0 16.0 18.0
Daylight (hours)
Тhe strong correlation between two variables A and B can be due to various
scenarios:
o Direct causality: A causes B
o Reverse causality: B causes A
o Bidirectional causality: A causes B and B causes A
o A and B are both caused by C
o Pure coincidence, i.e., there is no connection between A and B
365 DATA SCIENCE Page 50
20
15
10
5
0
-5 0.0 5.0 10.0 15.0 20.0
-10
Y = 1.67X - 6.64 Daylight (hours)
Regression equation: Y = a + bX
Y (dependent variable) - the one to be explained or to be predicted
X (independent variable, predictor) – the one explaining or predicting the value of Y
a (intercept) - a constant, corresponding to the point at which the line crosses the
vertical axis, i.e., when X is equal to zero
b (slope) - the coefficient of X, quantifying how much Y changes for each incremental
(one-unit) change in X.
N.B. The higher the absolute value of b, the steeper the regression curve
The sign of b indicates the direction of the relationship between the Y and X
b > 0 - the regression line shows an upward slope. An increase of X results in an
increase of Y
b < 0 - the regression line shows a downward slope. An increase of X results in a
decrease of Y
Notice that a prediction using simple linear regression does not prove any causality.
The coefficient b, no matter its absolute value, says nothing about a causal
relationship between the dependent and the independent variables.
365 DATA SCIENCE Page 51
Summary: The objective of the regression is to plot the one line that best
characterizes the cloud of dots.
5.2.1. R-squared
After running a linear regression analysis, you need to examine how well the model
fits the data., i.e., determine if the regression equation does a good job explaining
changes in the dependent variable.
The regression model fits the data well when the differences between the
observations and the predicted values are relatively small. If these differences
are too large, or if the model is biased, you cannot trust the results.
R-squared (R2, coefficient of determination) - estimates the scatter of the data points
around the fitted regression line
It represents the proportion of the variance in a dependent variable that can be
explained by the independent variable.
The remaining variance can be attributed to additional, unknown, variables or
inherent variability.
In simple linear regression models, the R-squared statistic is always a number
between 0 and 1, respectively between 0% and 100%
R-squared = 0 => the model does not explain any of the variance in the
dependent variable. The model is useless and should not be used to
make predictions
R-squared = 1 => the model explains the whole variance in the
dependent variable. Its predictions perfectly fit the data, as all the
observations fall exactly on the regression line.
Example:
15
-15
Y = 1.67X - 6.64 Daylight (hours)
R² = 0.72
R-squared = 0.72
365 DATA SCIENCE Page 52
=> the number of daylight accounts for about 72% of the variance of the temperature
=> the number of daylight hours is a good predictor of the temperature.
What is a good R-squared value? At what level can you trust a model?
Some people or textbooks claim that 0.70 is such a threshold, i.e., that if a
model returns an R-squared of 0.70, it fits well enough to make predictions
based on it. Inversely, a value of R-squared below 0.70 indicates that the
model does not fit well.
Feel free to use this rule of thumb. At the same time, you should be aware of
its caveats.
Indeed, the properties of R-squared are not as clear-cut as one may think
A higher R-squared indicates a better fit for the model, producing more accurate
predictions
*A model that explains 70% of the variance is likely to be much better than one that
explains 30% of the variance. However, such a conclusion is not necessarily correct.
R-squared depends on various factors, such as:
- The sample size: The larger the number of observations, lower the R-squared
typically gets
- The granularity of the data: Models based on case-level data have lower R-
squared statistics than those based on aggregated data (e.g., city or country
data)
- The type of data employed in the model: When the variables are categorical
or counts, the R-squared will typically be lower than with continuous data
- The field of research: studies that aim at explaining human behavior tend to
have lower R-squared values than those dealing with natural phenomena.
This is simply because people are harder to predict than stars, molecules,
cells, viruses, etc.
5.3. Forecasting
Forecast accuracy – the closeness of the forecasted value to the actual value.
For a business decision maker, the key question is how to determine the accuracy of
forecasts:
“Can you trust the forecast enough to make a decision based on it?”
As the actual value cannot be measured at the time the forecast, the accuracy can
only be determined retrospectively.
365 DATA SCIENCE Page 53
With the MAE, we can get an idea about how large the error from the
forecast is expected to be on average.
The main problem with it is that it can be difficult to anticipate the
relative size of the error. How can we tell a big error from a small error?
What does a MAE of 100 mean? Is it a good or bad forecast?
That depends on the underlying quantities and their units. For example, if your
monthly average sales volume is 10’000 units and the MAE of your forecast for the
next month is 100, then this is an amazing forecasting accuracy. However, if sales
volume is only 10 units on average, then we are talking of a rather poor accuracy.
Mean Absolute Percentage Error (MAPE) - the sum of the individual absolute errors
divided by the underlying value.
Root Mean Square Error (RMSE) - the square root of the average squared error.
RMSE has the key advantage of giving more importance to the most significant errors.
Accordingly, one big error is enough to lead to a higher RMSE. The decision maker is
not as easily pointed in the wrong direction as with the MAE or MAPE.
Best practice is to compare the MAE and RMSE to determine whether the
forecast contains large errors. The smaller the difference between RMSE and
MAE, the more consistent the error size, and the more reliable the value.
As with all “errors”, the objective is always to avoid them.
The smaller the MAE, MAPE or RMSE, the better!
What is a “good” or “acceptable” value of the MAE, MAPE or RSME depends on:
The environment: In stable environment, in which demand or prices do not
vary so much over time (e.g., electricity or water distribution), demand or sales
volumes are likely to be rather steady and predictable. A forecasting model
may therefore yield a very low MAPE, possibly under 5%.
The industry: In volatile industries (e.g., machine building, oil & gas, chemicals)
or if the company is exposed to hypercompetition and constantly has to run
advertising campaigns or price promotions (like in the FMCG or travel
industries), sales volumes vary significantly over time and are much more
difficult to be forecasted accurately. Accordingly, the MAPE of a model could
be much higher than 5%, and yet be useful for decision makers in the sales,
finance or supply chain departments.
The type of company: Forecasts for larger geographic areas (e.g., continental
or national level) are generally more accurate than for smaller areas (e.g.,
regional or local).
The time frame: Longer period (e.g., monthly) forecasts usually yield higher
accuracies than shorter period (e.g., daily or hourly) forecasts.
With three well-established indicators available, one cannot conclude that one is
better than the other. Each indicator can help you avoid some shortcomings but will be
prone to others. Only experimentation with all three indicators can tell you which one
is best, depending on the phenomenon to be forecasted.
5.4. Statistical tests
Hypothesis testing is a key tool in inferential statistics, and used in various domains -
social sciences, medicine, and market research. The purpose of hypothesis testing is
to establish whether there is enough statistical evidence in favor of a certain idea or
assumption, i.e., the hypothesis.
The process involves testing an assumption regarding a population by measuring
and analyzing a random sample taken from that population
365 DATA SCIENCE Page 55
Sample - the specific group that data are collected from. Its size is always smaller than
that of the population.
If you randomly select 189 men and 193 women among these employees to carry out
a survey, these 382 employees constitute your sample.
- The mean CTRs of the red and blue conversion buttons are the same.
OR: the difference of the mean CTRs of the red and the blue conversion buttons
is equal to zero
It is called “null” hypothesis, because it is usually the hypothesis that we want to nullify
or to disprove.
The alternative hypothesis (H1) is the one that you want to investigate, because
you think that it can help explain a phenomenon
It represents what you believe to be true or hope to prove true.
The greater the dissimilarity between these patterns, the less likely it is that the
difference occurred by chance.
Examples:
If p-value = 0.0326 => there is a 0.0326 (or 3.26%) chance that the results
happened randomly.
If p-value = 0.9429 => the results have a 94.29% chance of being random
The smaller the p-value, the stronger the evidence that you should reject the
null hypothesis
When you see a report with the results of statistical tests, look out for the p-value.
Normally, the closer to 0.000, the better – depending, of course, on the hypotheses
stated in that report.
5.4.3. Statistical significance
The significance level (alpha, ) is a number stated in advance to determine how
small the p-value must be to reject the null hypothesis.
For these 100 customers, we use the model to predict their responses. These
constitute the predicted classes.
As these customers also receive the marketing offer, we also get to know who
responded favorably, and who did not. These responses constitute the actual
classes.
You can compare the predicted with the actual classes, and find out which
predictions were correct.
Confusion matrix - shows the actual and predicted classes of a classification problem
(correct and incorrect matches). The rows represent the occurrences in the actual
class, while the columns represent the occurrences in the predicted class.
Predicted class
n = 100 Yes No
Yes 10 5 15
Actual class
No 15 70 85
25 75 100
There are two possible responses to a marketing offer:
- "yes" - the customers accept it
- "no" - they ignore or reject it.
365 DATA SCIENCE Page 59
Out of the 100 customer who received the offer, the model predicted that 25
customers would accept it (i.e., 25 times “yes”) and that 75 customers would reject it
(i.e., 75 times “No”)
After running the campaign, it turned out that 15 customers responded favorably
(“Yes”), while 85 customers ignored it (“No”).
Based on the confusion matrix, one can estimate the quality of a classification model
by calculating its:
- Accuracy
- Recall
- Precision
5.5.1. Accuracy
Accuracy is the proportion of the total number of correct predictions.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
The model correctly predicted 10 “Yes” cases and 70 “No” cases =>
10 + 70
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = 80%
100
Accuracy is the most fundamental metric used to assess the performance of
classification models.
It is intuitive and widely used.
It comes with a major flaw, which becomes apparent when the classes are
imbalanced.
Experienced analysts are familiar with this issue, and have at their disposal
various techniques to handle imbalanced data sets
5.5.2. Recall and precision
Recall (also known as sensitivity) is the ability of a classification model to identify all
relevant instances.
𝑇𝑜𝑡𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
The model correctly predicted 10 out of 15 positive responses =>
10
𝑅𝑒𝑐𝑎𝑙𝑙 = = 66.67%
15
This means that only two-thirds of the positives were identified as positive,
which is not a very good score.
365 DATA SCIENCE Page 60
Precision is the ability of a classification model to return only relevant instances (to be
correct when predicting “yes”).
𝑇𝑜𝑡𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
The model predicted 25 positive responses, out of which 10 were correctly predicted
=>
10
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 40%
25
There are two types of incorrect predictions: false positives and false negative
A false positive - when a case is labeled as positive although it is actually negative
- The model predicts “yes” where the case is actually “no”
- When the objective is to minimize false positives, it is advisable to assess
classification models using precision.
A false negative - when a case is labeled as negative although it is actually positive
- The model predicts “no” where the case is actually “yes”
- When the objective is to minimize false negatives, it is advisable to assess
classification models using recall.
365 DATA SCIENCE Page 61
Olivier Maugain
Email: team@365datascience.com