English Exam Literature Daf 202308
English Exam Literature Daf 202308
English Exam Literature Daf 202308
DATA
ANALYTICS
EXAM LITERATURE
EXIN
Data Analytics
F O U N DAT I O N
The EXIN Data Analytics Foundation certification is essential preparation for any
business and technical-oriented professional aspiring to work with data. Find out
more about avaliable study options on our website.
Edition 202306
Copyright © EXIN Holding B.V. 2023. All rights reserved.
EXIN® is a registered trademark.
No part of this publication may be reproduced, stored, utilized or transmitted in any form or by any means, electronic,
mechanical, or otherwise, without the prior written permission from EXIN.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
2
Contents
1. Preface 4
2. Glossary 5
3. The data analytics domain 10
4. Concepts and process 12
5. Business intelligence (BI) 15
6. Lagging and leading metrics 17
7. Risks in data analytics 18
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
3
1. Preface
This booklet is part of the exam literature for candidates preparing for the EXIN Data Analytics
Foundation certification. The content supplements the main literature of the certification. The
booklet contains a glossary of basic terms which are not explained in the main literature, and
it covers some additional subjects which are regarded as essential knowledge in the field.
All specifications and characteristics of the exam can be found in the EXIN preparation guide of the
certification which can be downloaded from www.exin.com.
EXIN would like to give special thanks to Quint Consulting Services Private Ltd, specifically Sharad Nalawade,
for investing their time and knowledge and for the great collaborative work that resulted in this exam
literature.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
4
2. Glossary
This glossary introduces basic concepts that are not explained in the main literature. Please note that it
does not intend to explain these terms exhaustively, but rather to present context and extra information for
clarification and complementarity of knowledge.
The table below contains the terms followed by their definitions and the sources from which the definitions
were taken. Definitions without sources were written exclusively for this booklet.
accuracy Data accuracy refers to the degree to which data represents a real-world object,
event or scenario.
analysis, Regression analysis is a powerful statistical method that allows to examine the
regression relationship between two or more variables of interest. While there are many
types of regression analysis, at their core they all examine the influence of one or
more independent variables on a dependent variable.
Source: alchemer.com
anomaly An anomaly (also known as an outlier) is something that is outside of the norm,
when it stands out or deviates from what is expected. An anomaly is an irregular,
or not easily classified, piece of information. It is essentially a piece of data that,
for one reason or another, does not fit with the rest of the results. It is often an
indicator of something unexpected or problematic happening.
Source: millimetric.ai
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
5
artificial Artificial intelligence (AI) makes it possible for machines to learn from experience,
intelligence (AI) adjust to new inputs and perform human-like tasks. Most AI examples rely heavily
on deep learning and natural language processing (NLP). Computers using these
technologies, can be trained to accomplish specific tasks by processing large
amounts of data and recognizing patterns in datasets..
Source: sas.com
big data Big data refers to extremely large and complex datasets that cannot be easily
managed, processed, or analyzed using traditional data processing tools and
methods. These datasets are characterized by their volume, velocity, and variety,
and are typically generated from a wide range of sources such as social media,
sensors, transactions, and weblogs. The processing and analysis of big data
often require specialized tools and techniques such as distributed computing,
machine learning, and data mining to extract meaningful insights and knowledge
from the data.
data analysis Data analysis is a process of inspecting, cleaning, transforming, and modeling
data with the goal of discovering useful information, informing conclusions, and
supporting decision-making.
Source: Clarke, E. (2022). Everything Data Analytics -A Beginner’s Guide to Data Literacy:
Understanding the processes that turn data into insights. Kenneth Michael Fornari.
data analytics Data analytics is the broad term referring to turning data into insights. It is a
network of processes and techniques focused on the analysis of raw sets of data
so that concrete conclusions can be drawn.
Source. Clarke, E. (2022). Everything Data Analytics -A Beginner’s Guide to Data Literacy:
Understanding the processes that turn data into insights. Kenneth Michael Fornari.
data quality Data quality describes the degree to which data fits its intended purpose. Data is
considered high quality when it accurately and consistently represents real-world
scenarios.
Source: tibco.com
data science Data science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from noisy, structured,
and unstructured data and to apply knowledge and actionable insights from data
across a broad range of application domains.
Source: Clarke, E. (2022). Everything Data Analytics - A Beginner’s Guide to Data Literacy:
Understanding the processes that turn data into insights. Kenneth Michael Fornari.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
6
data security Data security refers to the process of protecting data from unauthorized access
and data corruption throughout its lifecycle. Data security includes encryption,
hashing, tokenization, and key management practices that protect data across all
applications and platforms.
data validation Data validation is the process of verifying and validating data that is collected
before it is used. Any type of data handling task, whether it is gathering data,
analyzing it, or structuring it for presentation, must include data validation to
ensure accurate results. […] Common data validation rules that check for data
integrity and clarity are:
• Data type: if any other data type, than, for example text, is entered, it should
be rejected by the system.
• Code check: checks whether the value is from a list of accepted values, like
the zip code of a particular location.
• Range: verifies if the data entered falls into the range specified, for example,
between 20 and 30 characters.
• Consistent expressions: it is important that the data entered makes logical
sense. For instance, the date of leaving must be later than the date of
joining.
• Format: several data types have a defined format, like a date.
• Uniqueness: data fields need to contain unique values if defined. For
example, no two users can use the same phone number.
• No null values: certain input fields cannot be empty.
• Standards for formatting: the structure of the data must be validated to
ensure that the data model being used is compatible with the applications
that are being used to work with the data.
Source: tibco.com
decision tree A decision tree is a map of the possible outcomes of a series of related choices.
It allows an individual or organization to weigh possible actions against one
another based on their costs, probabilities, and benefits. They can be used either
to drive informal discussion or to map out an algorithm that predicts the best
choice mathematically.
Source: lucidchart.com
distribution A distribution in statistics is a function that shows the possible values for a
variable and how often they occur.
Source: 365datascience.com
duplicate Duplicate data is any record that inadvertently shares data with another record
in a database. Duplicate data mostly occurs when transferring data between
systems. The most common occurrence of duplicate data is a complete carbon
copy of a record.
Source: hevodata.com
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
7
frequency Frequency, in general, means the number of times a certain event has taken
place. It can simply be defined as the count of a certain event that has occurred..
Source: storyofmathematics.com
hyperparameter Hyperparameters are parameters whose values control the learning process
and determine the values of model parameters that a learning algorithm ends
up learning. The prefix ‘hyper_’ suggests that they are ‘top-level’ parameters that
control the learning process and the model parameters that result from it.
Source: towardsdatascience.com
median The median is the value in the middle of a dataset, meaning that 50% of the data
points have a value smaller or equal to the median, and 50% of the data points
have a value higher or equal to the median. For a small dataset, first count the
number of data points (n) and arrange the data points in increasing order.
Source: 150.statcan.gc.ca
mean In mathematics and statistics, the mean refers to the average of a set of values.
The mean can be computed in a number of ways, including the simple arithmetic
mean (add up the numbers and divide the total by the number of observations),
the geometric mean, and the harmonic mean.
Source: investopedia.com
percentile A percentile (or a centile) is a measure used in statistics indicating the value
below which a given percentage of observations falls in a group of observations.
For example, the 20th percentile is the value (or score) below which 20% of the
observations may be found.
Source: pallipedia.org
probability Probability simply is how likely something is to happen. Whenever the outcome of
an event is unsure, the probabilities of certain outcomes can be discussed — how
likely the outcomes are. The analysis of events governed by probability is called
statistics.
Source: khanacademy.org
quartile Quartiles are three values that split sorted data into four parts, each with an equal
number of observations. Quartiles are a type of quantile.
• First quartile: also known as Q1, or the lower quartile.
• Second quartile: also known as Q2, or the median.
• Third quartile: also known as Q3, or the upper quartile.
Source: scribbr.com
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
8
R R is a programming language that provides a wide variety of statistical (linear and
nonlinear modeling, classical statistical tests, time-series analysis, classification,
clustering, etc.) and graphical techniques and is highly extensible. The S language
is often the vehicle of choice for research in statistical methodology, and R
provides an open-source route to participation in that activity.
Source: r-project.org
regression, Exponential regression is a model that explains processes that experience growth
exponential at a double rate. It is used for situations where the growth begins slowly but
rapidly speeds up without bounds or where the decay starts rapidly but slows
down to get to zero.
Source: voxco.com
streaming data Streaming data is data that is generated continuously by thousands of data
sources, which typically send in the data records simultaneously and in small
sizes (order of Kilobytes). Streaming data includes a wide variety of data, such as
log files generated by customers using mobile or web applications, e-commerce
purchases, in-game player activities, information from social networks, financial
trading floors, or geospatial services, and telemetry from connected devices or
instrumentation in data centers.
Source: aws.amazon.com
structured query Structured query language (SQL) is a standardized programming language used
language (SQL) to manage relational databases and perform various operations on the data in
them.
Source: techtarget.com
variable, A categorical variable (also called a qualitative variable) refers to a characteristic that can’t
qualitative be quantifiable. Categorical variables can be either nominal or ordinal.
Source: 150.statcan.gc.ca
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
9
3. The data analytics domain
Data engineering, data analytics and data science
Every data analytics practitioner needs to understand the Data Science Hierarchy of Needs model shown
below. The model helps them prioritize their activities and deliverables.
The model below depicts the activities corresponding to the levels of data science. The phases are briefly
explained below. The EXIN Data Analytics Foundation certification covers the move/store, the explore/
transform, and part of the aggregate/label phases .
Source: Hackernoon.com
Collect phase
As can be seen from the model above, data collection is the initial phase. Collecting data is the crucial first
step of any data science project. Data is available in various formats and from a variety of sources, such
as user-generated data or data from external sources, like sensors. Core activities are data logging, data
instrumentation, and data collection.
Move/store phase
Once the data is collected, the next step is to move and secure it. This involves structuring and migrating the
data to a suitable platform. Typical activities include data extraction, data transformation and data loading,
collectively termed ETL (extract, transform, load).
Explore/transform phase
During this phase, data is further refined and prepared for the next analytics phase. Activities during this
phase include data cleaning, anomaly detection and preparing the data for data analytics projects.
Aggregate/label phase
For data analytics projects, data can be collected from various sources, like ERP, CRM, and other enterprise
applications, in addition to data from external sources. The data received from these sources is aggregated
or consolidated to make it ready for the analysis phase.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
10
Aggregation is a type of data mining where data is searched, gathered, and presented in a report-based,
summarized format to achieve specific business objectives or processes and/or conduct human analysis.
The aggregated data would be easier to interpret using statistical functions. For example, one can use
statistical functions, like the minimum salary of an employee or the average age of the customer, by
querying the aggregated data. For business purposes, data can be aggregated into monthly or quarterly
sales summaries to help stakeholders make informed business decisions.
Labeling, in data analytics, is a process where right answers are assigned to dependent variables. This is
called supervised learning and helps in training machine learning algorithms. In other cases, data is simply
provided to an algorithm without labeling, and the algorithm comes up with its own inference. This is called
unsupervised learning.
Learn/optimize phase
In this phase, machine learning models or algorithms are trained using the datasets. The machine learning
models help in predictive analytics based on historical data. Several experiments, such as A/B testing, could
be set up to predict customer behavior or sales trends using these trained models. These machine learning
models could be further optimized with larger datasets and well-selected features by a process called
feature engineering, re-training the machine learning algorithms. The accuracy of machine learning models’
predictions depends upon the datasets’ quality and size.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
11
4. Concepts and process
The previous chapters describe the model that explains the phases of a typical data analytics project. This
chapter provides additional important concepts with examples.
The goal of a data analytics project is to provide one of the following four outcomes:
Depending upon the goal of the data analytics project, one can choose a suitable machine learning algorithm
to achieve the desired outcome. The quality of the data plays a crucial role in the outcome of the project.
The data that is collected and aggregated must be suitable for the purpose of data analysis in terms of
its quality, source, and size. During the data collection phase, the focus is to ensure the data is clean. The
absence of duplicates, missing values, or outliers in the data must be ensured, as these elements would
adversely affect the outcomes.
Further, datasets may have bias or variance associated with them. Bias in the dataset arises due to data
collection prejudices or preferences. For example, data bias occurs if data from one source or area is
excessively collected and data from other areas is ignored. On the other hand, variance can occur due
to the data being spread across multiple areas, making a dataset unsuitable for training machine learning
algorithms. Variance can also occur due to outliers in the dataset.
Some datasets may have variables with different ranges or scales. For example, the age variable could be
a 2-digit value, whereas salary could be a 6-digit value. When using such a dataset, the machine learning
algorithm may develop a bias towards a higher-value variable (salary) compared to a low-value variable
(age). To resolve such issues, normalization and standardization techniques can be used.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
12
The data cleaning process involves removing duplicates and outliers and substituting missing values with
average values to make datasets free from bias and variance. After data collection, aggregation, and cleaning,
it is possible to start with the data analysis. Analysis is the process of interpreting data and extracting
insights to help decision-making. This process is also called data mining. During the analysis process, a
particular machine learning algorithm is selected and trained by the dataset, so the model learns how the
variables in the dataset are interrelated. Machine learning algorithms are mathematical models designed
to be trained by datasets. Several machine learning models or algorithms are available according to the
analysis type needed, as explained earlier.
Before any of the machine learning algorithms are used, datasets are assessed using statistical functions.
Statistical techniques are used to understand the dataset in terms of its overall distribution of values, trends,
minimum and maximum values of each feature, etc. This will help analysts understand various aspects of
the dataset and get a better idea about the nature of the datasets they are dealing with. There are groups of
statistical functions such as minimum, maximum, mean, median, mode, variance, and standard deviation
that help the analysts to learn about the overall composition and distribution of the dataset.
Statistical analysis can be broadly classified as descriptive analysis and inferential analysis. The two types
are described below.
Descriptive statistics
Descriptive statistics refers to the analysis and interpretation of the data in terms of its main features, such as
central tendency, variability, and distribution. It summarizes the data by the measures of central tendency, like
mean, median, and mode, and measures of variability, such as variance and standard deviation. Descriptive
statistics also uses visualization techniques to graphically display the trends (histograms, scatter plots, box
plots, among others).
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
13
A measure of how far each
number in a dataset is from the
Variance mean of its values. It essentially N/A
measures the volatility
associated with a dataset.
Inferential statistics
Inferential statistics is a type of analysis that uses a small sample of data and makes inferences or decisions
about a population. Sample data is a small set of representative data derived from a larger, voluminous
dataset called a population. Sample data is used instead of population data to save time, effort, and cost of
analysis.
Examples of inferential statistical methods include hypothesis testing and regression analysis. These
methods help test hypotheses about a dataset based on a few assumptions.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
14
5. Business intelligence (BI)
What is business intelligence (BI)?
Business intelligence (BI) is a process of using data analytics to gain insights into large volumes of business
data to make informed decisions. BI uses typically a data warehouse that integrates data from various
enterprise sources, such as backend systems like ERPs and CRMs. Some examples of business data used
in BI projects are financial data, sales data, customer data, market data, and payment data. Popular BI tools
like Microsoft Power BI and Tableau provide rich management reports and interactive dashboards to be
used by business decision-makers. BI helps stakeholders make both strategic and operational decisions.
Data Analytics
Interactive Data
BI warehousing
dashboards
Reporting OLAP
1. Data analytics.
This is at the core BI engine that does the heavy-lifting of the analytics work using advanced AI and machine
learning algorithms.
2. Data warehousing.
This is the storehouse of historical data supporting data analytics.
4. Reporting.
Reporting is the outcome of the analytics process that is used by decision makers to understand trends and
insights.
5. Interactive dashboards.
These can help decision makers with visual, interactive user interfaces to learn insights and undertake what-
if analysis.
Although the five BI components described above are core to every BI project, some organizations may choose
to have an additional component called Online Transaction Processing (OLTP) for real-time transaction
processing environments. Large e-commerce platforms like Amazon and e-Bay use OLTP capabilities.
The next section will discuss how BI can be applied to improve the decision-making process.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
15
How does business intelligence (BI) lead to business decisions?
BI plays a crucial role in helping stakeholders with strategic and operational decisions. Key strategic
choices like introducing new products in the marketplace, providing product discounts, and investing in
advertisements as well as customer satisfaction initiatives, are made based on the results of data analysis.
Let’s consider an example to further understand how BI is used to enable the decision-making process.
Business intelligence (BI) provides a series of steps from data collection to analysis. During the
data collection stage of the BI project, XYZ Megastore collects historical data about sales from
across the regions and cities. The data is collected from backend systems, such as the CRM,
ERP, sales systems, customer feedback, web channels, social media, etc. The data is scrubbed
and loaded into the staging area, where a well-architected data warehouse is created.
During the analysis phase, advanced OLAP tools are used to slice and dice the data and look for
hidden patterns and trends. The data is analyzed using multi-dimensional parameters, such as
region, store, city and product ranges, price ranges, etc. Powerful algorithms unearth unexpected
findings and produce reports for further analysis.
The management team uses these reports to get insights from the data. They use interactive
dashboards to carry out what-if analyses. These reveal further surprises concerning the poor
performance of some stores versus performance of others. They learn that some stores in some
regions have poor sales for specific periods of time. This may be due to the high pricing of the
products and the lack of a promotional campaign. They learn that customers’ confidence and
loyalty are decreasing in specific regions and during specific timeframes.
Armed with these insights, the management team decides to act by starting an aggressive
advertising campaign and restructuring its pricing policy for some regions where sales were
affected. They hope the company will recover its lost market share with these actions.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
16
6. Lagging and leading metrics
As discussed, the purpose of a data analytics project is to help business stakeholders to make wise and
profitable business decisions. The best way to assess the success of a data analytics project is to measure
business outcomes after the enterprise has started using the data analytics solution(s). Metrics can be used
to track past performance as well as to formulate strategies for the future performance of the business.
Lagging metrics
These metrics are used to measure past performance of the business. These metrics take time
to show impact.
Example: if the goal is to improve customer retention, customer churn rate metrics can be used,
but it takes a long time for these metrics to show results.
Leading metrics
These metrics are used to predict future performance. They can be used as early indicators of
how the business is going to perform in the future. The benefit of these metrics is that they can
be used to track performance and change strategy to meet future goals.
Lagging and leading metrics collectively help leadership teams to assess the current situation of the
company based on its past performance and use historical data to formulate a plan of action.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
17
7. Risks in data analytics
While data analytics projects bring immense benefits, they are also not without risks. In some cases, errors
can result in financial damage or could even endanger lives. For example, the risk of a wrong diagnosis or
treatment in the healthcare industry can spell disaster for both the patient and the hospital. Physicians may
rely on incorrect insights derived from data analytics projects for treating a patient or prescribe a drug that
may impact the patient’s life.
Errors in conclusions and decisions occur primarily due to poor data quality and incorrect assumptions
made about the data. Many mistakes can be made during data entry, data extraction, and data preparation
steps. Errors during data entry can be a major source of risk. The data collected with errors can lead to
misleading statistical results. The predictive ability of any machine learning algorithm is only as good or
bad as the data quality. Data treatment like data cleaning, data validation, and removal of duplicates and
outliers can drastically improve data quality. As discussed earlier, bias, variance, missing values, outliers,
and duplicates may result in erroneous predictions. The risks from these erroneous predictions can be
mitigated by ensuring data is collected and cleaned at the source.
Let’s have a look at a few risks associated with data analytics projects.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
18
Risks due to over-validating results
A special group of risks come from too much trust in the results of data analytics, for example the answers
of chatbots and language models in generative artificial intelligence (AI). The answers of these often look
complete and authentic. If chatbots are treated like a person, it is forgotten that chatbots don’t have moral
and social considerations and cannot make ethical decisions.
For example, the following conversation took place when testing ChatGPT (a chatbot) for health
care purposes:
(Source: https://www.wired.com/story/large-language-models-artificial-intelligence/)
But also graphs and models can look so convincing that people forget to be critical.
Thus, before starting a data analytics project, it is important to address the risks described above and take
necessary mitigation actions. It is important to note that risks associated with poor quality and lack of data
integrity can result in wrong predictive analytics outcomes and the damage incurred can be irreversible.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
19
Ready to be certif ied for
what’s next? Visit us at:
www.exin.com
The EXIN Data Analytics Foundation certification is essential preparation for any
business and technical-oriented professional aspiring to work with data. Find out
more about avaliable study options on our website.