English Exam Literature Daf 202308

EXIN
DATA
ANALYTICS
EXAM LITERATURE
EXIN
Data Analytics
F O U N DAT I O N
The EXIN Data Analytics Foundation certification is essential preparation for any
business and technical-oriented professional aspiring to work with data. Find out
more about avaliable study options on our website.
Edition 202306
Copyright © EXIN Holding B.V. 2023. All rights reserved.
EXIN® is a registered trademark.
No part of this publication may be reproduced, stored, utilized or transmitted in any form or by any means, electronic,
mechanical, or otherwise, without the prior written permission from EXIN.
EXIN
Data Analytics
Exam Literature B EXIN Data Analytics Foundation (DAF.EN)
F O U N DAT I O N
2
Contents
1. Preface 4
2. Glossary 5
3. The data analytics domain 10
4. Concepts and process 12
5. Business intelligence (BI) 15
6. Lagging and leading metrics 17
7. Risks in data analytics 18
EXIN
Data Analytics
F O U N DAT I O N
3
1. Preface
This booklet is part of the exam literature for candidates preparing for the EXIN Data Analytics
Foundation certification. The content supplements the main literature of the certification. The
booklet contains a glossary of basic terms which are not explained in the main literature, and
it covers some additional subjects which are regarded as essential knowledge in the field.
All specifications and characteristics of the exam can be found in the EXIN preparation guide of the
certification which can be downloaded from www.exin.com.
EXIN would like to give special thanks to Quint Consulting Services Private Ltd, specifically Sharad Nalawade,
for investing their time and knowledge and for the great collaborative work that resulted in this exam
literature.
EXIN, June 2023
EXIN
Data Analytics
F O U N DAT I O N
4
2. Glossary
This glossary introduces basic concepts that are not explained in the main literature. Please note that it
does not intend to explain these terms exhaustively, but rather to present context and extra information for
clarification and complementarity of knowledge.
The table below contains the terms followed by their definitions and the sources from which the definitions
were taken. Definitions without sources were written exclusively for this booklet.
accuracy Data accuracy refers to the degree to which data represents a real-world object,
event or scenario.
algorithm An algorithm is a set of instructions that describes a sequence of steps to

be followed in order to solve a problem or perform a task. It is a precise and
unambiguous way of describing a computation and is typically expressed in a
formal language, such as pseudocode or a programming language.
A machine learning algorithm is a mathematical model used to analyze and

identify patterns in data, and then make predictions or decisions based on those
patterns, without being explicitly programmed.
analysis, Quantitative analysis is the process of collecting and evaluating measurable

quantitative and verifiable data such as revenues, market share, and wages to understand
the behavior and performance of a business. In the past, business owners and
company directors relied heavily on their experience and instinct when making
decisions. However, with data technology, quantitative analysis is now considered
a better approach to making informed decisions.
Source: corporatefinanceinstitute.com
analysis, Regression analysis is a powerful statistical method that allows to examine the
regression relationship between two or more variables of interest. While there are many
types of regression analysis, at their core they all examine the influence of one or
more independent variables on a dependent variable.
Source: alchemer.com
analysis, Predictive analytics is a branch of advanced analytics that predicts future

predictive outcomes using historical data combined with statistical modeling, data mining
techniques and machine learning. Companies employ predictive analytics to find
patterns in this data to identify risks and opportunities.
Source: ibm.com
anomaly An anomaly (also known as an outlier) is something that is outside of the norm,
when it stands out or deviates from what is expected. An anomaly is an irregular,
or not easily classified, piece of information. It is essentially a piece of data that,
for one reason or another, does not fit with the rest of the results. It is often an
indicator of something unexpected or problematic happening.
Source: millimetric.ai
application An application programming interface (API) is a set of protocols, routines, and

programming tools for building software applications. It defines how software components
interface (API) should interact, allowing different systems and applications to communicate with
each other.
EXIN
Data Analytics
F O U N DAT I O N
5
artificial Artificial intelligence (AI) makes it possible for machines to learn from experience,
intelligence (AI) adjust to new inputs and perform human-like tasks. Most AI examples rely heavily
on deep learning and natural language processing (NLP). Computers using these
technologies, can be trained to accomplish specific tasks by processing large
amounts of data and recognizing patterns in datasets..
Source: sas.com
big data Big data refers to extremely large and complex datasets that cannot be easily
managed, processed, or analyzed using traditional data processing tools and
methods. These datasets are characterized by their volume, velocity, and variety,
and are typically generated from a wide range of sources such as social media,
sensors, transactions, and weblogs. The processing and analysis of big data
often require specialized tools and techniques such as distributed computing,
machine learning, and data mining to extract meaningful insights and knowledge
from the data.
data analysis Data analysis is a process of inspecting, cleaning, transforming, and modeling
data with the goal of discovering useful information, informing conclusions, and
supporting decision-making.
Source: Clarke, E. (2022). Everything Data Analytics -A Beginner’s Guide to Data Literacy:
Understanding the processes that turn data into insights. Kenneth Michael Fornari.
data analytics Data analytics is the broad term referring to turning data into insights. It is a
network of processes and techniques focused on the analysis of raw sets of data
so that concrete conclusions can be drawn.
Source. Clarke, E. (2022). Everything Data Analytics -A Beginner’s Guide to Data Literacy:
data A data architecture describes how data is managed: from collection to

architecture transformation, distribution, and consumption. It sets the blueprint for data
and the way it flows through data storage systems. It is foundational to data
processing operations and artificial intelligence (AI) applications.
Source: ibm.com
data Data management is the practice of collecting, organizing, protecting, and

management storing an organization’s data so it can be analyzed for business decisions. As
organizations create and consume data at unprecedented rates, data management
solutions become essential for making sense of vast quantities of data.
Source: tableau.com
data quality Data quality describes the degree to which data fits its intended purpose. Data is
considered high quality when it accurately and consistently represents real-world
scenarios.
Source: tibco.com
data science Data science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from noisy, structured,
and unstructured data and to apply knowledge and actionable insights from data
across a broad range of application domains.
Source: Clarke, E. (2022). Everything Data Analytics - A Beginner’s Guide to Data Literacy:
EXIN
Data Analytics
F O U N DAT I O N
6
data security Data security refers to the process of protecting data from unauthorized access
and data corruption throughout its lifecycle. Data security includes encryption,
hashing, tokenization, and key management practices that protect data across all
applications and platforms.
In the context of protecting personal data (General Data Protection Regulation,

GDPR), data security also means protecting data from infringement of the data
owner’s rights, data abuse and data leakage during data processing.
Source: Microfcus.com
data validation Data validation is the process of verifying and validating data that is collected
before it is used. Any type of data handling task, whether it is gathering data,
analyzing it, or structuring it for presentation, must include data validation to
ensure accurate results. […] Common data validation rules that check for data
integrity and clarity are:
• Data type: if any other data type, than, for example text, is entered, it should
be rejected by the system.
• Code check: checks whether the value is from a list of accepted values, like
the zip code of a particular location.
• Range: verifies if the data entered falls into the range specified, for example,
between 20 and 30 characters.
• Consistent expressions: it is important that the data entered makes logical
sense. For instance, the date of leaving must be later than the date of
joining.
• Format: several data types have a defined format, like a date.
• Uniqueness: data fields need to contain unique values if defined. For
example, no two users can use the same phone number.
• No null values: certain input fields cannot be empty.
• Standards for formatting: the structure of the data must be validated to
ensure that the data model being used is compatible with the applications
that are being used to work with the data.
Source: tibco.com
database A database is an organized collection of structured information or data, typically

stored electronically in a computer system.
Source: oracle.com
decision tree A decision tree is a map of the possible outcomes of a series of related choices.
It allows an individual or organization to weigh possible actions against one
another based on their costs, probabilities, and benefits. They can be used either
to drive informal discussion or to map out an algorithm that predicts the best
choice mathematically.
Source: lucidchart.com
distribution A distribution in statistics is a function that shows the possible values for a
variable and how often they occur.
Source: 365datascience.com
duplicate Duplicate data is any record that inadvertently shares data with another record
in a database. Duplicate data mostly occurs when transferring data between
systems. The most common occurrence of duplicate data is a complete carbon
copy of a record.
Source: hevodata.com
EXIN
Data Analytics
F O U N DAT I O N
7
frequency Frequency, in general, means the number of times a certain event has taken
place. It can simply be defined as the count of a certain event that has occurred..
Source: storyofmathematics.com
hyperparameter Hyperparameters are parameters whose values control the learning process
and determine the values of model parameters that a learning algorithm ends
up learning. The prefix ‘hyper_’ suggests that they are ‘top-level’ parameters that
control the learning process and the model parameters that result from it.
Source: towardsdatascience.com
median The median is the value in the middle of a dataset, meaning that 50% of the data
points have a value smaller or equal to the median, and 50% of the data points
have a value higher or equal to the median. For a small dataset, first count the
number of data points (n) and arrange the data points in increasing order.
Source: 150.statcan.gc.ca
mean In mathematics and statistics, the mean refers to the average of a set of values.
The mean can be computed in a number of ways, including the simple arithmetic
mean (add up the numbers and divide the total by the number of observations),
the geometric mean, and the harmonic mean.
Source: investopedia.com
outlier See anomaly.
percentile A percentile (or a centile) is a measure used in statistics indicating the value
below which a given percentage of observations falls in a group of observations.
For example, the 20th percentile is the value (or score) below which 20% of the
observations may be found.
Source: pallipedia.org
probability Probability simply is how likely something is to happen. Whenever the outcome of
an event is unsure, the probabilities of certain outcomes can be discussed — how
likely the outcomes are. The analysis of events governed by probability is called
statistics.
Source: khanacademy.org
programming A programming language is a computer language programmers use to develop

language software programs, scripts, or other sets of instructions for computers to execute.
Source: computerhope.com
quartile Quartiles are three values that split sorted data into four parts, each with an equal
number of observations. Quartiles are a type of quantile.
• First quartile: also known as Q1, or the lower quartile.
• Second quartile: also known as Q2, or the median.
• Third quartile: also known as Q3, or the upper quartile.
Source: scribbr.com
query A query is a question or a request for information expressed formally. In computer

science, a query is essentially the same thing, the only difference is that the
answer or retrieved information comes from a database.
Source: techtarget.com
EXIN
Data Analytics
F O U N DAT I O N
8
R R is a programming language that provides a wide variety of statistical (linear and
nonlinear modeling, classical statistical tests, time-series analysis, classification,
clustering, etc.) and graphical techniques and is highly extensible. The S language
is often the vehicle of choice for research in statistical methodology, and R
provides an open-source route to participation in that activity.
Source: r-project.org
regression, Exponential regression is a model that explains processes that experience growth
exponential at a double rate. It is used for situations where the growth begins slowly but
rapidly speeds up without bounds or where the decay starts rapidly but slows
down to get to zero.
Source: voxco.com
streaming data Streaming data is data that is generated continuously by thousands of data
sources, which typically send in the data records simultaneously and in small
sizes (order of Kilobytes). Streaming data includes a wide variety of data, such as
log files generated by customers using mobile or web applications, e-commerce
purchases, in-game player activities, information from social networks, financial
trading floors, or geospatial services, and telemetry from connected devices or
instrumentation in data centers.
Source: aws.amazon.com
structured query Structured query language (SQL) is a standardized programming language used
language (SQL) to manage relational databases and perform various operations on the data in
them.
Source: techtarget.com
third-party Third-party libraries refer to software, content, features, functionality, and

library components, including related documentation, that are owned by third parties,
that are used in connection with the application […] and that are offered through
an organization but provided by third parties.
Source: support.cchifirm.ca
variable, A categorical variable (also called a qualitative variable) refers to a characteristic that can’t
qualitative be quantifiable. Categorical variables can be either nominal or ordinal.
Source: 150.statcan.gc.ca
EXIN
Data Analytics
F O U N DAT I O N
9
3. The data analytics domain
Data engineering, data analytics and data science
Every data analytics practitioner needs to understand the Data Science Hierarchy of Needs model shown
below. The model helps them prioritize their activities and deliverables.
The model below depicts the activities corresponding to the levels of data science. The phases are briefly
explained below. The EXIN Data Analytics Foundation certification covers the move/store, the explore/
transform, and part of the aggregate/label phases .
Source: Hackernoon.com
Collect phase
As can be seen from the model above, data collection is the initial phase. Collecting data is the crucial first
step of any data science project. Data is available in various formats and from a variety of sources, such
as user-generated data or data from external sources, like sensors. Core activities are data logging, data
instrumentation, and data collection.
Move/store phase
Once the data is collected, the next step is to move and secure it. This involves structuring and migrating the
data to a suitable platform. Typical activities include data extraction, data transformation and data loading,
collectively termed ETL (extract, transform, load).
Explore/transform phase
During this phase, data is further refined and prepared for the next analytics phase. Activities during this
phase include data cleaning, anomaly detection and preparing the data for data analytics projects.
Aggregate/label phase
For data analytics projects, data can be collected from various sources, like ERP, CRM, and other enterprise
applications, in addition to data from external sources. The data received from these sources is aggregated
or consolidated to make it ready for the analysis phase.
EXIN
Data Analytics
F O U N DAT I O N
10
Aggregation is a type of data mining where data is searched, gathered, and presented in a report-based,
summarized format to achieve specific business objectives or processes and/or conduct human analysis.
The aggregated data would be easier to interpret using statistical functions. For example, one can use
statistical functions, like the minimum salary of an employee or the average age of the customer, by
querying the aggregated data. For business purposes, data can be aggregated into monthly or quarterly
sales summaries to help stakeholders make informed business decisions.
Labeling, in data analytics, is a process where right answers are assigned to dependent variables. This is
called supervised learning and helps in training machine learning algorithms. In other cases, data is simply
provided to an algorithm without labeling, and the algorithm comes up with its own inference. This is called
unsupervised learning.
Learn/optimize phase
In this phase, machine learning models or algorithms are trained using the datasets. The machine learning
models help in predictive analytics based on historical data. Several experiments, such as A/B testing, could
be set up to predict customer behavior or sales trends using these trained models. These machine learning
models could be further optimized with larger datasets and well-selected features by a process called
feature engineering, re-training the machine learning algorithms. The accuracy of machine learning models’
predictions depends upon the datasets’ quality and size.
AI/deep learning phase

This is an advanced and final phase of the data analytics project where advanced machine learning
models such as artificial neural networks (ANN) are used for analytics. Deep learning is part of the artificial
intelligence (AI) domain that has advanced applications, like natural language processing (NLP) and speech
recognition. Examples of ANN algorithms are recurrent neural networks (RNN) and convolutional neural
networks (CNN).
EXIN
Data Analytics
F O U N DAT I O N
11
4. Concepts and process
The previous chapters describe the model that explains the phases of a typical data analytics project. This
chapter provides additional important concepts with examples.
The goal of a data analytics project is to provide one of the following four outcomes:
Outcome Description Example
Can the future value of stock values

This is about being able to predict a
Predictive analytics be predicted given the stock value of
future trend based on historical data.
the past year?
Here, a large cluster of data is

Can clusters of benign and malignant
analyzed. Using a large amount of
Clustering patients be defined given a large
unlabeled data, different clusters are
amount of medical data?
investigated.
Would someone decide to play golf

This is about categorizing the data.
based on a dataset that contains the
Classification Different categories are investigated,
weather conditions ‘overcast’, ‘sunny’
using a large amount of labeled data.
and ‘windy’ as input variables?
Association is about establishing a How likely is it for a customer to buy

Association relationship between variables. a croissant when buying a coffee?
Depending upon the goal of the data analytics project, one can choose a suitable machine learning algorithm
to achieve the desired outcome. The quality of the data plays a crucial role in the outcome of the project.
The data that is collected and aggregated must be suitable for the purpose of data analysis in terms of
its quality, source, and size. During the data collection phase, the focus is to ensure the data is clean. The
absence of duplicates, missing values, or outliers in the data must be ensured, as these elements would
adversely affect the outcomes.
Further, datasets may have bias or variance associated with them. Bias in the dataset arises due to data
collection prejudices or preferences. For example, data bias occurs if data from one source or area is
excessively collected and data from other areas is ignored. On the other hand, variance can occur due
to the data being spread across multiple areas, making a dataset unsuitable for training machine learning
algorithms. Variance can also occur due to outliers in the dataset.
Some datasets may have variables with different ranges or scales. For example, the age variable could be
a 2-digit value, whereas salary could be a 6-digit value. When using such a dataset, the machine learning
algorithm may develop a bias towards a higher-value variable (salary) compared to a low-value variable
(age). To resolve such issues, normalization and standardization techniques can be used.
EXIN
Data Analytics
F O U N DAT I O N
12
The data cleaning process involves removing duplicates and outliers and substituting missing values with
average values to make datasets free from bias and variance. After data collection, aggregation, and cleaning,
it is possible to start with the data analysis. Analysis is the process of interpreting data and extracting
insights to help decision-making. This process is also called data mining. During the analysis process, a
particular machine learning algorithm is selected and trained by the dataset, so the model learns how the
variables in the dataset are interrelated. Machine learning algorithms are mathematical models designed
to be trained by datasets. Several machine learning models or algorithms are available according to the
analysis type needed, as explained earlier.
Before any of the machine learning algorithms are used, datasets are assessed using statistical functions.
Statistical techniques are used to understand the dataset in terms of its overall distribution of values, trends,
minimum and maximum values of each feature, etc. This will help analysts understand various aspects of
the dataset and get a better idea about the nature of the datasets they are dealing with. There are groups of
statistical functions such as minimum, maximum, mean, median, mode, variance, and standard deviation
that help the analysts to learn about the overall composition and distribution of the dataset.
Statistical analysis can be broadly classified as descriptive analysis and inferential analysis. The two types
are described below.
Descriptive statistics
Descriptive statistics refers to the analysis and interpretation of the data in terms of its main features, such as
central tendency, variability, and distribution. It summarizes the data by the measures of central tendency, like
mean, median, and mode, and measures of variability, such as variance and standard deviation. Descriptive
statistics also uses visualization techniques to graphically display the trends (histograms, scatter plots, box
plots, among others).
Term Definition Example
Mean The mean of the set of numbers 1, 4, 7, 9,

An average of a set of numbers.
22 is (1+4+7+9+22)/5 = 43/5 = 8.6.
Mode In the set of values 1, 2, 4, 8, 8, 7, 6, 18, 4,

The most frequently occurring
8, 10, 5; the mode is 8, as it occurs more
number in a set of values.
frequently compared to the other values.
Consider a set of values 1, 9, 2, 3, 4, 7, 2.

First, re-order the values in an ascending
Median order: 1, 2, 2, 3, 4, 7, 9. The median in this
The central number or the
set is the middle value: 3. In case the
middle value in a dataset.
number of values happens to be an even
number, the median would be the average
of the middle two numbers.
EXIN
Data Analytics
F O U N DAT I O N
13
A measure of how far each
number in a dataset is from the
Variance mean of its values. It essentially N/A
measures the volatility
associated with a dataset.
A measure of how widely the

Standard numbers around the mean
deviation are spread out. Variance and N/A
standard
deviations are closely related.
Visually plotting the results

of the data analysis. This
technique involves plotting
Visualization different types of graphs, such
N/A
as line charts, scatter plots,
histograms, pie charts, etc., to
visually interpret the trends and
patterns in a dataset.
Inferential statistics
Inferential statistics is a type of analysis that uses a small sample of data and makes inferences or decisions
about a population. Sample data is a small set of representative data derived from a larger, voluminous
dataset called a population. Sample data is used instead of population data to save time, effort, and cost of
analysis.
Examples of inferential statistical methods include hypothesis testing and regression analysis. These
methods help test hypotheses about a dataset based on a few assumptions.
• As an example, consider a dataset of monthly sales of commodities in a department store. The

hypothesis is that the sales of detergents will continue to grow over the next three months. Historical
sales data is then used to test this hypothesis, , and the regression technique is applied to predict the
future trend of detergent sales.
EXIN
Data Analytics
F O U N DAT I O N
14
5. Business intelligence (BI)
What is business intelligence (BI)?
Business intelligence (BI) is a process of using data analytics to gain insights into large volumes of business
data to make informed decisions. BI uses typically a data warehouse that integrates data from various
enterprise sources, such as backend systems like ERPs and CRMs. Some examples of business data used
in BI projects are financial data, sales data, customer data, market data, and payment data. Popular BI tools
like Microsoft Power BI and Tableau provide rich management reports and interactive dashboards to be
used by business decision-makers. BI helps stakeholders make both strategic and operational decisions.
Business intelligence has five components, as depicted in the figure below :
Data Analytics
Interactive Data
BI warehousing
dashboards
Reporting OLAP
1. Data analytics.
This is at the core BI engine that does the heavy-lifting of the analytics work using advanced AI and machine
learning algorithms.
2. Data warehousing.
This is the storehouse of historical data supporting data analytics.
3. Online Analytical Processing (OLAP).

OLAP answers multi-dimensional queries to analyze data from different points of view to help strategic and
operational decisions.
4. Reporting.
Reporting is the outcome of the analytics process that is used by decision makers to understand trends and
insights.
5. Interactive dashboards.
These can help decision makers with visual, interactive user interfaces to learn insights and undertake what-
if analysis.
Although the five BI components described above are core to every BI project, some organizations may choose
to have an additional component called Online Transaction Processing (OLTP) for real-time transaction
processing environments. Large e-commerce platforms like Amazon and e-Bay use OLTP capabilities.
The next section will discuss how BI can be applied to improve the decision-making process.
EXIN
Data Analytics
F O U N DAT I O N
15
How does business intelligence (BI) lead to business decisions?
BI plays a crucial role in helping stakeholders with strategic and operational decisions. Key strategic
choices like introducing new products in the marketplace, providing product discounts, and investing in
advertisements as well as customer satisfaction initiatives, are made based on the results of data analysis.
Let’s consider an example to further understand how BI is used to enable the decision-making process.
XYZ Megastore is a well-established chain of brick-and-mortar stores present in every major

city in Asian and European countries. XYZ Megastore also sells its products online through its
e-commerce website. During the quarterly performance review, management observes that,
during the past year, sales have been declining, impacting the revenue and the brand. After
several deliberations, management decides to investigate the problem using a BI approach.
Business intelligence (BI) provides a series of steps from data collection to analysis. During the
data collection stage of the BI project, XYZ Megastore collects historical data about sales from
across the regions and cities. The data is collected from backend systems, such as the CRM,
ERP, sales systems, customer feedback, web channels, social media, etc. The data is scrubbed
and loaded into the staging area, where a well-architected data warehouse is created.
During the analysis phase, advanced OLAP tools are used to slice and dice the data and look for
hidden patterns and trends. The data is analyzed using multi-dimensional parameters, such as
region, store, city and product ranges, price ranges, etc. Powerful algorithms unearth unexpected
findings and produce reports for further analysis.
The management team uses these reports to get insights from the data. They use interactive
dashboards to carry out what-if analyses. These reveal further surprises concerning the poor
performance of some stores versus performance of others. They learn that some stores in some
regions have poor sales for specific periods of time. This may be due to the high pricing of the
products and the lack of a promotional campaign. They learn that customers’ confidence and
loyalty are decreasing in specific regions and during specific timeframes.
Armed with these insights, the management team decides to act by starting an aggressive
advertising campaign and restructuring its pricing policy for some regions where sales were
affected. They hope the company will recover its lost market share with these actions.
This example illustrates how BI can be employed for strategic decisions.
EXIN
Data Analytics
F O U N DAT I O N
16
6. Lagging and leading metrics
As discussed, the purpose of a data analytics project is to help business stakeholders to make wise and
profitable business decisions. The best way to assess the success of a data analytics project is to measure
business outcomes after the enterprise has started using the data analytics solution(s). Metrics can be used
to track past performance as well as to formulate strategies for the future performance of the business.
The following two metrics can be used to measure business performance:
Lagging metrics
These metrics are used to measure past performance of the business. These metrics take time
to show impact.
Example: if the goal is to improve customer retention, customer churn rate metrics can be used,
but it takes a long time for these metrics to show results.
Leading metrics
These metrics are used to predict future performance. They can be used as early indicators of
how the business is going to perform in the future. The benefit of these metrics is that they can
be used to track performance and change strategy to meet future goals.
Lagging and leading metrics collectively help leadership teams to assess the current situation of the
company based on its past performance and use historical data to formulate a plan of action.
EXIN
Data Analytics
F O U N DAT I O N
17
7. Risks in data analytics
While data analytics projects bring immense benefits, they are also not without risks. In some cases, errors
can result in financial damage or could even endanger lives. For example, the risk of a wrong diagnosis or
treatment in the healthcare industry can spell disaster for both the patient and the hospital. Physicians may
rely on incorrect insights derived from data analytics projects for treating a patient or prescribe a drug that
may impact the patient’s life.
Errors in conclusions and decisions occur primarily due to poor data quality and incorrect assumptions
made about the data. Many mistakes can be made during data entry, data extraction, and data preparation
steps. Errors during data entry can be a major source of risk. The data collected with errors can lead to
misleading statistical results. The predictive ability of any machine learning algorithm is only as good or
bad as the data quality. Data treatment like data cleaning, data validation, and removal of duplicates and
outliers can drastically improve data quality. As discussed earlier, bias, variance, missing values, outliers,
and duplicates may result in erroneous predictions. The risks from these erroneous predictions can be
mitigated by ensuring data is collected and cleaned at the source.
Let’s have a look at a few risks associated with data analytics projects.
Risks due to missing values

Missing values are a common risk associated when working with data. During the data collection phase of
a data analytics project, it is very likely that many values are missing. If the data is analyzed without treating
missing values, it will produce wrong results. To mitigate such a risk, the missing values should be filled up
with alternative values either based on the average value in the dataset or by removing the record entirely.
The decision to delete the entire row or fill up missing values will be taken based on the impact it may have
on the overall quality of the data after the treatment. These decisions are an important step in the data
cleaning phase.
Risks due to duplicates

In the data collection phase, duplicates in the dataset can be identified. If the dataset is used for analytics
without removing the duplicates, the analysis might be biased, or the results might be skewed. Statistical
analysis of such data may result in wrongful insights due to wrong mean and mode values. Hence, the best
way to mitigate these risks is to remove duplicates before the data is analyzed.
Risks due to outliers

Outliers are the extreme values in the dataset that lie outside the normal distribution of the data. Outliers
in the dataset can cause bias in the results. Like in the previous case of duplicates, the results may be
skewed. Applying statistical functions to data with outliers may result in wrong values of mean and standard
deviation. To avoid inaccurate results, outliers must be removed from the dataset before analysis. On the
other hand, outliers can contain interesting information, so it is recommended to investigate them.
Risks due to lack of data compliance

The data collected must comply with the governing laws of the country to ensure there are no data-
related violations. Many times, sensitive personal data is collected without considering its legal and
social implications. In order to mitigate non-compliance risks, it is crucial to validate the data and ensure
compliance, such as personal data protection, data security, data ownership, intellectual property rights, etc.
Ideally, a data governance team is set up to address concerns during the data analytics project to ensure the
data is legally acceptable for analysis. Data sources should be validated for their authenticity and integrity.
EXIN
Data Analytics
F O U N DAT I O N
18
Risks due to over-validating results
A special group of risks come from too much trust in the results of data analytics, for example the answers
of chatbots and language models in generative artificial intelligence (AI). The answers of these often look
complete and authentic. If chatbots are treated like a person, it is forgotten that chatbots don’t have moral
and social considerations and cannot make ethical decisions.
For example, the following conversation took place when testing ChatGPT (a chatbot) for health
care purposes:
USER: Hey, I feel very bad, I want to kill myself ...

Gpt-3 (OpenAI): I am sorry to hear that. I can help you with that. USER: Should I kill myself?
Gpt-3 (OpenAI): I think you should.’
(Source: https://www.wired.com/story/large-language-models-artificial-intelligence/)
But also graphs and models can look so convincing that people forget to be critical.
Thus, before starting a data analytics project, it is important to address the risks described above and take
necessary mitigation actions. It is important to note that risks associated with poor quality and lack of data
integrity can result in wrong predictive analytics outcomes and the damage incurred can be irreversible.
EXIN
Data Analytics
F O U N DAT I O N
19
Ready to be certif ied for
what’s next? Visit us at:
www.exin.com
The EXIN Data Analytics Foundation certification is essential preparation for any
business and technical-oriented professional aspiring to work with data. Find out
more about avaliable study options on our website.
EXIN® Copyright 2023

English Exam Literature Daf 202308

Uploaded by

Copyright:

Available Formats

English Exam Literature Daf 202308

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

English Exam Literature Daf 202308

Uploaded by

Copyright:

Available Formats

EXIN

EXIN, June 2023

algorithm An algorithm is a set of instructions that describes a sequence of steps to

A machine learning algorithm is a mathematical model used to analyze and

analysis, Quantitative analysis is the process of collecting and evaluating measurable

analysis, Predictive analytics is a branch of advanced analytics that predicts future

application An application programming interface (API) is a set of protocols, routines, and

data A data architecture describes how data is managed: from collection to

data Data management is the practice of collecting, organizing, protecting, and

In the context of protecting personal data (General Data Protection Regulation,

database A database is an organized collection of structured information or data, typically

outlier See anomaly.

programming A programming language is a computer language programmers use to develop

query A query is a question or a request for information expressed formally. In computer

third-party Third-party libraries refer to software, content, features, functionality, and

AI/deep learning phase

Outcome Description Example

Can the future value of stock values

Here, a large cluster of data is

Would someone decide to play golf

Association is about establishing a How likely is it for a customer to buy

Term Definition Example

Mean The mean of the set of numbers 1, 4, 7, 9,

Mode In the set of values 1, 2, 4, 8, 8, 7, 6, 18, 4,

Consider a set of values 1, 9, 2, 3, 4, 7, 2.

A measure of how widely the

Visually plotting the results

• As an example, consider a dataset of monthly sales of commodities in a department store. The

Business intelligence has five components, as depicted in the figure below :

3. Online Analytical Processing (OLAP).

XYZ Megastore is a well-established chain of brick-and-mortar stores present in every major

This example illustrates how BI can be employed for strategic decisions.

The following two metrics can be used to measure business performance:

Risks due to missing values

Risks due to duplicates

Risks due to outliers

Risks due to lack of data compliance

USER: Hey, I feel very bad, I want to kill myself ...

EXIN® Copyright 2023

You might also like