0% found this document useful (0 votes)

99 views

Advanced Data Analytics Assignment

The document provides instructions for an advanced data analytics assignment involving exploratory data analysis, data mining techniques, hypothesis testing, and predictive modeling. For exploratory data analysis, students are asked to conduct analysis on a dataset to identify patterns, trends, and outliers. For data mining, students should use techniques like decision trees, association rules, or clustering to identify relationships. Hypothesis testing involves determining if a marketing campaign was effective by stating hypotheses, selecting a test, analyzing sample data, and drawing conclusions. Finally, predictive modeling requires building a model to forecast sales based on customer and purchase data.

Uploaded by

Olwethu N Mahlathini (Lethu)

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views

Advanced Data Analytics Assignment

Uploaded by

Olwethu N Mahlathini (Lethu)

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Advanced Data Analytics Assignment

a. Exploratory Data Analysis: Conduct an exploratory data analysis of the dataset to

identify any patterns, trends, or outliers. [15 marks]
 Start off by loading the dataset.
 When a dataset is too big to use within Jupyter notebook we have to subsample so
that we have a representation of the data, which is not to big or too small to work
with. The dataset/subsample may be reffered to as a synthetic dataset.
 Check For Missing Data.
You can impute(using mean) or choose to remove missing values(where there are
more missing values in a row or column its best to completely remove the
col/row) use as part of data cleaning. This can be done after the exploratory data
analysis. EDA
 Provide Basic Descriptions of Your Sample and Features
- We start off by categorizing the data (continuous, discrete or categorial).
Categorizing data will help with choosing visualisations to use for the
exploratory data analysis
 Identify the shape of the data. This talks about the distribution of the data. plot a
few features from the dataset. If the dataset is a time series, then we investigate
how the feature changes over time. Perhaps there’s a seasonality to the feature or a
positive/negative linear trend over time. These are all important things to consider
in the EDA. Then we calculate the average and the variance of each of the
features. We then need to notice is the is any change, the frequency of the change
if any, we can try to hypothesize any behaviour we see. Probability Density
Functions (PDFs) and Probability Mass Functions (PMFs) are important. To
understand the shape of your features, PMFs are used for discrete features and
PDFs for continuous features. They tell you ;
o Skewness
o Is the feature heterogeneous (multimodal)?
o If the PDF has a gap in it, the feature may be disconnected.
o Is it bounded?
 Identify Significant Correlations.
Correlation measures the relationship between 2 variable quantities. If there is a
high number of features in the dataset a pearson correlation is best to use. It
measures the linear correlation between features in the dataset and assigns values
between -1 and 1 to each pair. A positive value indicates a positive relation and
likewise a negative coorelation will represent a negative relationship. It’s
important to take note of all significant correlation between features. It’s possible
that you might observe many relationships between features in your dataset, but
you might also observe very little. Every dataset is different! Try to form
hypotheses around why features might be correlated with each other.
 Spot outliers.
Outliers are significantly different from other samples in your dataset and can lead
to major problems when performing statistical tasks following your EDA. There
are many reasons why an outlier might occur. A box plot is often used to show
outliers.
****Re engineering a Dataset****
Since the dataset we have is way too big to load it onto Jupyter notebook we can choose
subsample it as a way of re-enginering the dataset so that we can use it. Re-engineering
is a fancy way of saying we make that data suitable for use.
b. Data Mining: Use data mining techniques such as decision trees, association rules, or
clustering to identify any relationships or associations between customer
demographics, product purchases, and campaign responses. [15 marks]

Data mining process: How does it work?

The data mining process can be broken down into these four primary stages:

1. Data gathering. Relevant data for an analytics application is identified and

assembled. The data may be located in different source systems, a data
warehouse or a data lake, an increasingly common repository in big data
environments that contain a mix of structured and unstructured data. External
data sources may also be used. Wherever the data comes from, a data
scientist often moves it to a data lake for the remaining steps in the process.

2. Data preparation. This stage includes a set of steps to get the data ready to be
mined. It starts with data exploration, profiling and pre-processing, followed
by data cleansing work to fix errors and other data quality issues. Data
transformation is also done to make data sets consistent, unless a data
scientist is looking to analyze unfiltered raw data for a particular application.

3. Mining the data. Once the data is prepared, a data scientist chooses the
appropriate data mining technique and then implements one or more
algorithms to do the mining. In machine learning applications, the algorithms
typically must be trained on sample data sets to look for the information being
sought before they're run against the full set of data.

4. Data analysis and interpretation. The data mining results are used to create
analytical models that can help drive decision-making and other business
actions. The data scientist or another member of a data science team also
must communicate the findings to business executives and users, often
through data visualization and the use of data storytelling techniques.
These steps are part of the data mining process.
Types of data mining techniques
Various techniques can be used to mine data for different data science applications.
Pattern recognition is a common data mining use case that's enabled by multiple
techniques, as is anomaly detection, which aims to identify outlier values in data
sets. Popular data mining techniques include the following types:

 Association rule mining. In data mining, association rules are if-then

statements that identify relationships between data elements. Support and
confidence criteria are used to assess the relationships -- support measures
how frequently the related elements appear in a data set, while confidence
reflects the number of times an if-then statement is accurate.

 Classification. This approach assigns the elements in data sets to different

categories defined as part of the data mining process. Decision trees, Naive
Bayes classifiers, k-nearest neighbor and logistic regression are some examples
of classification methods.

 Clustering. In this case, data elements that share particular characteristics are
grouped together into clusters as part of data mining applications. Examples
include k-means clustering, hierarchical clustering and Gaussian mixture
models.

 Regression. This is another way to find relationships in data sets, by

calculating predicted data values based on a set of variables. Linear regression
and multivariate regression are examples. Decision trees and some other
classification methods can be used to do regressions, too.

 Sequence and path analysis. Data can also be mined to look for patterns in
which a particular set of events or values leads to later ones.

 Neural networks. A neural network is a set of algorithms that simulates the

activity of the human brain. Neural networks are particularly useful in complex
pattern recognition applications involving deep learning, a more advanced
offshoot of machine learning.

More on decision trees

 https://www.softwaretestinghelp.com/decision-tree-algorithm-examples-data-
mining/#:~:text=Decision%20Tree%20Mining%20is%20a,target%20result%20is
%20already%20known.
Association rules
 https://www.upgrad.com/blog/association-rule-mining-an-overview-and-its-
applications/
Cluster
 https://www.upgrad.com/blog/cluster-analysis-data-mining/#:~:text=Clustering%20in
%20data%20mining%20helps,analyzes%20the%20pattern%20of%20deception.

c. Hypothesis Testing: Use hypothesis testing to determine whether the marketing

campaign was effective in increasing sales. [10 marks]

State the hypotheses. Every hypothesis involves a null and

alternative hypothesis which are mutually exclusive. e.g.: the
null hypothesis-average exam between the two classes is
equal: x1 = x2. alternative hypothesis-average exam between
the two class are not equal: x1 != x2.

Formulate an analysis plan. This step involves picking a

test method: z-test, t-test, chi-square, etc. Then pick a
significance level, α. This is a threshold at which anything
equal or below this probability level would be considered
statistically unlikely, therefore you can reject the null
hypothesis and state the alternative hypothesis.
Typically α=0.05* (<5% probability), but I’ll put an asterisk
on this because it is loosely followed in our industry. If you are
in other industries like aerospace, you may want to use an even
smaller α.

Analyze sample data. This step involves calculating a static

score and a p-value, which is just re-calibrating the numbers
onto a distribution curve with one sample group fixed at zero,
and the other project on the distribution line. The observation
is also associated with a p-value, which is the probability of
getting an observation equal or more extreme when the null
hypothesis is true. The smaller the p-value, the farther away
from the observation from zero. Typically if the p-value is
below the α level, the null hypothesis is rejected.

Draw conclusions.

https://medium.com/@jw207427/how-to-apply-hypothesis-
test-in-marketing-data-fbe1e1ac2388

d. Predictive Modeling: Build a predictive model to forecast the sales for the next
quarter based on customer demographics, product purchases, and campaign responses.
[15 marks]

5 Types of Predictive Models

Fortunately, predictive models don’t have to be created from scratch for
every application. Predictive analytics tools use a variety of vetted models
and algorithms that can be applied to a wide spread of use cases.

Predictive modeling techniques have been perfected over time. As we add

more data, more muscular computing, AI and machine learning and see
overall advancements in analytics, we’re able to do more with these
models.

The top five predictive analytics models are:

1. Classification model: Considered the simplest model, it categorizes

data for simple and direct query response. An example use case
would be to answer the question “Is this a fraudulent transaction?”
2. Clustering model: This model nests data together by common
attributes. It works by grouping things or people with shared
characteristics or behaviors and plans strategies for each group at a
larger scale. An example is in determining credit risk for a loan
applicant based on what other people in the same or a similar
situation did in the past.
3. Forecast model: This is a very popular model, and it works on
anything with a numerical value based on learning from historical
data. For example, in answering how much lettuce a restaurant
should order next week or how many calls a customer support agent
should be able to handle per day or week, the system looks back to
historical data.
4. Outliers model: This model works by analyzing abnormal or outlying
data points. For example, a bank might use an outlier model to
identify fraud by asking whether a transaction is outside of the
customer’s normal buying habits or whether an expense in a given
category is normal or not. For example, a $1,000 credit card charge
for a washer and dryer in the cardholder’s preferred big box store
would not be alarming, but $1,000 spent on designer clothing in a
location where the customer has never charged other items might be
indicative of a breached account.
5. Time series model: This model evaluates a sequence of data points
based on time. For example, the number of stroke patients admitted
to the hospital in the last four months is used to predict how many
patients the hospital might expect to admit next week, next month or
the rest of the year. A single metric measured and compared over
time is thus more meaningful than a simple average.

https://www.wallstreetmojo.com/predictive-modeling/
https://www.javatpoint.com/logistic-regression-in-machine-learning

e. Visualization: Create visualizations such as charts and graphs to present your findings
to the marketing team.
You can use the predictive models made in jupyternotebook. Or connect powerbi to
the notebook
Sites for extra help https://www.geeksforgeeks.org/predictive-analysis-in-data-mining/

The Basics of Data Analytics
86% (7)
The Basics of Data Analytics
17 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
data analytics-1
No ratings yet
data analytics-1
21 pages
Crack_Data_Science_Interview_�_1731300339
No ratings yet
Crack_Data_Science_Interview_�_1731300339
132 pages
Data Mining
No ratings yet
Data Mining
22 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
AA MDM MST
No ratings yet
AA MDM MST
8 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
unit 2
No ratings yet
unit 2
20 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
32 pages
DM Sem U-1
No ratings yet
DM Sem U-1
50 pages
Unit 1 SPSS
No ratings yet
Unit 1 SPSS
9 pages
CC Unit - 4 Imp Questions
No ratings yet
CC Unit - 4 Imp Questions
4 pages
Data Science Full
No ratings yet
Data Science Full
32 pages
R For Data Science Sample Chapter
100% (1)
R For Data Science Sample Chapter
39 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
Data Science Full
No ratings yet
Data Science Full
31 pages
AUTOMATED EDA Libraries
No ratings yet
AUTOMATED EDA Libraries
12 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Data Science Pipeline, EDA & Data Preparation
No ratings yet
Data Science Pipeline, EDA & Data Preparation
14 pages
Unit 2 - Data Science & Big Data - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 2 - Data Science & Big Data - WWW - Rgpvnotes.in PDF
17 pages
Presentation1 Revised [Autosaved]
No ratings yet
Presentation1 Revised [Autosaved]
83 pages
Unit 1
No ratings yet
Unit 1
8 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Basic Data Science Interview Questions
No ratings yet
Basic Data Science Interview Questions
18 pages
Unit-1 Introduction To Data Mining
No ratings yet
Unit-1 Introduction To Data Mining
33 pages
Data Mining Mid 2
No ratings yet
Data Mining Mid 2
20 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
47 pages
Unit 1 BD PDF
No ratings yet
Unit 1 BD PDF
26 pages
Data Mining Notes
100% (1)
Data Mining Notes
75 pages
Modul 1 CertDA
No ratings yet
Modul 1 CertDA
8 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Dwdmsem 6 QB
No ratings yet
Dwdmsem 6 QB
13 pages
Computer Basics Document
No ratings yet
Computer Basics Document
27 pages
5 What Is Data-WPS Office
No ratings yet
5 What Is Data-WPS Office
19 pages
Unit 2
No ratings yet
Unit 2
58 pages
Unit 1
No ratings yet
Unit 1
27 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
DM VSAQ
No ratings yet
DM VSAQ
8 pages
UNIT-2 BI
No ratings yet
UNIT-2 BI
26 pages
sfds aat
No ratings yet
sfds aat
8 pages
Unit-I Data Mining
No ratings yet
Unit-I Data Mining
28 pages
BD.1ST MID
No ratings yet
BD.1ST MID
8 pages
Data Mining
No ratings yet
Data Mining
34 pages
CDS - Unit 2
No ratings yet
CDS - Unit 2
31 pages
Data Mining-Unit-1
No ratings yet
Data Mining-Unit-1
21 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Topic 4 - Data Mining Tools and Technique
No ratings yet
Topic 4 - Data Mining Tools and Technique
22 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
Data Mining
No ratings yet
Data Mining
25 pages
Data Mining Using Rapidminer by William Murakami-Brundage Mar. 15, 2012
No ratings yet
Data Mining Using Rapidminer by William Murakami-Brundage Mar. 15, 2012
44 pages
Fcthgchgtbelow
No ratings yet
Fcthgchgtbelow
6 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices
No ratings yet
Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices
8 pages
918pm - 62.Mr. S. K. Nidish 7470
No ratings yet
918pm - 62.Mr. S. K. Nidish 7470
3 pages
Harshitha L
No ratings yet
Harshitha L
13 pages
6 4 A Structuralanalysisautomoblox by Javier Prietocarolina Konarskigiovanni Fernandez
67% (3)
6 4 A Structuralanalysisautomoblox by Javier Prietocarolina Konarskigiovanni Fernandez
3 pages
(Writing History) Ute Frevert - Writing the History of Emotions_ Concepts and Practices, Economies and Politics-Bloomsbury Academic (2024)
No ratings yet
(Writing History) Ute Frevert - Writing the History of Emotions_ Concepts and Practices, Economies and Politics-Bloomsbury Academic (2024)
345 pages
An Investigation of The Efficacy and Effectiveness of Using Bilingualism As A Medium of Instruction in Different Educational Levels
No ratings yet
An Investigation of The Efficacy and Effectiveness of Using Bilingualism As A Medium of Instruction in Different Educational Levels
25 pages
Kualitas Pelayanan Jasa Penerbangan PT Garuda Indonesia
No ratings yet
Kualitas Pelayanan Jasa Penerbangan PT Garuda Indonesia
11 pages
Analyzing The Changes To Risk Management Standard ISO 149712019 PDF
No ratings yet
Analyzing The Changes To Risk Management Standard ISO 149712019 PDF
5 pages
Mary Mediatrix of All Graces Academy, Inc.: Statistics and Probability
No ratings yet
Mary Mediatrix of All Graces Academy, Inc.: Statistics and Probability
2 pages
Chapter Five Summary, Conclusion and Recommendation
No ratings yet
Chapter Five Summary, Conclusion and Recommendation
3 pages
PC 3 - 3.5 Digital India
No ratings yet
PC 3 - 3.5 Digital India
18 pages
Market Research Project TOPIC-Study of Entrepreneurial Ecosystem in Mysore
No ratings yet
Market Research Project TOPIC-Study of Entrepreneurial Ecosystem in Mysore
39 pages
DIPINSA
No ratings yet
DIPINSA
9 pages
Journal of The Society For Psychical Research (Vol. 19)
100% (1)
Journal of The Society For Psychical Research (Vol. 19)
312 pages
Elliot Richard - Media Amplification of A Brand Crisis
No ratings yet
Elliot Richard - Media Amplification of A Brand Crisis
18 pages
D Adp 0706976 Bibliography
No ratings yet
D Adp 0706976 Bibliography
11 pages
Savills Middle East - SCON - Intern - JD
No ratings yet
Savills Middle East - SCON - Intern - JD
1 page
Thesis Using One Way Anova
100% (3)
Thesis Using One Way Anova
7 pages
Impact of Conflict On Teaching Learning Process in Schools: Tara Bahadur Thapa
No ratings yet
Impact of Conflict On Teaching Learning Process in Schools: Tara Bahadur Thapa
6 pages
Recommendation Report Guide
No ratings yet
Recommendation Report Guide
2 pages
Coma Puig Carmona2021 Article Non TechnicalLossesDetectionIn
No ratings yet
Coma Puig Carmona2021 Article Non TechnicalLossesDetectionIn
31 pages
Notes of An Gupta
No ratings yet
Notes of An Gupta
3 pages
Quality Volunteering at The British Red Cross
100% (1)
Quality Volunteering at The British Red Cross
112 pages
Boresight Calibration of Mobile Mapping Systems
No ratings yet
Boresight Calibration of Mobile Mapping Systems
10 pages
Gender Issues and Challenges in Twenty First Century-Edited by Uttam Kumar Panda
No ratings yet
Gender Issues and Challenges in Twenty First Century-Edited by Uttam Kumar Panda
454 pages
MSA Training
No ratings yet
MSA Training
94 pages
Skills Audit
No ratings yet
Skills Audit
3 pages
Retraction-Hydroxychloroquine or Chloroquine With or Without A Macrolide For Treatment of COVID-19: A Multinational Registry Analysis
No ratings yet
Retraction-Hydroxychloroquine or Chloroquine With or Without A Macrolide For Treatment of COVID-19: A Multinational Registry Analysis
1 page
IOPS 311: Study Division A: Study Unit 2 Motivation, Attitudes & Job Satisfaction
No ratings yet
IOPS 311: Study Division A: Study Unit 2 Motivation, Attitudes & Job Satisfaction
38 pages
Creating Cooperative Communication in Groups
No ratings yet
Creating Cooperative Communication in Groups
40 pages