Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
99 views

Advanced Data Analytics Assignment

The document provides instructions for an advanced data analytics assignment involving exploratory data analysis, data mining techniques, hypothesis testing, and predictive modeling. For exploratory data analysis, students are asked to conduct analysis on a dataset to identify patterns, trends, and outliers. For data mining, students should use techniques like decision trees, association rules, or clustering to identify relationships. Hypothesis testing involves determining if a marketing campaign was effective by stating hypotheses, selecting a test, analyzing sample data, and drawing conclusions. Finally, predictive modeling requires building a model to forecast sales based on customer and purchase data.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Advanced Data Analytics Assignment

The document provides instructions for an advanced data analytics assignment involving exploratory data analysis, data mining techniques, hypothesis testing, and predictive modeling. For exploratory data analysis, students are asked to conduct analysis on a dataset to identify patterns, trends, and outliers. For data mining, students should use techniques like decision trees, association rules, or clustering to identify relationships. Hypothesis testing involves determining if a marketing campaign was effective by stating hypotheses, selecting a test, analyzing sample data, and drawing conclusions. Finally, predictive modeling requires building a model to forecast sales based on customer and purchase data.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Advanced Data Analytics Assignment

a. Exploratory Data Analysis: Conduct an exploratory data analysis of the dataset to


identify any patterns, trends, or outliers. [15 marks]
 Start off by loading the dataset.
 When a dataset is too big to use within Jupyter notebook we have to subsample so
that we have a representation of the data, which is not to big or too small to work
with. The dataset/subsample may be reffered to as a synthetic dataset.
 Check For Missing Data.
You can impute(using mean) or choose to remove missing values(where there are
more missing values in a row or column its best to completely remove the
col/row) use as part of data cleaning. This can be done after the exploratory data
analysis. EDA
 Provide Basic Descriptions of Your Sample and Features
- We start off by categorizing the data (continuous, discrete or categorial).
Categorizing data will help with choosing visualisations to use for the
exploratory data analysis
 Identify the shape of the data. This talks about the distribution of the data. plot a
few features from the dataset. If the dataset is a time series, then we investigate
how the feature changes over time. Perhaps there’s a seasonality to the feature or a
positive/negative linear trend over time. These are all important things to consider
in the EDA. Then we calculate the average and the variance of each of the
features. We then need to notice is the is any change, the frequency of the change
if any, we can try to hypothesize any behaviour we see. Probability Density
Functions (PDFs) and Probability Mass Functions (PMFs) are important. To
understand the shape of your features, PMFs are used for discrete features and
PDFs for continuous features. They tell you ;
o Skewness
o Is the feature heterogeneous (multimodal)?
o If the PDF has a gap in it, the feature may be disconnected.
o Is it bounded?
 Identify Significant Correlations.
Correlation measures the relationship between 2 variable quantities. If there is a
high number of features in the dataset a pearson correlation is best to use. It
measures the linear correlation between features in the dataset and assigns values
between -1 and 1 to each pair. A positive value indicates a positive relation and
likewise a negative coorelation will represent a negative relationship. It’s
important to take note of all significant correlation between features. It’s possible
that you might observe many relationships between features in your dataset, but
you might also observe very little. Every dataset is different! Try to form
hypotheses around why features might be correlated with each other.
 Spot outliers.
Outliers are significantly different from other samples in your dataset and can lead
to major problems when performing statistical tasks following your EDA. There
are many reasons why an outlier might occur. A box plot is often used to show
outliers.
****Re engineering a Dataset****
Since the dataset we have is way too big to load it onto Jupyter notebook we can choose
subsample it as a way of re-enginering the dataset so that we can use it. Re-engineering
is a fancy way of saying we make that data suitable for use.
b. Data Mining: Use data mining techniques such as decision trees, association rules, or
clustering to identify any relationships or associations between customer
demographics, product purchases, and campaign responses. [15 marks]

Data mining process: How does it work?


The data mining process can be broken down into these four primary stages:

1. Data gathering. Relevant data for an analytics application is identified and


assembled. The data may be located in different source systems, a data
warehouse or a data lake, an increasingly common repository in big data
environments that contain a mix of structured and unstructured data. External
data sources may also be used. Wherever the data comes from, a data
scientist often moves it to a data lake for the remaining steps in the process.

2. Data preparation. This stage includes a set of steps to get the data ready to be
mined. It starts with data exploration, profiling and pre-processing, followed
by data cleansing work to fix errors and other data quality issues. Data
transformation is also done to make data sets consistent, unless a data
scientist is looking to analyze unfiltered raw data for a particular application.

3. Mining the data. Once the data is prepared, a data scientist chooses the
appropriate data mining technique and then implements one or more
algorithms to do the mining. In machine learning applications, the algorithms
typically must be trained on sample data sets to look for the information being
sought before they're run against the full set of data.

4. Data analysis and interpretation. The data mining results are used to create
analytical models that can help drive decision-making and other business
actions. The data scientist or another member of a data science team also
must communicate the findings to business executives and users, often
through data visualization and the use of data storytelling techniques.
These steps are part of the data mining process.
Types of data mining techniques
Various techniques can be used to mine data for different data science applications.
Pattern recognition is a common data mining use case that's enabled by multiple
techniques, as is anomaly detection, which aims to identify outlier values in data
sets. Popular data mining techniques include the following types:

 Association rule mining. In data mining, association rules are if-then


statements that identify relationships between data elements. Support and
confidence criteria are used to assess the relationships -- support measures
how frequently the related elements appear in a data set, while confidence
reflects the number of times an if-then statement is accurate.

 Classification. This approach assigns the elements in data sets to different


categories defined as part of the data mining process. Decision trees, Naive
Bayes classifiers, k-nearest neighbor and logistic regression are some examples
of classification methods.

 Clustering. In this case, data elements that share particular characteristics are
grouped together into clusters as part of data mining applications. Examples
include k-means clustering, hierarchical clustering and Gaussian mixture
models.

 Regression. This is another way to find relationships in data sets, by


calculating predicted data values based on a set of variables. Linear regression
and multivariate regression are examples. Decision trees and some other
classification methods can be used to do regressions, too.

 Sequence and path analysis. Data can also be mined to look for patterns in
which a particular set of events or values leads to later ones.

 Neural networks. A neural network is a set of algorithms that simulates the


activity of the human brain. Neural networks are particularly useful in complex
pattern recognition applications involving deep learning, a more advanced
offshoot of machine learning.

More on decision trees


 https://www.softwaretestinghelp.com/decision-tree-algorithm-examples-data-
mining/#:~:text=Decision%20Tree%20Mining%20is%20a,target%20result%20is
%20already%20known.
Association rules
 https://www.upgrad.com/blog/association-rule-mining-an-overview-and-its-
applications/
Cluster
 https://www.upgrad.com/blog/cluster-analysis-data-mining/#:~:text=Clustering%20in
%20data%20mining%20helps,analyzes%20the%20pattern%20of%20deception.

c. Hypothesis Testing: Use hypothesis testing to determine whether the marketing


campaign was effective in increasing sales. [10 marks]

State the hypotheses. Every hypothesis involves a null and


alternative hypothesis which are mutually exclusive. e.g.: the
null hypothesis-average exam between the two classes is
equal: x1 = x2. alternative hypothesis-average exam between
the two class are not equal: x1 != x2.

Formulate an analysis plan. This step involves picking a


test method: z-test, t-test, chi-square, etc. Then pick a
significance level, α. This is a threshold at which anything
equal or below this probability level would be considered
statistically unlikely, therefore you can reject the null
hypothesis and state the alternative hypothesis.
Typically α=0.05* (<5% probability), but I’ll put an asterisk
on this because it is loosely followed in our industry. If you are
in other industries like aerospace, you may want to use an even
smaller α.

Analyze sample data. This step involves calculating a static


score and a p-value, which is just re-calibrating the numbers
onto a distribution curve with one sample group fixed at zero,
and the other project on the distribution line. The observation
is also associated with a p-value, which is the probability of
getting an observation equal or more extreme when the null
hypothesis is true. The smaller the p-value, the farther away
from the observation from zero. Typically if the p-value is
below the α level, the null hypothesis is rejected.

Draw conclusions.

https://medium.com/@jw207427/how-to-apply-hypothesis-
test-in-marketing-data-fbe1e1ac2388

d. Predictive Modeling: Build a predictive model to forecast the sales for the next
quarter based on customer demographics, product purchases, and campaign responses.
[15 marks]

5 Types of Predictive Models


Fortunately, predictive models don’t have to be created from scratch for
every application. Predictive analytics tools use a variety of vetted models
and algorithms that can be applied to a wide spread of use cases.

Predictive modeling techniques have been perfected over time. As we add


more data, more muscular computing, AI and machine learning and see
overall advancements in analytics, we’re able to do more with these
models.

The top five predictive analytics models are:

1. Classification model: Considered the simplest model, it categorizes


data for simple and direct query response. An example use case
would be to answer the question “Is this a fraudulent transaction?”
2. Clustering model: This model nests data together by common
attributes. It works by grouping things or people with shared
characteristics or behaviors and plans strategies for each group at a
larger scale. An example is in determining credit risk for a loan
applicant based on what other people in the same or a similar
situation did in the past.
3. Forecast model: This is a very popular model, and it works on
anything with a numerical value based on learning from historical
data. For example, in answering how much lettuce a restaurant
should order next week or how many calls a customer support agent
should be able to handle per day or week, the system looks back to
historical data.
4. Outliers model: This model works by analyzing abnormal or outlying
data points. For example, a bank might use an outlier model to
identify fraud by asking whether a transaction is outside of the
customer’s normal buying habits or whether an expense in a given
category is normal or not. For example, a $1,000 credit card charge
for a washer and dryer in the cardholder’s preferred big box store
would not be alarming, but $1,000 spent on designer clothing in a
location where the customer has never charged other items might be
indicative of a breached account.
5. Time series model: This model evaluates a sequence of data points
based on time. For example, the number of stroke patients admitted
to the hospital in the last four months is used to predict how many
patients the hospital might expect to admit next week, next month or
the rest of the year. A single metric measured and compared over
time is thus more meaningful than a simple average.

https://www.wallstreetmojo.com/predictive-modeling/
https://www.javatpoint.com/logistic-regression-in-machine-learning

e. Visualization: Create visualizations such as charts and graphs to present your findings
to the marketing team.
You can use the predictive models made in jupyternotebook. Or connect powerbi to
the notebook
Sites for extra help https://www.geeksforgeeks.org/predictive-analysis-in-data-mining/

You might also like