Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Data Analysis and Visualization - To Send

Uploaded by

ahradwan12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Analysis and Visualization - To Send

Uploaded by

ahradwan12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 202

Data Analysis &

Visualization

Presented by: Dr. Dina El Kayaly


Learning Outcomes
▪ Understand data driven decision making process
▪ Understand the various types of data
▪ Understand sampling design
▪ Set analysis plan
▪ Data entry, data importing and preparation using SPSS
▪ Summarizing data using descriptive statistics
▪ Exploring relationships between different types of variables
▪ Choosing the appropriate inferential tests such as chi-square, t-tests and
regression and running them
▪ How to interpret all the statistics presented in the course
Course RoadMap

Module 1 Module 2 Module 3

• Dara supporting • Introduction to • Data preparation


Decision Making basics of Statistics and Descriptive
Process Statistics using
SPSS software
Session 1: What is data-driven decision
making?
What is data-driven decision making?

In business, you constantly have to make decisions — from how much


raw material to order to how to optimize retail traffic for changing
weather.

Today, companies are finding that the best answers to these questions
come from another source entirely: large amounts of data and
computer-driven analysis that you rigorously leverage to make
predictions. This is called data-driven decision making (DDDM).
What is data-driven decision making?

It is vital because it may lead to greater organizational


performance.

This can help your company earn more money and flourish.

If you make judgments based only on intuition, you may find


yourself spending your money in the wrong locations.
Steps for Implementing Data-
Driven Decision Making
Pitfalls to Avoid in DDDM
What is Data Analysis?

Data analysis is the practice of working with data to glean useful


information, which can then be used to make informed decisions
Why we Need Data Analysis?

▪ Gather hidden insights.


▪ To generate reports based on the available data.
▪ Perform market analysis.
▪ Improvement of business Strategy.
Who is a Data Analyst?

A data analyst is a person whose job is to gather and interpret data in order to
solve a specific problem. Data analysts require strong skills in data
management, statistical analysis, data visualization, and business
domain knowledge
What is Data?
▪ Data represents raw elements or unprocessed facts,
including numbers and symbols to text and images.
▪ When collected and observed without interpretation, these
elements remain just data—simple and unorganized.
▪ When these pieces are analyzed and contextualized, they
transform into something more meaningful.
What is Data Collection?
▪ Data collection is a term used to describe a process of
preparing and collecting data
▪ Systematic gathering of data for a particular purpose from
various sources, that has been systematically observed,
recorded and organized
▪ Data are the basic inputs to any decision making process in
business
Primary vs.
Secondary Data
Types of
Secondary
Data
Examples of
Secondary
Data Sources
Advantages
and
Disadvantages
of Secondary
Data Sources
Sources of Primary Data
Examples: Sources of Quantitative Data
Session 2: Sampling Design
Population vs. Sample:
Understanding the Difference
Population vs. Samples
Population Samples
• It refers to the whole data set for • It is a subset of the population.
the use case. • These are a random sample of
• It includes all groups which can be data points.
correlated with each other. • The process of determining the
• Example: All the members of an sample from population data is
online forum reading articles. known as sampling.
• Example: A group of club
members sample who read
technical articles.
Advantages and
disadvantages of
Sampling
Two Concerns...
Sampling Design
Sampling design is the method
you use to choose your sample.
There are several types of
sampling designs, and they all
serve as roadmaps for the
selection of your survey sample.
How do we choose what members of
the population to sample?
Exercise
You are trying to estimate the average valuation of start-ups in the
Egypt. Imagine you are able to visit 200 start-ups in Polaris located in 6
of October in a random manner. What is a possible problem of your
study?

•The sample is not random.


•The sample is too small.
•The sample was not representative.
•The population is unknown.
Exercise

• The high school principal wants to conduct a survey on student


satisfaction for the entire school. You were tasked with contacting
your classmates about their opinion and then presenting them to the
principal.
• Would you say this was population or sample data? What is the value
you presented called?
Sample Size

Sample size calculator makes it easy, to get the right number of


responses for your survey.

Sample Size Calculator and Tips for Determining Sample Size |


SurveyMonkey
Session 3: Variables and How to Set Analysis
Plan?
Variable

• A variable is any characteristics, number, or quantity that can


be measured or counted.
• Example: Age, sex, business income and expenses, country
of birth, capital expenditure, class grades, eye color and
vehicle type are examples of variables.
Respondents / cases Gender Degree
Ahmed M MBA
Dina F DBA
Reem F PhD
Jamila F Undergraduate
Dataset
• A dataset is a collection of data.
• A collection of information obtained through observations,
measurements, study, or analysis is referred to as data.
Variable Data Types

More informative
Variable Data Types
Variable Data Types
Variable Data Types: Examples
Does the
questionnaire
have one or
more types of
data?
Exercise
Exercise

• A variable represents the weight of a person. What type of data does


it represent?
• A variable represents the gender of a person. What type of data does
it represent?
How to set an analysis plan?
How to Create an Analysis Plan?
Once you get survey feedback, you might think that the job is done.
The next step, however, is to analyze those results. Creating a data
analysis plan will help guide you through how to analyze the data and
come to logical conclusions.

It starts with the goals you set for your survey in the first place. This
guide will help you create a data analysis plan that will effectively
utilize the data your respondents provided.
Customer
Satisfaction
Survey
Steps to Create an Analysis Plan
1. Review your goals

If you’re testing the customer satisfaction, your survey goal might be


“How satisfied are our customers?” You probably came up with
several topics you wanted to address, such as:
▪ What is the overall level of satisfaction among our customers?
▪ Which demographics are responding most positively?
▪ Are there any specific pain points that need to be addressed?
▪ What are the main factors affecting the Overall average buying experience?
Steps to Create an Analysis Plan
2. Evaluate the results for your top questions

Your survey questions probably included at least one or two questions


that directly relate to your primary goals. For example, in the beta
testing example above, your top two questions might be:
▪ What is the overall level of satisfaction among our customers?
▪ Which demographics are responding most positively?
Those questions offer a general overview of how your customers feel.
The next goal is to determine why the customer feel the way they do.
Steps to Create an Analysis Plan
3. Assign Questions to specific goals

Next, you’ll organize your survey questions and responses by which


research question they answer. Regarding overall satisfaction question:
▪ It does not exist, you might consider creating a score.
▪ Set a target for it (management decision or comparing to previous
score)
Steps to Create an Analysis Plan
4. Pay Special Attention to demographics

Separating results reflecting on demographics might be illuminating


and would help in segmentation and drafting a suitable marketing plan.

Depending on your ultimate survey goals, you may want to compare


multiple demographic types to get accurate insight into your results.

You will report only the statistically significant results.


Steps to Create an Analysis Plan
5. Consider Correlation & Causation

Consider studying the prevailing patterns and the reasons behind


certain patterns using statistical techniques to ensure objectivity.
Steps to Create an Analysis Plan
6. Perform the analysis

Once you’ve assigned survey questions to the overall research


questions they’re designed to answer, you can move on to the actual
data analysis.
Choose the analysis types that suit your questions and goals, then use
SPSS to analyze the data. Complement the quantitative data with
qualitative data if possible.

At the end of the process, you should be able to answer your major
research questions.
DDDM Checklist

See this
check list

IC-Data-Driven-Decision-Making-Checklist-Template-10545_PDF.pdf (smartsheet.com)
References
Data Analysis Plan: Ultimate Guide and Examples (voiceform.com)

IC-Data-Driven-Decision-Making-Checklist-Template-
10545_PDF.pdf (smartsheet.com)

Creating a Data Analysis Plan: https://youtu.be/djVHKjmImrw


A Beginners Guide to the Data Analysis Process:
https://youtu.be/lgCNTuLBMK4
Exercise

Task: Identify the key research


Questions
Activity
▪ Download the SPSS and install it
Session 4: Opening SPSS
Opening SPSS
Difference between SPSS and Excel data file
Data View vs. Variable View
Details of the variable View
Importing Excel file
Exporting to Excel
Types of SPSS data files (input and output )
Analysis techniques
Help option
Session 5: Data Preparation
What is Data Quality?
Data quality is a measure of a data set's condition based on factors
such as accuracy, completeness, consistency, reliability and validity.

Measuring data quality can help organizations identify errors and


inconsistencies in their data and assess whether the data fits its
intended purpose.
Why Data Quality?
Low-quality data can have significant business consequences for an
organization. Bad data is often the reason behind operational
difficulties, inaccurate analytics and ill-conceived business strategies.

Fact: In 2021, consulting firm Gartner stated that bad data quality costs
organizations an average of $12.9 million per year.
Another figure that's still often cited comes from IBM, which estimated
that data quality issues in the U.S. cost $3.1 trillion in 2016.
A data set that meets all of these measures is much more reliable and
trustworthy than one that does not.

However, these are not necessarily the only standards that organizations use
to assess their data sets.

For example, they might also take into account qualities such as
appropriateness, credibility, relevance, reliability or usability. The goal is to
ensure that the data fits its intended purpose and that it can be trusted.
What is Data Preparation?
▪ Data preparation is the process of cleaning and transforming raw data
prior to processing and analysis.

▪ Data preparation is often a lengthy undertaking for data professionals or


business users, but it is essential as a prerequisite to put data in context
in order to turn it into insights and eliminate bias resulting from poor
data quality

▪ The data preparation process usually includes standardizing data formats,


enriching source data, and/or removing outliers.
Why Data Preparation?
Data in the real world is dirty, Not ready for analysis:

Incomplete data: Some data lack attribute values, lacking certain


attributes of interest, or containing only aggregate data.
For example, First name = “” or Last name = “”

Noisy: Some data contains errors.


For example, Age = -10

Inconsistent: Some data contain discrepancies in codes and names


For example, Age = 56, Birthdate = ’04–05–1995’
Why Data Preparation?
▪ Fix errors quickly — Data preparation helps catch errors before
processing. After data has been removed from its original source, these
errors become more difficult to understand and correct.

▪ Produce top-quality data — Cleaning and reformatting datasets ensures


that all data used in analysis will be high quality.

▪ Make better business decisions — higher quality data that can be


processed and analyzed more quickly and efficiently leads to more
timely, efficient and high-quality business decisions.
Steps of Data Preparation
Steps of Data Preparation

1. Data discretization: part of data reduction focusing on grouping continuous


values of variables into contiguous intervals. For example: age can be
transformed to (0-10,11-20….)
2. Data cleaning: it includes removing inconsistent, error or handling outliers.
For example handling missing values and coding categorical data
3. Data integration: it includes accessing data from different sources and of
different format
4. Data transformation: coding and converting data format or even using
Mathematical formula it is also use to reverse scale.
5. Data reduction: It obtains reduced representation in volume but produces
the same or similar analytical result. For example: reduced the data column
into a score
How to Convert Categorical Data to
Numerical Data?
Integer/Label Encoding One-Hot Encoding
Missing Values
Strategies To Handle Missing Values
• Remove observation/records that have missing values.
• Replacing With Mean/Median/Mode
• Predicting the missing value
Outliers

An outlier is a single data point that


goes far outside the average value of
a group of statistics.

Outliers may be exceptions that


stand outside individual samples of
populations as well. In a more
general context, an outlier is an
individual that is markedly different
from the norm in some respect.
Key Considerations When Dealing with
Outliers
▪ Verify data quality: Check for potential errors or anomalies in data
collection before assuming outliers are genuine observations.
▪ Choose appropriate outlier statistics techniques: Select the most
suitable method for handling outliers based on the dataset's
characteristics and the analysis goals.
▪ Understand the context: Consider the nature of the data and the
domain-specific factors that may influence the presence of outliers.
▪ Document procedures: Document the steps taken to identify and
handle outliers to ensure transparency and reproducibility of the
analysis.
Techniques to find outliers
1. Sorting method
You can sort quantitative variables from low to high and scan for extremely low or
extremely high values. Flag any extreme values that you find.
Techniques to find outliers
2. Using visualization
Techniques to find outliers
2. Using visualization
3. Z Score
4. Interquartile range
How to handle the outliers?
• Deleting the value
• Valuing the outliers

Should I lose such information?


What if it a new trend? Or a special
case?
Exercise
▪ Calculating Score
▪ Coding and decoding
▪ Handling open ended questions
Activity
▪ Prepare the file “Insurance Claim” for analysis
Activity
▪ Enter this data column to an SPSS file
Annual income
$ 62,000.00
$ 64,000.00
$ 49,000.00
$ 324,000.00
$ 1,264,000.00
$ 54,330.00
$ 64,000.00
$ 51,000.00
$ 55,000.00
$ 48,000.00
$ 53,000.00
Session 6: Data Analysis – Descriptive
What is Statistical Analysis?
There are two types of tests:
• Descriptive Type of Statistical Analysis
• Inferential Type of Statistical Analysis
Types of Statistical Analysis
Descriptive Statistics Inferential Statistics

Inferential statistics uses various analytical


Descriptive statistics is used to describe the characteristics of the
Definition tools to draw inferences about the population
population using a sample.
using samples.

Tools Measures of central tendency and measures of dispersion. Hypothesis testing and regression analysis.

Organizes, describes and presents data in a meaningful way with the help of Tests, predicts, and compares data obtained
Use charts and graphs. from various samples.

It is used to summarize known data in a way that can be used for further It tries to use the summarized samples to
Relevance predictions and analysis. draw conclusions about the population.
Descriptive Statistics
Descriptive Analysis is the
type of analysis of data that
helps describe, show or
summarize data points in a
constructive way such that
patterns might emerge that
fulfill every condition of the
data.
Mean
Median
Mode
Interquartile Range (IQR)
Standard Deviation
Variance
• Variance reflects the degree of spread in the data set. The more
spread the data, the larger the variance is in relation to the mean.
1 2 3 4 5 mean = 3

((1-3)2 +(2-3) 2 + (3-3) 2 + (4-3) 2 + (5-3) 2 ) / 5


Coefficient Of variation
The coefficient of variation is a dimensionless relative measure of
dispersion that is defined as the ratio of the standard deviation to
the mean.
Two plants C and D of a factory show the following results about the number of
workers and the wages paid to them.

Plant D has greater variability in individual wages


Exercise
▪ Calculating Central Tendency and dispersion measures
▪ Interpreting the results
Exercise
Background You have a sample of 11 people and their personal annual income.
Task 1 Calculate the mean, median and mode
Task 2 Try to interpret on the numbers you got

Annual income
$ 62,000.00
Task 1: Annual income
$ 64,000.00 Mean $ 189,848.18
$ 49,000.00 Median $ 55,000.00
Mode $ 64,000.00
$ 324,000.00
$ 1,264,000.00
$ 54,330.00
Task 2: Income is a very interesting topic. There is extreme variability in the income of different individuals.
$ 64,000.00 Generally, most of the people gravitate around a certain salary.
$ 51,000.00 Moreover, in most countries there is a minimum salary, therefore most data points are constrained
between the minimum salary and some number.
$ 55,000.00 Finally, there are certain individuals that are earn much more than others. They are the outliers.
$ 48,000.00 Usually, whenever we have research on income, we use the median income, instead of the mean
income.
$ 53,000.00 Income is an example where averages are meaningless. You should be aware that the correct
measure to use depends on the research that you are conducting.
Exercise
Background You have the annual personal income of 11 people from the USA. You have the mean income from the exercise on mean, median and mode
Task 1 Decide whether you have to use sample or population formula for the variance
Task 2 Calculate the variance of their income
Task 3 Generally, what does this number tell you?

Annual income Mean $ 189,848.18


$ 62,000.00
$ 64,000.00
$ 49,000.00
$ 324,000.00
$ 1,264,000.00
$ 54,330.00
$ 64,000.00
$ 51,000.00
$ 55,000.00
$ 48,000.00
$ 53,000.00

Task 1:The question is asking if this is a sample or a population. In other words, are those all the people in the US, receiving salaries?
Obviously not. This is a sample, drawn from the population of all working people in the USA.

Task 2:Variance $² 133,433,409,536.36

Task 3:There is great dispersion between the income of different people in the USA.
Exercise
▪ How to handle outliers calculate the z-score and Boxplot
Exercise
• Find Q1, Q2, Q3 for the following dataset. Identify any outlier, and
draw a box plot
{5,40,42,46,48, 49, 50, 50,52, 53, 55,56,58,75,102}
Activity
▪ Conduct the descriptive statistics for “Insurance Claim” file
Session 7: Statistical Distribution
What is a Distribution/
A distribution in statistics is a function that shows the possible values for a variable
and how often they occur. Think about a die.

It has six sides, numbered from 1 to 6.


We roll the die.

What is the probability of getting 1? 1/6


What is the probability of getting 2?
The same holds for 3, 4, 5 and 6.

Now. What is the probability of getting a 7? Zero

Uniform Distribution
What is a Distribution/
How about rolling two dice
Distribution
▪ Distribution represents all possible value and their frequency of occurrence.
▪ It represents reality.
▪ We can understand everything we need from the shape of the distribution.
▪ Everything can generate data set.
▪ Because we know that the sample approximate the shape we can draw connection
between the sample and the population.
▪ The distribution of a dataset is a listing or plot showing all the possible values or
▪ intervals of the data and how often they occur.
Normal Distribution
Normal Distribution

The normal distribution is the most


common type of distribution.

The standard normal distribution


has two parameters: the mean and
the standard deviation.
Why Normal Distribution is important?

▪ The normal distribution is a symmetric and bell-shaped. This shape is useful


because it can be used to describe many populations, from classroom grades to
heights and weights
▪ The Central Limit Theorem states that when the sample size is very large, the
sample mean will generally follow a normal distribution even if the original
population is not normally distributed.

This means that we can often use inferential statistical methods that assume
normality, even if the data in our sample doesn’t follow a normal distribution.
Exercise
▪ Discuss the various usage of Normal Distribution and its
descriptive statistics
Empirical Rule
Average is bigger but sigma is the same !!!
Improving product quality!!!

Upper
Lower Specification
Specification Limit
Limit
Skewness

Mean<Median<Mode Mean=Median=Mod Mean>Median>Mode


Commonly
observed
shapes of
Distributions
Can we approximate the
continuous distribution
to the Normal
distribution?
How to approximate the normal distribution
to standard Normal Distribution?

Z-Score Method
A z-score describes the position of a raw score in terms of its distance from
the mean when measured in standard deviation units. The z-score is positive
if the value lies above the mean and negative if it lies below the mean.
Standardization
Exercise
X X-mean (.97) Z= x-mean / sigma
0.8 -0.17 -0.51515
1.6 0.63 1.909091
0.9 -0.07 -0.21212
0.8 -0.17 -0.51515
1.2 0.23 0.69697
0.4 -0.57 -1.72727
0.7 -0.27 -0.81818
1 0.03 0.090909
1.2 0.23 0.69697
1.1 0.13 0.393939
Mean = almost 0
Mean = almost 0 Sigma = almost 1
Session 8: Statistical Analysis – Inferential (t-testing)
Statistical Analysis Types

Inferential tests that will cover include:


▪ Testing of Hypothesis (mean)
▪ Crosstabulation
▪ Correlation
▪ Multiple Linear Regression
▪ Logistic regression
What is a Hypothesis?
▪ A hypothesis is used to define the relationship between two variables.

▪ It is a proposed explanation made on the basis of limited evidence as a


starting point for further investigation

Example: The mean monthly cell


phone bill of this city is EGP 500
for Red Group A
What is Testing of Hypothesis?
Hypothesis Testing is a type of statistical analysis in which you put your
assumptions about a population parameter to the test.

For Example: A teacher assumes that 60% of his college's students come from
lower-middle-class families.

If the hypothesis is stated in terms of population parameters (such as mean and


variance), the hypothesis is called statistical hypothesis
What is a Statistical Hypothesis?
If the hypothesis is stated in terms of population parameters (such as mean and
variance), the hypothesis is called statistical hypothesis. Are divided into:

▪ Null hypothesis (denoted by H0) is a statement that the value of a


population parameter (such as proportion, mean, or standard deviation) is
equal to some claimed value. We either reject H0 or fail to reject H0

▪ Alternative hypothesis (denoted by H1 or Ha or HA) is the statement that the


parameter has a value that somehow differs from the Null Hypothesis

Always contains “=” in the SPSS Example: H0: µ = EGP 500


What is a Statistical Hypothesis?
We have two kinds of alternative hypothesis:-
(a) One sided alternative hypothesis (on-tailed test)
(b) Two sided alternative hypothesis (two –tailed test)
Example:
H1: µ < EGP 500 or µ > EGP 500
Ne Sided alternative hypothesis
It is always ≠ for the SPSS
H1: µ ≠ EGP 500
Two sided alternative hypothesis
Significance Level
The significance level (denoted
by ) is the probability that the Acceptance region
Accept H0 ,if the sample
mean falls in this region
test statistic will fall in the critical
region when the null hypothesis
is actually true. Common choices 95 % of area

for  is 0.05.
0.025 of area 0.025 of area

µH 0

Rejection region
Reject H0 ,if the sample mean falls
in either of these regions
P-Value
The p value is a number, calculated from a statistical test, that describes how
likely you are to have found a particular set of observations if the null hypothesis
were true.

The smaller the p value, the more likely you are to reject the null hypothesis.
How small is small enough?

The most common threshold is p < 0.05.


P-Value
• Determine the p-value
(part of the standard output of SPSS)

• Compare the p-value with 


• If p-value <  of 0.05 , reject H0
• If p-value   of 0.05 , do not reject H0

If alternative hypothesis is one tailed, then:


divide p-value by 2 and compare the result with  of 0.05
Steps of Hypothesis Testing
❖ Step 1. Define null and alternative hypothesis
❖ Step 2. Choose the appropriate test at significance level of 5%
❖ Step 3: Perform the test
❖ Step 4: Determine the Significance level  = 5%
❖ Step 5: Get the P-value (for one sided or two sided)
❖ Step 6: Determine the statistical significance.
❖ Step 7: Interpret the results (managerial Decision)
Exercise
Ford motor company has worked to reduce road noise inside the
cab of the redesigned F150 pickup truck. It would like to report
in its advertising that the truck is quieter. The average of the
prior design was 68 decibels at 60 mph.

• Formulate the statistical hypothesis to be suitable for SPSS


• How are you going to interpret the results?
Solution
• Formulate the statistical hypothesis
H0: µ = 68 (the truck is not quieter) status quo For SPSS
HA: µ ≠ 68 (the truck is quieter) wants to support

• If the null hypothesis is rejected, Ford has sufficient evidence to support that the
truck is now quieter.
Types of testing of hypothesis of Means


One Sample t-test
The population of pike in a lake was investigated for its content of
mercury (Hg). A sample of 10 pike of a certain size was caught and
the concentration of mercury was determined (unit: mg/kg). The
standard accepts the average Mercury level to be less than 0.9
mg/kg.

1. Explore the data through summary statistics


2. Should we close the lake and prohibit fishing?

For SPSS, you need a one column data set and a


test value (target value)
Step 1: H0: μ = 0.9 against the alternative H1: μ > 0.9 (one-tailed)
Step 2: One-sample t-test (against the quality standard)
Step 3: Significance level  = 5%
Step 4:
Step 5: P-Value = 0.519/2
Step 6: P-Value/2 > 
Step 7: Cannot reject H0
Do not close the lack
Need further analysis
Calculating the P-value for one sided t-test
• The p-value for this difference, using the one-tailed test, is 0.2595
(which is half the p-value for the two-tailed test, which SPSS reports to
be 0.519).
• The value is bigger than 0.05 then the mercury concentration is not
significantly > 0.9
Activity
▪ Using “Insurance Claim” File:
Regarding the cost of claims, can you help us to check if the
cost of claim = 200?
Session 9: Statistical Analysis – Inferential (t-testing)
Comparing 2 Populations
Estimating two population
values

Population Paired samples


means,
independent
samples

Group 1 vs. Same group before


independent Group vs. after treatment
2
Independent Sample t-test

• A campaign to motivate citizens to reduce the consumption of petrol


was planned. Before the campaign was launched, an experiment was
carried out to evaluate the effectiveness of such a campaign. For the
experiment, the campaign was conducted in a small but
representative geographical area. Twelve families were randomly
selected from the area, and the amount of petrol (unit: litre) they used
was monitored for 1 month prior to the advertising campaign and for
1 month following the campaign.
• Unfortunately, the variable identifying the different families was
lost during a data conversion from one format to another, so you will
have to treat the data as two independent samples.
Step 1: H0: μ1 = μ2 against the alternative H1: μ1 > μ2
(one-tailed)
Step 2: Independent sample t-test
Step 3: Significance level  = 5%
For SPSS, you need two column one is the
Step 4: numerical data set and the other column
is the factor column that represents the
two groups / categories included in the
testing.
The p-value for Levene’s test for equality of variances is 0.618 which is
greater than 0.05 (our standard significance level), so we do not reject
the null hypothesis that the variances of the two groups are equal.
The p-value for the two-tailed test is 0.462, but since we have chosen a
one sided alternative hypothesis, we should use a one-tailed test.
Thus, we should divide the p value computed by SPSS by two, which
gives p = 0.231 > 0.05

Step 5: P-Value = 0.462/2


Step 6: P-Value/2 > 
Step 7: Cannot reject H0
The campaign is not successful ! But see the averages, the after is
really less than the before?????
Paired Sample t-test

• By a stroke of luck, the original data file was found again, so the data
analysis can now be carried out according to the original design of the
experiment.
• However, in SPSS, the paired samples t-test requires data in a different
format than the format used in the independent samples t-test.
Step 1: H0: μ1 - μ2 = zero against the alternative H1: μ1 - μ2 ≠ zero
(one tail)
Step 2: Paired sample t-test For SPSS, you need a two columns, one data set
Step 3: Significance level  = 5% representing the before and the other data set
represents the after
Step 4:
The p-value for the two-tailed test is 0.014, and since we are still
working under the same one-sided alternative hypothesis as above, we
should divide this by two to obtain the p-value for the corresponding
one-tailed test.
Since 0.007 < 0.05, the null hypothesis of no effect is rejected in favor of
the alternative (i.e. that the experimental campaign has led to a
statistically significant reduction in fuel consumption).

Step 5: P-Value = 0.014/2


Step 6: P-Value/2 < 
Step 7: Reject H0
The campaign is successful !
Important Note

• The experiment was set up according to a paired samples pre-test/post-test


design to eliminate the influence of variation between families, which was
considered to be a nuisance or noise factor in this case.
• If the variation between families is considerable, the paired t-test is the
appropriate test to use – otherwise the variation between families may
mask the effect of the treatment (campaign).
• That is actually what happened in the independent samples case, which did
not show a significant effect due to the amount of ‘noise’ introduced by the
variation between families. So, design and type of statistical test do matter!
Activity
▪ Using “Insurance Claim” File:
Regarding the costs of claims, is there a difference across
the gender of policy holder?
Session 10: Statistical Analysis – Inferential (ANOVA &
Crosstab)
One- Way ANOVA
• People who are concerned about their health may prefer hot
dogs that are low in salt and calories. The data are results of a
laboratory analysis of calories and sodium content of major hot
dog brands. The variables are:
1. Type: Type of hotdog (beef, meat, or poultry).
2. Calories: Calories per hot dog.
3. Sodium: Milligrams of sodium per hot dog.

• Start with the number of calories and formulate a null


hypothesis, test the hypothesis on the 5% level of significance.
Step 1: H0: μ1 = μ2 = μ3 against the alternative one is not equal
(two tailed)
Step 2: One Way ANOVA For SPSS, you need a two columns, one
Step 3: Significance level  = 5% data set and the other representing the
factor or categories.
Step 4:
The p-value for ‘Between Groups’ in the ANOVA table is less than
0.001, which means that there is a highly significant difference
between at least two groups.
Step 5: P-Value = 0.014
Step 6: P-Value < 
Step 7: Reject H0
There is a significant difference. Poultry is less in calories than two types
of meat.

Post Hoc,
Tukey Test
Activity
▪ Complete the rest of enquiries related to “Insurance
Claim” case.
Activity
▪ Prepare the file “NGU perceptual Image Survey”.
Comparing frequencies of events - Crosstabs

A education expert gathered data about some education program and


wants to compare the frequencies among the various groups.
• H0: Type of program is independent gender (female)
• Ha: Type of program is related to gender (female)
Categorical
For significance level of = 5% data
How to do it on SPSS?

Recall that the cross tabulation is a way to examine


the relationship between two variables. In SPSS, you
can get a cross tabulation in
Analyze > Descriptive Statistics > Crosstabs…
In other words, click on Analyze in the menu bar,
select Descriptive Statistics, then Crosstabs…
In the Crosstabs window, find and move the
independent and dependent variable to
Column(s): and Row(s): boxes, respectively. In
this case, the independent variable [sex] goes to
Column(s): box and the dependent variable
[attend] goes to Row(s): box.
In the Cell Display window, select the Column in the Percentages section. It
allows you to have percentage value separately calculated for each category of
the independent variable. Make sure that Observed was selected in the Counts
section although it should be checked by default.
Click on Continue and OK buttons to see the cross tabulation.
Step 1: H0: two categorical data groups are independent
against the alternative two categorical data groups are
dependent
(two tailed)
Step 2: Crosstabs
Step 3: Significance level  = 5%
Step 4:

Decision Rule: P-value < 0.05 then accept Ha


Chi-Square: Used to measure if
there is an association or not

Phi & CRAMER'S V: Used to measure


the strength of the association

Step 5: P-Value = 0.974


Step 6: P-Value > 
Step 7: Can not Reject H0
There is no significant difference. All types of programs should be
promoted to every one (regardless the gender)
Session 11: Statistical Analysis – Inferential (Correlation and
Regression)
Correlation & Regression
▪ Correlation analysis is applied in quantifying the association
between two continuous variables, for example, a dependent
and independent variable or among two independent
variables.

▪ Regression analysis refers to assessing the relationship


between the outcome variable and one or more variables. The
outcome variable is known as the dependent or response
variable and the risk elements, and co-founders are known as
predictors or independent variables. The dependent variable is
shown by “y” and independent variables are shown by “x” in
regression analysis.
Correlation & Scatter Plot
Correlation Coefficient
• Correlation measures the strength of the linear association between two
variables. Pearson correlation for quantitative Values.
▪ Range between -1 and 1, 0 means no correlation
▪ The sign is a symbol of direction and is not related to the value.
▪ The closer to -1, the stronger the negative linear relationship. The closer
to 1, the stronger the positive linear relationship
Exercise

A university has a challenges, as the grades of its students were


declining. Students say that it is because of the long time spent on
commuting back home. Professors say that it is because of the new pub
that started working two months ago.
Help University management to take a decision.

Answer: Computing the Pearson correlation coefficient we get two


significant correlation and one not significantly correlated between
the dependent variable and independent variables. Commuting is
not significant, so it is the reason behind declining grades.
Always use Stepwise regression to
rationally select the independent
variables affecting the dependent Could substitute
variable one another

Dependent
variable

No significant
correlation
Regression Analysis

Linear regression is a linear approach to


modelling the relationship between the scalar
components and one or more independent
variables. If the regression has one independent
variable, then it is known as a simple linear
regression. If it has more than one independent
variable, then it is known as multiple linear
regression.
Regression Analysis
Regression analysis is used to:
• Explain the impact of changes in an independent variable on the
dependent variable
• Predict the value of a dependent variable based on the value of at
least one independent variable

Dependent variable: the variable we wish to explain


Independent variables: the variables used to explain the dependent
variable
y
Observed Value of
y for xi

εi Slope = β1
Predicted Value of Random Error for
y for xi
this x value

Intercept = β0

xi x
Model Building

Goal is to develop a model with the best set of independent variables


• Easier to interpret if unimportant variables are removed
• Lower probability of collinearity

Stepwise regression procedure


• Provide evaluation of alternative models as variables are added
Exercise
•Pie •Price •Advertising
• A distributor of frozen desert pies wants to •Week Sales •($) •($100s)
•1 •350 •5.50 •3.3
evaluate factors thought to influence demand •2 •460 •7.50 •3.3
•3 •350 •8.00 •3.0
• Dependent variable: Pie sales (units per week) •4 •430 •8.00 •4.5

• Independent variables: Price (in $) •5 •350 •6.80 •3.0


•6 •380 •7.50 •4.0
Advertising ($100’s)
•7 •430 •4.50 •3.0
•8 •470 •6.40 •3.7
• Data is collected for 15 weeks •9 •450 •7.00 •3.5
•10 •490 •5.00 •4.0
•11 •340 •7.20 •3.5
•12 •300 •7.90 •3.2
•13 •440 •5.90 •4.0
•14 •450 •5.00 •3.5
•15 •300 •7.00 •2.7
Correlation Matrix

Pie Sales Price Advertising


Pie Sales 1
Price -0.44327 1
Advertising 0.55632 0.03044 1

• Price vs. Sales : r = -0.44327 There is a negative association


• Advertising vs. Sales : r = 0.55632 There is a positive association
Regression Table
Regression Statistics
Multiple R 0.72213 R 2A = .44172
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
Observations 15

ANOVA df SS MS F Significance F

Regression 2 29460.027 14730.013 6.53861 0.01201


Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
How to Judge the model?
• Model Efficiency
44.2% of the variation in pie sales is explained by the variation
in price and advertising, taking into account the sample size and
number of independent variables
• Is the model significant?
Significant F < 5%, this the model is significant and we can build
upon it
• Are Individual Variables Significant?
See the P-value for every coefficient
where
Equation
Interpretation Sales is in number of pies per week
Price is in $
Advertising is in $100’s.

Sales = 306.526 - 24.975(Pri ce) + 74.131(Adv ertising)

b1 = -24.975: sales will b2 = 74.131: sales will


decrease, on average, by increase, on average, by
24.975 pies per week for 74.131 pies per week for
each $1 increase in selling each $100 increase in
price, net of the effects of advertising, net of the
changes due to advertising effects of changes due to
price
Using The Model to Make Predictions

Predict sales for a week in which the selling price is


$5.50 and advertising is $350:

Sales = 306.526 - 24.975(Pri ce) + 74.131(Adv ertising)


= 306.526 - 24.975 (5.50) + 74.131 (3.5)
= 428.62

Note that Advertising is in


Predicted sales is $100’s, so $350 means that x2
= 3.5
428.62 pies
Simulation to get the best option
Exercise

A university has a challenges, as the grades of its students were


declining. Students say that it is because of the long time spent on
commuting back home. Professors say that it is because of the new pub
that started working two months ago.
Help University management to take a decision.
The Adjusted R squared = 0.732. so this model justified 70% of the change in
the dependent using the independent

The overall linear regression model is significant since F statistic is 31.066 that
is < 0.05

Thus, we can write the estimated straight line equation as:


Mark = 11.710 -0.075× Hrs at the pub

Interpretation: time spend at the pub justifies 70% of the change in the grades
for the students included in the sample. University’s Management need to
contact authorities to displace the pub.
Activity
▪ Complete the enquiries related to “NGU Perceptual Image
Survey” .
Session 12: Statistical Analysis – Inferential (Logistic
Regression)
What is Logistic Regression?

• Logistic regression is useful for situations in which you want to be


able to predict the presence or absence of a characteristic or
outcome based on values of a set of predictor variables.
• It is similar to a linear regression model but is suited to models
where the dependent variable is dichotomous.
• Logistic regression is also useful when we expect interaction
between dependent variables.
Interaction Effects
• Hypothesizes interaction between pairs of x variables
• Response to one x variable varies at different levels of
another x variable
• Contains two-way cross product terms

𝑦 = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x1 x2 + β5 x1x2

Basic Terms Interactive Terms


Multinomial Logistic Regression
• Multinomial Logistic Regression is useful for situations in which you
want to be able to classify subjects based on values of a set of
predictor variables.
• This type of regression is similar to logistic regression, but it is more
general because the dependent variable is not restricted to two
categories.

For Example, you can conduct a survey in which participants are asked to
select one of several competing products as their favorite. Using
multinomial logistic regression, you can create profiles of people who are
most likely to be interested in your product, and plan your advertising
strategy accordingly.
Exercise

As part of an effort to improve the marketing of its breakfast options, a


Consumer Packaged Goods company polls 880 people, noting their age,
gender, marital status, and whether or not they have an active lifestyle
(based upon whether they exercise at least twice a week).

Each participant then tasted 3 breakfast foods and was asked which one
they liked best.
• From the menus
choose:
• Analyze
Regression
Multinomial
Logistic...
• Select Preferred
breakfast as the
dependent
variable.
► Select Age
category,
Gender, and
Lifestyle as
factors.
► Click Model.
• Select
Custom/Stepwise.
► Select Main
effects from the
Build Terms
dropdown.
► Select agecat and
active as forced
entry terms.
► Click Continue.
► Click Statistics in
the Multinomial
Logistic Regression
dialog box.
• Select Cell
probabilities,
Classification table,
and Goodness of
fit.
► Click Continue.
► Click OK in the
Multinomial
Logistic
Regression
dialog box.
Decision: Sig < 0.05, Final
model is outperforming the
Null, then the effect
contributes to the model

Decision: Sig > 0.10, so the


data are consistent with the
model assumptions.
The classification table shows the
practical results of using the
multinomial logistic regression model.

Cells on the diagonal are correct


predictions.
Cells off the diagonal are incorrect
predictions.
Of the cases used to create the model,
118 of the 231 people who chose the
breakfast bar are classified correctly.
Options with various
combinations of factors

This one is
better. What
to do next?
This one is better. What to do next?
Save the predicted values and cases where the observed and predicted
are equal, you will get 3 groups in this case, omit the rest.

Regarding each group of the three, start profiling each group using other
socio-demographic variables that were part of the questionnaire.
Activity
▪ Final Project “Electrical Car Acceptance Survey”
Next Step
Machine Learning vs.
Statistics
What is Machine Learning ?
Machine learning algorithms find patterns in data when they would be impractical or
impossible for a human to observe.

Once these patterns have been defined, machine learning tools can be used to forecast
future observations based on found rules.

Simple machine learning models can be based on probability, while the most
advanced machine learning algorithms can leverage artificial intelligence to magnify
the predictive power of statistical modeling.
Practical examples

How does Amazon know to suggest toys and accessories for pet owners?
How did Target learn a teenager was pregnant before her family did?

Machine learning can be used to piece together various scraps of information,


including purchase history and shopping habits, to build a profile for customers.

With this data, retailers can predict what customers are most willing to purchase.
Misconceptions About Machine Learning
▪ Machine learning is AI.
▪ Machine learning can be used anywhere!
▪ Computers can actually “learn.”

Misconceptions About Statistics


▪ You always need a large sample size to use statistics.
▪ Visual representations of data are enough to determine significant differences.
In a nutshell…

Machine learning and statistics are intrinsically linked.

Machine learning is always based on statistics, but statistics is not always machine
learning.

Machine learning or statistical model is only as good as the practitioner who


deployed it.

It is important to keep in mind that a comprehensive understanding and familiarity


with the data is key to choosing the most appropriate tools for the problem.
For Data Exploration project:

▪ Tableau: can be of help in a data exploration task that involves geographical


locations or too many numbers that can be put together in groups to cluster
information with the use of graphical representations without the need to use
any kind of statistical summaries beyond percentages etc.
▪ PowerBI: a business intelligence tool that can work with data and can also
present you with the query editor to enhance the data cleaning and wrangling
process before you prepare the visualizations. This can be used when you want
to explore some unclean data.
▪ Python: is the crown jewel because of its versatility as a programming tool and
can help individuals prepare data visualizations to dive deeper into the data while
also enabling them to perform high-level statistical analysis on the data to ensure
that no detail is missed out.
For Machine Learning project:

▪ Tableau: is not the best choice probably when it comes to data exploration with
some predictive analytics involved.
▪ Power BI: capabilities is limited when it comes to Data wrangling and can also do
a little bit of share here and there with integration with R to make Predictive
analysis possible.
▪ Python: best choice when it comes to predictive analytics, it helps data scientists
perform predictive analysis and derive metrics to check the performance of their
statistical models on the data.
Python Basics: A Practical Introduction to
Python 3 (realpython.com)

A_Practical_Introduction_to_Python_P
rogramming_Heinold.pdf
(brianheinold.net)
THANK YOU

You might also like