Data Analysis and Visualization - To Send
Data Analysis and Visualization - To Send
Visualization
Today, companies are finding that the best answers to these questions
come from another source entirely: large amounts of data and
computer-driven analysis that you rigorously leverage to make
predictions. This is called data-driven decision making (DDDM).
What is data-driven decision making?
This can help your company earn more money and flourish.
A data analyst is a person whose job is to gather and interpret data in order to
solve a specific problem. Data analysts require strong skills in data
management, statistical analysis, data visualization, and business
domain knowledge
What is Data?
▪ Data represents raw elements or unprocessed facts,
including numbers and symbols to text and images.
▪ When collected and observed without interpretation, these
elements remain just data—simple and unorganized.
▪ When these pieces are analyzed and contextualized, they
transform into something more meaningful.
What is Data Collection?
▪ Data collection is a term used to describe a process of
preparing and collecting data
▪ Systematic gathering of data for a particular purpose from
various sources, that has been systematically observed,
recorded and organized
▪ Data are the basic inputs to any decision making process in
business
Primary vs.
Secondary Data
Types of
Secondary
Data
Examples of
Secondary
Data Sources
Advantages
and
Disadvantages
of Secondary
Data Sources
Sources of Primary Data
Examples: Sources of Quantitative Data
Session 2: Sampling Design
Population vs. Sample:
Understanding the Difference
Population vs. Samples
Population Samples
• It refers to the whole data set for • It is a subset of the population.
the use case. • These are a random sample of
• It includes all groups which can be data points.
correlated with each other. • The process of determining the
• Example: All the members of an sample from population data is
online forum reading articles. known as sampling.
• Example: A group of club
members sample who read
technical articles.
Advantages and
disadvantages of
Sampling
Two Concerns...
Sampling Design
Sampling design is the method
you use to choose your sample.
There are several types of
sampling designs, and they all
serve as roadmaps for the
selection of your survey sample.
How do we choose what members of
the population to sample?
Exercise
You are trying to estimate the average valuation of start-ups in the
Egypt. Imagine you are able to visit 200 start-ups in Polaris located in 6
of October in a random manner. What is a possible problem of your
study?
More informative
Variable Data Types
Variable Data Types
Variable Data Types: Examples
Does the
questionnaire
have one or
more types of
data?
Exercise
Exercise
It starts with the goals you set for your survey in the first place. This
guide will help you create a data analysis plan that will effectively
utilize the data your respondents provided.
Customer
Satisfaction
Survey
Steps to Create an Analysis Plan
1. Review your goals
At the end of the process, you should be able to answer your major
research questions.
DDDM Checklist
See this
check list
IC-Data-Driven-Decision-Making-Checklist-Template-10545_PDF.pdf (smartsheet.com)
References
Data Analysis Plan: Ultimate Guide and Examples (voiceform.com)
IC-Data-Driven-Decision-Making-Checklist-Template-
10545_PDF.pdf (smartsheet.com)
Fact: In 2021, consulting firm Gartner stated that bad data quality costs
organizations an average of $12.9 million per year.
Another figure that's still often cited comes from IBM, which estimated
that data quality issues in the U.S. cost $3.1 trillion in 2016.
A data set that meets all of these measures is much more reliable and
trustworthy than one that does not.
However, these are not necessarily the only standards that organizations use
to assess their data sets.
For example, they might also take into account qualities such as
appropriateness, credibility, relevance, reliability or usability. The goal is to
ensure that the data fits its intended purpose and that it can be trusted.
What is Data Preparation?
▪ Data preparation is the process of cleaning and transforming raw data
prior to processing and analysis.
Tools Measures of central tendency and measures of dispersion. Hypothesis testing and regression analysis.
Organizes, describes and presents data in a meaningful way with the help of Tests, predicts, and compares data obtained
Use charts and graphs. from various samples.
It is used to summarize known data in a way that can be used for further It tries to use the summarized samples to
Relevance predictions and analysis. draw conclusions about the population.
Descriptive Statistics
Descriptive Analysis is the
type of analysis of data that
helps describe, show or
summarize data points in a
constructive way such that
patterns might emerge that
fulfill every condition of the
data.
Mean
Median
Mode
Interquartile Range (IQR)
Standard Deviation
Variance
• Variance reflects the degree of spread in the data set. The more
spread the data, the larger the variance is in relation to the mean.
1 2 3 4 5 mean = 3
Annual income
$ 62,000.00
Task 1: Annual income
$ 64,000.00 Mean $ 189,848.18
$ 49,000.00 Median $ 55,000.00
Mode $ 64,000.00
$ 324,000.00
$ 1,264,000.00
$ 54,330.00
Task 2: Income is a very interesting topic. There is extreme variability in the income of different individuals.
$ 64,000.00 Generally, most of the people gravitate around a certain salary.
$ 51,000.00 Moreover, in most countries there is a minimum salary, therefore most data points are constrained
between the minimum salary and some number.
$ 55,000.00 Finally, there are certain individuals that are earn much more than others. They are the outliers.
$ 48,000.00 Usually, whenever we have research on income, we use the median income, instead of the mean
income.
$ 53,000.00 Income is an example where averages are meaningless. You should be aware that the correct
measure to use depends on the research that you are conducting.
Exercise
Background You have the annual personal income of 11 people from the USA. You have the mean income from the exercise on mean, median and mode
Task 1 Decide whether you have to use sample or population formula for the variance
Task 2 Calculate the variance of their income
Task 3 Generally, what does this number tell you?
Task 1:The question is asking if this is a sample or a population. In other words, are those all the people in the US, receiving salaries?
Obviously not. This is a sample, drawn from the population of all working people in the USA.
Task 3:There is great dispersion between the income of different people in the USA.
Exercise
▪ How to handle outliers calculate the z-score and Boxplot
Exercise
• Find Q1, Q2, Q3 for the following dataset. Identify any outlier, and
draw a box plot
{5,40,42,46,48, 49, 50, 50,52, 53, 55,56,58,75,102}
Activity
▪ Conduct the descriptive statistics for “Insurance Claim” file
Session 7: Statistical Distribution
What is a Distribution/
A distribution in statistics is a function that shows the possible values for a variable
and how often they occur. Think about a die.
Uniform Distribution
What is a Distribution/
How about rolling two dice
Distribution
▪ Distribution represents all possible value and their frequency of occurrence.
▪ It represents reality.
▪ We can understand everything we need from the shape of the distribution.
▪ Everything can generate data set.
▪ Because we know that the sample approximate the shape we can draw connection
between the sample and the population.
▪ The distribution of a dataset is a listing or plot showing all the possible values or
▪ intervals of the data and how often they occur.
Normal Distribution
Normal Distribution
This means that we can often use inferential statistical methods that assume
normality, even if the data in our sample doesn’t follow a normal distribution.
Exercise
▪ Discuss the various usage of Normal Distribution and its
descriptive statistics
Empirical Rule
Average is bigger but sigma is the same !!!
Improving product quality!!!
Upper
Lower Specification
Specification Limit
Limit
Skewness
Z-Score Method
A z-score describes the position of a raw score in terms of its distance from
the mean when measured in standard deviation units. The z-score is positive
if the value lies above the mean and negative if it lies below the mean.
Standardization
Exercise
X X-mean (.97) Z= x-mean / sigma
0.8 -0.17 -0.51515
1.6 0.63 1.909091
0.9 -0.07 -0.21212
0.8 -0.17 -0.51515
1.2 0.23 0.69697
0.4 -0.57 -1.72727
0.7 -0.27 -0.81818
1 0.03 0.090909
1.2 0.23 0.69697
1.1 0.13 0.393939
Mean = almost 0
Mean = almost 0 Sigma = almost 1
Session 8: Statistical Analysis – Inferential (t-testing)
Statistical Analysis Types
For Example: A teacher assumes that 60% of his college's students come from
lower-middle-class families.
for is 0.05.
0.025 of area 0.025 of area
µH 0
Rejection region
Reject H0 ,if the sample mean falls
in either of these regions
P-Value
The p value is a number, calculated from a statistical test, that describes how
likely you are to have found a particular set of observations if the null hypothesis
were true.
The smaller the p value, the more likely you are to reject the null hypothesis.
How small is small enough?
• If the null hypothesis is rejected, Ford has sufficient evidence to support that the
truck is now quieter.
Types of testing of hypothesis of Means
⁺
One Sample t-test
The population of pike in a lake was investigated for its content of
mercury (Hg). A sample of 10 pike of a certain size was caught and
the concentration of mercury was determined (unit: mg/kg). The
standard accepts the average Mercury level to be less than 0.9
mg/kg.
• By a stroke of luck, the original data file was found again, so the data
analysis can now be carried out according to the original design of the
experiment.
• However, in SPSS, the paired samples t-test requires data in a different
format than the format used in the independent samples t-test.
Step 1: H0: μ1 - μ2 = zero against the alternative H1: μ1 - μ2 ≠ zero
(one tail)
Step 2: Paired sample t-test For SPSS, you need a two columns, one data set
Step 3: Significance level = 5% representing the before and the other data set
represents the after
Step 4:
The p-value for the two-tailed test is 0.014, and since we are still
working under the same one-sided alternative hypothesis as above, we
should divide this by two to obtain the p-value for the corresponding
one-tailed test.
Since 0.007 < 0.05, the null hypothesis of no effect is rejected in favor of
the alternative (i.e. that the experimental campaign has led to a
statistically significant reduction in fuel consumption).
Post Hoc,
Tukey Test
Activity
▪ Complete the rest of enquiries related to “Insurance
Claim” case.
Activity
▪ Prepare the file “NGU perceptual Image Survey”.
Comparing frequencies of events - Crosstabs
Dependent
variable
No significant
correlation
Regression Analysis
εi Slope = β1
Predicted Value of Random Error for
y for xi
this x value
Intercept = β0
xi x
Model Building
ANOVA df SS MS F Significance F
The overall linear regression model is significant since F statistic is 31.066 that
is < 0.05
Interpretation: time spend at the pub justifies 70% of the change in the grades
for the students included in the sample. University’s Management need to
contact authorities to displace the pub.
Activity
▪ Complete the enquiries related to “NGU Perceptual Image
Survey” .
Session 12: Statistical Analysis – Inferential (Logistic
Regression)
What is Logistic Regression?
𝑦 = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x1 x2 + β5 x1x2
For Example, you can conduct a survey in which participants are asked to
select one of several competing products as their favorite. Using
multinomial logistic regression, you can create profiles of people who are
most likely to be interested in your product, and plan your advertising
strategy accordingly.
Exercise
Each participant then tasted 3 breakfast foods and was asked which one
they liked best.
• From the menus
choose:
• Analyze
Regression
Multinomial
Logistic...
• Select Preferred
breakfast as the
dependent
variable.
► Select Age
category,
Gender, and
Lifestyle as
factors.
► Click Model.
• Select
Custom/Stepwise.
► Select Main
effects from the
Build Terms
dropdown.
► Select agecat and
active as forced
entry terms.
► Click Continue.
► Click Statistics in
the Multinomial
Logistic Regression
dialog box.
• Select Cell
probabilities,
Classification table,
and Goodness of
fit.
► Click Continue.
► Click OK in the
Multinomial
Logistic
Regression
dialog box.
Decision: Sig < 0.05, Final
model is outperforming the
Null, then the effect
contributes to the model
This one is
better. What
to do next?
This one is better. What to do next?
Save the predicted values and cases where the observed and predicted
are equal, you will get 3 groups in this case, omit the rest.
Regarding each group of the three, start profiling each group using other
socio-demographic variables that were part of the questionnaire.
Activity
▪ Final Project “Electrical Car Acceptance Survey”
Next Step
Machine Learning vs.
Statistics
What is Machine Learning ?
Machine learning algorithms find patterns in data when they would be impractical or
impossible for a human to observe.
Once these patterns have been defined, machine learning tools can be used to forecast
future observations based on found rules.
Simple machine learning models can be based on probability, while the most
advanced machine learning algorithms can leverage artificial intelligence to magnify
the predictive power of statistical modeling.
Practical examples
How does Amazon know to suggest toys and accessories for pet owners?
How did Target learn a teenager was pregnant before her family did?
With this data, retailers can predict what customers are most willing to purchase.
Misconceptions About Machine Learning
▪ Machine learning is AI.
▪ Machine learning can be used anywhere!
▪ Computers can actually “learn.”
Machine learning is always based on statistics, but statistics is not always machine
learning.
▪ Tableau: is not the best choice probably when it comes to data exploration with
some predictive analytics involved.
▪ Power BI: capabilities is limited when it comes to Data wrangling and can also do
a little bit of share here and there with integration with R to make Predictive
analysis possible.
▪ Python: best choice when it comes to predictive analytics, it helps data scientists
perform predictive analysis and derive metrics to check the performance of their
statistical models on the data.
Python Basics: A Practical Introduction to
Python 3 (realpython.com)
A_Practical_Introduction_to_Python_P
rogramming_Heinold.pdf
(brianheinold.net)
THANK YOU