0% found this document useful (0 votes)

40 views

Module 2 - Statistical Foundations

Module 2 covers statistical foundations including descriptive statistics, statistical distributions, dimensionality reduction techniques, Bayesian statistics, and statistical modeling. Key concepts include summarizing and visualizing data, identifying outliers, understanding how variables relate, and using statistics to analyze data and make inferences. Statistical methods are divided into descriptive statistics, which summarize data, and inferential statistics, which are used to study samples and infer conclusions about populations.

Uploaded by

Farheen Nawazi

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Module 2 - Statistical Foundations

Uploaded by

Farheen Nawazi

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 108

Module 2 - Statistical

Foundations
Module 2 - Statistical Foundations
• Descriptive statistics, Statistical Features,
summarizing the data, outlier analysis,
Understanding distributions and plots,
Univariate statistical plots and usage, Bivariate
and multivariate statistics, Dimensionality
Reduction, Over and Under Sampling,
Bayesian Statistics, Statistical Modeling for
data analysis
Basics of Statistics
• Definition: Science of collection, presentation, analysis,
and reasonable interpretation of data.
• Statistics presents a rigorous scientific method for
gaining insight into data. For example, suppose we
measure the weight of 100 patients in a study. With so
many measurements, simply looking at the data fails to
provide an informative account. However statistics can
give an instant overall picture of data based on graphical
presentation or numerical summarization irrespective to
the number of data points. Besides data summarization,
another important task of statistics is to make inference
and predict relations of variables.
A Taxonomy of Statistics
Statistics in Data science

 Statistics is a set of mathematical methods and tools that

enable us to answer important questions about data. It is
divided into two categories:
 Descriptive Statistics - this offers methods to
summarise data by transforming raw observations into
meaningful information that is easy to interpret and share.
 Inferential Statistics - this offers methods to study
experiments done on small samples of data and chalk out
the inferences to the entire population (entire domain).
General Statistics Skills

 How to define statistically answerable questions for

effective decision making.
 Calculating and interpreting common statistics and how
to use standard data visualization techniques to
communicate findings.
 Understanding of how mathematical statistics is applied to
the field, concepts such as the central limit theorem and
the law of large numbers.
 Making inferences from estimates of location and
variability (ANOVA).
 How to identify the relationship between target variables
and independent variables.
 How to design statistical hypothesis testing experiments,
A/B testing, and so on.
 How to calculate and interpret performance metrics like
p-value, alpha, type1 and type2 errors, and so on.
Important Statistics Concepts

 Getting Started— Understanding types of data (rectangular

and non-rectangular), estimate of location, estimate of
variability, data distributions, binary and categorical data,
correlation, relationship between different types of variables.
 Distribution of Statistic — random numbers, the law of
large numbers, Central Limit Theorem, standard error, and
so on.
 Data sampling and Distributions — random sampling,
sampling bias, selection bias, sampling distribution,
bootstrapping, confidence interval, normal distribution, t-
distribution, binomial distribution, chi-square distribution, F-
distribution, Poisson and exponential distribution.
Six-Step Problem Solving Model

 This technique uses an analytical approach to solve any given

problem. As the name suggests, this technique uses 6 steps to
solve a problem, which are:
1. Have a clear and concise problem definition.
2. Study the roots of the problem.
3. Brainstorm possible solutions to the problem.
4. Examine the possible solution and chose the best one.
5. Implement the solution effectively.
6. Evaluate the results.
 This model follows the mindset of continuous development and
improvement. So, on step 6, if your results didn’t turn out the
way you wanted, you can go back to stem 4 and choose another
solution or to step 1 and try to define the problem differently.
Statistical Description of Data
• Statistics describes a numeric set of data by its
• Center
• Variability
• Shape
• Statistics describes a categorical set of data by
• Frequency, percentage or proportion of each category
Some Definitions
Variable - any characteristic of an individual or entity. A variable can take different
values for different individuals. Variables can be categorical or quantitative. Per S. S.
Stevens…
• Nominal - Categorical variables with no inherent order or ranking sequence such as names or classes
(e.g., gender). Value may be a numerical, but without numerical value (e.g., I, II, III). The only operation
that can be applied to Nominal variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be compared for
equality, or greater or less, but not how much greater or less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally, differences between
values are meaningful, however, the scale is not absolutely anchored. Calendar dates and
temperatures on the Fahrenheit scale are examples. Addition and subtraction, but not
multiplication and division are meaningful operations.
• Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero point, e.g. age,
weight, temperature (Kelvin). Addition, subtraction, multiplication, and division are all meaningful
operations.
Some Definitions
Distribution - (of a variable) tells us what values the variable takes and how often it
takes these values.
• Unimodal - having a single peak
• Bimodal - having two distinct peaks
• Symmetric - left and right half are mirror images.
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:

Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2

Grouped Frequency Distribution of Age:

Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency 8 20 26
Data Presentation
Two types of statistical presentation of data - graphical and numerical.

Graphical Presentation: We look for the overall pattern and for striking deviations from
that pattern. Over all pattern usually described by shape, center, and spread of the data.
An individual value that falls outside the overall pattern is called an outlier.

Bar diagram and Pie charts are used for categorical variables.

Histogram, stem and leaf and Box-plot are used for numerical variable.
Data Presentation –Categorical Variable
Bar Diagram: Lists the categories and presents the percent or count of individuals who fall
in each category.

Figure 1: Bar Chart of Subjects in

Tre atm ent Groups
Treatment Frequency Proportion Percent
Group (%)
Nu m ber of Subjects

30
25
1 15 (15/60)=0.25 25.0
20
15
2 25 (25/60)=0.333 41.7
10
3 20 (20/60)=0.417 33.3
5
0 Total 60 1.00 100
1 2 3
Treatm ent Group
Data Presentation –Categorical Variable
Pie Chart: Lists the categories and presents the percent or count of individuals who fall in
each category.

Figure 2: Pie Chart of Treatment Frequency Proportion Percent

Subjects in Treatment Groups Group (%)

1 15 (15/60)=0.25 25.0

25% 2 25 (25/60)=0.333 41.7

33% 1
2 3 20 (20/60)=0.417 33.3

3 Total 60 1.00 100

42%
Graphical Presentation –Numerical Variable

Histogram: Overall pattern can be described by its shape, center, and spread. The
following age distribution is right skewed. The center lies between 80 to 100. No
outliers.

Mean 90.41666667
Figure 3: Age Distribution
Standard Error 3.902649518

16 Median 84
14 Mode 84
Number of Subjects

12 Standard Deviation 30.22979318

10
Sample Variance 913.8403955
8
Kurtosis -1.183899591
6
4 Skewness 0.389872725
2 Range 95
0 Minimum 48
40 60 80 100 120 140 More
Maximum 143
Age in Month
Sum 5425
Count 60
Graphical Presentation –Numerical Variable

Box-Plot: Describes the five-number summary

Figure 3: Distribution of Age

160
140
120
q1
100 min
80 median
60 max
q3
40
20
0
1
Box Plot
Numerical Presentation
A fundamental concept in summary statistics is that of a central value for a set of
observations and the extent to which the central value characterizes the whole
set of data. Measures of central value such as the mean or median must be
coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.

To understand how well a central value characterizes a set of observations, let

us consider the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from
the mean in data set A is larger than in the data set B. Thus, the mean of data
set B is a better representation of the data set than is the case for set A.
Methods of Center Measurement

Center measurement is a summary measure of the overall level of a dataset

Commonly used methods are mean, median, mode, geometric mean etc.

Mean: Summing up all the observation and dividing by number of observations. Mean
of 20, 30, 40 is (20+30+40)/3 = 30.
Notation : Let x1 , x2 , ...xn are n observations of a variable
x. Then the mean of this variable,
n

x1  x2  ...  xn x i
x  i 1
n n
Methods of Center Measurement

Median: The middle value in an ordered sequence of observations. That is, to find
the median we need to order the data set and then find the middle value. In case of
an even number of observations the average of the two middle most values is the
median. For example, to find the median of {9, 3, 6, 7, 5}, we first sort the data
giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the number of observations
is even, e.g., {9, 3, 6, 7, 5, 2}, then the median is the average of the two middle
values from the sorted sequence, in this case, (5 + 6) / 2 = 5.5.

Mode: The value that is observed most frequently. The mode is undefined for
sequences in which no observation is repeated.
Mean or Median
The median is less sensitive to outliers (extreme scores) than the mean and thus a
better measure than the mean for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median of these
four observations is (30+40)/2 =35. Here 3 observations out of 4 lie between 20-40.
So, the mean 270 really fails to give a realistic picture of the major part of the data. It
is influenced by extreme value 990.
Methods of Variability Measurement

Variability (or dispersion) measures the amount of scatter in a dataset.

Commonly used methods: range, variance, standard deviation, interquartile range,

coefficient of variation etc.

Range: The difference between the largest and the smallest observations. The range of
10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.
Methods of Variability Measurement

Variance: The variance of a set of observations is the average of the squares of the
deviations of the observations from their mean. In symbols, the variance of the n
observations x1, x2,…xn is

( x1  x ) 2  ....  ( xn  x ) 2
S 
2

n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

(5  5) 2  (3  5) 2  (7  5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard deviation of the above
example is 2.
Methods of Variability Measurement

Quartiles: Data can be divided into four regions that cover the total range of observed
values. Cut points for these regions are known as quartiles.

In notations, quartiles of a data is the ((n+1)/4)qth observation of the data, where q is

the desired quartile and n is the number of observations of data.

The first quartile (Q1) is the first 25% of the data. The second quartile (Q2) is between
the 25th and 50th percentage points in the data. The upper bound of Q2 is the median.
The third quartile (Q3) is the 25% of the data lying between the median and the 75% cut
point in the data.

Q1 is the median of the first half of the ordered observations and Q3 is the median of
the second half of the ordered observations.
Methods of Variability Measurement

In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th observation is
11. So Q1 is of this data is 11.

An example with 15 numbers

3 6 7 11 13 22 30 40 44 50 52 61 68 80 94 Q1
Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40 (This is also the Median.) The
third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the previous
example is 61- 40=21. The middle half of the ordered data lie between 40 and 61.
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points are called Deciles

Percentiles: If data is ordered and divided into 100 parts, then cut points are called
Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th
percentile of the data is Q3.

In notations, percentiles of a data is the ((n+1)/100)p th observation of the data, where p

is the desired percentile and n is the number of observations of data.

Coefficient of Variation: The standard deviation of data divided by it’s mean. It is usually
expressed in percent.

Coefficient of Variation = 100
x
Five Number Summary

Five Number Summary: The five number summary of a distribution consists of the
smallest (Minimum) observation, the first quartile (Q1),
The median(Q2), the third quartile, and the largest (Maximum) observation written in
order from smallest to largest.

Box Plot: A box plot is a graph of the five number summary. The central box spans
the quartiles. A line within the box marks the median. Lines extending above and
below the box mark the smallest and the largest observations (i.e., the range).
Outlying samples may be additionally plotted outside the range.
Boxplot
Distribution of Age in Month
160
160
140
140
120
120 q1
100 q1
100 min
min
80 median
80 median
60 max
60 max
q3
40 q3
40
20
20
0
0
1
1
Choosing a Summary
The five number summary is usually better than the mean and standard deviation for
describing a skewed distribution or a distribution with extreme outliers. The mean and
standard deviation are reasonable for symmetric distributions that are free of outliers.

In real life we can’t always expect symmetry of the data. It’s a common practice to include
number of observations (n), mean, median, standard deviation, and range as common for
data summarization purpose. We can include other summary statistics like Q1, Q3,
Coefficient of variation if it is considered to be important for describing data.
Shape of Data
• Shape of data is measured by
– Skewness
– Kurtosis
Skewness
• Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail

Let x1 , x2 ,... xn be n observations. Then,

n
n  ( xi  x ) 3
Skewness  i 1
3/ 2
 n
2
  ( xi  x ) 
 i 1 
Kurtosis
• Measures peakedness of the distribution of data. The
kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn be n observations. Then,

n
n ( xi  x ) 4
Kurtosis  i 1
2
3
 n 2
  ( xi  x ) 
 i 1 
Summary of the Variable ‘Age’ in the given
data set
Mean 90.41666667 Histogram of Age
Standard Error 3.902649518

10
Median 84
Mode 84

8
Standard Deviation 30.22979318

Number of Subjects
Sample Variance 913.8403955

6
Kurtosis -1.183899591

4
Skewness 0.389872725
Range 95
2
Minimum 48
Maximum 143
0

Sum 5425 40 60 80 100 120 140 160

Count 60 Age in Month

Summary of the Variable ‘Age’ in the given
data set

Boxplot of Age in Month

140
120
Age(month)

100
80
60
Class Summary (First Part)
So far we have learned-

Statistics and data presentation/data summarization

Graphical Presentation: Bar Chart, Pie Chart, Histogram, and Box Plot
Numerical Presentation: Measuring Central value of data (mean, median, mode etc.),
measuring dispersion (standard deviation, variance, co-efficient of variation, range, inter-
quartile range etc), quartiles, percentiles, and five number summary
Outlier Analysis
What Are Outliers?

• Outlier: A data object that deviates significantly from the normal objects as if it
were generated by a different mechanism
– Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
• Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into
the model
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis

39
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly) Global Outlier

– Object is Og if it significantly deviates from the rest of the data set

– Ex. Intrusion detection in computer networks
– Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
– Object is Oc if it deviates significantly based on a selected context
– Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
– Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
– Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
– Issue: How to define or formulate meaningful context?

40
Types of Outliers (II)
• Collective Outliers
– A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data objects
may not be outliers
– Applications: E.g., intrusion detection: Collective Outlier
• When a number of computers keep sending denial-of-
service packages to each other
 Detection of collective outliers
 Consider not only behavior of individual objects, but also that of

groups of objects
 Need to have the background knowledge on the relationship

among data objects, such as a distance or similarity measure

on objects.
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
41
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application

 The border between normal and outlier objects is often a gray area

 Application-specific outlier detection

 Choice of distance measure among objects and the model of

relationship among objects are often application-dependent

 E.g., clinic data: a small deviation could be an outlier; while in

marketing analysis, larger fluctuations

 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction

between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
 Understandability
 Understand why these are outliers: Justification of the detection

 Specify the degree of an outlier: the unlikelihood of the object being

generated by a normal mechanism

42
Outlier Detection (1): Statistical Methods
• Statistical methods (also known as model-based methods) assume that the normal data
follow some statistical model (a stochastic model)
– The data not following the model are outliers.

 Example (right figure): First use Gaussian distribution

to model the normal data
 For each object y in region R, estimate g (y), the
D
probability of y fits the Gaussian distribution
 If g (y) is very low, y is unlikely generated by the
D
Gaussian model, thus an outlier
 Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
 There are rich alternatives to use various statistical models
 E.g., parametric vs. non-parametric

43
•
Statistical Approaches
Statistical approaches assume that the objects in a data set are generated by
a stochastic process (a generative model)
• Idea: learn a generative model fitting the given data set, and then identify the
objects in low probability regions of the model as outliers
• Methods are divided into two categories: parametric vs. non-parametric
• Parametric method
– Assumes that the normal data is generated by a parametric distribution
with parameter θ
– The probability density function of the parametric distribution f(x, θ)
gives the probability that object x is generated by the distribution
– The smaller this value, the more likely x is an outlier
• Non-parametric method
– Not assume an a-priori statistical model and determine the model from
the input data
– Not completely parameter free but consider the number and nature of
the parameters are flexible and not fixed in advance
– Examples: histogram and kernel density estimation

44
Parametric Methods I: Detection Univariate Outliers Based on
Normal Distribution
• Univariate data: A data set involving only one attribute or variable
• Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low probability as
outliers
• Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
– Use the maximum likelihood method to estimate μ and σ

 Taking derivatives with respect to μ and σ2, we derive the following

maximum likelihood estimates

 For the above data with n = 10, we have

 Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since

45
Parametric Methods I: The Grubb’s Test
• Univariate outlier detection: The Grubb's test (maximum normed residual
test) ─ another statistical method under normal distribution
– For each object x in a data set, compute its z-score: x is an outlier if

where is the value taken by a t-distribution at a significance

level of α/(2N), and N is the # of objects in the data set

46
Parametric Methods II: Detection of Multivariate
Outliers
• Multivariate data: A data set involving two or more attributes or variables
• Transform the multivariate outlier detection task into a univariate outlier
detection problem
• Method 1. Compute Mahalaobis distance
– Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō) where S is
the covariance matrix
– Use the Grubb's test on this measure to detect outliers
• Method 2. Use χ2 –statistic:
– where Ei is the mean of the i-dimension among all objects, and n is the
dimensionality
– If χ2 –statistic is large, then object oi is an outlier

47
Parametric Methods III: Using Mixture of Parametric
Distributions
• Assuming data generated by a normal distribution could
be sometimes overly simplified
• Example (right figure): The objects between the two
clusters cannot be captured as outliers since they are
close to the estimated mean
 To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by

where fθ1 and fθ2 are the probability density functions of θ1 and θ2
 Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
 An object o is an outlier if it does not belong to any cluster
48
Non-Parametric Methods: Detection Using Histogram

• The model of normal data is learned from the input

data without any a priori structure.
• Often makes fewer assumptions about the data, and
thus can be applicable in more scenarios
• Outlier detection using histogram:

 Figure shows the histogram of purchase amounts in transactions

 A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
 Problem: Hard to choose an appropriate bin size for histogram
 Too small bin size → normal objects in empty/rare bins, false positive
 Too big bin size → outliers in some frequent bins, false negative
 Solution: Adopt kernel density estimation to estimate the probability
density distribution of the data. If the estimated density function is high,
the object is likely normal. Otherwise, it is likely an outlier.
49
Analysis Strategies
• Why do we have to have them?

– People who read our ‘research’ are

interested in the highlights
– Should try to communicate findings in
an understandable and ‘painless
fashion’
Three types of analysis
• Univariate analysis
– the examination of the distribution of cases on
only one variable at a time (e.g., college
graduation)
• Bivariate analysis
– the examination of two variables simultaneously
(e.g., the relation between gender and college
graduation)
• Multivariate analysis
– the examination of more than two variables
simultaneously (e.g., the relationship between
gender, race, and college graduation)
“Purpose”
• Univariate analysis
– Purpose: description
• Bivariate analysis

– Purpose: determining the empirical

relationship between the two variables

• Multivariate analysis

– Purpose: determining the empirical

relationship among the variables
Univariate Analysis
• Involves examination of the distribution of
cases on only ONE variable at a time
• Frequency distributions are listings of the
number of cases in each attribute of a variable
– Ungrouped frequency distribution
– Grouped frequency distribution

• Proportions express number of cases of the

criterion variable as part of the total
population; frequency of criterion variable
divided by N
• Percentages are simple 100 X proportion
– Or [100 X (frequency of criterion variable
divided by N)]

• Rates make comparisons more

meaningful by controlling for population
differences
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:

Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2

Grouped Frequency Distribution of Age:

Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency 8 20 26
Measures of Central Tendency
• Measures of central tendency reflect the
central tendencies of a distribution
– Mode reflects the attribute with the
greatest frequency

– Median reflects the attribute that cuts the

distribution in half

– Mean reflects the average; sum of

attributes divided by # of cases
Measures of Dispersion
• Measures of dispersion reflect the spread or
distribution of the distribution
– Range is the difference between largest & smallest
scores; high – low

– Variance is the average of the squared differences

between each observation and the mean
– Standard deviation is the square root of variance
Types of Variables
• Continuous: increase steadily in tiny
fractions

• Discrete: jumps from category to category

Subgroup Comparisons
• Somewhere between univariate & bivariate,
are Subgroup Comparisons

• Present descriptive univariate data for each of

several subgroups
– Ratios: compare the number of cases in one
category with the number in another
Contingency Tables
• Format: attributes of independent variable are
used as column headings and attributes of the
dependent variable are used as row headings
• Guidelines for presenting & interpreting
contingency tables
– Contents of table described in title
– Attributes of each variable clearly described
– Base on which percentages are computed should be shown
– Norm is to percentage down & compare across
– Table should indicate # of cases omitted from analysis
Bivariate Analysis
• Bivariate analysis focus on the
relationship between two variables
Bivariate Analysis Examples
Multivariate Analysis
• Multivariate Analysis allow the separate
and combined effects of the
independent variable to be examined
Multivariate Analysis
• Many statistical techniques focus on just one
or two variables
• Multivariate analysis (MVA) techniques allow
more than two variables to be analysed at
once
– Multiple regression is not typically included under
this heading, but can be thought of as a
multivariate analysis
Many Variables
• Commonly have many relevant variables in market
research surveys
– E.g. one not atypical survey had ~2000 variables
– Typically researchers pore over many crosstabs
– However it can be difficult to make sense of these, and the
crosstabs may be misleading
• MVA can help summarise the data
– E.g. factor analysis and segmentation based on agreement
ratings on 20 attitude statements
• MVA can also reduce the chance of obtaining
spurious results
Multivariate Analysis Methods
• Two general types of MVA technique
– Analysis of dependence
• Where one (or more) variables are dependent
variables, to be explained or predicted by others
– E.g. Multiple regression, PLS, MDA
– Analysis of interdependence
• No variables thought of as “dependent”
• Look at the relationships among variables, objects or
cases
– E.g. cluster analysis, factor analysis
Multivariate Analysis Example
• Example 1. A researcher has collected data on three psychological variables, four
academic variables (standardized test scores), and the type of educational program
the student is in for 600 high school students. She is interested in how the set of
psychological variables is related to the academic variables and the type of program
the student is in.
• Example 2. A doctor has collected data on cholesterol, blood pressure, and weight.
She also collected data on the eating habits of the subjects (e.g., how many ounces
of red meat, fish, dairy products, and chocolate consumed per week). She wants to
investigate the relationship between the three measures of health and eating habits.
• Example 3. A researcher is interested in determining what factors influence the
health African Violet plants. She collects data on the average leaf diameter, the mass
of the root ball, and the average diameter of the blooms, as well as how long the
plant has been in its current container. For predictor variables, she measures several
elements in the soil, as well as the amount of light and water each plant receives.
• The study to determine the possible causes of a medical condition, such as heart disease. An
initial survey of non-disease males is conducted and data are collected on age, body weight,
height, serum cholesterol, phospholipids, blood glucose, diet, and many other putative factors.
The history of these males is followed and it is determined if and when they may be diagnosed
with heart disease.
• • Determining the value of an apartment. Factors possibly related to the value are size of the
apartment, age of the building, number of bedrooms, number of bathrooms, and location (e.g.
floor, view, etc.).
• • A medical study is conducted to determine the effects of air pollution on lung function.
Because you can’t assign people randomly to treatment groups (i.e. a rural environment with
no air quality concerns vs. a large city with air quality issues), a research chose four cohorts
that live in areas with very different air quality and each location is close to an air-monitoring
device. The researchers took measures of lung function on each individual at two different
time periods by recording one breath. Data collected on this breath included length of time for
the inhale and exhale, speed and force of the exhale, and amount of air exhaled after one
second and at the mid-point of the exhale. The air quality at these two times was also
recorded.
• • Political surveys to determine which qualities in a candidate are most important in garnering
popularity
Case Study
• Fake News Detection
• Road Lane Line Detection
• Sentiment Analysis
• Detecting Parkinson’s Disease
• Color Detection [Fruit]
• Leaf Disease Detection
• Chatbot Project
Case Study 1: How can we improve client acquisition rate?
Case Study 2: How do I create a sales incentive model?
Case Study 3: How can I price more accurately?
• https://www.baselismail.com/data-science-gui
de-and-case-studies/
• https://data-flair.training/blogs/data-science-c
ase-studies/
• https://data-flair.training/blogs/data-science-u
se-cases/
• https://eleuven.github.io/statthink/exercise-so
lutions.html#chapter-1
Graphic Presentation
• The Pie Chart
• The Bar Graph
• The Statistical Map
• The Histogram
• Statistics in Practice
• The Frequency Polygon
• Times Series Charts
• Distortions in Graphs
It is important to choose the appropriate graphs
to make statistical information coherent.
The Pie Chart: The Race and Ethnicity of the
Elderly

• Pie chart: a graph showing the differences in

frequencies or percentages among categories
of a nominal or an ordinal variable. The
categories are displayed as segments of a
circle whose pieces add up to 100 percent of
the total frequencies.
Too many categories can be messy!

2.8% .8% .6%

.5%
8.3%

White
Black
American Indian
Asian
Pacific Islander
Two or more

87.7%

N = 35,919,174

Figure 3.1 Annual Estimates of U.S. Population 65 Years and Over by Race, 2003
We can reduce some of the categories

4%
8.3%

White
Black
Other and two or more

87.7%

N = 35,919,174
Figure 3.2 Annual Estimates of U.S. Population 65 Years and Over, 2003
The Bar Graph: The Living Arrangements
and Labor Force Participation of the Elderly

• Bar graph: a graph showing the differences

in frequencies or percentages among
categories of a nominal or an ordinal
variable. The categories are displayed as
rectangles of equal width with their height
proportional to the frequency or
percentage of the category.
80

40 Series1

0
Living alone Married Other
N=13,886,000

Figure 3.3 Living Arrangements of Males (65 and Older) in the United States, 2000
Can display more info by splitting sex

50
Males
40 Females

0
Living alone Married Other

Figure 3.4 Living Arrangement of U.S. Elderly (65 and Older) by Gender, 2003
9.8
65 +
18

44.3 Women
60- to 64
57.2 Men

63.4
55 to 59
77.1

0 20 40 60 80 100

Figure 3.5 Percent of Men and Women 55 Years and Over in the Civilian Labor Force,
2002
The Statistical Map: The Geographic
Distribution of the Elderly
We can display dramatic geographical changes
in American society by using a statistical map.
Maps are especially useful for describing
geographical variations in variables, such as
population distribution, voting patterns,
crimes rates, or labor force participation.
The Histogram
• Histogram: a graph showing the differences
in frequencies or percentages among
categories of an interval-ratio variable. The
categories are displayed as contiguous bars,
with width proportional to the width of the
category and height proportional to the
frequency or percentage of that category.
30

0
65-69 70-74 75-79 80-84 85-89 90-94 95+

Figure 3.7 Age Distribution of U.S. Population 65 Years and Over, 2000
The Frequency Polygon

• Frequency polygon: a graph showing the

differences in frequencies or percentages
among categories of an interval-ratio
variable. Points representing the
frequencies of each category are placed
above the midpoint of the category and are
jointed by a straight line.
Source: Adapted from U.S. Bureau of the Census, Center for International Research,
International Data Base, 2003.

Population of Japan, Age 55 and Over, 2000,

2010, and 2020

12,000,000

10,000,000
8,000,000 2000
6,000,000 2010
4,000,000 2020

2,000,000
0
55-59 60-64 65-69 70-74 75-79 80+

Figure 3.11. Population of Japan, Age 55 and Over, 2000, 2010, and 2020

Time Series Charts

• Time series chart: a graph displaying

changes in a variables at different points in
time. It shows time (measured in units
such as years or months) on the horizontal
axis and the frequencies (percentages or
rates) of another variable on the vertical
axis.
Source: Federal Interagency Forum on Aging Related Statistics, Older Americans 2004:
Key Indicators of Well Being, 2004.
25

0
1900 1920 1940 1960 1980 2000 2020 2040 2060

Figure 3.12 Percentage of Total U. S. Population 65 Years and Over, 1900 to 2050
Source: U.S. Bureau of the Census, “65+ in America,” Current Population Reports,
1996, Special Studies, P23-190, Table 6-1.

10
Males
8
Females
6

0
1960 1980 2000 2020 2040

Figure 3.13 Percentage Currently Divorced Among U.S. Population 65 Years and
Over, by Gender, 1960 to 2040
Distortions in Graphs
Graphs not only quickly inform us; they can
quickly deceive us. Because we are often
more interested in general impressions
than in detailed analyses of the numbers,
we are more vulnerable to being swayed by
distorted graphs.

– What are graphical distortions?

– How can we recognize them?
Shrinking an Stretching the Axes: Visual
Confusion
Shrinking an Stretching the Axes: Visual
Confusion
Distortions with Picture Graphs
Statistics in Practice
The following graphs are particularly
suitable for making comparisons among
groups:

- Bar chart
- Frequency polygon
- Time series chart
Source: Smith, 2003.

11.8
85+
20.8

14.7 Women
65-74
23.3 Men

21.5
55-64
31.1

0 10 20 30 40

Figure 3.17 Percentage of College Graduates among People 55 years and over by age
and sex, 2002
Source: Stoops, Nicole. 2004. “Educational Attainment in the United States: 2003.”
Current Population Reports, P20-550. Washington D.C.: U.S. Government Printing
Office.

50 all races 15-64

40
all races 65+

30
black alone
15-64
20
black alone
10 65+

0
0 to 8 9 to 12 13 to 16+
15

Figure 3.18 Years of School Completed in the United States by Race and Age, 2003
Why use charts and graphs?
– What do you lose?
– ability to examine numeric detail offered by a table
– potentially the ability to see additional relationships
within the data
– potentially time: often we get caught up in selecting
colors and formatting charts when a simply formatted
table is sufficient
– What do you gain?
– ability to direct readers’ attention to one aspect of the
evidence
– ability to reach readers who might otherwise be
intimidated by the same data in a tabular format
– ability to focus on bigger picture rather than perhaps
minor technical details
References
• The material is prepared by taking inputs from
various text books and other internet sources.

Courses Unisys
0% (2)
Courses Unisys
26 pages
Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
100% (1)
Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
33 pages
CAT1 F1 Key
No ratings yet
CAT1 F1 Key
6 pages
Design of Purlin
No ratings yet
Design of Purlin
5 pages
DS License Server: Installation Guide
No ratings yet
DS License Server: Installation Guide
176 pages
Unit 4
No ratings yet
Unit 4
152 pages
Basic Statistics
No ratings yet
Basic Statistics
52 pages
Class1
No ratings yet
Class1
52 pages
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
No ratings yet
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
34 pages
Basic of Statistics #5 (!!!)
No ratings yet
Basic of Statistics #5 (!!!)
49 pages
Basic Stat 1
No ratings yet
Basic Stat 1
50 pages
Basic Statistics
100% (9)
Basic Statistics
73 pages
Lecture Afffasfafa
No ratings yet
Lecture Afffasfafa
29 pages
Intro To Stat1
No ratings yet
Intro To Stat1
31 pages
Intro to Stat
No ratings yet
Intro to Stat
50 pages
Statistics For Bussiness: By: Dr. (C) Nanik Istianingsih, S.E., M.E., C.LMA., C.PR., C.DM
No ratings yet
Statistics For Bussiness: By: Dr. (C) Nanik Istianingsih, S.E., M.E., C.LMA., C.PR., C.DM
31 pages
Week 5 - Result and Analysis 1 (UP)
No ratings yet
Week 5 - Result and Analysis 1 (UP)
7 pages
Basic Statistics (3685) PPT - Lecture On 20-01-2019
100% (1)
Basic Statistics (3685) PPT - Lecture On 20-01-2019
64 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Basic-Statistical-Concepts-_-Measures-of-Location.docx
No ratings yet
Basic-Statistical-Concepts-_-Measures-of-Location.docx
14 pages
Topic Review - Statistics
No ratings yet
Topic Review - Statistics
5 pages
Lec Notes Business Stat
No ratings yet
Lec Notes Business Stat
7 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
2 Research - 2ND QT - Week 1 - 10 14 2024
No ratings yet
2 Research - 2ND QT - Week 1 - 10 14 2024
13 pages
Statistics
No ratings yet
Statistics
81 pages
Basic Concepts in Statistics
No ratings yet
Basic Concepts in Statistics
42 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
No ratings yet
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
32 pages
GE 104 Module 4
No ratings yet
GE 104 Module 4
24 pages
DeMeasure of central tendency and dispersion
No ratings yet
DeMeasure of central tendency and dispersion
15 pages
Basics of Statistics
No ratings yet
Basics of Statistics
32 pages
43hyrs Principles of Statistics 3
No ratings yet
43hyrs Principles of Statistics 3
56 pages
Introduction To Statistics: Prepared By: Joshua Erdy A. Tan
No ratings yet
Introduction To Statistics: Prepared By: Joshua Erdy A. Tan
29 pages
Sta 103 L1 Upda2
No ratings yet
Sta 103 L1 Upda2
104 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Basic Concepts of Statistics
No ratings yet
Basic Concepts of Statistics
41 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
Statistics 24 04 2021 20210618114031
No ratings yet
Statistics 24 04 2021 20210618114031
41 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
Statistics Lecture 1
No ratings yet
Statistics Lecture 1
20 pages
Statistics Ppt.1
No ratings yet
Statistics Ppt.1
39 pages
Descriptive_Statistics
No ratings yet
Descriptive_Statistics
73 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Handout-A-Preliminaries (Advance Statistics)
No ratings yet
Handout-A-Preliminaries (Advance Statistics)
29 pages
Statistical Methods
No ratings yet
Statistical Methods
43 pages
Week 5A - Statistics Handout
No ratings yet
Week 5A - Statistics Handout
9 pages
Statistics A Review
No ratings yet
Statistics A Review
47 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
24 pages
Statistics
No ratings yet
Statistics
68 pages
Statistics L 1
No ratings yet
Statistics L 1
27 pages
Emdad Rahman
No ratings yet
Emdad Rahman
85 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
4 pages
AL- I (Unit -I)
No ratings yet
AL- I (Unit -I)
19 pages
Physics
No ratings yet
Physics
6 pages
Lecture_4
No ratings yet
Lecture_4
61 pages
Business Statistics: Qualitative or Categorical Data
No ratings yet
Business Statistics: Qualitative or Categorical Data
14 pages
Module 1a Nature of Statistics
No ratings yet
Module 1a Nature of Statistics
56 pages
Statistics
No ratings yet
Statistics
30 pages
Statistics
No ratings yet
Statistics
12 pages
Written Report Gathering and Organizing Data
No ratings yet
Written Report Gathering and Organizing Data
13 pages
Biostatistics Notes-numbered
No ratings yet
Biostatistics Notes-numbered
21 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Week 12 - Lecture Notes Special Matrices
No ratings yet
Week 12 - Lecture Notes Special Matrices
25 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Week 6 - Lecture Notes Maxima and Minima: Dy DX
No ratings yet
Week 6 - Lecture Notes Maxima and Minima: Dy DX
13 pages
Module 5 C
No ratings yet
Module 5 C
44 pages
CAT1 F2 Final Key
No ratings yet
CAT1 F2 Final Key
6 pages
Module 5 Programming Foundation and Exploratory Data Analysis
No ratings yet
Module 5 Programming Foundation and Exploratory Data Analysis
152 pages
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
No ratings yet
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
54 pages
13.1 Why Fourier Series?
No ratings yet
13.1 Why Fourier Series?
38 pages
Water Soluble Vitamins 4
No ratings yet
Water Soluble Vitamins 4
15 pages
Water Soluble Vitamins 3
No ratings yet
Water Soluble Vitamins 3
20 pages
Hueristic Searchh
No ratings yet
Hueristic Searchh
19 pages
GA Convex Hulls
No ratings yet
GA Convex Hulls
24 pages
Peer Pressure
No ratings yet
Peer Pressure
21 pages
Business Proposal1
No ratings yet
Business Proposal1
20 pages
From Writings On The Wall To Signals Travelling in The Airwaves
100% (1)
From Writings On The Wall To Signals Travelling in The Airwaves
13 pages
CS QP - CLASS XI ANNUAL EXAM APRIL 30TH (1)
No ratings yet
CS QP - CLASS XI ANNUAL EXAM APRIL 30TH (1)
5 pages
[Ebooks PDF] download Clinical Anatomy and Physiology of the Visual System, 4e (Aug 9, 2021)_(0323711685)_(Elsevier) 4th Edition Remington Od Ms Faao full chapters
100% (4)
[Ebooks PDF] download Clinical Anatomy and Physiology of the Visual System, 4e (Aug 9, 2021)_(0323711685)_(Elsevier) 4th Edition Remington Od Ms Faao full chapters
37 pages
5S Management Audit Form - Canteen and Offices Rev 2
100% (1)
5S Management Audit Form - Canteen and Offices Rev 2
8 pages
98-99 LS400 Owners Manual Dash Light Meanings
No ratings yet
98-99 LS400 Owners Manual Dash Light Meanings
10 pages
Fshortcut Keys Description
No ratings yet
Fshortcut Keys Description
2 pages
10 Clivet Product Lineup 2019
No ratings yet
10 Clivet Product Lineup 2019
6 pages
Mid 2
No ratings yet
Mid 2
12 pages
1st Gen Air Conditioning Specifications For The FAQ: View Unanswered Posts View Active Topics
No ratings yet
1st Gen Air Conditioning Specifications For The FAQ: View Unanswered Posts View Active Topics
7 pages
Tu Hi Yeh Mujhko Bata de From Ashiqui-2 (Single String)
No ratings yet
Tu Hi Yeh Mujhko Bata de From Ashiqui-2 (Single String)
2 pages
Catalogo Kobo Lighting
No ratings yet
Catalogo Kobo Lighting
13 pages
Certificate / Certificat Zertifikat /: Series 8320 Solenoid Valves Asco, L.P. Florham Park, NJ - USA
No ratings yet
Certificate / Certificat Zertifikat /: Series 8320 Solenoid Valves Asco, L.P. Florham Park, NJ - USA
2 pages
Application For Final Project Thesis
No ratings yet
Application For Final Project Thesis
4 pages
BRKSEC-3032 (ASA Clustering Deep Dive)
No ratings yet
BRKSEC-3032 (ASA Clustering Deep Dive)
77 pages
j2534 Tutor
No ratings yet
j2534 Tutor
24 pages
Update and Document Operational Procedures LO2-LO3 FOR WDDBA
No ratings yet
Update and Document Operational Procedures LO2-LO3 FOR WDDBA
7 pages
Bizhub 40P: Designed For Productivity
No ratings yet
Bizhub 40P: Designed For Productivity
4 pages
Business Model Canvass ADVANCE
No ratings yet
Business Model Canvass ADVANCE
5 pages
MSCSSE Handbook August 2019
No ratings yet
MSCSSE Handbook August 2019
18 pages
History and Types of Computer Mouse
No ratings yet
History and Types of Computer Mouse
6 pages
Summer Internship at Outlook Magazine India
No ratings yet
Summer Internship at Outlook Magazine India
34 pages
GlamoxHeating FVD 3001 TLO Eng
No ratings yet
GlamoxHeating FVD 3001 TLO Eng
2 pages
Brochure Thermoscientific
No ratings yet
Brochure Thermoscientific
6 pages
Observation Checklist Repair and Maintain a Computer Level 4
No ratings yet
Observation Checklist Repair and Maintain a Computer Level 4
4 pages
The Energy Act, 2019 (No. 1 of 2019) The Draft Energy (Solar Photovoltaic Systems) Regulations, 2019
No ratings yet
The Energy Act, 2019 (No. 1 of 2019) The Draft Energy (Solar Photovoltaic Systems) Regulations, 2019
49 pages
Sliding Door Operator
No ratings yet
Sliding Door Operator
12 pages