Module 2 - Statistical Foundations
Module 2 - Statistical Foundations
Foundations
Module 2 - Statistical Foundations
• Descriptive statistics, Statistical Features,
summarizing the data, outlier analysis,
Understanding distributions and plots,
Univariate statistical plots and usage, Bivariate
and multivariate statistics, Dimensionality
Reduction, Over and Under Sampling,
Bayesian Statistics, Statistical Modeling for
data analysis
Basics of Statistics
• Definition: Science of collection, presentation, analysis,
and reasonable interpretation of data.
• Statistics presents a rigorous scientific method for
gaining insight into data. For example, suppose we
measure the weight of 100 patients in a study. With so
many measurements, simply looking at the data fails to
provide an informative account. However statistics can
give an instant overall picture of data based on graphical
presentation or numerical summarization irrespective to
the number of data points. Besides data summarization,
another important task of statistics is to make inference
and predict relations of variables.
A Taxonomy of Statistics
Statistics in Data science
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Cumulative Frequency 5 8 15 20 24 26
Graphical Presentation: We look for the overall pattern and for striking deviations from
that pattern. Over all pattern usually described by shape, center, and spread of the data.
An individual value that falls outside the overall pattern is called an outlier.
Bar diagram and Pie charts are used for categorical variables.
Histogram, stem and leaf and Box-plot are used for numerical variable.
Data Presentation –Categorical Variable
Bar Diagram: Lists the categories and presents the percent or count of individuals who fall
in each category.
30
25
1 15 (15/60)=0.25 25.0
20
15
2 25 (25/60)=0.333 41.7
10
3 20 (20/60)=0.417 33.3
5
0 Total 60 1.00 100
1 2 3
Treatm ent Group
Data Presentation –Categorical Variable
Pie Chart: Lists the categories and presents the percent or count of individuals who fall in
each category.
1 15 (15/60)=0.25 25.0
Histogram: Overall pattern can be described by its shape, center, and spread. The
following age distribution is right skewed. The center lies between 80 to 100. No
outliers.
Mean 90.41666667
Figure 3: Age Distribution
Standard Error 3.902649518
16 Median 84
14 Mode 84
Number of Subjects
Commonly used methods are mean, median, mode, geometric mean etc.
Mean: Summing up all the observation and dividing by number of observations. Mean
of 20, 30, 40 is (20+30+40)/3 = 30.
Notation : Let x1 , x2 , ...xn are n observations of a variable
x. Then the mean of this variable,
n
x1 x2 ... xn x i
x i 1
n n
Methods of Center Measurement
Median: The middle value in an ordered sequence of observations. That is, to find
the median we need to order the data set and then find the middle value. In case of
an even number of observations the average of the two middle most values is the
median. For example, to find the median of {9, 3, 6, 7, 5}, we first sort the data
giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the number of observations
is even, e.g., {9, 3, 6, 7, 5, 2}, then the median is the average of the two middle
values from the sorted sequence, in this case, (5 + 6) / 2 = 5.5.
Mode: The value that is observed most frequently. The mode is undefined for
sequences in which no observation is repeated.
Mean or Median
The median is less sensitive to outliers (extreme scores) than the mean and thus a
better measure than the mean for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median of these
four observations is (30+40)/2 =35. Here 3 observations out of 4 lie between 20-40.
So, the mean 270 really fails to give a realistic picture of the major part of the data. It
is influenced by extreme value 990.
Methods of Variability Measurement
Range: The difference between the largest and the smallest observations. The range of
10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.
Methods of Variability Measurement
Variance: The variance of a set of observations is the average of the squares of the
deviations of the observations from their mean. In symbols, the variance of the n
observations x1, x2,…xn is
( x1 x ) 2 .... ( xn x ) 2
S
2
n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is
(5 5) 2 (3 5) 2 (7 5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard deviation of the above
example is 2.
Methods of Variability Measurement
Quartiles: Data can be divided into four regions that cover the total range of observed
values. Cut points for these regions are known as quartiles.
The first quartile (Q1) is the first 25% of the data. The second quartile (Q2) is between
the 25th and 50th percentage points in the data. The upper bound of Q2 is the median.
The third quartile (Q3) is the 25% of the data lying between the median and the 75% cut
point in the data.
Q1 is the median of the first half of the ordered observations and Q3 is the median of
the second half of the ordered observations.
Methods of Variability Measurement
In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th observation is
11. So Q1 is of this data is 11.
Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the previous
example is 61- 40=21. The middle half of the ordered data lie between 40 and 61.
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points are called Deciles
Percentiles: If data is ordered and divided into 100 parts, then cut points are called
Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th
percentile of the data is Q3.
Coefficient of Variation: The standard deviation of data divided by it’s mean. It is usually
expressed in percent.
Coefficient of Variation = 100
x
Five Number Summary
Five Number Summary: The five number summary of a distribution consists of the
smallest (Minimum) observation, the first quartile (Q1),
The median(Q2), the third quartile, and the largest (Maximum) observation written in
order from smallest to largest.
Box Plot: A box plot is a graph of the five number summary. The central box spans
the quartiles. A line within the box marks the median. Lines extending above and
below the box mark the smallest and the largest observations (i.e., the range).
Outlying samples may be additionally plotted outside the range.
Boxplot
Distribution of Age in Month
160
160
140
140
120
120 q1
100 q1
100 min
min
80 median
80 median
60 max
60 max
q3
40 q3
40
20
20
0
0
1
1
Choosing a Summary
The five number summary is usually better than the mean and standard deviation for
describing a skewed distribution or a distribution with extreme outliers. The mean and
standard deviation are reasonable for symmetric distributions that are free of outliers.
In real life we can’t always expect symmetry of the data. It’s a common practice to include
number of observations (n), mean, median, standard deviation, and range as common for
data summarization purpose. We can include other summary statistics like Q1, Q3,
Coefficient of variation if it is considered to be important for describing data.
Shape of Data
• Shape of data is measured by
– Skewness
– Kurtosis
Skewness
• Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail
10
Median 84
Mode 84
8
Standard Deviation 30.22979318
Number of Subjects
Sample Variance 913.8403955
6
Kurtosis -1.183899591
4
Skewness 0.389872725
Range 95
2
Minimum 48
Maximum 143
0
100
80
60
Class Summary (First Part)
So far we have learned-
• Outlier: A data object that deviates significantly from the normal objects as if it
were generated by a different mechanism
– Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
• Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into
the model
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
39
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly) Global Outlier
40
Types of Outliers (II)
• Collective Outliers
– A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data objects
may not be outliers
– Applications: E.g., intrusion detection: Collective Outlier
• When a number of computers keep sending denial-of-
service packages to each other
Detection of collective outliers
Consider not only behavior of individual objects, but also that of
groups of objects
Need to have the background knowledge on the relationship
The border between normal and outlier objects is often a gray area
between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
Understandability
Understand why these are outliers: Justification of the detection
43
•
Statistical Approaches
Statistical approaches assume that the objects in a data set are generated by
a stochastic process (a generative model)
• Idea: learn a generative model fitting the given data set, and then identify the
objects in low probability regions of the model as outliers
• Methods are divided into two categories: parametric vs. non-parametric
• Parametric method
– Assumes that the normal data is generated by a parametric distribution
with parameter θ
– The probability density function of the parametric distribution f(x, θ)
gives the probability that object x is generated by the distribution
– The smaller this value, the more likely x is an outlier
• Non-parametric method
– Not assume an a-priori statistical model and determine the model from
the input data
– Not completely parameter free but consider the number and nature of
the parameters are flexible and not fixed in advance
– Examples: histogram and kernel density estimation
44
Parametric Methods I: Detection Univariate Outliers Based on
Normal Distribution
• Univariate data: A data set involving only one attribute or variable
• Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low probability as
outliers
• Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
– Use the maximum likelihood method to estimate μ and σ
45
Parametric Methods I: The Grubb’s Test
• Univariate outlier detection: The Grubb's test (maximum normed residual
test) ─ another statistical method under normal distribution
– For each object x in a data set, compute its z-score: x is an outlier if
46
Parametric Methods II: Detection of Multivariate
Outliers
• Multivariate data: A data set involving two or more attributes or variables
• Transform the multivariate outlier detection task into a univariate outlier
detection problem
• Method 1. Compute Mahalaobis distance
– Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō) where S is
the covariance matrix
– Use the Grubb's test on this measure to detect outliers
• Method 2. Use χ2 –statistic:
– where Ei is the mean of the i-dimension among all objects, and n is the
dimensionality
– If χ2 –statistic is large, then object oi is an outlier
47
Parametric Methods III: Using Mixture of Parametric
Distributions
• Assuming data generated by a normal distribution could
be sometimes overly simplified
• Example (right figure): The objects between the two
clusters cannot be captured as outliers since they are
close to the estimated mean
To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
An object o is an outlier if it does not belong to any cluster
48
Non-Parametric Methods: Detection Using Histogram
• Multivariate analysis
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Cumulative Frequency 5 8 15 20 24 26
White
Black
American Indian
Asian
Pacific Islander
Two or more
87.7%
N = 35,919,174
Figure 3.1 Annual Estimates of U.S. Population 65 Years and Over by Race, 2003
We can reduce some of the categories
4%
8.3%
White
Black
Other and two or more
87.7%
N = 35,919,174
Figure 3.2 Annual Estimates of U.S. Population 65 Years and Over, 2003
The Bar Graph: The Living Arrangements
and Labor Force Participation of the Elderly
70
60
50
40 Series1
30
20
10
0
Living alone Married Other
N=13,886,000
Figure 3.3 Living Arrangements of Males (65 and Older) in the United States, 2000
Can display more info by splitting sex
80
70
60
50
Males
40 Females
30
20
10
0
Living alone Married Other
Figure 3.4 Living Arrangement of U.S. Elderly (65 and Older) by Gender, 2003
9.8
65 +
18
44.3 Women
60- to 64
57.2 Men
63.4
55 to 59
77.1
0 20 40 60 80 100
Figure 3.5 Percent of Men and Women 55 Years and Over in the Civilian Labor Force,
2002
The Statistical Map: The Geographic
Distribution of the Elderly
We can display dramatic geographical changes
in American society by using a statistical map.
Maps are especially useful for describing
geographical variations in variables, such as
population distribution, voting patterns,
crimes rates, or labor force participation.
The Histogram
• Histogram: a graph showing the differences
in frequencies or percentages among
categories of an interval-ratio variable. The
categories are displayed as contiguous bars,
with width proportional to the width of the
category and height proportional to the
frequency or percentage of that category.
30
25
20
15
10
0
65-69 70-74 75-79 80-84 85-89 90-94 95+
Figure 3.7 Age Distribution of U.S. Population 65 Years and Over, 2000
The Frequency Polygon
12,000,000
10,000,000
8,000,000 2000
6,000,000 2010
4,000,000 2020
2,000,000
0
55-59 60-64 65-69 70-74 75-79 80+
Figure 3.11. Population of Japan, Age 55 and Over, 2000, 2010, and 2020
Time Series Charts
20
15
10
0
1900 1920 1940 1960 1980 2000 2020 2040 2060
Figure 3.12 Percentage of Total U. S. Population 65 Years and Over, 1900 to 2050
Source: U.S. Bureau of the Census, “65+ in America,” Current Population Reports,
1996, Special Studies, P23-190, Table 6-1.
16
14
12
10
Males
8
Females
6
0
1960 1980 2000 2020 2040
Figure 3.13 Percentage Currently Divorced Among U.S. Population 65 Years and
Over, by Gender, 1960 to 2040
Distortions in Graphs
Graphs not only quickly inform us; they can
quickly deceive us. Because we are often
more interested in general impressions
than in detailed analyses of the numbers,
we are more vulnerable to being swayed by
distorted graphs.
- Bar chart
- Frequency polygon
- Time series chart
Source: Smith, 2003.
11.8
85+
20.8
14.7 Women
65-74
23.3 Men
21.5
55-64
31.1
0 10 20 30 40
Figure 3.17 Percentage of College Graduates among People 55 years and over by age
and sex, 2002
Source: Stoops, Nicole. 2004. “Educational Attainment in the United States: 2003.”
Current Population Reports, P20-550. Washington D.C.: U.S. Government Printing
Office.
60
40
all races 65+
30
black alone
15-64
20
black alone
10 65+
0
0 to 8 9 to 12 13 to 16+
15
Figure 3.18 Years of School Completed in the United States by Race and Age, 2003
Why use charts and graphs?
– What do you lose?
– ability to examine numeric detail offered by a table
– potentially the ability to see additional relationships
within the data
– potentially time: often we get caught up in selecting
colors and formatting charts when a simply formatted
table is sufficient
– What do you gain?
– ability to direct readers’ attention to one aspect of the
evidence
– ability to reach readers who might otherwise be
intimidated by the same data in a tabular format
– ability to focus on bigger picture rather than perhaps
minor technical details
References
• The material is prepared by taking inputs from
various text books and other internet sources.