0% found this document useful (0 votes)

113 views

Lecture1&2slides PDF

This document discusses applied multivariate analysis. It provides an overview of challenges in multivariate analysis, including noisy and missing data. It also discusses exploratory data analysis methods like data visualization, descriptive statistics, and data transformation techniques. Specific visualization methods discussed include scatter plots, pairwise scatter plots, 3D scatter plots, time plots, and radar charts. Examples are provided to illustrate how these techniques can provide insights into multivariate datasets.

Uploaded by

sarra bouk

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views

Lecture1&2slides PDF

Uploaded by

sarra bouk

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

Applied Multivariate Analysis

LILI WANG

Time: Friday, 2:00-4:00 PM

Location: DingDing Group
Objectives: What is the scientific problem of
interest? How can it be drafted into a
multivariate analysis problem? Does the
explanation meet the “common sense” to the
domain knowledge?

Challenges in Dataset: Data is noisy. Measurement error.

Multivariate Missing data. Normality versus Heavy
tailedness. Homogeneity vs Heterogeneity.
Analysis

Methodology: Which method should we use?

What are the assumptions? Statistically
accurate? Computationally efficient?
Exploratory Data Analysis
• An important step in data analysis is exploratory data analysis (EDA).

• EDA methods include visualization of data, descriptive statistics and

more.

• The benefits of EDA:

1. Check the quality of data: cleaned or not, missing data, outliers, …
2. Gain the first impression of data: data type, distribution, symmetry, …
3. Illustrate analysis results: gain intuition, collaboration, general audience, …
Visualization of Multivariate Data
• Why do we look at graphical displays of the data?
One difficulty of understanding complex multivariate data is the human
perceptional system. Visualization tools can help!

• Visualization may:
1. suggest a plausible model for the data,
2. assess validity of model assumptions,
3. detect outliers or suggest plausible normalizing transformation,
4. and more.
Scatter Plot
• A scatter plot is a data visualization tool that uses dots to represent the values
obtained for two different variables.

• Plotted on Cartesian coordinates: x-axis is the value of the first variable and y-
axis is the value of the second variable.

• Used to check the relationship between two variables:

1. Correlation
2. Linear or nonlinear
3. Joint normality
4. Groups
Example 1: USDA Women’s Health Survey

• In 1985, the USDA commissioned a study of women’s nutrition.

Nutrient intake was measured for a random sample of 737 women
aged 25-50 years.

• The following variables were measured:

1. Calcium (mg)
2. Iron (mg)
3. Protein (g)
4. Vitamin A (μg)
5. Vitamin C (mg)
A Peek at the Data
• Dataset contains 5 variables and 737 observations.
• Table of first five observations

Calcium Iron Protein Vitamin A Vitamin C

1 522.29 10.188 42.561 349.13 54.141
2 343.32 4.113 67.793 266.99 24.839
3 858.26 13.741 59.933 667.90 155.455
4 575.98 13.245 42.215 792.23 224.688
5 1927.50 18.919 111.316 740.27 80.961
Scatter Plot
between
Calcium and
Iron
• Each red dot represents
an observation

• Calcium on x-axis

• Iron on y-axis

What can we say

about this plot?
Some Questions

• Are there outliers?

• What model should we

use to fit the data?

• Can we model the data

using bivariate normal?
• Majority part of data are
clustered in the shaded
area.

• Are other points

outliers?

• How large is large?

• How to compare 500 in

Calcium and 30 in Iron?
Standardize Data
• Rescale data from different sources and measures to a “standard” scale
• Avoid compare apple to pear
• A common standardization is called Z-score scaling which scales a
random sample to have zero sample mean and unit sample variance.
• For a random sample (𝑥1, …, 𝑥𝑛), the Z-score scaling transforms each
observation by
𝑥𝑖 − 𝜇
𝑥𝑖∗= ,
𝑠
where 𝜇 is the sample mean and 𝑠 is sample standard deviation.
Scatter plot after Z-score
standardization

• Zero sample mean and

unit sample variance.

• Solid lines stand for 3

standard deviations.

• If data is normal, we
should expect about
99% of data in the
bottom-left box.
Transformation Methods
• Sometimes data is “irregular”: non-normal, outliers, skewed, heavy-tailed,
…
• Data transformation techniques can be used to stabilize variance, make
the data more normal-like, improve the validity of measures of association

• Power transformation:
𝑦 = 𝑥 𝛼, 0 < 𝛼 < 1 .
• Log transformation:
𝑦 = ln(𝑥), 0 ≤ 𝑥 .
Scatter plot after Log
transformation

• More Normal-like.

• Reduced variance

• Less outliers
Scatter Plot for Three Variables
• The scatter plot can be extended to visualize the relationship among
three different variables which is called 3D scatter plot.

• Plotted on Cartesian coordinates: x-axis is the value of the first

variable, y-axis is the value of the second variable and z-axis is the
value of the third variable.

• A fourth variable can be set to denote the color or size of the markers.
3D scatter plot for
Calcium, Iron and Protein

• Each red dot represents

an observation

• Calcium on x-axis

• Iron on y-axis

• Protein on z-axis
3D scatter plot after Log
transformation

• Each blue represents an

observation

• More clustered in a
“ball”

• Reduced variance

• Less outliers
Pros and Cons of 3D Scatter Plot

Pros
• Visualization for 3 or 4 variables.
• Complex relationship rather than pair wise
• Joint sample distribution

Cons
•Not friendly to bare eyes (angle dependent)
•Hard to interpret
•Not working for more ( ≥ 5) variables
Pairwise Scatter Plot for More Variables

• The pairwise scatter plot aims to visualize the relationship for each
pair of variables in a multivariate dataset.

• A pairwise scatter plot is an array of scatter plots, the (i, j)-th plot in
the array is the scatter plot between the i-th and j-th variables.
Pairwise Scatter Plot for USDA Women’s Health Survey
Pros and Cons of Pairwise Scatter Plot

Pros
• Visualization for many variables
simultaneously
• Interpret pairwise relationships

Cons
•No joint relationship for more
than 2 variables
•Huge array when the number of
variables is large
Time Plot
• A time plot (sometimes called a time series graph) displays values versus time.
It is similar to scatter plot, but x-axis is chosen to be time (or age, survival time
…).

• A time plot is useful to compare the “growth” of multiple variables with respect
to time (or some other common index).

• Application of time plot:

1. Finance: compare multiple stock returns vs time
2. Clinical: compare multiple patients vs survival time
3. Biology: compare multiple genome sequences vs positions
4. Physics: multiple measurements vs time
Time Plot for Financial Time Series

• Time series are one of the most common data types encountered in finance
and weather forecasting

• One powerful yet simple visualization tool in financial analysis is to draw the
time plot for multiple assets.

• Things to check in a time plot:

1. Co-movement of variables
2. Trend (increasing or decreasing …)
3. Periodical patterns (weekly, seasonal, long term … )
4. Black swan event (huge gap, financial crisis …)
Example 2: Stock Prices of High-tech Companies

• We collect daily stock prices (closing price) of four leading high-tech

companies between January 2018 and January 2019.

• The following variables were included:

1. Daily stock price of Apple
2. Daily stock price of Facebook
3. Daily stock price of IBM
4. Daily stock price of Microsoft
A Peek at the Data
• Dataset contains 4 variables and 250 observations (250 trading days).
• Table of first five observations
Date Apple Facebook IBM Microsoft

1/16/2018 176.19 178.39 163.85 88.35

1/17/2018 179.10 177.60 168.65 90.14

1/18/2018 179.26 179.80 169.12 90.10

1/19/2018 178.46 181.29 162.37 90

1/22/2018 177 185.37 162.60 91.61

Time Plot of
Stock Prices

What can you say

about this plot?
Some Questions?

• Can you observe any co-

movement of these four
stocks?

• Any trends?

• Which asset is most

risky?
Pros and Cons of
Price Data

Pros
• Straightforward
• Easy to check trends and
high/low prices

Cons
•Compare apple to pear
•No relative gain/loss
•Non-stationary data
Log-return of Financial Assets

• In financial analysis, logarithm of returns is more popular than prices

or raw returns.

• For an asset (e.g. stock, bond, gold, bitcoin …), log-return at time 𝑡 is

( 𝑃𝑡−1 )
𝑃𝑡 − 𝑃𝑡−1
𝑟𝑡 = log(1 + 𝑅𝑡) = log 1 + = log(𝑃𝑡) − log(𝑃𝑡−1),

where 𝑃𝑡 and 𝑅𝑡 are the price and simple return at time 𝑡 .

Why Log-return ?
• Log-return is favored for multiple reasons:
• More normal-like (recall log transformation)

• More stationary time series (zero mean and fixed variance)

• Log additivity
𝑇
𝑟𝑡 = log(𝑃𝑡) − log(𝑃0)
∑
1
• Easy calculus
𝑑 𝑥
∫
𝑥 𝑥
𝑒 = 𝑒 𝑑𝑥 = 𝑒
𝑑𝑥
Time Plot of
Stock Returns

What can you say

about this plot?
Some Questions?

• Which asset is the most

risky (largest variance)?

What is this • Can you observe any

huge drop for black swan event?
FB?
What
happened?
Radar Chart
• Radar chart (also known as spider, web, polar, star chart) is a graphical
method of comparing multivariate data in the form of a two-
dimensional chart of three or more quantitative variables.

• A radar chart is useful to:

1. Find similar observations
2. Find observation with high/low scores
3. Find observations, clusters
4. Find outliers
Example 3: High School Final Scores

• We generate a toy dataset. The dataset contains the final scores of

some students in a hypothetical high school.

• Suppose the following 8 subjects are tested:

1. Math, 2. English, 3. Biology, 4. Music,
5. Programming, 6. French, 7. Physic and 8. Statistics

• Each subject is scored from 1 to 10.

A Peek at the Data
• We want to compare the performance of students
• Table of first four observations

ID Math English Biology Music Prog. French Physics Stat.

1 4 6 9 3 9 2 6 8
2 5 5 3 5 3 3 8 5
3 3 2 3 5 4 8 6 8
4 9 2 6 9 4 2 7 6
Radar Chart for 1st Student

• Each axis represents the

score in a subject

• Require standardized
variables

• The area covered can be

considered as a score for
overall performance
Comparison on parallel charts
Comparison on One Chart

• Draw multiple
observations on the
same chart

• Easy to compare areas,

pros and cons

• Not good for a large

number of observations
Chernoff Face
• Chernoff faces, invented by Herman Chernoff in 1973, display
multivariate data in the shape of a human face.

• The individual parts, such as eyes, ears, mouth and nose represent
values of the variables by their shape, size, placement and orientation.

• The idea behind using faces is that people easily recognize faces and
notice small changes without difficulty.
Example 4: Crime Rates by State in 2008

• The data contains the rates of various types of crimes in different

states. The data source is Table 301 of the 2008 US Statistical Abstract.

• Rates of the following crime types are recorded:

1. Murder 2. Forcible rape

3. Robbery 4. Aggravated assault
5. Burglary 6. Larceny theft
7. Motor vehicle theft
• Each type of crime
corresponds to a
character in face.

• Shape of the
character depends
on the value of
variable

How to make a face?

Let’s make some faces

Pros
• Easy to tell and remember
the differences between
states

Cons
• Hard to translate faces back
to the value of variables
More
Faces
Heat Map
• A heat map (or heatmap) is a visualization tool which represent values
in a data matrix by colors in a 2D graph

• Applications of heat map:

1. Molecular biology
2. Neural science
3. Physics
4. Density plot
Heatmap
for Whole
Brain
Analysis
Network Map
• Network Map is a visualization tool to study of the physical
connectivity of networks.
• Each node in the map represents a variable (e.g. users, characters,
features).
• Two nodes are connected if there is an edge between them.

• An example: Network map for marvel cinematic universe

Network Map
• Network Map is a visualization tool to study of the physical
connectivity of networks.
• Each node in the map represents a variable (e.g. users, characters,
features).
• Two nodes are connected if there is an edge between them.

• An example: Network map for marvel cinematic universe

Network Map
Univariate - Review
Descriptive Statistics
• Descriptive statistics is the term given to the analysis of data that
helps describe, show or summarize data in a meaningful way.

• Descriptive statistics are very important since raw data is hard to

interpret and visualization is not quantitively accurate.

• Descriptive statistics do not, however, allow us to make conclusions

beyond the data we have analyzed or reach conclusions regarding any
hypotheses we might have made.
Descriptive Statistics (cont.)
• The goal of descriptive statistics is to obtain some partial descriptions of the
joint distribution of the data.

• Three aspects of the data are of importance:

1. Central Tendency. What is a typical value for each variable?
2. Dispersion. How far apart are the individual observations deviate from a central
value for a given variable?
3. Association. When more than one variable are studied together, how does each
variable relate to the remaining variables? How are the variables simultaneously
related to one another? Are they positively or negatively related?
Population
• A population is the collection of all people, plants, animals, or objects of interest about
which we wish to make statistical inferences (generalizations).

• The population may also be viewed as the collection of all possible random draws from a
stochastic model; for example, independent draws from a normal distribution with a given
population mean and population variance.

• A population parameter is a numerical characteristic of a population.

• In nearly all statistical problems we do not know the value of a parameter because we do
not measure the entire population. We use sample data to make an inference about the
value of a parameter.
Sample
• A sample is the subset of the population that we actually measure or observe.

• A sample statistic is a numerical characteristic of a sample. A sample statistic

estimates the unknown value of a population parameter.

• Information collected from sample statistic is sometimes referred to

as Descriptive Statistic.

• Statistics, as a subject matter, is the science and art of using sample

information to make generalizations about populations
Example: USDA Women’s Health Survey
Population Sample
• Intake of the 5 nutrients for all • Intake of the 5 nutrients
women aged between 25 and 50 observed from 737 women aged
in United States. between 25 and 50 in United
States.

A Population Parameter A Sample Statistic

• Average intake of Calcium • Sample mean of Calcium intake
Notation
•𝑝 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠, 𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠,
𝑡ℎ
𝑥
• 𝑖𝑗 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑗 𝑖𝑛 𝑡ℎ𝑒 𝑖 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

• Vector of observations for the j-th variable

𝑥𝑗1
𝑥𝑗2
𝒙𝒋 = = (𝑥𝑗1, 𝑥𝑗2, …, 𝑥𝑗𝑛)T
⋮
𝑥𝑗𝑛
Notation (cont.)
• Data matrix (sometimes called design, regressor and model matrix) whose j-
th column is the vector of observations for the j-th subject

𝑥11 ⋯ 𝑥𝑝1
𝐗 = (𝒙𝟏, 𝒙𝟐, ⋯ 𝒙𝒑) = ⋮ ⋱ ⋮ ∈ ℝ 𝑛×𝑝
𝑥1𝑛 ⋯ 𝑥𝑝𝑛

• Each column of 𝐗 contains all observations of a subject.

• Each row of 𝐗 contains all subjects (variables) of an observation.
Measures of Central Tendency
• Throughout this lecture, we use symbol 𝜇𝑗 to represent a population mean of j-th variable
and the symbol 𝑥¯𝑗 to represent a sample mean based on observed data for j-th variable.

• The population mean is the measure of central tendency for the population. The population
mean for j-th variable is
𝜇𝑗 = E(𝑥𝑖𝑗)

• The population mean can be estimated by the sample mean

1 𝑛
∑
𝑥¯𝑗 = 𝑥𝑖𝑗
𝑛 𝑖=1
Population Mean Vector
• A collection of population means of all variables can be written as a
population mean vector

𝜇1 𝑥𝑖1
𝜇2 𝑥𝑖2
𝝁= =E = E(𝒙𝒊)
⋮ ⋮
𝜇𝑝 𝑥𝑖𝑝
Sample Mean Vector
• We can estimate the population mean vector by sample mean vector

1 𝑛
𝑥¯1 𝑛
∑𝑖=1 𝑥𝑖1
1 𝑛
𝑥¯2 ∑𝑖=1 𝑥𝑖2 1 𝑛
𝒙¯ =
𝑛 ∑𝑖=1
= 𝑛 = 𝒙𝒊
⋮ ⋮
𝑥¯𝑝 1 𝑛
𝑛
∑𝑖=1 𝑥𝑖𝑝
Sample Mean is Unbiased
• Sample mean (vector) is an unbiased descriptive statistic of population mean (vector)

( 𝑛 ∑𝑖=1 ) 𝑛 ∑𝑖=1
1 1 1
E(𝑥¯𝑗) = E E(𝑥𝑖𝑗) =
𝑛 𝑛 𝑛

𝑛 ∑𝑖=1
𝑥𝑖𝑗 = 𝜇𝑗 = 𝜇𝑗,

E(𝑥¯1)
𝜇1
E(𝑥¯2) 𝜇2
¯ =
𝑎𝑛𝑑 E(𝒙) = = 𝝁.
⋮ ⋮
𝜇𝑝
E(𝑥¯3)
Why We Care Bias?
• Statistical bias is defined as the difference between population parameter and
the expectation of the estimator
• For example

𝜇𝑗 − E(𝑥¯𝑗) .
• The expectation of estimator is the parameter that your estimator converges to when
sample size is large enough. This is the “best” you can expect from your estimator.

• If the bias is not 0, there will be a non-vanishing estimation error even if you increase
your sample size.
Measures of Dispersion
• A variance measures the degree of dispersion (spread) in a variable’s values.

• The population variance of the j-th variable is

𝜎𝑗2 = Var(𝑥𝑖𝑗) = E(𝑥𝑖𝑗 − 𝜇𝑗) = E(𝑥𝑖𝑗2 ) − 𝜇𝑗2 .

The population standard deviation of the j-th variable is

𝜎𝑗 = Var(𝑥𝑖𝑗)
Sample Variance
• The population variance 𝜎𝑗2 can be estimated by the sample variance

1 𝑛
𝑠𝑗2 (𝑥𝑖𝑗 − 𝑥¯𝑗)2 .
𝑛 − 1 ∑𝑖=1
=

• The sample standard deviation for the j-th variable is simply the square root of
the sample variance, i.e. 𝑠𝑗.

• Question: why dividing 𝑛 − 1 instead of 𝑛 ?

Why dividing 𝑛 − 1 ?
𝑛
∑𝑖=1
• (𝑥𝑖𝑗 − 𝑥¯𝑗) = 0, thus, if we know 𝑛 − 1 of the deviations, we can compute
the last one.

• This means that there are only 𝑛 − 1 freely varying deviations, i.e. 𝑛 − 1 degrees of
freedom.

• Dividing 𝑛 − 1 makes sample variance an unbiased descriptive statistic of population

variance

E(𝑠𝑗2) = 𝜎𝑗2 .
Dividing 𝑛 Dividing 𝑛 − 1
Pros Pros
• From a purely descriptive point of view, Unbiased sample variance.
to divide by 𝑛 in the definition of the
•
sample variance is makes more sense.

Cons Cons
• Biased sample variance • Not intuitive.

When n is large 𝑛 ≈ 𝑛 − 1 and the difference is

negligible.
Relation between center and dispersion
• Measures of center and measures of dispersion are best thought of together, in
the context of an error function.

• The error function measures how well a single number 𝑎 represents the entire
data set.

• The values of 𝑎 (if they exist) that minimize the error functions are our measures
of center.

• The minimum value of the error function is the corresponding measure of spread.
Mean Square Error Function
• The mean square error (𝑀𝑆𝐸) function is defined by
1 𝑛
(𝑥𝑖𝑗 − 𝑎)2 .
𝑛 − 1 ∑𝑖=1
MSE(𝑎) =

• Minimizing 𝑀𝑆𝐸 with respect to 𝑎 is equivalent to solve

𝑑 2 𝑛
∑
MSE(𝑎) = (𝑥𝑖𝑗 − 𝑎) = 0 .
𝑑𝑎 𝑛−1 𝑖=1

• 𝑀𝑆𝐸 is minimized at 𝑎 = 𝑥¯𝑗, the sample mean.

• The minimum value of 𝑀𝑆𝐸 is 𝑠 2, the sample variance.
Measures of Association: Covariance
• The population covariance is a measure of the association between pairs
of variables. The population covariance between variables j and k is
𝜎𝑗𝑘 = E{(𝑥𝑖𝑗 − 𝜇𝑗)(𝑥 − 𝜇𝑘)}
𝑖𝑘
• The production (𝑥𝑖𝑗 − 𝜇𝑗)(𝑥𝑖𝑘 − 𝜇𝑘) is a function of random variables
𝑥𝑖𝑗 and 𝑥𝑖𝑘. Therefore itself is also a random variable and has a
population mean.

• Positive population covariance means that the two variables are

positively associated (similar to negative).
Population Covariance Matrix
• The population variances and covariances can be collected into
the population variance-covariance matrix: This is also known by the name
of population dispersion matrix.

𝜎12 𝜎12 ⋯ 𝜎1𝑝

𝜎21 𝜎22 ⋯ 𝜎2𝑝
𝚺= ∈ ℝ 𝑝×𝑝 .
⋮ ⋮ ⋱ ⋮
𝜎𝑝1 𝜎𝑝2 ⋯ 𝜎𝑝2
• The population variance-covariance matrix is a symmetric matrix.
Sample Covariance
• The population covariance between variables j and k can be estimated by the sample
covariance
1
( 𝑥𝑖𝑗 − 𝑥¯𝑗)(𝑥𝑖𝑘 − 𝑥¯𝑘) .
𝑛

𝑛 − 1 ∑𝑖=1
𝑠𝑗𝑘 =

• 𝑠𝑗𝑘 = 0 : suggests two variables are uncorrelated (not independence!);

• 𝑠𝑗𝑘 > 0 : suggests two variables positively correlated (j ↑when k ↑);
• 𝑠𝑗𝑘 < 0 : suggests two variables negatively correlated (j ↓when k ↑).

Unbiasedness:E(𝑠𝑗𝑘) = 𝜎𝑗𝑘
Sample Covariance Matrix
• The population variance-covariance matrix may be estimated by the sample
variance-covariance matrix

𝑠12 𝑠12 ⋯ 𝑠1𝑝

𝑠21 𝑠22 ⋯ 𝑠2𝑝
𝐒= ∈ ℝ 𝑝×𝑝 .
⋮ ⋮ ⋱ ⋮
𝑠𝑝1 𝑠𝑝2 ⋯ 𝑠𝑝2
• The sample variance-covariance matrix is also a symmetric matrix.
Unbiasedness: E(𝐒) = 𝚺 (implied by element-wise result)
Measures of Association: Correlation
• The sign of covariance value is useful to suggest positive, negative
correlations or un-correlated.
• The magnitude of the covariance value is not particularly helpful as it
depends on the magnitudes (scales) of the two variables. It does not
tell us the strength of the associations.
• To assess the strength of an association, we use correlation values.
The population correlation between variables j and k is
𝜎𝑗𝑘
𝜌𝑗𝑘 =
𝜎𝑗𝜎𝑘
Correlation and Data Transformation
• Correlation of raw data is equivalent to the covariance of Z-score
standardized data.

• After Z-score standardization 𝜎𝑗 = 𝜎𝑘 = 1.

• Correlation is a “standardized” version of covariance.

• Correlation is “scale invariant” as it’s value does not change if we apply a

linear transformation (except multiply by 0) to the variable.
Population Correlation
• The population correlation 𝜌𝑗𝑘 has the same sign with 𝜎𝑗𝑘.
• The population correlation 𝜌𝑗𝑘 must lie between -1 and 1

−1 ≤ 𝜌𝑗𝑘 ≤ 1 .

• 𝜌𝑗𝑘 = 0 : two variables are uncorrelated;

• 𝜌𝑗𝑘 close to 1 : strong positively dependence;
• 𝜌𝑗𝑘 close to -1: strong negative dependence.
Sample Correlation
• The population correlation may be estimated by substituting into the formula the
sample covariances and standard deviations.
• The sample correlation between variables j and k is
𝑠𝑗𝑘
𝑟𝑗𝑘 =
𝑠𝑗𝑠𝑘
• 𝑟𝑗𝑘 = 0 : suggests two variables are uncorrelated;
• 𝑟𝑗𝑘 close to 1 : suggests strong positively dependence;
• 𝑟𝑗𝑘 close to -1: suggests strong negative dependence.
Unbiasedness: E(𝑟𝑗𝑘) = 𝜌𝑗𝑘
Correlation Matrix
• The population correlation matrix and sample correlation matrix are

𝜌12 𝜌12 ⋯ 𝜌1𝑝 𝑟12 𝑟12 ⋯ 𝑟1𝑝

𝜌21 𝜌22 ⋯ 𝜌2𝑝 𝑟21 𝑟22 ⋯ 𝑟2𝑝
𝚸= and 𝐑 = .
⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮
𝜌𝑝1 𝜌𝑝2 ⋯ 𝜌𝑝2 𝑟𝑝1 𝑟𝑝2 ⋯ 𝑟𝑝2
• The above two matrices are also symmetric.
Unbiasedness: E(𝐑) = 𝐏 (implied by element-wise result)
Example 1: USDA Women’s Health Survey

• In 1985, the USDA commissioned a study of women’s nutrition.

Nutrient intake was measured for a random sample of 737 women
aged between 25 and 50.

• The following variables were measured:

1. Calcium (mg), 2. Iron (mg), 3. Protein (g)
4. Vitamin A (μg), 5. Vitamin C (mg)

Q: Find the descriptive statistics of this dataset.

Sample Mean and Sample Standard Deviation

• Here we calculate sample mean and sample standard deviation for

each variable in the dataset.

Variable Sample mean Sample SD

Calcium 624.0 mg 397.3 mg
Iron 11.1 mg 6.0 mg
Protein 65.8 mg 30.6 mg
Vitamin A 839.6 μg 1634.0 μg
Vitamin C 78.9 mg 73.6 mg
How to interpret?
• Sample mean estimates the
central tendency (Average
Variable Sample mean Sample SD amount of nutrient intake).
Calcium 624.0 mg 397.3 mg
Iron 11.1 mg 6.0 mg • Sample SD estimates
Protein 65.8 mg 30.6 mg dispersion.
Vitamin A 839.6 μg 1634.0 μg
Vitamin C 78.9 mg 73.6 mg • Notice that the standard
deviations are large relative to
their respective means,
However, whether the standard deviations are
relatively large or not, will depend on the context of especially for Vitamin A & C.
application. Skill in interpreting the statistical
analysis depends very much on the researcher's • This would indicate a high
subject matter knowledge. variability among women in
nutrient intake.
Sample Covariance Matrix

• The sample variance-covariance matrix is copied into the matrix below.

Calcium Iron Protein Vitamin A Vitamin C

Calcium 157829.4 940.1 6075.8 102411.1 6701.6

Iron 940.1 35.8 114.1 2383.2 137.7

Protein 6075.8 114.1 934.9 7330.1 477.2

Vitamin A 102411.1 2383.2 7330.1 2668452.4 22063.3

Vitamin C 6701.6 137.7 477.2 22063.3 5416.3

How to interpret? • Sample covariance estimates
the association between
different variables.
Calcium Iron Protein Vitamin A Vitamin C

Calcium 157829.4 940.1 6075.8 102411.1 6701.6

• All off-diagonal elements in this
Iron 940.1 35.8 114.1 2383.2 137.7 table are positive which
Protein 6075.8 114.1 934.9 7330.1 477.2 indicates positive dependency.
Vitamin A 102411.1 2383.2 7330.1 2668452.4 22063.3
Vitamin C 6701.6 137.7 477.2 22063.3 5416.3
• A woman with above-average
Calcium intake may also have
However, the magnitude of the covariance value can above-average intake of other
not be directly interpreted as the strength of nutrients.
association as it depends on the scales of variables.
• A woman with below-average
Iron intake may also have
below-average intake of other
nutrients.
Sample Correlation Matrix

• The sample correlation matrix is copied into the matrix below.

Calcium Iron Protein Vitamin A Vitamin C

Calcium 1.000 0.395 0.500 0.158 0.229

Iron 0.395 1.000 0.623 0.244 0.313
Protein 0.500 0.623 1.000 0.147 0.212
Vitamin A 0.158 0.244 0.147 1.000 0.184
Vitamin C 0.229 0.313 0.212 0.184 1.000
How to interpret? • Sample correlation estimates
the association between
Calcium Iron Protein Vitamin A Vitamin C standardized variables.

Calcium 1.000 0.395 0.500 0.158 0.229 • All off-diagonal elements in this
Iron 0.395 1.000 0.623 0.244 0.313 table are positive which
Protein 0.500 0.623 1.000 0.147 0.212 indicates positive dependency.

Vitamin A 0.158 0.244 0.147 1.000 0.184

• Magnitude indicates strength
of dependency.
Vitamin C 0.229 0.313 0.212 0.184 1.000

• High correlation pairs: Calcium

– Iron, Calcium – Protein, Iron-
Why these three nutrients are highly correlated? Protein
This can be a good research problem!
Overall Measures of Dispersion
• Sometimes it is also useful to have an overall measure of dispersion in
the data. In this measure, it would be good to include all of the
variables simultaneously, rather than one at a time.

• The following two quantities are used to measure the dispersion of all
variables together
1. Total variance
2. Generalized variance
Total Variance
• Population total variance is defined as the trace of the population variance-
covariance matrix

𝑡𝑟𝑎𝑐𝑒(𝚺) = 𝜎12 + 𝜎22 + ⋯ + 𝜎𝑝2

• Total variance is the sum of variances for all variables in the dataset.
• Population total variance can be estimated by the trace of 𝐒

𝑡𝑟𝑎𝑐𝑒(𝐒) = 𝑠12 + 𝑠22 + ⋯ + 𝑠𝑝2

Generalized Variance
• Population generalized variance is defined as the determinant of the population
variance-covariance matrix

𝑑𝑒𝑡(𝚺) or 𝚺

• Generalized variance also accounts for off-diagonal elements in 𝚺 (e.g. covariance

effects).
• Population generalized variance can be estimated by the determinant of 𝐒

𝑑𝑒𝑡(𝐒) or 𝐒

Application To File A Statutory Declaration Out of Time: Form PE2
No ratings yet
Application To File A Statutory Declaration Out of Time: Form PE2
1 page
Information Assurance Security 2
No ratings yet
Information Assurance Security 2
7 pages
Senior Seminar Thesis Final Paper 2015
No ratings yet
Senior Seminar Thesis Final Paper 2015
16 pages
Torts II Outline Spring 2017
No ratings yet
Torts II Outline Spring 2017
48 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
Business Statistics - Session 1 - 3
No ratings yet
Business Statistics - Session 1 - 3
63 pages
Datamining and Analytics Unit V
No ratings yet
Datamining and Analytics Unit V
102 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
lec2
No ratings yet
lec2
59 pages
TQM Seven QC Tools
No ratings yet
TQM Seven QC Tools
41 pages
Module 1 - Descriptive Statistics PDF
No ratings yet
Module 1 - Descriptive Statistics PDF
34 pages
Module 1 - Descriptive Statistics
100% (1)
Module 1 - Descriptive Statistics
31 pages
Step 1: Ask Questions
No ratings yet
Step 1: Ask Questions
30 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
Revision Module 1,2,3
No ratings yet
Revision Module 1,2,3
129 pages
Business Statistics SIM Semester 1 2019: Welcome - Lecture 1: Ms. Kathryn Bendell Email
No ratings yet
Business Statistics SIM Semester 1 2019: Welcome - Lecture 1: Ms. Kathryn Bendell Email
38 pages
Business Statistics SIM Semester 1 2019: Welcome - Lecture 1: Ms. Kathryn Bendell Email
No ratings yet
Business Statistics SIM Semester 1 2019: Welcome - Lecture 1: Ms. Kathryn Bendell Email
38 pages
Prob & Stat
No ratings yet
Prob & Stat
50 pages
Data Analysis3
No ratings yet
Data Analysis3
31 pages
Random Forest
No ratings yet
Random Forest
83 pages
5 Statistik
No ratings yet
5 Statistik
62 pages
Parametric and non parametric test
No ratings yet
Parametric and non parametric test
76 pages
Chapter 5 - Analysis and Presentation of Data
No ratings yet
Chapter 5 - Analysis and Presentation of Data
30 pages
02.data Preprocessing PDF
100% (1)
02.data Preprocessing PDF
31 pages
IOT Domain
No ratings yet
IOT Domain
70 pages
Exploratory Data Analysis_v3_part1
No ratings yet
Exploratory Data Analysis_v3_part1
36 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
53 pages
Lecture Statistics 02
No ratings yet
Lecture Statistics 02
28 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Unit .......
No ratings yet
Unit .......
45 pages
Chapter 10 Data Analysis-Quantitative
No ratings yet
Chapter 10 Data Analysis-Quantitative
93 pages
Descriptive Analytics - Univariate and Bivariate
No ratings yet
Descriptive Analytics - Univariate and Bivariate
41 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
AA SL - Unit 1a - Representing Data (Statistics)
No ratings yet
AA SL - Unit 1a - Representing Data (Statistics)
74 pages
Central Tendency and Dispersion
No ratings yet
Central Tendency and Dispersion
61 pages
Nature of Statistics Part 2
No ratings yet
Nature of Statistics Part 2
48 pages
Lesson 09 Data Analysis I Descriptive Statistics
No ratings yet
Lesson 09 Data Analysis I Descriptive Statistics
15 pages
Unit 3
No ratings yet
Unit 3
55 pages
Unit 3
No ratings yet
Unit 3
30 pages
Quantitative and Qualitative
No ratings yet
Quantitative and Qualitative
41 pages
Advance Statistics
No ratings yet
Advance Statistics
292 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
10 pages
Chapter Six Methods of Describing Data
No ratings yet
Chapter Six Methods of Describing Data
20 pages
Modified Ps Final 2023
No ratings yet
Modified Ps Final 2023
124 pages
01_statistics_lesson
No ratings yet
01_statistics_lesson
35 pages
Data Preparation: March 6, 2010
No ratings yet
Data Preparation: March 6, 2010
17 pages
Module 1_BCS602_chapter 02.pptx
No ratings yet
Module 1_BCS602_chapter 02.pptx
90 pages
Chapter 1 Introduction To Statistics and Analysis
No ratings yet
Chapter 1 Introduction To Statistics and Analysis
6 pages
Probability Ans Statistics
100% (1)
Probability Ans Statistics
40 pages
Data Mining-L3
No ratings yet
Data Mining-L3
22 pages
m2 final
No ratings yet
m2 final
151 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Chapter 2 Research Methods - Lecture 4 Part 1
No ratings yet
Chapter 2 Research Methods - Lecture 4 Part 1
9 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
13_20241118_DataVisualisation_2
No ratings yet
13_20241118_DataVisualisation_2
91 pages
03a EDA
No ratings yet
03a EDA
47 pages
UNIT 1,2
No ratings yet
UNIT 1,2
17 pages
DS Module 2
No ratings yet
DS Module 2
113 pages
Basics of Statistics1
No ratings yet
Basics of Statistics1
63 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
Thinking Statistically
From Everand
Thinking Statistically
Anthony Banfield
5/5 (1)
Robin Shahini v. Islamic Republic of Iran
No ratings yet
Robin Shahini v. Islamic Republic of Iran
17 pages
Kurdish-Turkish Conflict (1978-Present) - Wikipedia
No ratings yet
Kurdish-Turkish Conflict (1978-Present) - Wikipedia
36 pages
Bajirav II
No ratings yet
Bajirav II
345 pages
Perbaikan Nik PKH Padan Capil Anis
No ratings yet
Perbaikan Nik PKH Padan Capil Anis
22 pages
Citizen Charter of Tamilnadu Police
No ratings yet
Citizen Charter of Tamilnadu Police
2 pages
Buy ebook (eBook PDF) Information Privacy Law (Aspen Casebook Series) 5th Edition cheap price
100% (1)
Buy ebook (eBook PDF) Information Privacy Law (Aspen Casebook Series) 5th Edition cheap price
45 pages
State v. Laakmann, Ariz. Ct. App. (2014)
No ratings yet
State v. Laakmann, Ariz. Ct. App. (2014)
6 pages
12 Years Synopsis
No ratings yet
12 Years Synopsis
2 pages
20220129a - Restored-Republic-via-a-GCR-1-29-2022
No ratings yet
20220129a - Restored-Republic-via-a-GCR-1-29-2022
11 pages
Use Drug Trace Detection Equipment Prisons Policy Framework
No ratings yet
Use Drug Trace Detection Equipment Prisons Policy Framework
20 pages
Geluz V CA Digest
100% (2)
Geluz V CA Digest
2 pages
The Spanish Inquisition
No ratings yet
The Spanish Inquisition
32 pages
San Mateo Daily Journal 04-12-19 Edition
No ratings yet
San Mateo Daily Journal 04-12-19 Edition
32 pages
15DirtyTricks PDF
83% (6)
15DirtyTricks PDF
18 pages
For Debate
No ratings yet
For Debate
3 pages
Parliament Denies Amidu's Allegations
No ratings yet
Parliament Denies Amidu's Allegations
3 pages
2nd Year Block 4a LCAS WBOT Visit Site Allocations
No ratings yet
2nd Year Block 4a LCAS WBOT Visit Site Allocations
11 pages
STS Concept Map Idea
0% (2)
STS Concept Map Idea
2 pages
What Is 'Gang-Stalking' and How Does It Differ From Normal Stalking - Quora 2
0% (1)
What Is 'Gang-Stalking' and How Does It Differ From Normal Stalking - Quora 2
1 page
Motion For Leave To File Motion in Limine
No ratings yet
Motion For Leave To File Motion in Limine
8 pages
Pleb Rules
No ratings yet
Pleb Rules
25 pages
Case Digests Philo Finals Dec. 9
No ratings yet
Case Digests Philo Finals Dec. 9
5 pages
Research Topics
No ratings yet
Research Topics
2 pages
Crimes Within The Authority of Lupon
No ratings yet
Crimes Within The Authority of Lupon
13 pages
Cyber Bullying
No ratings yet
Cyber Bullying
7 pages
Civil Disorder, States of Emergency and Armed Conflicts: I. Plan To Keep The Demonstration in Order
No ratings yet
Civil Disorder, States of Emergency and Armed Conflicts: I. Plan To Keep The Demonstration in Order
6 pages