Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
113 views

Lecture1&2slides PDF

This document discusses applied multivariate analysis. It provides an overview of challenges in multivariate analysis, including noisy and missing data. It also discusses exploratory data analysis methods like data visualization, descriptive statistics, and data transformation techniques. Specific visualization methods discussed include scatter plots, pairwise scatter plots, 3D scatter plots, time plots, and radar charts. Examples are provided to illustrate how these techniques can provide insights into multivariate datasets.

Uploaded by

sarra bouk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views

Lecture1&2slides PDF

This document discusses applied multivariate analysis. It provides an overview of challenges in multivariate analysis, including noisy and missing data. It also discusses exploratory data analysis methods like data visualization, descriptive statistics, and data transformation techniques. Specific visualization methods discussed include scatter plots, pairwise scatter plots, 3D scatter plots, time plots, and radar charts. Examples are provided to illustrate how these techniques can provide insights into multivariate datasets.

Uploaded by

sarra bouk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Applied Multivariate Analysis

LILI WANG

Time: Friday, 2:00-4:00 PM


Location: DingDing Group
Objectives: What is the scientific problem of
interest? How can it be drafted into a
multivariate analysis problem? Does the
explanation meet the “common sense” to the
domain knowledge?

Challenges in Dataset: Data is noisy. Measurement error.


Multivariate Missing data. Normality versus Heavy
tailedness. Homogeneity vs Heterogeneity.
Analysis

Methodology: Which method should we use?


What are the assumptions? Statistically
accurate? Computationally efficient?
Exploratory Data Analysis
• An important step in data analysis is exploratory data analysis (EDA).

• EDA methods include visualization of data, descriptive statistics and


more.

• The benefits of EDA:


1. Check the quality of data: cleaned or not, missing data, outliers, …
2. Gain the first impression of data: data type, distribution, symmetry, …
3. Illustrate analysis results: gain intuition, collaboration, general audience, …
Visualization of Multivariate Data
• Why do we look at graphical displays of the data?
One difficulty of understanding complex multivariate data is the human
perceptional system. Visualization tools can help!

• Visualization may:
1. suggest a plausible model for the data,
2. assess validity of model assumptions,
3. detect outliers or suggest plausible normalizing transformation,
4. and more.
Scatter Plot
• A scatter plot is a data visualization tool that uses dots to represent the values
obtained for two different variables.

• Plotted on Cartesian coordinates: x-axis is the value of the first variable and y-
axis is the value of the second variable.

• Used to check the relationship between two variables:


1. Correlation
2. Linear or nonlinear
3. Joint normality
4. Groups
Example 1: USDA Women’s Health Survey

• In 1985, the USDA commissioned a study of women’s nutrition.


Nutrient intake was measured for a random sample of 737 women
aged 25-50 years.

• The following variables were measured:


1. Calcium (mg)
2. Iron (mg)
3. Protein (g)
4. Vitamin A (μg)
5. Vitamin C (mg)
A Peek at the Data
• Dataset contains 5 variables and 737 observations.
• Table of first five observations

Calcium Iron Protein Vitamin A Vitamin C


1 522.29 10.188 42.561 349.13 54.141
2 343.32 4.113 67.793 266.99 24.839
3 858.26 13.741 59.933 667.90 155.455
4 575.98 13.245 42.215 792.23 224.688
5 1927.50 18.919 111.316 740.27 80.961
Scatter Plot
between
Calcium and
Iron
• Each red dot represents
an observation

• Calcium on x-axis

• Iron on y-axis

What can we say


about this plot?
Some Questions

• Are there outliers?

• What model should we


use to fit the data?

• Can we model the data


using bivariate normal?
• Majority part of data are
clustered in the shaded
area.

• Are other points


outliers?

• How large is large?

• How to compare 500 in


Calcium and 30 in Iron?
Standardize Data
• Rescale data from different sources and measures to a “standard” scale
• Avoid compare apple to pear
• A common standardization is called Z-score scaling which scales a
random sample to have zero sample mean and unit sample variance.
• For a random sample (𝑥1,  …, 𝑥𝑛), the Z-score scaling transforms each
observation by
𝑥𝑖 − 𝜇
𝑥𝑖∗= ,    
𝑠
where 𝜇 is the sample mean and 𝑠 is sample standard deviation.
Scatter plot after Z-score
standardization

• Zero sample mean and


unit sample variance.

• Solid lines stand for 3


standard deviations.

• If data is normal, we
should expect about
99% of data in the
bottom-left box.
Transformation Methods
• Sometimes data is “irregular”: non-normal, outliers, skewed, heavy-tailed,

• Data transformation techniques can be used to stabilize variance, make
the data more normal-like, improve the validity of measures of association

• Power transformation:
𝑦 = 𝑥 𝛼,  0 < 𝛼 < 1 .
• Log transformation:
𝑦 = ln(𝑥),  0 ≤ 𝑥 .
Scatter plot after Log
transformation

• More Normal-like.

• Reduced variance

• Less outliers
Scatter Plot for Three Variables
• The scatter plot can be extended to visualize the relationship among
three different variables which is called 3D scatter plot.

• Plotted on Cartesian coordinates: x-axis is the value of the first


variable, y-axis is the value of the second variable and z-axis is the
value of the third variable.

• A fourth variable can be set to denote the color or size of the markers.
3D scatter plot for
Calcium, Iron and Protein

• Each red dot represents


an observation

• Calcium on x-axis

• Iron on y-axis

• Protein on z-axis
3D scatter plot after Log
transformation

• Each blue represents an


observation

• More clustered in a
“ball”

• Reduced variance

• Less outliers
Pros and Cons of 3D Scatter Plot

Pros
• Visualization for 3 or 4 variables.
• Complex relationship rather than pair wise
• Joint sample distribution

Cons
•Not friendly to bare eyes (angle dependent)
•Hard to interpret
•Not working for more ( ≥ 5) variables
Pairwise Scatter Plot for More Variables

• The pairwise scatter plot aims to visualize the relationship for each
pair of variables in a multivariate dataset.

• A pairwise scatter plot is an array of scatter plots, the (i, j)-th plot in
the array is the scatter plot between the i-th and j-th variables.
Pairwise Scatter Plot for USDA Women’s Health Survey
Pros and Cons of Pairwise Scatter Plot

Pros
• Visualization for many variables
simultaneously
• Interpret pairwise relationships

Cons
•No joint relationship for more
than 2 variables
•Huge array when the number of
variables is large
Time Plot
• A time plot (sometimes called a time series graph) displays values versus time.
It is similar to scatter plot, but x-axis is chosen to be time (or age, survival time
…).

• A time plot is useful to compare the “growth” of multiple variables with respect
to time (or some other common index).

• Application of time plot:


1. Finance: compare multiple stock returns vs time
2. Clinical: compare multiple patients vs survival time
3. Biology: compare multiple genome sequences vs positions
4. Physics: multiple measurements vs time
Time Plot for Financial Time Series

• Time series are one of the most common data types encountered in finance
and weather forecasting

• One powerful yet simple visualization tool in financial analysis is to draw the
time plot for multiple assets.

• Things to check in a time plot:


1. Co-movement of variables
2. Trend (increasing or decreasing …)
3. Periodical patterns (weekly, seasonal, long term … )
4. Black swan event (huge gap, financial crisis …)
Example 2: Stock Prices of High-tech Companies

• We collect daily stock prices (closing price) of four leading high-tech


companies between January 2018 and January 2019.

• The following variables were included:


1. Daily stock price of Apple
2. Daily stock price of Facebook
3. Daily stock price of IBM
4. Daily stock price of Microsoft
A Peek at the Data
• Dataset contains 4 variables and 250 observations (250 trading days).
• Table of first five observations
Date Apple Facebook IBM Microsoft

1/16/2018 176.19 178.39 163.85 88.35

1/17/2018 179.10 177.60 168.65 90.14

1/18/2018 179.26 179.80 169.12 90.10

1/19/2018 178.46 181.29 162.37 90

1/22/2018 177 185.37 162.60 91.61


Time Plot of
Stock Prices

What can you say


about this plot?
Some Questions?

• Can you observe any co-


movement of these four
stocks?

• Any trends?

• Which asset is most


risky?
Pros and Cons of
Price Data

Pros
• Straightforward
• Easy to check trends and
high/low prices

Cons
•Compare apple to pear
•No relative gain/loss
•Non-stationary data
Log-return of Financial Assets

• In financial analysis, logarithm of returns is more popular than prices


or raw returns.

• For an asset (e.g. stock, bond, gold, bitcoin …), log-return at time 𝑡 is

( 𝑃𝑡−1 )
𝑃𝑡 − 𝑃𝑡−1
𝑟𝑡 = log(1 + 𝑅𝑡) = log 1 + = log(𝑃𝑡) − log(𝑃𝑡−1),

where 𝑃𝑡 and 𝑅𝑡 are the price and simple return at time 𝑡 .


Why Log-return ?
• Log-return is favored for multiple reasons:
• More normal-like (recall log transformation)

• More stationary time series (zero mean and fixed variance)

• Log additivity
𝑇
𝑟𝑡 = log(𝑃𝑡) − log(𝑃0)

1
• Easy calculus
𝑑 𝑥

𝑥 𝑥
𝑒 = 𝑒 𝑑𝑥 = 𝑒
𝑑𝑥
Time Plot of
Stock Returns

What can you say


about this plot?
Some Questions?

• Which asset is the most


risky (largest variance)?

What is this • Can you observe any


huge drop for black swan event?
FB?
What
happened?
Radar Chart
• Radar chart (also known as spider, web, polar, star chart) is a graphical
method of comparing multivariate data in the form of a two-
dimensional chart of three or more quantitative variables.

• A radar chart is useful to:


1. Find similar observations
2. Find observation with high/low scores
3. Find observations, clusters
4. Find outliers
Example 3: High School Final Scores

• We generate a toy dataset. The dataset contains the final scores of


some students in a hypothetical high school.

• Suppose the following 8 subjects are tested:


1. Math, 2. English, 3. Biology, 4. Music,
5. Programming, 6. French, 7. Physic and 8. Statistics

• Each subject is scored from 1 to 10.


A Peek at the Data
• We want to compare the performance of students
• Table of first four observations

ID Math English Biology Music Prog. French Physics Stat.


1 4 6 9 3 9 2 6 8
2 5 5 3 5 3 3 8 5
3 3 2 3 5 4 8 6 8
4 9 2 6 9 4 2 7 6
Radar Chart for 1st Student

• Each axis represents the


score in a subject

• Require standardized
variables

• The area covered can be


considered as a score for
overall performance
Comparison on parallel charts
Comparison on One Chart

• Draw multiple
observations on the
same chart

• Easy to compare areas,


pros and cons

• Not good for a large


number of observations
Chernoff Face
• Chernoff faces, invented by Herman Chernoff in 1973, display
multivariate data in the shape of a human face.

• The individual parts, such as eyes, ears, mouth and nose represent
values of the variables by their shape, size, placement and orientation.

• The idea behind using faces is that people easily recognize faces and
notice small changes without difficulty.
Example 4: Crime Rates by State in 2008

• The data contains the rates of various types of crimes in different


states. The data source is Table 301 of the 2008 US Statistical Abstract.

• Rates of the following crime types are recorded:

1. Murder 2. Forcible rape


3. Robbery 4. Aggravated assault
5. Burglary 6. Larceny theft
7. Motor vehicle theft
• Each type of crime
corresponds to a
character in face.

• Shape of the
character depends
on the value of
variable

How to make a face?


Let’s make some faces

Pros
• Easy to tell and remember
the differences between
states

Cons
• Hard to translate faces back
to the value of variables
More
Faces
Heat Map
• A heat map (or heatmap) is a visualization tool which represent values
in a data matrix by colors in a 2D graph

• Applications of heat map:


1. Molecular biology
2. Neural science
3. Physics
4. Density plot
Heatmap
for Whole
Brain
Analysis
Network Map
• Network Map is a visualization tool to study of the physical
connectivity of networks.
• Each node in the map represents a variable (e.g. users, characters,
features).
• Two nodes are connected if there is an edge between them.

• An example: Network map for marvel cinematic universe


Network Map
• Network Map is a visualization tool to study of the physical
connectivity of networks.
• Each node in the map represents a variable (e.g. users, characters,
features).
• Two nodes are connected if there is an edge between them.

• An example: Network map for marvel cinematic universe


Network Map
Univariate - Review
Descriptive Statistics
• Descriptive statistics is the term given to the analysis of data that
helps describe, show or summarize data in a meaningful way.

• Descriptive statistics are very important since raw data is hard to


interpret and visualization is not quantitively accurate.

• Descriptive statistics do not, however, allow us to make conclusions


beyond the data we have analyzed or reach conclusions regarding any
hypotheses we might have made.
Descriptive Statistics (cont.)
• The goal of descriptive statistics is to obtain some partial descriptions of the
joint distribution of the data.

• Three aspects of the data are of importance:


1. Central Tendency. What is a typical value for each variable?
2. Dispersion. How far apart are the individual observations deviate from a central
value for a given variable?
3. Association. When more than one variable are studied together, how does each
variable relate to the remaining variables? How are the variables simultaneously
related to one another? Are they positively or negatively related?
Population
• A population is the collection of all people, plants, animals, or objects of interest about
which we wish to make statistical inferences (generalizations).

• The population may also be viewed as the collection of all possible random draws from a
stochastic model; for example, independent draws from a normal distribution with a given
population mean and population variance.

• A population parameter is a numerical characteristic of a population.

• In nearly all statistical problems we do not know the value of a parameter because we do
not measure the entire population. We use sample data to make an inference about the
value of a parameter.
Sample
• A sample is the subset of the population that we actually measure or observe.

• A sample statistic is a numerical characteristic of a sample. A sample statistic


estimates the unknown value of a population parameter.

• Information collected from sample statistic is sometimes referred to


as Descriptive Statistic.

• Statistics, as a subject matter, is the science and art of using sample


information to make generalizations about populations
Example: USDA Women’s Health Survey
Population Sample
• Intake of the 5 nutrients for all • Intake of the 5 nutrients
women aged between 25 and 50 observed from 737 women aged
in United States. between 25 and 50 in United
States.

A Population Parameter A Sample Statistic


• Average intake of Calcium • Sample mean of Calcium intake
Notation
•𝑝 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠, 𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠,
𝑡ℎ
𝑥
• 𝑖𝑗 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑗 𝑖𝑛 𝑡ℎ𝑒 𝑖  𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

• Vector of observations for the j-th variable

𝑥𝑗1
𝑥𝑗2
𝒙𝒋 = = (𝑥𝑗1, 𝑥𝑗2, …, 𝑥𝑗𝑛)T

𝑥𝑗𝑛
Notation (cont.)
• Data matrix (sometimes called design, regressor and model matrix) whose j-
th column is the vector of observations for the j-th subject

𝑥11 ⋯ 𝑥𝑝1
𝐗 = (𝒙𝟏, 𝒙𝟐, ⋯ 𝒙𝒑) = ⋮ ⋱ ⋮ ∈ ℝ 𝑛×𝑝
𝑥1𝑛 ⋯ 𝑥𝑝𝑛

• Each column of 𝐗 contains all observations of a subject.


• Each row of 𝐗 contains all subjects (variables) of an observation.
Measures of Central Tendency
• Throughout this lecture, we use symbol 𝜇𝑗 to represent a population mean of j-th variable
and the symbol 𝑥¯𝑗 to represent a sample mean based on observed data for j-th variable.

• The population mean is the measure of central tendency for the population. The population
mean for j-th variable is
𝜇𝑗 = E(𝑥𝑖𝑗)

• The population mean can be estimated by the sample mean

1 𝑛

𝑥¯𝑗 = 𝑥𝑖𝑗
𝑛 𝑖=1
Population Mean Vector
• A collection of population means of all variables can be written as a
population mean vector

𝜇1 𝑥𝑖1
𝜇2 𝑥𝑖2
𝝁= =E = E(𝒙𝒊)
⋮ ⋮
𝜇𝑝 𝑥𝑖𝑝
Sample Mean Vector
• We can estimate the population mean vector by sample mean vector

1 𝑛
𝑥¯1 𝑛
∑𝑖=1 𝑥𝑖1
1 𝑛
𝑥¯2 ∑𝑖=1 𝑥𝑖2 1 𝑛
𝒙¯ =
𝑛 ∑𝑖=1
= 𝑛 = 𝒙𝒊
⋮ ⋮
𝑥¯𝑝 1 𝑛
𝑛
∑𝑖=1 𝑥𝑖𝑝
Sample Mean is Unbiased
• Sample mean (vector) is an unbiased descriptive statistic of population mean (vector)

( 𝑛 ∑𝑖=1 ) 𝑛 ∑𝑖=1
1 1 1
E(𝑥¯𝑗) = E E(𝑥𝑖𝑗) =
𝑛 𝑛 𝑛

𝑛 ∑𝑖=1
𝑥𝑖𝑗 = 𝜇𝑗 = 𝜇𝑗,

E(𝑥¯1)
𝜇1
E(𝑥¯2) 𝜇2
¯ =
𝑎𝑛𝑑 E(𝒙) = = 𝝁.
⋮ ⋮
𝜇𝑝
E(𝑥¯3)
Why We Care Bias?
• Statistical bias is defined as the difference between population parameter and
the expectation of the estimator
• For example

𝜇𝑗 − E(𝑥¯𝑗) .
• The expectation of estimator is the parameter that your estimator converges to when
sample size is large enough. This is the “best” you can expect from your estimator.

• If the bias is not 0, there will be a non-vanishing estimation error even if you increase
your sample size.
Measures of Dispersion
• A variance measures the degree of dispersion (spread) in a variable’s values.

• The population variance of the j-th variable is

𝜎𝑗2 = Var(𝑥𝑖𝑗) = E(𝑥𝑖𝑗 − 𝜇𝑗) = E(𝑥𝑖𝑗2 ) − 𝜇𝑗2 .


2

The population standard deviation of the j-th variable is

𝜎𝑗 = Var(𝑥𝑖𝑗)
Sample Variance
• The population variance 𝜎𝑗2  can be estimated by the sample variance

1 𝑛
𝑠𝑗2 (𝑥𝑖𝑗 − 𝑥¯𝑗)2   .
𝑛 − 1 ∑𝑖=1
=

• The sample standard deviation for the j-th variable is simply the square root of
the sample variance, i.e. 𝑠𝑗.

• Question: why dividing 𝑛 − 1 instead of 𝑛 ?


Why dividing 𝑛 − 1 ?
𝑛
∑𝑖=1
• (𝑥𝑖𝑗 − 𝑥¯𝑗) = 0, thus, if we know 𝑛 − 1 of the deviations, we can compute
the last one.

• This means that there are only 𝑛 − 1 freely varying deviations, i.e. 𝑛 − 1 degrees of
freedom.

• Dividing 𝑛 − 1 makes sample variance an unbiased descriptive statistic of population


variance

E(𝑠𝑗2) = 𝜎𝑗2 .
Dividing 𝑛 Dividing 𝑛 − 1
Pros Pros
• From a purely descriptive point of view, Unbiased sample variance.
to divide by 𝑛 in the definition of the

sample variance is makes more sense.

Cons Cons
• Biased sample variance • Not intuitive.

When n is large 𝑛 ≈ 𝑛 − 1 and the difference is


negligible.
Relation between center and dispersion
• Measures of center and measures of dispersion are best thought of together, in
the context of an error function.

• The error function measures how well a single number 𝑎 represents the entire
data set.

• The values of 𝑎 (if they exist) that minimize the error functions are our measures
of center.

• The minimum value of the error function is the corresponding measure of spread.
Mean Square Error Function
• The mean square error (𝑀𝑆𝐸) function is defined by
1 𝑛
(𝑥𝑖𝑗 − 𝑎)2   .
𝑛 − 1 ∑𝑖=1
MSE(𝑎) =

• Minimizing 𝑀𝑆𝐸 with respect to 𝑎 is equivalent to solve


𝑑 2 𝑛

MSE(𝑎) = (𝑥𝑖𝑗 − 𝑎) = 0   .
𝑑𝑎 𝑛−1 𝑖=1

• 𝑀𝑆𝐸 is minimized at 𝑎 = 𝑥¯𝑗, the sample mean.


• The minimum value of 𝑀𝑆𝐸  is 𝑠 2, the sample variance.
Measures of Association: Covariance
• The population covariance is a measure of the association between pairs
of variables. The population covariance between variables j and k is
𝜎𝑗𝑘 = E{(𝑥𝑖𝑗 − 𝜇𝑗)(𝑥 − 𝜇𝑘)}
𝑖𝑘
• The production (𝑥𝑖𝑗 − 𝜇𝑗)(𝑥𝑖𝑘 − 𝜇𝑘) is a function of random variables
𝑥𝑖𝑗 and 𝑥𝑖𝑘. Therefore itself is also a random variable and has a
population mean.

• Positive population covariance means that the two variables are


positively associated (similar to negative).
Population Covariance Matrix
• The population variances and covariances can be collected into
the population variance-covariance matrix: This is also known by the name
of population dispersion matrix.

𝜎12 𝜎12 ⋯ 𝜎1𝑝


𝜎21 𝜎22 ⋯ 𝜎2𝑝
𝚺= ∈ ℝ 𝑝×𝑝 .
⋮ ⋮ ⋱ ⋮
𝜎𝑝1 𝜎𝑝2 ⋯ 𝜎𝑝2
• The population variance-covariance matrix is a symmetric matrix.
Sample Covariance
• The population covariance between variables j and k can be estimated by the sample
covariance
1
( 𝑥𝑖𝑗 − 𝑥¯𝑗)(𝑥𝑖𝑘 − 𝑥¯𝑘) .
𝑛

𝑛 − 1 ∑𝑖=1
𝑠𝑗𝑘 =

• 𝑠𝑗𝑘 = 0 : suggests two variables are uncorrelated (not independence!);


• 𝑠𝑗𝑘 > 0 : suggests two variables positively correlated (j ↑when k ↑);
• 𝑠𝑗𝑘 < 0 : suggests two variables negatively correlated (j ↓when k ↑).

Unbiasedness:E(𝑠𝑗𝑘) = 𝜎𝑗𝑘
Sample Covariance Matrix
• The population variance-covariance matrix may be estimated by the sample
variance-covariance matrix

𝑠12 𝑠12 ⋯ 𝑠1𝑝


𝑠21 𝑠22 ⋯ 𝑠2𝑝
𝐒= ∈ ℝ 𝑝×𝑝 .
⋮ ⋮ ⋱ ⋮
𝑠𝑝1 𝑠𝑝2 ⋯ 𝑠𝑝2
• The sample variance-covariance matrix is also a symmetric matrix.
Unbiasedness: E(𝐒) = 𝚺 (implied by element-wise result)
Measures of Association: Correlation
• The sign of covariance value is useful to suggest positive, negative
correlations or un-correlated.
• The magnitude of the covariance value is not particularly helpful as it
depends on the magnitudes (scales) of the two variables. It does not
tell us the strength of the associations.
• To assess the strength of an association, we use correlation values.
The population correlation between variables j and k is
𝜎𝑗𝑘
𝜌𝑗𝑘 =
𝜎𝑗𝜎𝑘
Correlation and Data Transformation
• Correlation of raw data is equivalent to the covariance of Z-score
standardized data.

• After Z-score standardization 𝜎𝑗 = 𝜎𝑘 = 1.

• Correlation is a “standardized” version of covariance.

• Correlation is “scale invariant” as it’s value does not change if we apply a


linear transformation (except multiply by 0) to the variable.
Population Correlation
• The population correlation 𝜌𝑗𝑘 has the same sign with 𝜎𝑗𝑘.
• The population correlation 𝜌𝑗𝑘 must lie between -1 and 1

−1 ≤ 𝜌𝑗𝑘 ≤ 1 .

• 𝜌𝑗𝑘 = 0 : two variables are uncorrelated;


• 𝜌𝑗𝑘 close to 1 : strong positively dependence;
• 𝜌𝑗𝑘 close to -1: strong negative dependence.
Sample Correlation
• The population correlation may be estimated by substituting into the formula the
sample covariances and standard deviations.
• The sample correlation between variables j and k is
𝑠𝑗𝑘
𝑟𝑗𝑘 =
𝑠𝑗𝑠𝑘
• 𝑟𝑗𝑘 = 0 : suggests two variables are uncorrelated;
• 𝑟𝑗𝑘 close to 1 : suggests strong positively dependence;
• 𝑟𝑗𝑘 close to -1: suggests strong negative dependence.
Unbiasedness: E(𝑟𝑗𝑘) = 𝜌𝑗𝑘
Correlation Matrix
• The population correlation matrix and sample correlation matrix are

𝜌12 𝜌12 ⋯ 𝜌1𝑝 𝑟12 𝑟12 ⋯ 𝑟1𝑝


𝜌21 𝜌22 ⋯ 𝜌2𝑝 𝑟21 𝑟22 ⋯ 𝑟2𝑝
𝚸=  and 𝐑 = .
⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮
𝜌𝑝1 𝜌𝑝2 ⋯ 𝜌𝑝2 𝑟𝑝1 𝑟𝑝2 ⋯ 𝑟𝑝2
• The above two matrices are also symmetric.
Unbiasedness: E(𝐑) = 𝐏 (implied by element-wise result)
Example 1: USDA Women’s Health Survey

• In 1985, the USDA commissioned a study of women’s nutrition.


Nutrient intake was measured for a random sample of 737 women
aged between 25 and 50.

• The following variables were measured:


1. Calcium (mg), 2. Iron (mg), 3. Protein (g)
4. Vitamin A (μg), 5. Vitamin C (mg)

Q: Find the descriptive statistics of this dataset.


Sample Mean and Sample Standard Deviation

• Here we calculate sample mean and sample standard deviation for


each variable in the dataset.

Variable Sample mean Sample SD


Calcium 624.0 mg 397.3 mg
Iron 11.1 mg 6.0 mg
Protein 65.8 mg 30.6 mg
Vitamin A 839.6 μg 1634.0 μg
Vitamin C 78.9 mg 73.6 mg
How to interpret?
• Sample mean estimates the
central tendency (Average
Variable Sample mean Sample SD amount of nutrient intake).
Calcium 624.0 mg 397.3 mg
Iron 11.1 mg 6.0 mg • Sample SD estimates
Protein 65.8 mg 30.6 mg dispersion.
Vitamin A 839.6 μg 1634.0 μg
Vitamin C 78.9 mg 73.6 mg • Notice that the standard
deviations are large relative to
their respective means,
However, whether the standard deviations are
relatively large or not, will depend on the context of especially for Vitamin A & C.
application. Skill in interpreting the statistical
analysis depends very much on the researcher's • This would indicate a high
subject matter knowledge. variability among women in
nutrient intake.
Sample Covariance Matrix

• The sample variance-covariance matrix is copied into the matrix below.

Calcium Iron Protein Vitamin A Vitamin C

Calcium 157829.4 940.1 6075.8 102411.1 6701.6

Iron 940.1 35.8 114.1 2383.2 137.7

Protein 6075.8 114.1 934.9 7330.1 477.2


Vitamin A 102411.1 2383.2 7330.1 2668452.4 22063.3

Vitamin C 6701.6 137.7 477.2 22063.3 5416.3


How to interpret? • Sample covariance estimates
the association between
different variables.
Calcium Iron Protein Vitamin A Vitamin C

Calcium 157829.4 940.1 6075.8 102411.1 6701.6


• All off-diagonal elements in this
Iron 940.1 35.8 114.1 2383.2 137.7 table are positive which
Protein 6075.8 114.1 934.9 7330.1 477.2 indicates positive dependency.
Vitamin A 102411.1 2383.2 7330.1 2668452.4 22063.3
Vitamin C 6701.6 137.7 477.2 22063.3 5416.3
• A woman with above-average
Calcium intake may also have
However, the magnitude of the covariance value can above-average intake of other
not be directly interpreted as the strength of nutrients.
association as it depends on the scales of variables.
• A woman with below-average
Iron intake may also have
below-average intake of other
nutrients.
Sample Correlation Matrix

• The sample correlation matrix is copied into the matrix below.

Calcium Iron Protein Vitamin A Vitamin C

Calcium 1.000 0.395 0.500 0.158 0.229


Iron 0.395 1.000 0.623 0.244 0.313
Protein 0.500 0.623 1.000 0.147 0.212
Vitamin A 0.158 0.244 0.147 1.000 0.184
Vitamin C 0.229 0.313 0.212 0.184 1.000
How to interpret? • Sample correlation estimates
the association between
Calcium Iron Protein Vitamin A Vitamin C standardized variables.

Calcium 1.000 0.395 0.500 0.158 0.229 • All off-diagonal elements in this
Iron 0.395 1.000 0.623 0.244 0.313 table are positive which
Protein 0.500 0.623 1.000 0.147 0.212 indicates positive dependency.

Vitamin A 0.158 0.244 0.147 1.000 0.184


• Magnitude indicates strength
of dependency.
Vitamin C 0.229 0.313 0.212 0.184 1.000

• High correlation pairs: Calcium


– Iron, Calcium – Protein, Iron-
Why these three nutrients are highly correlated? Protein
This can be a good research problem!
Overall Measures of Dispersion
• Sometimes it is also useful to have an overall measure of dispersion in
the data. In this measure, it would be good to include all of the
variables simultaneously, rather than one at a time.

• The following two quantities are used to measure the dispersion of all
variables together
1. Total variance
2. Generalized variance
Total Variance
• Population total variance is defined as the trace of the population variance-
covariance matrix

𝑡𝑟𝑎𝑐𝑒(𝚺) = 𝜎12 + 𝜎22 + ⋯ + 𝜎𝑝2

• Total variance is the sum of variances for all variables in the dataset.
• Population total variance can be estimated by the trace of 𝐒

𝑡𝑟𝑎𝑐𝑒(𝐒) = 𝑠12 + 𝑠22 + ⋯ + 𝑠𝑝2


Generalized Variance
• Population generalized variance is defined as the determinant of the population
variance-covariance matrix

𝑑𝑒𝑡(𝚺) or 𝚺

• Generalized variance also accounts for off-diagonal elements in 𝚺 (e.g. covariance


effects).
• Population generalized variance can be estimated by the determinant of 𝐒

𝑑𝑒𝑡(𝐒) or  𝐒

You might also like