Lecture1&2slides PDF
Lecture1&2slides PDF
LILI WANG
• Visualization may:
1. suggest a plausible model for the data,
2. assess validity of model assumptions,
3. detect outliers or suggest plausible normalizing transformation,
4. and more.
Scatter Plot
• A scatter plot is a data visualization tool that uses dots to represent the values
obtained for two different variables.
• Plotted on Cartesian coordinates: x-axis is the value of the first variable and y-
axis is the value of the second variable.
• Calcium on x-axis
• Iron on y-axis
• If data is normal, we
should expect about
99% of data in the
bottom-left box.
Transformation Methods
• Sometimes data is “irregular”: non-normal, outliers, skewed, heavy-tailed,
…
• Data transformation techniques can be used to stabilize variance, make
the data more normal-like, improve the validity of measures of association
• Power transformation:
𝑦 = 𝑥 𝛼, 0 < 𝛼 < 1 .
• Log transformation:
𝑦 = ln(𝑥), 0 ≤ 𝑥 .
Scatter plot after Log
transformation
• More Normal-like.
• Reduced variance
• Less outliers
Scatter Plot for Three Variables
• The scatter plot can be extended to visualize the relationship among
three different variables which is called 3D scatter plot.
• A fourth variable can be set to denote the color or size of the markers.
3D scatter plot for
Calcium, Iron and Protein
• Calcium on x-axis
• Iron on y-axis
• Protein on z-axis
3D scatter plot after Log
transformation
• More clustered in a
“ball”
• Reduced variance
• Less outliers
Pros and Cons of 3D Scatter Plot
Pros
• Visualization for 3 or 4 variables.
• Complex relationship rather than pair wise
• Joint sample distribution
Cons
•Not friendly to bare eyes (angle dependent)
•Hard to interpret
•Not working for more ( ≥ 5) variables
Pairwise Scatter Plot for More Variables
• The pairwise scatter plot aims to visualize the relationship for each
pair of variables in a multivariate dataset.
• A pairwise scatter plot is an array of scatter plots, the (i, j)-th plot in
the array is the scatter plot between the i-th and j-th variables.
Pairwise Scatter Plot for USDA Women’s Health Survey
Pros and Cons of Pairwise Scatter Plot
Pros
• Visualization for many variables
simultaneously
• Interpret pairwise relationships
Cons
•No joint relationship for more
than 2 variables
•Huge array when the number of
variables is large
Time Plot
• A time plot (sometimes called a time series graph) displays values versus time.
It is similar to scatter plot, but x-axis is chosen to be time (or age, survival time
…).
• A time plot is useful to compare the “growth” of multiple variables with respect
to time (or some other common index).
• Time series are one of the most common data types encountered in finance
and weather forecasting
• One powerful yet simple visualization tool in financial analysis is to draw the
time plot for multiple assets.
• Any trends?
Pros
• Straightforward
• Easy to check trends and
high/low prices
Cons
•Compare apple to pear
•No relative gain/loss
•Non-stationary data
Log-return of Financial Assets
• For an asset (e.g. stock, bond, gold, bitcoin …), log-return at time 𝑡 is
( 𝑃𝑡−1 )
𝑃𝑡 − 𝑃𝑡−1
𝑟𝑡 = log(1 + 𝑅𝑡) = log 1 + = log(𝑃𝑡) − log(𝑃𝑡−1),
• Log additivity
𝑇
𝑟𝑡 = log(𝑃𝑡) − log(𝑃0)
∑
1
• Easy calculus
𝑑 𝑥
∫
𝑥 𝑥
𝑒 = 𝑒 𝑑𝑥 = 𝑒
𝑑𝑥
Time Plot of
Stock Returns
• Require standardized
variables
• Draw multiple
observations on the
same chart
• The individual parts, such as eyes, ears, mouth and nose represent
values of the variables by their shape, size, placement and orientation.
• The idea behind using faces is that people easily recognize faces and
notice small changes without difficulty.
Example 4: Crime Rates by State in 2008
• Shape of the
character depends
on the value of
variable
Pros
• Easy to tell and remember
the differences between
states
Cons
• Hard to translate faces back
to the value of variables
More
Faces
Heat Map
• A heat map (or heatmap) is a visualization tool which represent values
in a data matrix by colors in a 2D graph
• The population may also be viewed as the collection of all possible random draws from a
stochastic model; for example, independent draws from a normal distribution with a given
population mean and population variance.
• In nearly all statistical problems we do not know the value of a parameter because we do
not measure the entire population. We use sample data to make an inference about the
value of a parameter.
Sample
• A sample is the subset of the population that we actually measure or observe.
𝑥𝑗1
𝑥𝑗2
𝒙𝒋 = = (𝑥𝑗1, 𝑥𝑗2, …, 𝑥𝑗𝑛)T
⋮
𝑥𝑗𝑛
Notation (cont.)
• Data matrix (sometimes called design, regressor and model matrix) whose j-
th column is the vector of observations for the j-th subject
𝑥11 ⋯ 𝑥𝑝1
𝐗 = (𝒙𝟏, 𝒙𝟐, ⋯ 𝒙𝒑) = ⋮ ⋱ ⋮ ∈ ℝ 𝑛×𝑝
𝑥1𝑛 ⋯ 𝑥𝑝𝑛
• The population mean is the measure of central tendency for the population. The population
mean for j-th variable is
𝜇𝑗 = E(𝑥𝑖𝑗)
1 𝑛
∑
𝑥¯𝑗 = 𝑥𝑖𝑗
𝑛 𝑖=1
Population Mean Vector
• A collection of population means of all variables can be written as a
population mean vector
𝜇1 𝑥𝑖1
𝜇2 𝑥𝑖2
𝝁= =E = E(𝒙𝒊)
⋮ ⋮
𝜇𝑝 𝑥𝑖𝑝
Sample Mean Vector
• We can estimate the population mean vector by sample mean vector
1 𝑛
𝑥¯1 𝑛
∑𝑖=1 𝑥𝑖1
1 𝑛
𝑥¯2 ∑𝑖=1 𝑥𝑖2 1 𝑛
𝒙¯ =
𝑛 ∑𝑖=1
= 𝑛 = 𝒙𝒊
⋮ ⋮
𝑥¯𝑝 1 𝑛
𝑛
∑𝑖=1 𝑥𝑖𝑝
Sample Mean is Unbiased
• Sample mean (vector) is an unbiased descriptive statistic of population mean (vector)
( 𝑛 ∑𝑖=1 ) 𝑛 ∑𝑖=1
1 1 1
E(𝑥¯𝑗) = E E(𝑥𝑖𝑗) =
𝑛 𝑛 𝑛
𝑛 ∑𝑖=1
𝑥𝑖𝑗 = 𝜇𝑗 = 𝜇𝑗,
E(𝑥¯1)
𝜇1
E(𝑥¯2) 𝜇2
¯ =
𝑎𝑛𝑑 E(𝒙) = = 𝝁.
⋮ ⋮
𝜇𝑝
E(𝑥¯3)
Why We Care Bias?
• Statistical bias is defined as the difference between population parameter and
the expectation of the estimator
• For example
𝜇𝑗 − E(𝑥¯𝑗) .
• The expectation of estimator is the parameter that your estimator converges to when
sample size is large enough. This is the “best” you can expect from your estimator.
• If the bias is not 0, there will be a non-vanishing estimation error even if you increase
your sample size.
Measures of Dispersion
• A variance measures the degree of dispersion (spread) in a variable’s values.
𝜎𝑗 = Var(𝑥𝑖𝑗)
Sample Variance
• The population variance 𝜎𝑗2 can be estimated by the sample variance
1 𝑛
𝑠𝑗2 (𝑥𝑖𝑗 − 𝑥¯𝑗)2 .
𝑛 − 1 ∑𝑖=1
=
• The sample standard deviation for the j-th variable is simply the square root of
the sample variance, i.e. 𝑠𝑗.
• This means that there are only 𝑛 − 1 freely varying deviations, i.e. 𝑛 − 1 degrees of
freedom.
E(𝑠𝑗2) = 𝜎𝑗2 .
Dividing 𝑛 Dividing 𝑛 − 1
Pros Pros
• From a purely descriptive point of view, Unbiased sample variance.
to divide by 𝑛 in the definition of the
•
sample variance is makes more sense.
Cons Cons
• Biased sample variance • Not intuitive.
• The error function measures how well a single number 𝑎 represents the entire
data set.
• The values of 𝑎 (if they exist) that minimize the error functions are our measures
of center.
• The minimum value of the error function is the corresponding measure of spread.
Mean Square Error Function
• The mean square error (𝑀𝑆𝐸) function is defined by
1 𝑛
(𝑥𝑖𝑗 − 𝑎)2 .
𝑛 − 1 ∑𝑖=1
MSE(𝑎) =
𝑛 − 1 ∑𝑖=1
𝑠𝑗𝑘 =
Unbiasedness:E(𝑠𝑗𝑘) = 𝜎𝑗𝑘
Sample Covariance Matrix
• The population variance-covariance matrix may be estimated by the sample
variance-covariance matrix
−1 ≤ 𝜌𝑗𝑘 ≤ 1 .
Calcium 1.000 0.395 0.500 0.158 0.229 • All off-diagonal elements in this
Iron 0.395 1.000 0.623 0.244 0.313 table are positive which
Protein 0.500 0.623 1.000 0.147 0.212 indicates positive dependency.
• The following two quantities are used to measure the dispersion of all
variables together
1. Total variance
2. Generalized variance
Total Variance
• Population total variance is defined as the trace of the population variance-
covariance matrix
• Total variance is the sum of variances for all variables in the dataset.
• Population total variance can be estimated by the trace of 𝐒
𝑑𝑒𝑡(𝚺) or 𝚺
𝑑𝑒𝑡(𝐒) or 𝐒