StatsLecture1 Probability

This document discusses methods for summarizing and analyzing data, including probability distributions, error bars, and nonparametric approaches. It begins by introducing common statistics like the mean, median, and standard deviation used to describe datasets. It then discusses probability distributions, focusing on the Gaussian distribution. Finally, it covers error bars and how they can be estimated parametrically or nonparametrically using bootstrapping to account for uncertainty in sample statistics.

Uploaded by

choi7

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

StatsLecture1 Probability

Uploaded by

choi7

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Statistics and Data Analysis in MATLAB

Kendrick Kay, kendrick.kay@wustl.edu

Lecture 1: Probability distributions and error bars

1. Exploring a simple dataset: one variable, one condition

- Let's start with the simplest possible dataset. Suppose we measure a single quantity for a single
condition. For example, suppose we measure the heights of male adults. What can we do with
the data?
- The histogram provides a useful summary of a set of data—it shows the distribution of the
data. A histogram is constructed by binning values and counting the number of observations in
each bin.
- The mean and standard deviation are simple summaries of a set of data. They are parametric
statistics, as they make implicit assumptions about the form of the data. The mean is designed to
quantify the central tendency of a set of data, while the standard deviation is designed to quantify
the spread of a set of data.
n

∑x i
mean(x) = x = i =1

n

n

∑ (x i − x )2
std(x) = i =1

n −1
In these equations, xi is the ith data point and n is the total number of data points.
- The median and interquartile range (IQR) also summarize data. They are nonparametric
statistics, as they make minimal assumptions about the form of the data. The Xth percentile is the
value below which X% of the data points lie. The median is the 50th percentile. The IQR is the
difference between the 75th and 25th percentiles.
- Mean and standard deviation are appropriate when the data are roughly Gaussian. When the
data are not Gaussian (e.g. skewed, heavy-tailed, outliers present), the mean and standard
deviation may be misleading and the median and IQR may be preferable.

2. Probability distributions
- A probability distribution (or probability density function) is a mathematical function of one or
more variables that describes the likelihood of observing any specific set of values for the
variables. Distributions can be univariate (pertaining to one variable) or multivariate (pertaining
to more than one variable); we will stick with the univariate case for now. The integral of a
probability density function necessarily equals one.
- The Gaussian (or normal) distribution is a very useful probability distribution. It is parametric
in the sense that it places certain constraints on the distribution of the data (the distribution must
be unimodal, symmetric, etc.). The Gaussian distribution has two parameters, the mean (µ) and
the standard deviation (σ), and is given by the following equation:
( x − µ )2
1 −
p(x) = e 2σ 2
σ 2π
For any given value x, this equation specifies how to compute p(x), the likelihood of that value.
When points are drawn from a Gaussian distribution, 68% and 95% of the points will be within 1
and 2 standard deviations from the mean, respectively.
- Given a set of data, the Gaussian distribution that best describes the data (i.e. maximizes the
likelihood of the data) is the one whose mean and standard deviation are matched to the mean
and standard deviation of the data. Thus, when computing the mean and standard deviation of a
set of data, you are in a sense fitting a Gaussian distribution to the data.
- An advantage of the Gaussian distribution is that it is simple and may be a reasonable
approximation for many types of data. But what if the data are not Gaussian? If there is a suitable
parametric probability distribution for the data (e.g. the Poisson distribution), we could choose to
use it. Alternatively, we can adopt nonparametric techniques that take a more flexible approach,
allowing the data themselves to determine the form of the probability distribution. Such
techniques include histograms, bootstrapping, and kernel density estimation, and are covered
later in this lecture.

3. Error bars
- When measuring some quantity, we may find that the measurement is different each time it is
performed. We attribute this variability to noise, i.e. any factor that contributes to variability in
the measurement.
- Statistically speaking, the measurements we make constitute a sample from the population, i.e.
the underlying probability distribution that describes the measurement process. The problem is
that we are interested in characteristics of the population but all we can observe is our finite
sample from the population.
- A statistic (e.g. mean) computed on a random sample is subject to variability and is not the
same as the statistic computed on the whole population (technically known as the parameter).
Thus, we need to distrust, to some degree, the statistic computed on the sample. To indicate
uncertainty on the statistic, it is useful to plot error bars indicating the standard error.
- To understand standard error, let's consider a simple example. Suppose we randomly draw n
points from a Gaussian distribution with standard deviation σ and compute the mean of these
points. Then suppose we repeat this process many more times. The distribution of the resulting
σ
means will have a standard deviation equal to . This is the standard error, i.e. the standard
n
deviation of the sampling distribution of the statistic. Thus, given a single sample of n data
points, the mean of the sample may be offset from the true population mean, and the standard
error indicates about how far away the true population mean may be. (Note that when computing
standard error on actual data, the standard deviation of the population is unknown, so we use the
standard deviation of the sample as an estimate.)
- Confidence intervals are intimately related to standard error. Assuming that the sampling
distribution is Gaussian, +/– 1 standard error gives the 68% confidence interval and +/– 2
standard errors gives the 95% confidence interval. Technically, the interpretation of confidence
intervals is that with repeated experiments, we can expect that X% of the time, the true
population parameter will be contained within the X% confidence interval. More loosely, we can
use confidence intervals as indicators of our uncertainty in our estimates.
- Test-retest refers to the idea of collecting a set of data, performing some analyses on those data,
and then repeating the whole process on a fresh set of data. Variation between the first set of
results and the second set of results tells us something about the reliability (or replicability or
reproducibility) of the results. We can construe test-retest as a simple procedure for estimating
error bars that involves drawing two points from a distribution.

4. Nonparametric approaches to error bars

- In the previous example describing standard error, we assumed that the underlying population
distribution is Gaussian and we assumed that we want to estimate the mean. Let's drop these
parametric assumptions—let's use bootstrapping to bypass the Gaussian assumption and let's
compute the median instead of the mean.
- The procedure is simple: Given a set of n data points, draw n points with replacement from the
data points and compute the median of the drawn points. Repeat the procedure many times, e.g.
10,000 times. Finally, summarize the resulting distribution using the median and the 68%
confidence interval. (The reason for choosing the 68% confidence interval is that the range
spanned by the 68% confidence interval is analogous to the range spanned by +/– 1 standard
error in the case of Gaussian error bars.)
- Why does bootstrapping work? Recall that in the parametric approach to error bars, we assume
a parametric probability distribution for the data and calculate error bars based on theoretical
(analytical) considerations of what happens if we sample from that distribution. Bootstrapping
adheres to exactly the same logic, but has two deviations: (1) instead of assuming a parametric
probability distribution, the data themselves are used as an approximation of the underlying
probability distribution, and (2) instead of calculating error bars analytically, brute computational
force is used (random samples are drawn and the sampling distribution of the statistic is directly
constructed).
- Note that in the limit, results from bootstrapping will approximately match those based on
analytic assumptions. For example, if the data are truly Gaussian-distributed, then as the sample
size grows, results from bootstrapping will be very similar to results based on Gaussian
assumptions. (However, if the data are not Gaussian-distributed, then of course all bets are off.)
- Why is the same sample size (n) used when drawing a bootstrap sample? The reason is that the
sample size determines the uncertainty in the estimate. If we artificially increased the number of
samples drawn, we would be artificially (and incorrectly) decreasing the uncertainty.
- Why is sampling with replacement used when drawing a bootstrap sample? One reason is that
we are simulating independent samples from a distribution; if the samples were drawn without
replacement, then there would be dependencies across samples. A more mundane reason is that if
we draw without replacement, all samples would be identical (which is clearly absurd).

5. Nonparametric approaches to probability distributions

- Bootstrapping can be cast as a nonparametric technique for characterizing probability
distributions. The bootstrap estimate of the probability distribution that generated a set of data is
simply a probability distribution consisting of a "spike" at each observed data point. This
distribution looks a bit funny, but it is nonetheless a valid distribution—we can draw random
samples from the distribution, just like any other probability distribution.
- The histogram is another nonparametric technique for characterizing probability distributions.
To obtain an estimate of the probability distribution that generated a set of data, we simply
construct a histogram of the data and then modify the scale of the y-axis so that the total area of
the bars is equal to one.
- Kernel density estimation (KDE) is yet another nonparametric technique for characterizing
probability distributions. In KDE, a probability distribution is constructed by placing a kernel at
each observed data point and averaging across kernels. Each kernel is itself a probability
distribution.
- KDE bridges the gap between bootstrapping and the histogram. On the one hand, KDE can be
seen as a smooth version of the bootstrap: instead of placing a sharp spike at each data point, we
use a smooth bump. On the other hand, KDE can be seen as a more sophisticated version of the
histogram: KDE performs the same basic function as the histogram but avoids the awkward
discreteness of histogram distributions through slightly fancier mathematics.
- One issue in using the histogram is what bin size to use. (An analogous issue for KDE is what
kernel size to use.) If the bin size is small, the resulting density estimate may be noisy (i.e.
unstable across repeated measurements) but has the potential to reveal small-scale features in the
data. If the bin size is large, the resulting density estimate will be less noisy (i.e. more stable) but
may obscure small-scale features in the data. If the goal of the analysis is simply visualization
and data exploration, a reasonable strategy is to try a variety of bin sizes, paying attention to the
number of observations that fall into each bin (small numbers suggest an unstable and therefore
untrustworthy estimate).

Hbs
No ratings yet
Hbs
45 pages
Error Analysis - Statistics: - Accuracy and Precision - Individual Measurement Uncertainty
No ratings yet
Error Analysis - Statistics: - Accuracy and Precision - Individual Measurement Uncertainty
33 pages
Prob & Stats (Slides) PDF
No ratings yet
Prob & Stats (Slides) PDF
101 pages
Statisitcs
No ratings yet
Statisitcs
22 pages
Lecture 2.2 - Statistics - Desc Stat and Distrib
No ratings yet
Lecture 2.2 - Statistics - Desc Stat and Distrib
48 pages
Data Types:: Basic Statistics
No ratings yet
Data Types:: Basic Statistics
23 pages
Week 5 - Result and Analysis 1 (UP)
No ratings yet
Week 5 - Result and Analysis 1 (UP)
7 pages
Introduction To Uncertainty: Asma Khalid and Muhammad Sabieh Anwar
No ratings yet
Introduction To Uncertainty: Asma Khalid and Muhammad Sabieh Anwar
36 pages
Statistics 101 Study Notes
No ratings yet
Statistics 101 Study Notes
33 pages
M&I Lect 4
No ratings yet
M&I Lect 4
27 pages
Lecture4 Mech SU
No ratings yet
Lecture4 Mech SU
17 pages
PHY224H1F/324H1S Notes On Error Analysis: References
No ratings yet
PHY224H1F/324H1S Notes On Error Analysis: References
14 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
48 pages
Statistics
No ratings yet
Statistics
12 pages
STAT100 - Full Course Notes
No ratings yet
STAT100 - Full Course Notes
27 pages
Basic Concepts of Statistics
No ratings yet
Basic Concepts of Statistics
43 pages
QM Formula Class
No ratings yet
QM Formula Class
31 pages
CHAPTERS
No ratings yet
CHAPTERS
17 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Evaluating Analytical Data PDF
No ratings yet
Evaluating Analytical Data PDF
8 pages
Begreber Note For Statistics
No ratings yet
Begreber Note For Statistics
17 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
33 pages
Definitions
No ratings yet
Definitions
4 pages
Che 411 L2
No ratings yet
Che 411 L2
22 pages
MetNum1 2023 1 Week 10
No ratings yet
MetNum1 2023 1 Week 10
79 pages
Biostatistics Revision Dr.nj
No ratings yet
Biostatistics Revision Dr.nj
67 pages
Data Science 01 - Basics
No ratings yet
Data Science 01 - Basics
52 pages
Analysis Basic Statistics Descriptive Statistics
No ratings yet
Analysis Basic Statistics Descriptive Statistics
8 pages
Statistics Notes 1702100127
No ratings yet
Statistics Notes 1702100127
22 pages
Basic Statistics Vocabulary
No ratings yet
Basic Statistics Vocabulary
2 pages
uncertainity
No ratings yet
uncertainity
2 pages
Modified Ps Final 2023
No ratings yet
Modified Ps Final 2023
124 pages
Basic Statistics: Introductory Workshop MS-Bapm
No ratings yet
Basic Statistics: Introductory Workshop MS-Bapm
78 pages
RMP470S Lecture 7 - One-Dimensionalstatistics
No ratings yet
RMP470S Lecture 7 - One-Dimensionalstatistics
27 pages
Lectorial Slides 6a
No ratings yet
Lectorial Slides 6a
30 pages
C3-Comm213
No ratings yet
C3-Comm213
6 pages
Statistics ESCP
No ratings yet
Statistics ESCP
383 pages
10. Statistical Inference
No ratings yet
10. Statistical Inference
5 pages
Stats and Maths For Data Analyst
No ratings yet
Stats and Maths For Data Analyst
23 pages
Statistics - The Big Picture
No ratings yet
Statistics - The Big Picture
4 pages
9.1. Prob - Stats
No ratings yet
9.1. Prob - Stats
19 pages
Tpe 517 Geostatistics II
No ratings yet
Tpe 517 Geostatistics II
83 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Ch6 0915 2023
No ratings yet
Ch6 0915 2023
52 pages
Stat Reviewer
No ratings yet
Stat Reviewer
4 pages
Module Wise Important Formulae
No ratings yet
Module Wise Important Formulae
45 pages
Lesson 4 Notes
No ratings yet
Lesson 4 Notes
14 pages
Basic - Statistics 30 Sep 2013 PDF
100% (1)
Basic - Statistics 30 Sep 2013 PDF
20 pages
MTH1310 - Statistics
No ratings yet
MTH1310 - Statistics
34 pages
Stat and Prob
No ratings yet
Stat and Prob
6 pages
Stats Assign
No ratings yet
Stats Assign
6 pages
Chen10011 Notes
No ratings yet
Chen10011 Notes
58 pages
Chapter 5 - RM
No ratings yet
Chapter 5 - RM
22 pages
ANALYST Sources
No ratings yet
ANALYST Sources
23 pages
Research 9 Q3
No ratings yet
Research 9 Q3
17 pages
A. Variables:: Types of Distributions
No ratings yet
A. Variables:: Types of Distributions
10 pages
REVIEWER-STATISTICS-AND-PROBABILITY
No ratings yet
REVIEWER-STATISTICS-AND-PROBABILITY
5 pages
Eng 2015 Prelims Reviewer
No ratings yet
Eng 2015 Prelims Reviewer
11 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Electro Magnetic Field PDF
No ratings yet
Electro Magnetic Field PDF
42 pages
2.1. List Decoding of Polar Codes 2015
No ratings yet
2.1. List Decoding of Polar Codes 2015
14 pages
Ijimsep M 2 2018
No ratings yet
Ijimsep M 2 2018
13 pages
Maths Class X Sample Paper Test 04 For Board Exam 2024 Answers
No ratings yet
Maths Class X Sample Paper Test 04 For Board Exam 2024 Answers
14 pages
Python Lab
No ratings yet
Python Lab
21 pages
Stegmayer+et+al 2024 Fluid Dyn. Res. 10.1088 1873-7005 Ad6c7b
No ratings yet
Stegmayer+et+al 2024 Fluid Dyn. Res. 10.1088 1873-7005 Ad6c7b
27 pages
SPSS Youtube Video Transcript
No ratings yet
SPSS Youtube Video Transcript
2 pages
Maggi - Strategic Trade Policies With Endogenous Mode of Competition
No ratings yet
Maggi - Strategic Trade Policies With Endogenous Mode of Competition
23 pages
Duobias 690 Proddoc 20114715
No ratings yet
Duobias 690 Proddoc 20114715
143 pages
MST121 Course Guide
No ratings yet
MST121 Course Guide
12 pages
Combinatorial Set Theory PDF
No ratings yet
Combinatorial Set Theory PDF
2 pages
Hypothesis Testing - A Step-by-Step Guide With Easy Examples PDF
No ratings yet
Hypothesis Testing - A Step-by-Step Guide With Easy Examples PDF
8 pages
Arithmetic 1
No ratings yet
Arithmetic 1
92 pages
Chapter 13 Summary Notes
No ratings yet
Chapter 13 Summary Notes
16 pages
(Said S.E.H. Elnashaie, Parag Garhyan) Conservatio
No ratings yet
(Said S.E.H. Elnashaie, Parag Garhyan) Conservatio
661 pages
FALLSEM2018-19 - MAT5009 - TH - TT531 - VL2018191004951 - Reference Material I - 01 - MAT 5009 - ADVANCED COMPUTER ARITHMETIC PDF
No ratings yet
FALLSEM2018-19 - MAT5009 - TH - TT531 - VL2018191004951 - Reference Material I - 01 - MAT 5009 - ADVANCED COMPUTER ARITHMETIC PDF
13 pages
SN200 2010-09 e
No ratings yet
SN200 2010-09 e
131 pages
DOC-20250128-WA0109
No ratings yet
DOC-20250128-WA0109
4 pages
GEELECDS Introduction To Probability
No ratings yet
GEELECDS Introduction To Probability
2 pages
Fefef
No ratings yet
Fefef
3 pages
Math 121A: Midterm 1 Solutions: Answer
No ratings yet
Math 121A: Midterm 1 Solutions: Answer
5 pages
Final HSSC-II Mathematics Model Paper
0% (1)
Final HSSC-II Mathematics Model Paper
13 pages
Mixed Strategies: Mixed Strategy Nash Equilibrium
No ratings yet
Mixed Strategies: Mixed Strategy Nash Equilibrium
9 pages
My Answer: - Correct Answer: 4.5: 11th ADVANCED Star Batch Paper-1 (4th June)
No ratings yet
My Answer: - Correct Answer: 4.5: 11th ADVANCED Star Batch Paper-1 (4th June)
16 pages
Week 7 Practice Questions-1
No ratings yet
Week 7 Practice Questions-1
4 pages
Upsc Cse Free Test Series 2024
No ratings yet
Upsc Cse Free Test Series 2024
6 pages
AMC2019 StudentsResults-Indonesia-9 & 10 I
No ratings yet
AMC2019 StudentsResults-Indonesia-9 & 10 I
3 pages
Iit Bombay PHD Thesis Download
100% (3)
Iit Bombay PHD Thesis Download
6 pages
Pecora 1998 Master
No ratings yet
Pecora 1998 Master
4 pages