StatsLecture1 Probability
StatsLecture1 Probability
∑x i
mean(x) = x = i =1
n
n
∑ (x i − x )2
std(x) = i =1
n −1
In these equations, xi is the ith data point and n is the total number of data points.
- The median and interquartile range (IQR) also summarize data. They are nonparametric
statistics, as they make minimal assumptions about the form of the data. The Xth percentile is the
value below which X% of the data points lie. The median is the 50th percentile. The IQR is the
difference between the 75th and 25th percentiles.
- Mean and standard deviation are appropriate when the data are roughly Gaussian. When the
data are not Gaussian (e.g. skewed, heavy-tailed, outliers present), the mean and standard
deviation may be misleading and the median and IQR may be preferable.
2. Probability distributions
- A probability distribution (or probability density function) is a mathematical function of one or
more variables that describes the likelihood of observing any specific set of values for the
variables. Distributions can be univariate (pertaining to one variable) or multivariate (pertaining
to more than one variable); we will stick with the univariate case for now. The integral of a
probability density function necessarily equals one.
- The Gaussian (or normal) distribution is a very useful probability distribution. It is parametric
in the sense that it places certain constraints on the distribution of the data (the distribution must
be unimodal, symmetric, etc.). The Gaussian distribution has two parameters, the mean (µ) and
the standard deviation (σ), and is given by the following equation:
( x − µ )2
1 −
p(x) = e 2σ 2
σ 2π
For any given value x, this equation specifies how to compute p(x), the likelihood of that value.
When points are drawn from a Gaussian distribution, 68% and 95% of the points will be within 1
and 2 standard deviations from the mean, respectively.
- Given a set of data, the Gaussian distribution that best describes the data (i.e. maximizes the
likelihood of the data) is the one whose mean and standard deviation are matched to the mean
and standard deviation of the data. Thus, when computing the mean and standard deviation of a
set of data, you are in a sense fitting a Gaussian distribution to the data.
- An advantage of the Gaussian distribution is that it is simple and may be a reasonable
approximation for many types of data. But what if the data are not Gaussian? If there is a suitable
parametric probability distribution for the data (e.g. the Poisson distribution), we could choose to
use it. Alternatively, we can adopt nonparametric techniques that take a more flexible approach,
allowing the data themselves to determine the form of the probability distribution. Such
techniques include histograms, bootstrapping, and kernel density estimation, and are covered
later in this lecture.
3. Error bars
- When measuring some quantity, we may find that the measurement is different each time it is
performed. We attribute this variability to noise, i.e. any factor that contributes to variability in
the measurement.
- Statistically speaking, the measurements we make constitute a sample from the population, i.e.
the underlying probability distribution that describes the measurement process. The problem is
that we are interested in characteristics of the population but all we can observe is our finite
sample from the population.
- A statistic (e.g. mean) computed on a random sample is subject to variability and is not the
same as the statistic computed on the whole population (technically known as the parameter).
Thus, we need to distrust, to some degree, the statistic computed on the sample. To indicate
uncertainty on the statistic, it is useful to plot error bars indicating the standard error.
- To understand standard error, let's consider a simple example. Suppose we randomly draw n
points from a Gaussian distribution with standard deviation
σ
and compute the mean of these
points. Then suppose we repeat this process many more times. The distribution of the resulting
σ
means will have a standard deviation equal to . This is the standard error, i.e. the standard
n
deviation of the sampling distribution of the statistic. Thus, given a single sample of n data
points, the mean of the sample may be offset from the true population mean, and the standard
error indicates about how far away the true population mean may be. (Note that when computing
standard error on actual data, the standard deviation of the population is unknown, so we use the
standard deviation of the sample as an estimate.)
- Confidence intervals are intimately related to standard error. Assuming that the sampling
distribution is Gaussian, +/– 1 standard error gives the 68% confidence interval and +/– 2
standard errors gives the 95% confidence interval. Technically, the interpretation of confidence
intervals is that with repeated experiments, we can expect that X% of the time, the true
population parameter will be contained within the X% confidence interval. More loosely, we can
use confidence intervals as indicators of our uncertainty in our estimates.
- Test-retest refers to the idea of collecting a set of data, performing some analyses on those data,
and then repeating the whole process on a fresh set of data. Variation between the first set of
results and the second set of results tells us something about the reliability (or replicability or
reproducibility) of the results. We can construe test-retest as a simple procedure for estimating
error bars that involves drawing two points from a distribution.