FIN 640 - Lecture Notes 4 - Sampling and Estimation
FIN 640 - Lecture Notes 4 - Sampling and Estimation
FALL 2020
1. Sampling
4. Sampling Bias
1. Sampling
3
Simple Random Sampling
When an analyst chooses to sample, he must formulate a sampling
plan. A sampling plan is the set of rules used to select a sample. The
basic type of sample from which we can draw statistically sound
conclusions about a population is the simple random sample
(random sample, for short).
A simple random sample is a subset of a larger population created
in such a way that each element of the population has an equal
probability of being selected to the subset.
The procedure of drawing a sample to satisfy the definition of a
simple random sample is called simple random sampling.
Systematic sampling
With systematic sampling, we select every kth member until we
have a sample of the desired size. The sample that results from this
procedure should be approximately random. real sampling
situations may require that we take an approximately random
sample.
Sampling Error
Sampling error is the difference between the observed value of a
statistic and the quantity it is intended to estimate.
Sampling distribution
The sampling distribution of a statistic is the distribution of all the
distinct possible values that the statistic can assume when
computed from samples of the same size randomly drawn from the
same population.
12
The Central Limit Theorem
The Central Limit Theorem
Let’s recall the example we used in estimating mean stock returns
for IBM.
Standard Error of the Sample Mean
For sample mean X calculated from a sample generated by a
population with standard deviation σ, the standard error of the
sample mean is given by one of two expressions.
When we know σ, the population standard deviation:
We will soon see how we can use the sample mean and its
standard error to make probability statements about the
population mean by using the technique of confidence intervals.
3. Point and Interval Estimates
17
We care most about the population mean
So we use estimators calculated from the sample to estimate the
population mean
The formulas that we use to compute the sample mean and all the
other sample statistics are examples of estimation formulas or
estimators.
The particular value that we calculate from sample observations
using an estimator is called an estimate.
Point Estimate
To take the example of the mean, the calculated value of the
sample mean in a given sample, used as an estimate of the
population mean, is called a point estimate of the population
mean.
In many applications, we have a choice among a number of
possible estimators for estimating a given parameter. how do we
make our choice? We often select estimators because they have
one or more desirable statistical properties. Following is a brief
description of three desirable properties of estimators:
unbiasedness (lack of bias), efficiency, and consistency.
Unbiasedness
An unbiased estimator is one whose expected value (the mean of its
sampling distribution) equals the parameter it is intended to estimate.
For example, the expected value of the sample mean, X , equals μ, the
population mean, so we say that the sample mean is an unbiased estimator
(of the population mean).
The sample variance, s2, which is calculated using a divisor of n − 1 (equation
3), is an unbiased estimator of the population variance, σ2. if we were to
calculate the sample variance using a divisor of n, the estimator would be
biased: its expected value would be smaller than the population variance. We
would say that sample variance calculated with a divisor of n is a biased
estimator of the population variance.
Unbiasedness
Sample mean and sample variance are both unbiased estimators of
the population mean and variance.
Consistency
A consistent estimator is one for which the probability of estimates
close to the value of the population parameter increases as sample
size increases.
Consistency
Law of Large Numbers (LLN)
The weak law of large numbers (also called Khinchin's law) states
that the sample average converges in probability towards the
expected value.
The strong law of large numbers states that the sample average
converges almost surely to the expected value
Efficiency
An unbiased estimator is efficient if no other unbiased estimator of
the same parameter has a sampling distribution with smaller
variance.
Sample mean X is an efficient estimator of the population mean;
sample variance s2 is an efficient estimator of σ2.
Interval Estimate
Confidence Interval:
A confidence interval is a range for which one can assert with a given
probability 1 − α, called the degree of confidence, that it will contain the
parameter it is intended to estimate. This interval is often referred to as
the 100(1 − α)% confidence interval for the parameter.
The endpoints of a confidence interval are referred to as the lower and
upper confidence limits.
Confidence interval estimate
A 100(1 − α)% confidence interval for a parameter has the
following structure:
Point estimate ± reliability factor × Standard error
Reliability Factors for Confidence Intervals Based on the Standard Normal Distribu-
tion. We use the following reliability factors when we construct confidence intervals
based on the standard normal distribution:
90 percent confidence intervals: use z0.05 = 1.65
95 percent confidence intervals: use z0.025 = 1.96
99 percent confidence intervals: use z0.005 = 2.58
28
29
Confidence interval for the Population Mean
(with Unknown Pop. Variance)
If we are sampling from a population with unknown variance, then a 100(1 −
α)% confidence interval for the population mean μ is given by
For a sample of size n, the t distribution will have n-1 degrees of freedom,
denoted t(n-1)
31
Example
Suppose an investment analyst takes a random sample of US equity
mutual funds and calculates the average Sharpe ratio. The sample
size is 100, and the average Sharpe ratio is 0.45. The sample has a
standard deviation of 0.30.
Calculate and interpret the 90 percent confidence interval for the
population mean of all US equity mutual funds.
Recognizing that the population variance of the distribution of
Sharpe ratios is unknown, the analyst decides to calculate the
confidence interval using the theoretically correct t-statistic.
Example
Selection of Sample Size
Hence, the larger the sample size, the greater precision with which we can estimate
the population parameter.
This might explain why sometimes we don’t observe statistically significant results.
4. Sampling Bias
35
Data-Mining Bias
Data-mining is the practice of determining a model by extensive
searching through a dataset for statistically significant patterns
(that is, repeatedly “drilling” in the same data until finding
something that appears to work).
Out-of-sample test
If we were to just report the significant variables, without also
reporting the total number of variables that we tested that were
unsuccessful as predictors, we would be presenting a very
misleading picture of our findings. – Asset Pricing Tests
Sample Selection Bias
When data availability leads to certain assets being excluded from
the analysis, we call the resulting problem sample selection bias.
Funds or companies that are no longer in business do not appear
there. So, a study that uses these types of databases suffers from a
type of sample selection bias known as survivorship bias.
Look-ahead Bias
A test design is subject to look-ahead bias if it uses information that
was not available on the test date.
For example, tests of trading rules that use stock market returns
and accounting balance sheet data must account for look-ahead
bias. in such tests, a company’s book value per share is commonly
used to construct the P/B variable. Although the market price of a
stock is available for all market participants at the same point in
time, fiscal year-end book equity per share might not become
publicly available until sometime in the following quarter.
Time-Period Bias
A test design is subject to time-period bias if it is based on a time
period that may make the results time-period specific.
Q&A