Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

- Module 4-Sampling 2

Module 4 covers Probability Distribution and Sampling Theory, focusing on Poisson and Normal distributions, sampling distributions, and hypothesis testing. It explains key concepts such as population, sample, and various sampling techniques, along with statistical inference methods including estimation and hypothesis testing. The module also discusses the Central Limit Theorem and the use of Student's t-distribution for small sample sizes.

Uploaded by

Waste Material
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

- Module 4-Sampling 2

Module 4 covers Probability Distribution and Sampling Theory, focusing on Poisson and Normal distributions, sampling distributions, and hypothesis testing. It explains key concepts such as population, sample, and various sampling techniques, along with statistical inference methods including estimation and hypothesis testing. The module also discusses the Central Limit Theorem and the use of Student's t-distribution for small sample sizes.

Uploaded by

Waste Material
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Module 4: Probability Distribution and Sampling Theory

Sr. No. Topics Hours

4.1 Probability Distribution: Poisson and Normal distribution 7 hr

4.2 Sampling distribution, Test of Hypothesis, Level of


Significance, Critical region, One-tailed, and two-tailed
test, Degree of freedom.

4.3 Students’ t-distribution (Small sample). Test the


significance of mean and Difference between the means of
two samples. Chi-Square Test: Test of goodness of fit and
independence of attributes, Contingency table.

Self-learning Topics: Test significance for Large samples,


Estimate parameters of a population, Yate’s Correction.
Basic terminology and logic
Population
The populations we wish to study are almost always so large that we are unable to
gather information from every case.
Eg. : If we are interested in the weights of students enrolled in the Engineering
college
Population Size
The number of elements in the population is called the population size and is
denoted by N.
Sample
A sample is a part of a population. - From the population, we select various
elements on which we collect our data. This part of the population on which we
collect data is called the sample.
Eg: Suppose that we are interested in studying the characteristics of the weights
of the students enrolled in the college of engineering. If we randomly select 50
students among the students of the college of engineering and measure their
weights, then the weights of these 50 students form our sample.
Sample Size
The number of elements in the sample is called the sample size and is denoted by
n.
Sampling and Statistical Inference
There are several types of sampling techniques, some of which are:
1. Simple random sampling
If a sample of size ‘n’ is selected from a population ‘N’ in such a way that each element in the
population has the same chance to be selected, the sample is called a simple random sample.
Eg. Suppose we want to know what percent of students at a large university work during the
semester •then draw a sample of 500 from a list of all students (N=20,000) and write the
name of each students in single paper and folded, mixed and now draw 500 of them. But its
lengthy method so another way
Each student has a unique, 6 digit ID number that ranges from 000001 to 999999. Use a table
of random numbers or a computer program to select 500 ID numbers with 6 digits each. Each
time a randomly selected 6 digit number matches the ID of a student, that student is selected
for the sample
2. Stratified Random Sampling:
In this type of sampling, the elements of the population are classified into several homogenous
groups (strata). From each group, an independent simple random sample is drawn. The sample
resulting from combining these samples is called a stratified random Sample.
Eg. Suppose we want to conduct a sample survey relating to lung cancer among smokers.
Suppose the population of smokers is 1000 and that there are 300 pipe smokers, 500 cigarette
smokers and 200 bidi smokers. Suppose we have to select sample of 250. i.e. 1/4th of the size
of population. We have therefore to select 1/4th from each group/strata i.e. 75,125,and 50 from
three strata respectively.
Sampling Distribution
Sampling distribution is the link between sample and population. The probability
distribution of a statistic is called the sampling distribution of that statistic. The
sampling distribution of the statistic is used to make statistical inference about the
unknown parameter.
How ???
• Statistics are used to estimate parameters
 Statistics are mathematical characteristics of samples
 Parameters are mathematical characteristics of
populations
Distinctions Between Parameters and Statistics

Parameters Statistics

Source Population Sample

Notation Greek (e.g. ) Roman (e.g. ̅ )

Vary No Yes

Calculated No yes
Sampling Distribution
A sampling distribution acts as a frame of reference for
statistical decision making.
 It is a theoretical probability distribution of the possible values of some sample
statistic that would occur if we were to draw all possible samples of a fixed size
from a given population.
 The sampling distribution allows us to determine whether, given the variability
among all possible sample means, the one we observed is a common out come or a
rare outcome.
 Imagine that each one of you asks a random sample of 10 people
in this class what their height is.
 You each calculate the average height of your sample to get the
sample mean.
 When you report back, would you expect all of your sample
means to be the same?
 How much would you expect them to differ?
Sampling Distribution of the Mean
• Random samples rarely exactly represent the underlying population. We rely
on sampling distributions to give us a better idea whether the sample we’ve
observed represents a common or rare outcome.
• Sampling distribution of the mean: probability distribution of means for ALL
possible random samples OF A GIVEN SIZE from some population, It
describes the behavior of a sampling mean
• ALL possible samples is a lot!
• Example: All possible samples of size 5 from a class of 90 = 43949268
Sampling Distribution of the Mean
The value of the sample mean varies from random sample to
another.
- The value of is random and it depends on the random
sample.
- The sample mean is a random variable.
- The probability distribution of is called the sampling
distribution of the sample mean .
- Questions:
 What is the sampling distribution of the sample mean ?
 What is the mean of the sample mean ?
 What is the variance of the sample mean ?
Sampling Distribution of the Mean
Sampling Distribution of the Mean
• A population of 10 people with $0–$9
Sampling Distribution of the Mean

The sampling distribution (n=1) The sampling distribution (n=2)


Sampling Distribution of the Mean
Properties of sampling distribution
 It has a mean ( )) equal to the population mean (μ)

 It has a standard deviation (standard error, )σ) equal to the population


standard deviation (σ) divided by the square root of n. it is also called
Standard Error of the mean measures the variability in the sampling
distribution (roughly represents the average amount the sample means
deviate from the mean of the sampling distribution)

 It has a normal distribution.


Sampling Distribution of the Mean
Draw observations at random from any population with finite mean mu. As the number
of observations drawn increases, the sample mean of the observed values ̅ gets closer
and closer to the mean of the population.
Central limit theorem
Central limit theorem
 The importance of the central limit theorem is that it removes the constraint
of normality in the population
– Applies to large samples (n≥30)

 If the sample is small (N<30)


– We must have information on the normality of the population before we
can assume the sampling distribution is normal
Some Results of sampling distribution of

We use this result when sampling from normal distribution with known variance
.
Some Results of sampling distribution of

Note: “≈ ” means “approximately distributed.


We use this result when sampling from non-normal distribution with known
variance and with large sample size.
Some Results of sampling distribution of
Statistical Inferences:
(Estimation and Hypotheses Testing)
It is the procedure by which we reach a conclusion about a population on the
basis of the information contained in a sample drawn from that population.
There are two main purposes of statistics;
• Descriptive Statistics: Organization & summarization of the data
• Statistical Inference: Answering research questions about some unknown
population parameters.
(1) Estimation: Approximating (or estimating) the actual values of the
unknown parameters:
- Point Estimate: A point estimate is single value used to estimate the
corresponding population parameter.
- - Interval Estimate (or Confidence Interval): An interval estimate
consists of two numerical values defining a range of values that most
likely includes the parameter being estimated with a specified degree of
confidence.
(2) Hypothesis Testing: Answering research questions about the unknown
parameters of the population (confirming or denying some conjectures or
statements about the unknown parameters).
Hypothesis Testing
Hypothesis
A statement or assumption about parameter is called hypothesis.
For eg. Average life time of table light is 1500 hrs.

Statistical Hypothesis
An assumption or statements about population parameters in numerical form
is called statistical hypothesis
For eg. Height of indian soldiers is 6 feet.

Types of Statistical Hypothesis


---Null Hypothesis
An assumption which is to be tested for possible rejection is called null
hypothesis and it is denoted by
---Alternative Hypothesis
An assumption which is opposite to null hypothesis is called altrnative
hypothesis and it is denoted by
Hypothesis Testing
Null hypothesis always involves equality i.e.
∶ =
∶ ≥
∶ ≤
Alternative hypothesis always involves inequality i.e.
∶ ≠
∶ <
∶ >
A procedure for deciding whether to accept or to reject a null hypothesis (or
to reject or accept alternative hypothesis) is called Test of Hypothesis.
Type of Errors
Level of Significance
 Probability of Type-I error is called level of significance. It is denoted by
=
= ( ℎ )
• Usually expressed in %. Los usual values are = 5% = 1%
• = 5% means the probability of rejecting a true hypothesis is 0.05 or only
5 chances out of 100 that we reject hypothesis when it should be true.
 Note:
• When a hypothesis is rejected it does not mean that the hypothesis is
disproved. It only means that the sample value does not support the
hypothesis. Same is true when we accept hypothesis.
 Confidence limit
 The limits with in which the hypothesis should lie with specified
probability are called confidence limit
 Generally confidence limits set up with 5% or 1% LOS or 95%,90% or
99%.
 If sample values lies between the confidence limits the hypothesis is
accepted otherwise rejected.
Confidence Interval
• Draw a sample it gives us a mean bar that is our best guess at (for most
samples will be close to );
• is a `point' estimate for the mean of the population.
• However, we can also give a range or interval estimate that takes into account the
uncertainty involved in that point estimate.
• Condence interval equation is Limits = ±
• where bar is sample mean, z is value from normal curve, and is standard
error of the mean
Critical region
 Critical or Rejection Region:
• A region which leads us to reject is called critical or rejection region.
 Critical or Acceptance Region:
• A region which leads us to accept is called critical or acceptance region.
 Critical Value:
• A value which separate rejection region and acceptance region is called
critical value
Type of Test
 Two type of Test

 Two Tailed Test:


When rejection region is on both sides of the distribution of population
parameters. Such test is called two-tailed test.
: = : ≠ .
Type of Test
 Two type of Test

 One Tail Test:


When rejection region is on one sides of the distribution of population
parameters. Such test is called one-tail test.
: ≥ : < .
: ≤ : >
When to apply one tailed or two tailed test
Suppose that there are two population brands of bulbs, one manufactured by
standard process (with mean ) and another manufactured by process (with
mean )
(a) If we want to test if the bulbs differ significantly.
(b) if the bulbs produced by process 2 have higher avg life than process 1.
(c) If the bulbs of new process is inferior to that of std process.
Small Sample Test
Student t-Distribution
 T-Distribution can be used when the variance of the population is unknown and the
distribution is not normal
 Student-t, whose tail longer. That means the fact that sample mean with unknown
population variance is inclined to be an extreme value. If you use normal
distribution for hypothesis testing instead of t distribution, probability of error
becomes bigger.
 Formula :
Suppose we have a simple random sample of size n drawn from a Normal population
with mean and standard deviation . Let denote the sample mean and s, the sample
standard deviation. Then the quantity
∑ ̅
= where =

or = where s is given

• has a t distribution with n-1 degrees of freedom.


• Note that there is a different t distribution for each sample size,
Properties of Student t-Distribution
 The graph for the Student’s t-distribution is similar to the standard normal curve.
 The mean for the Student’s t-distribution is zero and the distribution is symmetric
about zero. The variance is greater than one, but approaches one from above as the
sample size increases.
 The Student’s t-distribution has more probability in its tails than the standard
normal distribution because the spread of the t-distribution is greater than the
spread of the standard normal. So the graph of the Student’s t-distribution will be
thicker in the tails and shorter in the center than the graph of the standard normal
distribution.
 The exact shape of the Student’s t-distribution depends on the degrees of freedom.
As the degrees of freedom increases, the graph of Student’s t-distribution becomes
more like the graph of the standard normal distribution. For n > 30, the differences
are negligible.
T-Distribution
T-Distribution
Examples: t-Distribution
The Procedure for hypotheses testing about the mean (μ)
Let be a given known value and Variance is unknown.
Test Procedures:
Test the significance of mean
Qu.1A soap manufacturing company was distributing a particular brand of soap
through a large number of retail shops. Before a heavy advertisement campaign
the mean sale per week per shop was140 dozens. After the campaign, a
sample of 26 shops was taken and the mean sales was found to be 147 dozens
with standard deviation 16. Can you consider the advertisement effective?
Test the significance of mean
Qu.2 A random sample of size 16 from normal population showed a mean of 103.75
cm. and sum of squares of deviations from mean 843.75 square cm. Can we say that
population has a mean of 108.75 cm?
Test the significance of mean
Qu.3 Problem 27.11
Qu.4 Problem 27.12
Test of Difference between the means of two samples.

Hypotheses: We choose one of the following
situations:
(i) Ho: μ1 = μ2 against H1: μ1 ≠ μ2
(ii) Ho: μ1 ≥ μ2 against H1: μ1 < μ2
(iii) Ho: μ1 ≤ μ2 against H1: μ1 > μ2 or equivalently,
(i) Ho: μ1-μ2 = 0 against H1: μ1 - μ2 ≠ 0
(ii) Ho: μ1-μ2 ≥ 0 against H1: μ1 - μ2 < 0
(iii) Ho: μ1-μ2 ≤ 0 against H1: μ1 - μ2 > 0
Test of Difference between the means of two samples.

Test of Difference between the means of two samples.

Sample of two types of electric light bulbs were tested for length of life and following data
obtained .
Type-I Type-II
Sample no. =8 =8

Sample mean = 1234 = 1036

Std deviation = 36 = 40
Paired t-test for difference of mean.
• If n1=n2=n (sample size are same)
• And Two samples are not independent (or correlated).
• Samples observations are paired together.
Examples of related populations are:
1. Height of the father and height of his son.
2. Mark of the student in MATH and his mark in STAT.
3. Pulse rate of the patient before and after the medical treatment.
4. Hemoglobin level of the patient before and after the medical treatment.

Example: (effectiveness of a diet program) Suppose that we are interested in studying


the effectiveness of a certain diet program. Let the random variables X and Y are as
follows:
X = the weight of the individual before the diet program
Y= the weight of the same individual after the diet program
Populations:
1-st population (X): weights before a diet program mean = μ1
2-nd population (Y): weights after the diet program mean = μ2
Hypotheses:
Ho: the diet program has no effect on weight
H1: the diet program has an effect on weight Equivalently,
Ho: μ1 = μ2 H1: μ1 ≠ μ2
Equivalently, Ho: μ1 - μ2 = 0 H1: μ1 - μ2 ≠ 0
Equivalently, Ho: μD = 0 H1: μD ≠ 0 where: μD = μ1 - μ2

Test Statistic:

Where

Di = Xi - Yi (i=1, 2, …, n)
Does these data provide sufficient evidence to allow us to conclude that the diet
program is effective? Use α=0.05 and assume that the populations are normal.

Solution:
μ1 = the mean of weights before the diet program
μ2 = the mean of weights after the diet program Hypotheses:
Ho: μ1 = μ2 (Ho: the diet program is not effective)
H1: μ1 ≠ μ2 (HA: the diet program is effective)
Degrees of freedom: df= ν= n-1 = 10-1=9
Significance level: α=0.05
Rejection Region of Ho: Critical values: t 0.025 = - 2.262 and
2.262 Critical Region: t < - 2.262 or t > 2.262

Decision:
Since t= 2.43 ∈R.R., i.e., t=2.43 > 2.262,
we reject: Ho: μ1 = μ2 (the diet program is not effective) and
we accept: H1: μ1 ≠ μ2 (the diet program is effective)
Consequently, we conclude that the diet program is effective at
α=0.05.
Chi-Square( )
Chi-Square( )
• Non-Parmetric Test
• A measure of the difference between the observed and expected
frequencies of the outcomes of a set of events or variables.
• χ2 depends on the size of the difference between actual and observed
values, the degrees of freedom, and the samples size.
• Can be used to test whether two variables are related or independent from
one another or to test the goodness-of-fit between an observed distribution
and a theoretical distribution of frequencies.
• Can be applied to only categorical data type e.g containing
groups/categories of gender, marital status, inoculated, age group etc.
• Data to be presented in tabular form.
Statistical Inferences:
(Estimation and Hypotheses Testing)
It is the procedure by which we reach a conclusion about a population on the
basis of the information contained in a sample drawn from that population.
There are two main purposes of statistics;
• Descriptive Statistics: Organization & summarization of the data
• Statistical Inference: Answering research questions about some unknown
population parameters.
(1) Estimation: Approximating (or estimating) the actual values of the
unknown parameters:
- Point Estimate: A point estimate is single value used to estimate the
corresponding population parameter.
- - Interval Estimate (or Confidence Interval): An interval estimate
consists of two numerical values defining a range of values that most
likely includes the parameter being estimated with a specified degree of
confidence.
(2) Hypothesis Testing: Answering research questions about the unknown
parameters of the population (confirming or denying some conjectures or
statements about the unknown parameters).
Hypothesis Testing Using Chi-Square( )
1. Set up Null Hypothesis (No significant difference between the observed
and expected values/No association between the mentioned attributes)
and Alternate Hypothesis.
2. Identify the degrees of freedom, n-1 OR(r-1)(c-1), where r = no. of rows,
c = no.of columns.
3. Test statistc =∑
• where:c=Degrees of freedom; =Observed value(s); =Expected
value(s)
4. Determine the critical value of from the table.
5. Compare the calculated and tabulated results.
6. Make Decision as is rejected or is not rejected on the basis if the
calculated test statistic value falls in rejection region or acceptance
region respectively.
Problems-1Chi square test for goodness of fit
• A die is thrown 132 times with the following results:
Number turned up: 1 2 3 4 5 6
Frequency: 16 20 25 14 29 28

Test the hypothesis that the die is unbiased.


Problems-2 Chi square test for goodness of fit
• The theory predicts the proportion of beans in the four groups A,B,C and D
should be 9:3:3:1. In an experiment among 1600 beans, the number in the
four groups were 882,313, 287,118. Does the experimental results support
theory?
Problems-3 Chi square test Independence of Attributes
• A certain drug was administered to 456 males out of total 720 in a certain
locality to test its efficacy against typhoid. The incidence of typhoid is
shown below. Find out the effectiveness of the drug against the disease.

Infection No Total
Infection
Administered the drug 144 312 456

Without Administered the 192 72 264


drug
Total 336 384 720
Problems-4 Chi square test Independence of Attributes
An app provides ratings to three categories of restaurants under three categories. Can we
conclude that the ratings are related to the size of the restaurant?
Small Medium Large Total
Good: 20 10 17 47
Okay: 11 8 8 27
Not Recommended: 10 7 9 26
Total: 41 25 34 100

You might also like