BA unit2
BA unit2
Statistical Sampling
BUSINESS ANALYTICS
B.Tech(CSE) IV Year - I Semester
Open Elective - III
2
01 SAMPLING DISTRIBUTION
02 SAMPLING METHODS
04 SAMPLING ERROR
05 HYPOTHESIS TESTING
8. ANOVA
Sampling
Methods
The population is
Subjective Probabilistic divided into
Sampling Sampling clusters, and a
Methods Methods random sample of
clusters is selected
Expert judgment is Samples are selected Each item in the Selects every The population is
used to select the based on ease of population has nth item from divided into natural
sample access an equal chance the population subsets (strata) Random
based on Time Random
characteristics Selection Time Points
(Gender, Age group
etc.) Choose a random Choose n random
time and then select times and select the
the next n items next item
Point Estimate:
A point estimate is a single value derived from sample data to estimate an unknown population parameter. It is the most used method
in statistics.
Provides a specific numerical value as an approximation of the population parameter.
Advantages: Easy to calculate. Limitations: Lack of Precision
Interval Estimates:
An interval estimate provides a range of values within which a population parameter is expected to lie, based on sample data. It offers
more information than a point estimate by accounting for the variability and uncertainty inherent in the estimation process.
• Confidence Intervals: A confidence interval is a range of values between which the value of the population parameter is believed to be,
along with a probability that the interval correctly estimates the true (unknown) population parameter. This probability is called the
level of confidence, denoted by 1 - α, where α is a number between 0 and 1. The confidence level is usually expressed as a percent;
common values are 90%, 95%, or 99%. (Note that if the level of confidence is 90%, then α = 0.1.) The margin of error depends on the
level of confidence and the sample size.
• Prediction Intervals: A prediction interval provides a range for predicting the value of a new observation from the same population.
This is different from a confidence interval, which provides an interval estimate of a population parameter, such as the mean or
proportion. A confidence interval is associated with the sampling distribution of a statistic, but a prediction interval is associated
with the distribution of the random variable itself.
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University 6
SAMPLING ERROR
Sampling error is the difference between a sample statistic (e.g., sample mean) and the corresponding
population parameter (e.g., population mean) due to the fact that the sample is only a subset of the
population.
Causes of Sampling Error:
• Random Variation: Natural differences between samples due to randomness.
• Sample Size: Smaller samples tend to have larger sampling errors due to less data representing the
population.
• Sampling Method: Non-random sampling methods can introduce bias, increasing sampling error.
Striking Balance 9
TYPES OF TESTING:
2. Non Parametric Test: Whenever a few assumptions in the given population are uncertain, we use non-
parametric tests
a. Chi-Square Test: Simple random sampling size greater than 50, Samples are independent.
1. Formulate Hypotheses:
o Null Hypothesis (H0): A statement of no effect or no difference, which we aim to test.
o Alternative Hypothesis (H1 or Ha): A statement that there is an effect or a difference.
5. Make a Decision:
o Compare the p-value to the significance level (α).
o If p-value ≤ α, reject the null hypothesis.
o If p-value > α, do not reject the null hypothesis.
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University 11
ONE-SAMPLE HYPOTHESIS TESTING
A one-sample hypothesis test is used to determine if a sample comes from a
population with a specific mean (or another parameter). It is useful when you
want to compare the sample mean to a known population mean or a
hypothesized value.
Alternative Hypothesis (H₁): The statement you want to test against the null
hypothesis. It suggests that the population mean is different from the
specified value.
Example: H₁: μ ≠ μ₀ (two-tailed test)
H₁: μ > μ₀ (right-tailed test)
H₁: μ < μ₀ (left-tailed test)
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering, 12
Andhra University
TWO-TAILED TEST OF HYPOTHESIS FOR MEAN
Two-sample hypothesis testing is used to compare the means (or other parameters) of two independent
groups to determine if there is a statistically significant difference between them. This type of test is often
used to compare experimental and control groups, different treatment groups, or any other two
independent samples.
Suppose we want to know whether or not the mean weight between two different species of turtles is equal.
To test this, will perform a two-sample t-test at significance level α = 0.05 using the following steps:
Step 5: Conclusion.
We fail to reject the null hypothesis since this p-value is not less than
our significance level α = 0.05.
We do not have sufficient evidence to say that the mean weight of turtles between
these two populations is different.
• ANOVA (Analysis of Variance) is used when you have three or more groups and you want to compare their
means to see if they are significantly different.
• It is a statistical method that is used in a variety of research scenarios. Here are some examples of when you
might use ANOVA:
• A One-Way ANOVA is used to determine how one factor impacts a response variable.
• For example, we might want to know if three different studying techniques lead to different mean exam
scores.
• To see if there is a statistically significant difference in mean exam scores, we can conduct a one-way
ANOVA.
22
CHI-SQUARE TEST FOR INDEPENDENCE
The Chi-Square Test of Independence is a statistical method used to determine if there is a significant association between two categorical
variables. Essentially, it tests whether the distribution of one variable is independent of the distribution of another variable
23
CHI-SQUARE TEST FOR INDEPENDENCE usage
• A Chi-Square Test of Independence is used to determine whether or not there is a significant association
between two categorical variables.
Example
Suppose we want to know whether or not gender is associated with political party preference.
We take a simple random sample of 500 voters and survey them on their political party preference.
The following table shows the results of the survey:
We can repeat this formula to obtain the expected value for each cell in the table:
Next, we will calculate (O-E)2 / E for each cell in the table where:
•O: observed value
•E: expected value
For example, Male Republicans would have a value of: (120-115)2 /115 = 0.2174.
According to the Chi-Square Score to P Value Calculator, the p-value associated with
X2 = 0.8642 and (2-1)*(3-1) = 2 degrees of freedom is 0.649198.
• We fail to reject the null hypothesis since this p-value is not less than
0.05.
• This means we do not have sufficient evidence to say that there is an
association between gender and political party preference.