Module 02 - AIML Statisitcs
Module 02 - AIML Statisitcs
Module 02
What Is a Random Variable?
A random variable is a variable whose value is unknown or a function that
assigns values to each of an experiment's outcomes
Random Variable:
A continuous random variable stands for any amount within a specific range or
set of points and can reflect an infinite number of potential values, such as the
average rainfall in a region.
Why Statistical Distribution in important?
What do you mean by
● Mean
● Median
● Mode
● Variance
● Standard Distribution
Continuous Random Variable
Examples –
● Number of accidents in week
● Number of deaths in week
𝑒 −𝑚 𝑚𝑥
P(x) = , x = 0,1,2, …..
𝑥!
● Number of trials n is infinitely large
● Probability is very small.
np = m
ry for a given data set
ry for a given data set
ry for a given data set
ry for a given data set
ry for a given data set
ry for a given data set
ry for a given data set
● Probability Sampling Methods Vs Non-Probability Sampling Methods
Non-probability Sampling
Probability Sampling Methods
Methods
Probability Sampling is a sampling Non-probability sampling method is a
technique in which samples taken from technique in which the researcher
a larger population are chosen based on chooses samples based on subjective
probability theory. judgment, preferably random selection.
These are also known as Random These are also called non-random
sampling methods. sampling methods.
These are used for research which is These are used for research which is
conclusive. exploratory.
These involve a long time to get the These are easy ways to collect the data
data. quickly.
●
ry for a given data set
ry for a given data set
● Convenience Sampling
● Example: A researcher wants to study the eating habits of college students and
decides to survey students in the campus cafeteria during lunch hours.
● Scenario:
• Population: All college students
• Sample: Students in the cafeteria at lunchtime
● Quota Sampling
● Example: A market researcher wants to understand the buying behavior of different
age groups. They decide to interview 20 people from each age group: 18-25, 26-35,
36-45, and 46-55.
● Scenario:
• Population: Consumers
• Sample: 20 people from each specified age group
Hypothesis Testing
Hypothesis testing is an act in statistics whereby an analyst tests an
assumption regarding a population parameter. The methodology employed by
the analyst depends on the nature of the data used and the reason for the
analysis
● Hypothesis testing is used to assess the plausibility of a hypothesis by
using sample data.
● Such data may come from a larger population, or from a data-generating
process
Hypothesis Meaning:
An idea that is suggested as the possible explanation for something but has not yet been found to be
true or correct
Null Hypothesis
HYPOTHESIS
TESTING
▪State the hypothesized value of the parameter All possible alternatives otherthan
before sampling.
the null hypothesis.
▪ The assumption we wish totest
E.g µ ≠ 20 µ
(or the assumption we are trying to
reject) > 20
▪ E.g population mean µ = 20 µ < 20
▪There is no difference between coke and diet There is a difference between coke
coke and diet coke
Normal
•
Distribution
The commonest and the most useful continuous distribution.
• A symmetrical probability distribution where most results are
located in the middle and few are spread on both sides.
• It has the shape of a bell.
• Can entirely be described by its mean and standard deviation.
Normal Distribution
Can be found practically everywhere:
• In nature.
• In engineering and industrial processes.
• In social and human sciences.
Many everyday data sets follow approximately the normal
• distribution.
Normal Distribution
Normal Curve:
• Helps calculating the probabilities for normally distributed
populations.
• The probabilities are represented by the area under the normal curve.
• The total area under the curve is equal to 100% (or 1.00).
• This represents the population of the observations.
Inflection Point
50% 50%
Mean X
Normal Distribution
Empirical Rule:
• For any normally distributed data:
• 68% of the data fall within 1 standard deviation of the mean.
• 95% of the data fall within 2 standard deviations of the mean.
• 99.7% of the data fall within 3 standard deviations of the mean.
99.73%
95.44%
68.26%
Level of Confidence:
● If in this range means null hypothesis is true.
● It is denoted by (1 – α).
● Level of Significance + Level of Confidence = 1
Normal
Distribution
Standard Normal Distribution: Converting from ‘X’ to ‘Z’:
σ = 0.2
x 3. 4.1 4.3 4.5 4.7 5.1
z -3
9
-2 -1
4.9
0 1 2 3
-2.5 2.5
The specification limits at 4.0 and 5.0mm respectively, and the corresponding z values
The X-scale is for the actual values and the Z-scale is for the standardized values
What is use of Standard Normal Distribution:
To determine the area under the normal distribution curve that
is:
• To the right of your data point.
• To the left of the data point.
• Between two data points.
• Outside of two data points.
X X
X1 X2 X1 X2
What is statistical significance?
● Significance is usually denoted by a p-value, or probability value.
● Statistical significance is arbitrary – it depends on the threshold, or alpha value,
chosen by the researcher. The most common threshold is p < 0.05, which means
that the data is likely to occur less than 5% of the time under the null hypothesis.
● When the p-value falls below the chosen alpha value, then we say the result of
the test is statistically significant.
●
Type I Error
A type I error occurs if a true null hypothesis is rejected (a “false positive”)
● The probability of making a Type I error is the significance level, or alpha (α)
● The significance level is usually set at 0.05 or 5%. This means that your results only
have a 5% chance of occurring, or less, if the null hypothesis is actually true.
● If the p value of your test is lower than the significance level, it means your results are
statistically significant and consistent with the alternative hypothesis.
● If your p value is higher than the significance level, then your results are considered
statistically non-significant
Example: Type I error
A potentially COVID Patient decides to get tested for COVID-19 based on mild
symptoms. There is error that could potentially occur:
Type I error (false positive): The test result says patient have coronavirus, but
actually patient don’t have coronavirus.
• Type II error (false negative): the test result says you don’t have coronavirus,
but
Type II Error
● A type II error occurs if a false null hypothesis is not rejected (a “false negative”)
● The probability of making a Type II error is beta (β)
● In reality, your study may not have had enough statistical power to detect an effect of a certain
size.
● The Type I and Type II error rates influence each other. That’s because the
significance level (the Type I error rate) affects statistical power, which is
inversely related to the Type II error rate.
• Setting a lower significance level decreases a Type I error risk, but increases a
Type II error risk.
• Increasing the power of a test decreases a Type II error risk, but increases a Type
I error risk.
• Type I and Type II errors occur where these two distributions overlap. The blue
shaded area represents alpha, the Type I error rate, and the green shaded area
represents beta, the Type II error rate.
• By setting the Type I error rate, you indirectly influence the size of the Type II error
rate as well.
• It’s important to strike a balance between the risks of making Type I and Type II
errors. Reducing the alpha always comes at the cost of increasing beta, and vice versa.
Is a Type I or Type II error worse?
For statisticians, a Type I error is usually worse. In practical terms, however, either
type of error could be worse depending on your research context
Example: Consequences of a Type I error
Based on the incorrect conclusion that the new drug intervention is effective, over a
million patients are prescribed the medication, despite risks of severe side effects and
inadequate research on the outcomes. The consequences of this Type I error also
mean that other treatment options are rejected in favor of this intervention.
Example: Consequences of a Type II error
If a Type II error is made, the drug intervention is considered ineffective when it can
actually improve symptoms of the disease. This means that a medication with
important clinical significance doesn’t reach a large number of patients who could
tangibly benefit from it.
Definition of One-tailed Test
● One-tailed test alludes to the significance test in which the region of
rejection appears on one end of the sampling distribution.
● It represents that the estimated test parameter is greater or less than the
critical value.
● When the sample tested falls in the region of rejection, i.e. either left or
right side, as the case may be, it leads to the acceptance of alternative
hypothesis rather than the null hypothesis.
One tailed test can be:
Z Test Definition
● A z test is conducted on a population that follows a normal distribution with
independent data points and has a sample size that is greater than or equal to 30.
● It is used to check whether the means of two populations are equal to each other
when the population variance is known.
● The null hypothesis of a z test can be rejected if the z test statistic is statistically
significant when compared with the critical value.
One sample Z test
A one-sample z test is used to check if there is a difference between the sample mean and the
population mean when the population standard deviation is known.
The formula for the z test statistic is given as
𝑋−𝜇
𝑧= 𝜎
√𝑛
X is the sample mean, μ is the population mean, σ is the population standard deviation and n is
the sample size.
● The average weight of a certain breed of dog is 20 kg with a standard
deviation of 2 kg. A sample of 50 dogs is taken, and the sample mean is
21 kg. Perform a Z test to determine if the sample mean is significantly
different from the population mean at 5% level of significance level.
Left Tailed Test:
Null Hypothesis: H0: μ=μ0
Alternate Hypothesis: H1 : μ<μ0
Decision Criteria: If the z statistic < z critical value then reject the null hypothesis.
In the Population average rpm of motor is 100 rpm. A team of designer want to
change type of lubrication used and to observe whether there is increase or decrease
in rpm or remains same. A sample of 30 motors were taken and it was observed that
its rpm was 140, with standard deviation of 20. Did the change in lubrication have
affect ? Take Alpha as 0.05
Your company wants to improve sales. Past sales data indicate that the average sale was $100 per
transaction. After training your sales force, recent sales data (taken from a sample of 25 salesmen)
indicates an average sale of $130, with a standard deviation of $15. Did the training work? Test your
hypothesis at a 5% alpha level.
Soln -
Ho - The accepted hypothesis is that there is no difference in sales (μ = $100)
Ha - There is a difference (μ > $100)
sample mean(x̄) - $130.
population mean(μ) - $100
sample standard deviation(s) - $15.
Number of observations(n) - 25.
t = (30 / 3) = 10
df = 25 -1
From t table - Critical value is 1.711
A two sample t hypothesis tests also known as independent t-test is used to analyze
the difference between two unknown population means.
The Two-sample T-test is used when the two small samples (n< 30) are taken from
two different populations and compared. The underlying chart makes use of the T
distribution.
Binary
Data
• Continuous (aka ratio variables): represent measures and can usually be divided into units smaller than one
(e.g. 0.75 grams).
• Discrete (aka integer variables): represent counts and usually can’t be divided into units smaller than one
(e.g. 1 tree).
• Ordinal represent data with an order (e.g. rankings).
• Nominal represent group names (e.g. brands or species names).
• Binary represent data with a yes/no or 1/0 outcome (e.g. win or lose).
Chi-square test (X2)
A Chi-square test is a hypothesis testing method. Chi-square tests involve checking if observed frequencies
in one or more categories match expected frequencies.