Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Module 02 - AIML Statisitcs

Uploaded by

Bhaskar Mulik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module 02 - AIML Statisitcs

Uploaded by

Bhaskar Mulik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Probablity & Statistics

Module 02
What Is a Random Variable?
A random variable is a variable whose value is unknown or a function that
assigns values to each of an experiment's outcomes

Two Types of Random variable:


1. Discrete random variable
2. Continuous random variable
Discrete Variable:
A discrete random variable is a type of random variable that has a countable
number of distinct values that can be assigned to it, such as in a coin toss.

Random Variable:
A continuous random variable stands for any amount within a specific range or
set of points and can reflect an infinite number of potential values, such as the
average rainfall in a region.
Why Statistical Distribution in important?
What do you mean by

● Mean
● Median
● Mode
● Variance
● Standard Distribution
Continuous Random Variable

Continuous random variables are used to model continuous phenomena or quantities,


such as time, length, mass, ... that depend on chance.

To calculate probablity we need the probability density function (PDF)


𝒃
P(a ≤ X ≤ b)=‫𝒙𝒅 𝒙 𝒇 𝒂׬‬

Where Area under curve is 1


A continuous random variable X has the following probability law
f(X) = k𝑥 2 , 0 ≤ x ≤ 2
Determine K and find the probabilities that (i) 0.2 ≤ X ≤ 0.5
`
Binomial Probability Distribution
● Bernoulli Trial: An experiment with a binary outcome: success (p) or failure (1-p).
● Number of Trials (n): Fixed number of independent trials.
● Number of Successes (k): Number of successes in these trials.
● Probability of Success (p): Probability of a success on an individual trial.
Problem:
If you flip a fair coin (where p=0.5) 10 times, what is the probability of getting
exactly 4 heads?
Parameters: n = 10, k = 4, p = 0.5
1) Plot Binomial Distribution for probability of 0.1 for 5 observations in the
sequences of all success from 0 to 5.
2) Plot Binomial Distribution for probability of 0.5 for 5 observations in the
sequences of all success from 0 to 5.
The Poison Distribution
● Poisson Distribution can occur when probability for occurrence of an event is small
and number of trials are very large.

Examples –
● Number of accidents in week
● Number of deaths in week

𝑒 −𝑚 𝑚𝑥
P(x) = , x = 0,1,2, …..
𝑥!
● Number of trials n is infinitely large
● Probability is very small.
np = m
ry for a given data set
ry for a given data set
ry for a given data set
ry for a given data set
ry for a given data set
ry for a given data set
ry for a given data set
● Probability Sampling Methods Vs Non-Probability Sampling Methods
Non-probability Sampling
Probability Sampling Methods
Methods
Probability Sampling is a sampling Non-probability sampling method is a
technique in which samples taken from technique in which the researcher
a larger population are chosen based on chooses samples based on subjective
probability theory. judgment, preferably random selection.

These are also known as Random These are also called non-random
sampling methods. sampling methods.

These are used for research which is These are used for research which is
conclusive. exploratory.

These involve a long time to get the These are easy ways to collect the data
data. quickly.

There is an underlying hypothesis in


The hypothesis is derived later by
probability sampling before the study
conducting the research study in the
starts. Also, the objective of this method
case of non-probability sampling.
is to validate the defined hypothesis.
ry for a given data set
ry for a given data set
ry for a given data set
● Numerical:
● Suppose you have a population of 100 students and you want to randomly select a
sample of 10 students.

● Population: 100 students


● Sample Size: 10 students
● Using a random number generator, you could select the following students (e.g.,
student IDs):
● 3, 12, 25, 34, 47, 58, 69, 71, 82, 95
ry for a given data set
ry for a given data set

● Assume you have a population of 100 students, consisting of 60 males and 40


females, and you want to take a sample of 10 students such that the sample reflects
the gender ratio of the population.

● Population: 100 students (60 males, 40 females)


● Sample Size: 10 students
● Proportional sampling:
● Males: (60/100)×10 = 6 students
● Females: (40/100)×10= 4 students

● Randomly select 6 males and 4 females from their respective groups:


● Males: 2, 7, 15, 28, 33, 46
● Females: 5, 14, 22, 37
ry for a given data set
ry for a given data set

Numerical:
Suppose you have a population of 100 students listed in a sequence, and you want
to select a sample of 10 students.
Population: 100 students
Sample Size: 10 students

Sampling Interval: k=100/10 = 10

Randomly choose a starting point between 1 and 10 (e.g., 4)

Then select every 10th student:

4, 14, 24, 34, 44, 54, 64, 74, 84, 94


ry for a given data set
ry for a given data set
ry for a given data set


ry for a given data set
ry for a given data set

● Convenience Sampling
● Example: A researcher wants to study the eating habits of college students and
decides to survey students in the campus cafeteria during lunch hours.
● Scenario:
• Population: All college students
• Sample: Students in the cafeteria at lunchtime

● Judgmental (or Purposive) Sampling


● Example: A researcher is studying the impact of a new teaching method on high-
performing students and selects only the top 10% of students based on their
academic records.
● Scenario:
• Population: All students in a school
• Sample: Top 10% of students based on grades
ry for a given data set
ry for a given data set
● Snowball Sampling
● Example: A researcher is studying the experiences of former crime gang members.
They start with a few known former gang members and ask them to refer other
former gang members they know.
● Scenario:
• Population: Former gang members
• Sample: Initial few former gang members, expanding to others referred by them

● Quota Sampling
● Example: A market researcher wants to understand the buying behavior of different
age groups. They decide to interview 20 people from each age group: 18-25, 26-35,
36-45, and 46-55.
● Scenario:
• Population: Consumers
• Sample: 20 people from each specified age group
Hypothesis Testing
Hypothesis testing is an act in statistics whereby an analyst tests an
assumption regarding a population parameter. The methodology employed by
the analyst depends on the nature of the data used and the reason for the
analysis
● Hypothesis testing is used to assess the plausibility of a hypothesis by
using sample data.
● Such data may come from a larger population, or from a data-generating
process

Hypothesis Meaning:
An idea that is suggested as the possible explanation for something but has not yet been found to be
true or correct
Null Hypothesis

● It represents a default assumption that there is no effect or no difference


between groups or variables.
● Essentially, it's a statement that any observed differences or effects in
your data are due to random chance rather than a real effect.

Example: If you're testing whether a new drug is more effective than an


existing one.
The null hypothesis would be that there is no difference in
effectiveness between the two drugs.
Alternative Hypothesis (H1)
● The alternative hypothesis is the statement that opposes the null
hypothesis.
● It suggests that there is a significant effect, relationship, or difference
between groups or variables.
● In hypothesis testing, if the evidence from the data is strong enough to
reject the null hypothesis, the alternative hypothesis is accepted.
Example: If you're testing whether a new drug is more effective than an
existing one.
The alternative hypothesis would be "the new drug has a
different effect (better or worse) than the existing drug."
There are two types of alternative hypotheses:
1. Two-tailed alternative hypothesis: It suggests that the parameter is different
from the null hypothesis value, but it doesn't specify a direction (greater or less
than). Example: The new drug is either more or less effective than the existing
one.
2. One-tailed alternative hypothesis: It specifies the direction of the effect.
Example: The new drug is more effective than the existing one.
analyze statistical decision Theory

HYPOTHESIS
TESTING

Null , H0 Alternative hypothesis,HA

▪State the hypothesized value of the parameter All possible alternatives otherthan
before sampling.
the null hypothesis.
▪ The assumption we wish totest
E.g µ ≠ 20 µ
(or the assumption we are trying to
reject) > 20
▪ E.g population mean µ = 20 µ < 20
▪There is no difference between coke and diet There is a difference between coke
coke and diet coke
Normal

Distribution
The commonest and the most useful continuous distribution.
• A symmetrical probability distribution where most results are
located in the middle and few are spread on both sides.
• It has the shape of a bell.
• Can entirely be described by its mean and standard deviation.
Normal Distribution
Can be found practically everywhere:
• In nature.
• In engineering and industrial processes.
• In social and human sciences.
Many everyday data sets follow approximately the normal
• distribution.
Normal Distribution
Normal Curve:
• Helps calculating the probabilities for normally distributed
populations.
• The probabilities are represented by the area under the normal curve.
• The total area under the curve is equal to 100% (or 1.00).
• This represents the population of the observations.

We can get a rough estimate of the Total probability


= 100%
probability above a value, below a
value, or between any two values.

Continuous Improvement Toolkit . www.citoolkit.com


Normal Distribution
Normal Curve:
• Since the normal curve is symmetrical, 50 percent of the data lie
on each side of the curve.

Total probability = 100%

Inflection Point

50% 50%

Mean X
Normal Distribution
Empirical Rule:
• For any normally distributed data:
• 68% of the data fall within 1 standard deviation of the mean.
• 95% of the data fall within 2 standard deviations of the mean.
• 99.7% of the data fall within 3 standard deviations of the mean.

99.73%

95.44%

68.26%

2.1% 13.6% 34.1% 34.1% 13.6% 2.1%


-3σ -2σ -1σ μ 1σ 2σ 3σ
Level of Significance:
● Significance means percentage risk to reject null hypothesis.
● It is denoted by α. Generally taken as 1 %, 5 %, 10 %.

Level of Confidence:
● If in this range means null hypothesis is true.
● It is denoted by (1 – α).
● Level of Significance + Level of Confidence = 1
Normal
Distribution
Standard Normal Distribution: Converting from ‘X’ to ‘Z’:

XL = 4.0 (LSL) XU = 5.0 (USL)

σ = 0.2
x 3. 4.1 4.3 4.5 4.7 5.1
z -3
9
-2 -1
4.9
0 1 2 3
-2.5 2.5
The specification limits at 4.0 and 5.0mm respectively, and the corresponding z values
The X-scale is for the actual values and the Z-scale is for the standardized values
What is use of Standard Normal Distribution:
To determine the area under the normal distribution curve that
is:
• To the right of your data point.
• To the left of the data point.
• Between two data points.
• Outside of two data points.
X X

X1 X2 X1 X2
What is statistical significance?
● Significance is usually denoted by a p-value, or probability value.
● Statistical significance is arbitrary – it depends on the threshold, or alpha value,
chosen by the researcher. The most common threshold is p < 0.05, which means
that the data is likely to occur less than 5% of the time under the null hypothesis.
● When the p-value falls below the chosen alpha value, then we say the result of
the test is statistically significant.

Type I Error
A type I error occurs if a true null hypothesis is rejected (a “false positive”)
● The probability of making a Type I error is the significance level, or alpha (α)
● The significance level is usually set at 0.05 or 5%. This means that your results only
have a 5% chance of occurring, or less, if the null hypothesis is actually true.
● If the p value of your test is lower than the significance level, it means your results are
statistically significant and consistent with the alternative hypothesis.
● If your p value is higher than the significance level, then your results are considered
statistically non-significant
Example: Type I error
A potentially COVID Patient decides to get tested for COVID-19 based on mild
symptoms. There is error that could potentially occur:
Type I error (false positive): The test result says patient have coronavirus, but
actually patient don’t have coronavirus.
• Type II error (false negative): the test result says you don’t have coronavirus,
but
Type II Error
● A type II error occurs if a false null hypothesis is not rejected (a “false negative”)
● The probability of making a Type II error is beta (β)
● In reality, your study may not have had enough statistical power to detect an effect of a certain
size.

What is statistical power?


In statistics, power refers to the likelihood of a hypothesis test detecting a true effect if there is one. A
statistically powerful test is more likely to reject a false negative (a Type II error).
Trade-off between Type I and Type II errors

● The Type I and Type II error rates influence each other. That’s because the
significance level (the Type I error rate) affects statistical power, which is
inversely related to the Type II error rate.
• Setting a lower significance level decreases a Type I error risk, but increases a
Type II error risk.
• Increasing the power of a test decreases a Type II error risk, but increases a Type
I error risk.
• Type I and Type II errors occur where these two distributions overlap. The blue
shaded area represents alpha, the Type I error rate, and the green shaded area
represents beta, the Type II error rate.
• By setting the Type I error rate, you indirectly influence the size of the Type II error
rate as well.
• It’s important to strike a balance between the risks of making Type I and Type II
errors. Reducing the alpha always comes at the cost of increasing beta, and vice versa.
Is a Type I or Type II error worse?

For statisticians, a Type I error is usually worse. In practical terms, however, either
type of error could be worse depending on your research context
Example: Consequences of a Type I error
Based on the incorrect conclusion that the new drug intervention is effective, over a
million patients are prescribed the medication, despite risks of severe side effects and
inadequate research on the outcomes. The consequences of this Type I error also
mean that other treatment options are rejected in favor of this intervention.
Example: Consequences of a Type II error
If a Type II error is made, the drug intervention is considered ineffective when it can
actually improve symptoms of the disease. This means that a medication with
important clinical significance doesn’t reach a large number of patients who could
tangibly benefit from it.
Definition of One-tailed Test
● One-tailed test alludes to the significance test in which the region of
rejection appears on one end of the sampling distribution.
● It represents that the estimated test parameter is greater or less than the
critical value.
● When the sample tested falls in the region of rejection, i.e. either left or
right side, as the case may be, it leads to the acceptance of alternative
hypothesis rather than the null hypothesis.
One tailed test can be:

• Left-tailed test: When the population parameter is believed to be lower


than the assumed one, the hypothesis test carried out is the left-tailed test.
• Right-tailed test: When the population parameter is supposed to be
greater than the assumed one, the statistical test conducted is a right-tailed
test.
Definition of Two-tailed Test
● The two-tailed test is described as a hypothesis test, in which the region of
rejection or say the critical area is on both the ends of the normal
distribution.
● It determines whether the sample tested falls within or outside a certain
range of values.
● Therefore, an alternative hypothesis is accepted in place of the null
hypothesis, if the calculated value falls in either of the two tails of the
probability distribution.
BASIS OF ONE-TAILED TEST TWO-TAILED TEST
COMPARISON

Meaning A statistical hypothesis test in A significance test in which


which alternative hypothesis alternative hypothesis has two ends,
has only one end, is known as is called two-tailed test.
one tailed test.
Hypothesis Directional Non-directional

Region of rejection Either left or right Both left and right

Determines If there is a relationship If there is a relationship between


between variables in single variables in either direction.
direction.
Result Greater or less than certain range of
Greater or less than certain value.
values.

Sign in alternative > or < ≠


hypothesis
Z Test
● Z test is a statistical test that is conducted on data that approximately
follows a normal distribution.
● The z test can be performed on one sample, two samples, or on
proportions for hypothesis testing.
● It checks if the means of two large samples are different or not when the
population variance is known.
● A z test can further be classified into left-tailed, right-tailed, and two-tailed
hypothesis tests depending upon the parameters of the data.
What is Z Test
● A z test is a test that is used to check if the means of two populations are different or
not provided the data follows a normal distribution. For this purpose, the null
hypothesis and the alternative hypothesis must be set up and the value of the z test
statistic must be calculated. The decision criterion is based on the z critical value

Z Test Definition
● A z test is conducted on a population that follows a normal distribution with
independent data points and has a sample size that is greater than or equal to 30.
● It is used to check whether the means of two populations are equal to each other
when the population variance is known.
● The null hypothesis of a z test can be rejected if the z test statistic is statistically
significant when compared with the critical value.
One sample Z test
A one-sample z test is used to check if there is a difference between the sample mean and the
population mean when the population standard deviation is known.
The formula for the z test statistic is given as
𝑋−𝜇
𝑧= 𝜎
√𝑛
X is the sample mean, μ is the population mean, σ is the population standard deviation and n is
the sample size.
● The average weight of a certain breed of dog is 20 kg with a standard
deviation of 2 kg. A sample of 50 dogs is taken, and the sample mean is
21 kg. Perform a Z test to determine if the sample mean is significantly
different from the population mean at 5% level of significance level.
Left Tailed Test:
Null Hypothesis: H0: μ=μ0
Alternate Hypothesis: H1 : μ<μ0
Decision Criteria: If the z statistic < z critical value then reject the null hypothesis.

Right Tailed Test:


Null Hypothesis: H0: μ=μ0
Alternate Hypothesis: H1: μ>μ0
Decision Criteria: If the z statistic > z critical value then reject the null hypothesis.

Two Tailed Test:


Null Hypothesis: H0: μ=μ0
Alternate Hypothesis: H1: μ≠μ0
Decision Criteria: If the z statistic > z critical value then reject the null hypothesis.
Two Sample Z Test
● A two sample z test is used to check if there is a difference between the means
of two samples.
T test
The One Sample T Hypothesis Test (Student’s T Test) allows to compares the (small)
population mean to some hypothesized value or one sample mean to determine if they
are significantly different.
or
A t-test is a form of the statistical hypothesis test, based on Student’s t-statistic and t-
distribution to find out the p-value (probability) which can be used to accept or reject
the null hypothesis.
Assumptions of One Sample T Hypothesis Test
• Data is continuous and quantitative at the scale level (in other words data in ratio
or interval)
• The sample should be randomly selected from the population
• Samples are independent to each other
• Data should follow normal probability distribution
• Assumes it don’t have extreme outliers in the dependent variable
One Sample T Hypothesis test formula

● Where x̅ is observed sample mean


• μ0 is population mean
• s is sample standard deviation
• n is the number of the observations in the sample
Numerical:

In the Population average rpm of motor is 100 rpm. A team of designer want to
change type of lubrication used and to observe whether there is increase or decrease
in rpm or remains same. A sample of 30 motors were taken and it was observed that
its rpm was 140, with standard deviation of 20. Did the change in lubrication have
affect ? Take Alpha as 0.05
Your company wants to improve sales. Past sales data indicate that the average sale was $100 per
transaction. After training your sales force, recent sales data (taken from a sample of 25 salesmen)
indicates an average sale of $130, with a standard deviation of $15. Did the training work? Test your
hypothesis at a 5% alpha level.
Soln -
Ho - The accepted hypothesis is that there is no difference in sales (μ = $100)
Ha - There is a difference (μ > $100)
sample mean(x̄) - $130.
population mean(μ) - $100
sample standard deviation(s) - $15.
Number of observations(n) - 25.
t = (30 / 3) = 10
df = 25 -1
From t table - Critical value is 1.711

As t-value is more than Critical value, so we reject the null hypothesis


Two Sample T Hypothesis Test

A two sample t hypothesis tests also known as independent t-test is used to analyze
the difference between two unknown population means.
The Two-sample T-test is used when the two small samples (n< 30) are taken from
two different populations and compared. The underlying chart makes use of the T
distribution.

Assumptions of Two Sample T Hypothesis Tests


• The sample should be randomly selected from the two population
• Samples are independent to each other
• Two sample sizes must me less than 30
• Samples collected from the population are normally distributed
A test was conducted on two different experimental setup to check any significant difference between
performance of two setup. The final reading observations of 15 runs were sampled in in 1st setup, and
its mean value was 82 with a standard deviation of 2.4. The mean value of 2nd setup was 84, with a
standard deviation of 1.7 using a sample of 12 observations. Determine if there is any major difference
at a 5%.
Paired T Test
● A paired t test (also called a correlated pairs t-test, a paired samples t
test or dependent samples t test) is where you run a t test on dependent
samples. Dependent samples are essentially connected — they are tests on
the same person or thing.
For example:
• Two tests on the same person before and after training,
Types of Data in Statistics

Binary
Data
• Continuous (aka ratio variables): represent measures and can usually be divided into units smaller than one
(e.g. 0.75 grams).
• Discrete (aka integer variables): represent counts and usually can’t be divided into units smaller than one
(e.g. 1 tree).
• Ordinal represent data with an order (e.g. rankings).
• Nominal represent group names (e.g. brands or species names).
• Binary represent data with a yes/no or 1/0 outcome (e.g. win or lose).
Chi-square test (X2)
A Chi-square test is a hypothesis testing method. Chi-square tests involve checking if observed frequencies
in one or more categories match expected frequencies.

Where O = Observed Value, E = Expected Value

There are two commonly used Chi-square tests:


1) Chi-square goodness of fit test:
○ For one variable
○ Decide if one variable is likely to come from a given distribution or not
○ dof = (n-1)

2) Chi-square test of independence (For two variable):


○ For two variable
○ Decide if two variables might be related or not
○ dof = (r-1)(c-1)
● Using appropriate hypothesis testing method investigate the association
between the darkness of eye colour in father and son from the following
data.
Colour of the Father’s Eye
Dark Not Dark Total
Colour of the Dark 48 90 138
Son’s Eye
Not Dark 80 782 862
Total 128 872 1000
● A sample of 400 students of undergraduate and 400 students of post graduate
classes were taken to know their opinion about autonomous college. 290 of the
under-graduate and 310 of the post-graduate students favoured the autonomous
status. Present these facts in the form of table and test at 5% level, that the
opinion regarding autonomous status of colleges is independent of the level of
classes of students.
● To test the effect of new drug, a controlled experiment was conducted. 300
patients were given the new drug while 200 patients were given no drug. On the
basis of examination of these persons, the following results were obtained.
Cured Condition No effect Total
Worsened
Given the new 200 40 60 300
drug
Not given the 120 30 50 200
drug
Total 320 70 110 500

Using Chi-square test analyze the effectiveness of new drug.


● A die was thrown 132 times and the following frequencies were
observed.
No. 1 2 3 4 5 6 Total
Obtained
Frequenc 15 20 25 15 29 28 132
y

Testing the hypothesis discuss that die is unbiased or not.


● Theory predicts that the proportion of beans in the four groups A,B,C,D should
be 9:3:3:1. In an experiment among 1600 beans the numbers in the four groups
were 882, 313, 287 and 118. Does the experimental results support the theory.
Using Hypothesis testing provide a conclusion.
● The number of defects in printed circuit board is hypothesised to follow
poisson distribution. A random sample of 60 printed boards showed the
following data.
No. of defects 0 1 2 3
Observed 32 15 9 4
Frequency

Does the hypothesis of poisson distribution seen appropriate. Using


Hypotheis testing provide a conclusion.

You might also like