Be A 65 Ads Exp 4
Be A 65 Ads Exp 4
Be A 65 Ads Exp 4
Experiment No.04
A.1 Aim:
To implement inferential statistics analysis on given data set.
A.2 Prerequisite:
Knowledge of Python, Dataset (Kaggle), Hypothesis tests like Test,T test ,Annova Test and so on
A.3 Outcome:
After successful completion of this experiment students will be,able to inference the p value
of given data
A.4 Theory:
Data science is an emerging technology in the corporate society and it mainly deals with the data.
Applying statistical analysis to data and getting insights from it is our main objective. A
company will store millions of records for analysis. A data scientist will collect all the required
data and conduct statistical operations to arrive at conclusions. This type of statistical analysis is
called Descriptive Statistics.
Suppose they collect a subset of data, known as Sample in statistical terminology, from the entire
data known as Population. The sample is analyzed and conclusions are drawn about the
population. This type of analysis falls under Statistical Inference (also known as Inferential
Statistics).
Sampling Methods
Sampling is necessary for Statistical Analysis to reduce the maximum permissible error,
confidence level, and population variance/ standard deviation.
Maximum permissible error is defined as the difference between actual output and predicted
output.
Confidence Level is defined as the probability that the value of a parameter falls within a
specified range of values.
Population Variance is defined as the value of variance that is calculated from population data.
There are two types of sampling methods
1 Random Sampling
2. Stratified Sampling
1. Random Sampling
If we use Stratified Sampling, the population is divided into groups based on characteristics.
these groups are called Strata. The sample is chosen randomly from each of these groups.
Hypothesis Testing
In statistics, the hypothesis is a statement about the population and it deals with collecting
enough evidence about the hypothesis. Then, based on the evidence collected, the test either
accepts or rejects the hypothesis about the population.
Hypothesis testing needs to be performed to find evidence in support of this hypothesis. Based on the
evidence found, this hypothesis can be accepted or rejected.
In each hypothesis testing, there are two parameters Null hypothesis and Alternate hypothesis.
Null Hypothesis (H0): It is a statement that rejects the observation based on which the
hypothesis is made. You can start the hypothesis testing considering the null hypothesis to be
true. It cannot be rejected until there is evidence that suggests otherwise.
Alternate Hypothesis (Ha): It is a statement that is contradictory to the null hypothesis. If you
find enough evidence to reject the null hypothesis, then the alternative hypothesis is accepted.
if the probability of occurrence of the given data is less than the level of significance (0.05)
you can reject the null hypothesis.
if the probability of occurrence of the given data is greater than or equal to the level of
significance (0.05) you cannot reject the null hypothesis.
Step 1: Let assume the null hypothesis, alternate hypothesis, and the level of significance.
Step 3: Conclude whether to reject the null hypothesis or not based on the P-value i.e.
we will be discussing the statistical test for hypothesis testing including both parametric and non-
parametric tests.
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
Parametric Tests
The basic principle behind the parametric tests is that we have a fixed set of parameters that are
used to determine a probabilistic model that may be used in Machine Learning as well.
Parametric tests are those tests for which we have prior knowledge of the population distribution
(i.e, normal), or if not then we can easily approximate it to a normal distribution which is
possible with the help of the Central Limit Theorem.
• Mean
• Standard Deviation
• To find the confidence interval for the population means with the help of known standard
deviation.
• To determine the confidence interval for population means along with the unknown
standard deviation.
• To find the confidence interval for the population variance.
• To find the confidence interval for the difference of two means, with an unknown value
of standard deviation.
Non-parametric Tests
In Non-Parametric tests, we don’t make any assumption about the parameters for the given
population or the population we are studying. In fact, these tests don’t depend on the population.
Hence, there is no fixed set of parameters is available, and also there is no distribution (normal
distribution, etc.) of any kind is available for use.
This is also the reason that nonparametric tests are also referred to as distribution-free tests. In
modern days, Non-parametric tests are gaining popularity and an impact of influence some
reasons behind this fame is –
• The main reason is that there is no need to be mannered while using parametric tests.
• The second reason is that we do not require to make assumptions about the population
given (or taken) on which we are doing the analysis.
• Most of the nonparametric tests available are very easy to apply and to understand also
i.e.
the complexity is very low.
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
T-Test
A T-test can be a:
One Sample T-test: To compare a sample mean with that of the population mean.
where,
where,
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
x̄ 1 is the sample mean of the first group x̄
2 is the sample mean of the second group
S1 is the sample-1 standard deviation
S2 is the sample-2 standard deviation
n is the sample size
• If the value of the test statistic is greater than the table value -> Rejects the null
hypothesis.
• If the value of the test statistic is less than the table value -> Do not reject the null
hypothesis.
Z-Test
One Sample Z-test: To compare a sample mean with that of the population mean.
where,
x̄ 1 is the sample mean of 1st group x̄ 2 is
the sample mean of 2nd group σ1 is the
population-1 standard deviation σ2 is the
population-2 standard deviation n is the
sample size
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
ANOVA
6. Types:
Chi-Square Test
7. It is also known as the “Goodness of fit test” which determines whether a particular
distribution fits the observed data or not.
8. It is calculated as:
11. Chi-square as a parametric test is used as a test for population variance based on sample
variance.
12. If we take each one of a collection of sample variances, divide them by the known population
variance and multiply these quotients by (n-1), where n means the number of items in the
sample, we get the values of chi-square.
Mann-Whitney U-Test
When consulting the significance tables, the smaller values of U 1 and U2 are used. The sum of two values is
given by,
U1 + U2 = { R1 – n1(n1+1)/2 } + { R2 – n2(n2+1)/2 }
Knowing that R1+R2 = N(N+1)/2 and N=n1+n2, and doing some algebra, we find that the sum is: U1
+ U2 = n1*n2
Kruskal-Wallis H-test
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical. The
soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge faculties at
the end of the practical in case the there is no Black board access available)
Program:
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
import scipy.stats as stats import seaborn as sns import
pandas as pd import numpy as np
dataset=sns.load_dataset('tips') dataset.head()
dataset_table=pd.crosstab(dataset['sex'],dataset['smoker'])
print(dataset_table) dataset_table.values
#Observed Values
Observed_Values = dataset_table.values
print("Observed Values :-\n",Observed_Values)
val=stats.chi2_contingency(dataset_table)
no_of_rows=len(dataset_table.iloc[0:2,0])
no_of_columns=len(dataset_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof) alpha =
0.05
from scipy.stats import chi2 chi_square=sum([(o-
e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)
#p-value p_value=1-
chi2.cdf(x=chi_square_statistic,df=ddof) print('p-
value:',p_value) print('Significance level:
',alpha) print('Degree of Freedom: ',ddof)
print('p-value:',p_value) if
chi_square_statistic>=critical_value:
Mu = ages_mean
Std = std
sample_avg_bp = np.average(ages) std_error_bp = Std / np.sqrt(len(ages)) #
Standard dev of the sampling mean distribution... estimated from population
print("Sample Avg BP : " , sample_avg_bp) print("Standard Error: " ,
std_error_bp)
B.2 Conclusion:
Hence, we can conclude that we have successfully implemented inferential statistics on given
data set using python programming language.