Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Be A 65 Ads Exp 4

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

(PART A: TO BE REFFERED BY STUDENTS)

Experiment No.04
A.1 Aim:
To implement inferential statistics analysis on given data set.
A.2 Prerequisite:
Knowledge of Python, Dataset (Kaggle), Hypothesis tests like Test,T test ,Annova Test and so on

A.3 Outcome:
After successful completion of this experiment students will be,able to inference the p value
of given data

A.4 Theory:
Data science is an emerging technology in the corporate society and it mainly deals with the data.
Applying statistical analysis to data and getting insights from it is our main objective. A
company will store millions of records for analysis. A data scientist will collect all the required
data and conduct statistical operations to arrive at conclusions. This type of statistical analysis is
called Descriptive Statistics.
Suppose they collect a subset of data, known as Sample in statistical terminology, from the entire
data known as Population. The sample is analyzed and conclusions are drawn about the
population. This type of analysis falls under Statistical Inference (also known as Inferential
Statistics).

Sampling Methods

Sampling is necessary for Statistical Analysis to reduce the maximum permissible error,
confidence level, and population variance/ standard deviation.
Maximum permissible error is defined as the difference between actual output and predicted
output.
Confidence Level is defined as the probability that the value of a parameter falls within a
specified range of values.
Population Variance is defined as the value of variance that is calculated from population data.
There are two types of sampling methods
1 Random Sampling
2. Stratified Sampling

1. Random Sampling

If we use random sampling, It randomly selects the 10 students(sample) from 1000


students(population). By using Random Sampling, there is a chance to select the same type of
students as a sample. It may give a biased result and such a sample is called a Biased Sample.
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
.2. Stratified Sampling

If we use Stratified Sampling, the population is divided into groups based on characteristics.
these groups are called Strata. The sample is chosen randomly from each of these groups.

Hypothesis Testing

What is hypothesis testing?

In statistics, the hypothesis is a statement about the population and it deals with collecting
enough evidence about the hypothesis. Then, based on the evidence collected, the test either
accepts or rejects the hypothesis about the population.

Hypothesis testing needs to be performed to find evidence in support of this hypothesis. Based on the
evidence found, this hypothesis can be accepted or rejected.

In each hypothesis testing, there are two parameters Null hypothesis and Alternate hypothesis.

Null Hypothesis (H0): It is a statement that rejects the observation based on which the
hypothesis is made. You can start the hypothesis testing considering the null hypothesis to be
true. It cannot be rejected until there is evidence that suggests otherwise.

Alternate Hypothesis (Ha): It is a statement that is contradictory to the null hypothesis. If you
find enough evidence to reject the null hypothesis, then the alternative hypothesis is accepted.

if the probability of occurrence of the given data is less than the level of significance (0.05)
you can reject the null hypothesis.

if the probability of occurrence of the given data is greater than or equal to the level of
significance (0.05) you cannot reject the null hypothesis.

steps to calculate the Hypothesis:-

Step 1: Let assume the null hypothesis, alternate hypothesis, and the level of significance.

Step 2: Calculate the P-value.

Step 3: Conclude whether to reject the null hypothesis or not based on the P-value i.e.

• If P-value < significance level, then reject the null hypothesis


• If P-value >= significance level, the null hypothesis cannot be rejected

Step 4: State the conclusion.

we will be discussing the statistical test for hypothesis testing including both parametric and non-
parametric tests.
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
Parametric Tests

The basic principle behind the parametric tests is that we have a fixed set of parameters that are
used to determine a probabilistic model that may be used in Machine Learning as well.

Parametric tests are those tests for which we have prior knowledge of the population distribution
(i.e, normal), or if not then we can easily approximate it to a normal distribution which is
possible with the help of the Central Limit Theorem.

Parameters for using the normal distribution is –

• Mean
• Standard Deviation

Eventually, the classification of a test to be parametric is completely dependent on the population


assumptions. There are many parametric tests available from which some of them are as follows:

• To find the confidence interval for the population means with the help of known standard
deviation.
• To determine the confidence interval for population means along with the unknown
standard deviation.
• To find the confidence interval for the population variance.
• To find the confidence interval for the difference of two means, with an unknown value
of standard deviation.

Non-parametric Tests

In Non-Parametric tests, we don’t make any assumption about the parameters for the given
population or the population we are studying. In fact, these tests don’t depend on the population.
Hence, there is no fixed set of parameters is available, and also there is no distribution (normal
distribution, etc.) of any kind is available for use.

This is also the reason that nonparametric tests are also referred to as distribution-free tests. In
modern days, Non-parametric tests are gaining popularity and an impact of influence some
reasons behind this fame is –

• The main reason is that there is no need to be mannered while using parametric tests.
• The second reason is that we do not require to make assumptions about the population
given (or taken) on which we are doing the analysis.

• Most of the nonparametric tests available are very easy to apply and to understand also
i.e.
the complexity is very low.
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

T-Test

1. It is a parametric test of hypothesis testing based on Student’s T distribution.


2. It is essentially, testing the significance of the difference of the mean values when the sample
size is small (i.e, less than 30) and when the population standard deviation is not available.
3. Assumptions of this test:
• Population distribution is normal, and • Samples are random and independent
• The sample size is small.
• Population standard deviation is not known.
4. Mann-Whitney ‘U’ test is a non-parametric counterpart of the T-test.

A T-test can be a:
One Sample T-test: To compare a sample mean with that of the population mean.

where,

x̄ is the sample mean s is the


sample standard deviation n is
the sample size
μ is the population mean

Two-Sample T-test: To compare the means of two different samples.

where,
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
x̄ 1 is the sample mean of the first group x̄
2 is the sample mean of the second group
S1 is the sample-1 standard deviation
S2 is the sample-2 standard deviation
n is the sample size

• If the value of the test statistic is greater than the table value -> Rejects the null
hypothesis.
• If the value of the test statistic is less than the table value -> Do not reject the null
hypothesis.

Z-Test

1. It is a parametric test of hypothesis testing.


2. It is used to determine whether the means are different when the population variance is known
and the sample size is large (i.e, greater than 30).
3. Assumptions of this test:
• Population distribution is normal • Samples are random and independent.
• The sample size is large.
• Population standard deviation is known.

A Z-test can be:

One Sample Z-test: To compare a sample mean with that of the population mean.

Two Sample Z-test: To compare the means of two different samples.

where,
x̄ 1 is the sample mean of 1st group x̄ 2 is
the sample mean of 2nd group σ1 is the
population-1 standard deviation σ2 is the
population-2 standard deviation n is the
sample size
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

ANOVA

1. Also called as Analysis of variance, it is a parametric test of hypothesis testing.


2. It is an extension of the T-Test and Z-test.
3. It is used to test the significance of the differences in the mean values among more than two
sample groups.
4. It uses F-test to statistically test the equality of means and the relative variance between them.
5. Assumptions of this test:

• Population distribution is normal, and • Samples are random


and independent.
• Homogeneity of sample variance.

6. Types:

One-way ANOVA and Two-way ANOVA

Chi-Square Test

1. It is a non-parametric test of hypothesis testing.


2. As a non-parametric test, chi-square can be used:
• test of goodness of fit.
• as a test of independence of two variables.
3. It helps in assessing the goodness of fit between a set of observed and those expected
theoretically.
4. It makes a comparison between the expected frequencies and the observed frequencies.
5. Greater the difference, the greater is the value of chi-square.
6. If there is no difference between the expected and observed frequencies, then the value of
chi- square is equal to zero.

7. It is also known as the “Goodness of fit test” which determines whether a particular
distribution fits the observed data or not.
8. It is calculated as:

9. Chi-square is also used to test the independence of two variables.


10. Conditions for chi-square test:
• Randomly collect and record the Observations.
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
• In the sample, all the entities must be independent.
• No one of the groups should contain very few items, say less than 10.
• The reasonably large overall number of items. Normally, it should be at least 50,
however small the number of groups may be.

11. Chi-square as a parametric test is used as a test for population variance based on sample
variance.

12. If we take each one of a collection of sample variances, divide them by the known population
variance and multiply these quotients by (n-1), where n means the number of items in the
sample, we get the values of chi-square.

Mann-Whitney U-Test

1. It is a non-parametric test of hypothesis testing.


2. This test is used to investigate whether two independent samples were selected from a population
having the same distribution.
3. It is a true non-parametric counterpart of the T-test and gives the most accurate estimates of significance
especially when sample sizes are small and the population is not normally distributed. 4. It is based on
the comparison of every observation in the first sample with every observation in the other sample.
5. The test statistic used here is “U”.
6. Maximum value of “U” is ‘n1*n2‘ and the minimum value is zero.
7. It is also known as:
• Mann-Whitney Wilcoxon Test.
• Mann-Whitney Wilcoxon Rank Test.
8. Mathematically, U is given by: U1 = R1 – n1(n1+1)/2 where n1 is the sample size for sample 1, and R1 is
the sum of ranks in Sample 1.
U2 = R2 – n2(n2+1)/2

When consulting the significance tables, the smaller values of U 1 and U2 are used. The sum of two values is
given by,
U1 + U2 = { R1 – n1(n1+1)/2 } + { R2 – n2(n2+1)/2 }
Knowing that R1+R2 = N(N+1)/2 and N=n1+n2, and doing some algebra, we find that the sum is: U1
+ U2 = n1*n2

Kruskal-Wallis H-test

1. It is a non-parametric test of hypothesis testing.


2. This test is used for comparing two or more independent samples of equal or different sample sizes.
3. It extends the Mann-Whitney-U-Test which is used to comparing only two groups.
4. One-Way ANOVA is the parametric equivalent of this test. And that’s why it is also known as ‘One-
Way ANOVA on ranks.
5. It uses ranks instead of actual data.
6. It does not assume the population to be normally distributed.
7. The test statistic used here is “H”.
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical. The
soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge faculties at
the end of the practical in case the there is no Black board access available)

Roll No.: 65 Name: Ketki Kulkarni


Class: BE-A Batch: A4
Date of Experiment: Date of Submission:
Grade:

Program:
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
import scipy.stats as stats import seaborn as sns import
pandas as pd import numpy as np
dataset=sns.load_dataset('tips') dataset.head()
dataset_table=pd.crosstab(dataset['sex'],dataset['smoker'])
print(dataset_table) dataset_table.values
#Observed Values
Observed_Values = dataset_table.values
print("Observed Values :-\n",Observed_Values)
val=stats.chi2_contingency(dataset_table)
no_of_rows=len(dataset_table.iloc[0:2,0])
no_of_columns=len(dataset_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof) alpha =
0.05
from scipy.stats import chi2 chi_square=sum([(o-
e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)
#p-value p_value=1-
chi2.cdf(x=chi_square_statistic,df=ddof) print('p-
value:',p_value) print('Significance level:
',alpha) print('Degree of Freedom: ',ddof)
print('p-value:',p_value) if
chi_square_statistic>=critical_value:

print("Reject H0,There is a relationship between 2 categorical variables


") else: print("Retain H0,There is no relationship between 2 categorical
variable s") if p_value<=alpha:
print("Reject H0,There is a relationship between 2 categorical variables
") else: print("Retain H0,There is no relationship between 2 categorical
variable s")
ages=[10,20,35,50,28,40,55,18,16,55,30,25,43,18,30,28,14,24,16,17,32,35,26,2
7,65,18,43,23,21,20,19,70]
len(ages) import numpy as np
ages_mean=np.mean(ages)
print(ages_mean) std =
np.std(ages) print("Standard
Deviation:", std)
## Lets take sample
sample_size=10
age_sample=np.random.choice(ages,sample_size)
from scipy.stats import ttest_1samp
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
ttest,p_value=ttest_1samp(age_sample,30)
print(p_value) if p_value < 0.05: # alpha
value is 0.05 or 5% print(" we are rejecting
null hypothesis") else:
print("we are accepting null hypothesis") np.random.seed(12)
ClassB_ages=stats.poisson.rvs(loc=18,mu=33,size=60) ClassB_ages.mean()
weight1=[25,30,28,35,28,34,26,29,30,26,28,32,31,30,45]
weight2=weight1+stats.norm.rvs(scale=5,loc=-1.25,size=15)
print(weight1) print(weight2)
weight_df=pd.DataFrame({"weight_10":np.array(weight1),
"weight_20":np.array(weight2),
"weight_change":np.array(weight2)-np.array(weight1)})
weight_df
_,p_value=stats.ttest_rel(a=weight1,b=weight2)
print(p_value) if p_value < 0.05: # alpha
value is 0.05 or 5% print(" we are rejecting
null hypothesis") else:
print("we are accepting null hypothesis")
import scipy.stats as st

Mu = ages_mean
Std = std
sample_avg_bp = np.average(ages) std_error_bp = Std / np.sqrt(len(ages)) #
Standard dev of the sampling mean distribution... estimated from population
print("Sample Avg BP : " , sample_avg_bp) print("Standard Error: " ,
std_error_bp)

# Z_norm_deviate = sample_mean - population_mean / std_error_bp


Z_norm_deviate = (sample_avg_bp - Mu) / std_error_bp print("Normal
Deviate Z value :" , Z_norm_deviate)
p_value = st.norm.sf(abs(Z_norm_deviate))*2 #twosided using sf - Survival
Fu nction print('p values' , p_value)
if p_value >
0.05:
print('Samples are likely drawn from the same distributions (H0 accepted)'
) else:
print('Samples are likely drawn from different distributions (reject H0)')
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
import seaborn as sns df1=sns.load_dataset('iris') df_anova =
df1[['petal_width','species']] grps = pd.unique(df_anova.species.values)
d_data = {grp:df_anova['petal_width'][df_anova.species == grp] for grp in gr
ps} d_data
F, p = stats.f_oneway(d_data['setosa'], d_data['versicolor'], d_data['virgin
ica']) if p<0.05:
print("reject null hypothesis") else:
print("accept null hypothesis")
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
B.1 Observations and learning:
Inferential statistics plays a crucial role in scientific research by helping researchers draw
conclusions about a population based on data from a sample. The main role of inferential
statistics is to use the information obtained from a sample to make inferences or predictions
about the larger population from which the sample was drawn.
Inferential statistics is used to test hypotheses and make estimates about population parameters
such as mean, variance, and correlation. It helps researchers determine whether the results
obtained from a sample are likely to generalize to the larger population from which the sample
was drawn, or if they are due to chance or sampling error.
In summary, the role of inferential statistics is to help researchers make valid and reliable
inferences about a population based on data from a sample, while accounting for the inherent
variability and uncertainty associated with sampling and measurement.

B.2 Conclusion:
Hence, we can conclude that we have successfully implemented inferential statistics on given
data set using python programming language.

B.3 Question of Curiosity


Q1. What is p value? How does p value help to infer data?
Answer: The p-value is a statistical measure used to determine the significance of results in a
hypothesis test. It is the probability of obtaining a result as extreme or more extreme than the
observed result, assuming the null hypothesis is true. It helps to determine whether the
observed differences between two groups are statistically significant or merely due to chance.
The p-value is compared with a pre-determined significance level to decide whether to reject
or fail to reject the null hypothesis.

Q2. Explain the hypothesis and how hypothesis testing done?


Answer: A hypothesis is a statement or assumption made about a population parameter
based on available evidence. Hypothesis testing is a process of testing a hypothesis by
collecting and analyzing data from a sample of the population. It involves formulating a
research question, collecting data, choosing a significance level, calculating a test statistic
and a p-value, and making a decision about whether to accept or reject the null hypothesis.
The null hypothesis is the hypothesis that there is no significant difference between two
groups or no relationship between two variables, while the alternative hypothesis is the
hypothesis that there is a significant difference between two groups or a relationship between
two variables.
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI


Q3. What are the types of statistical tests that can be used to infer data?
Answer: There are various types of statistical tests that can be used to infer data, including
parametric tests such as t-tests and ANOVA, non-parametric tests such as Chi-squared test,
Mann-Whitney U test, and Kruskal-Wallis test, and correlation analysis and regression
analysis. The choice of statistical test depends on the research question, type of data, and the
number of groups being compared. Researchers should choose the appropriate statistical test
based on these factors to ensure that the results are valid and reliable.

You might also like