Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
39 views

Inferential Statistical Analysis Using Python -

Uploaded by

slaydes13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Inferential Statistical Analysis Using Python -

Uploaded by

slaydes13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

(https://brainalystacademy.

com/)

HOME (HTTPS://BRAINALYSTACADEMY.COM/) BASIC STATISTICS  (HTTPS://BRAINALYSTACADEMY.COM/BASIC-STATISTICS/)

DATA EXPLORATION & PREPRATION  (HTTPS://BRAINALYSTACADEMY.COM/DATA-EXPLORATION-AND-PREPRATION/)

MODELING  (HTTPS://BRAINALYSTACADEMY.COM/MODELING/) BUY OUR COURSE NOW (HTTPS://BRAINALYST.IN/)


Inferential Statistical Analysis Using
Python

Contents [ hide ]
1 Inferential Statistics¶

1.1 Contents¶
1.2 1. Z scores, Z-Test¶

1.2.1 1.1 Z Value¶


1.2.1.1 Computing z-score using defualt values¶
1.2.1.2 Computing z-score along specified axis using degrees of freedom¶

1.2.1.3 Computing z-score using nan_policy¶


1.2.2 1.2 Z-Test¶

1.2.3 Example¶
1.2.4 T-Test¶
1.2.5 1.3 Two Sided One-Sample T-test¶

1.2.6 1.4 Independent t-Test¶


1.2.7 Using scipy library¶
1.2.8 Example¶

1.2.9 Using statsmodels¶


1.2.10 1.5 Paired t-Test¶

1.3 2. F-Test¶
1.4 3. Correlation coefficients¶
1.4.1 Calculating correlation coefficient using pandas dataframe¶

1.4.2 4. Chi-Square Test¶


1.4.2.1 Importing Library¶

1.4.2.2 With f_obs values¶


1.4.2.3 With f_exp and f_obs values¶
1.4.2.4 With f_obs as 2d¶

1.4.2.5 With axis as None¶


1.4.2.6 With axis as 1¶

1.4.2.7 With ddof specified¶

Inferential Statistics
Inferential statistics is used for finding inferences on the data and make predictions about the data on a
given sample of data.This uses probability to find conclusions.
There are possible methods to perform inferential statistics on the data. In this blog we will discuss
about Z-Score, Z-Test, F-Test, Correlation Coefficients, chi-square Test for performing the analysis of the
data and get a probable conclusion based on it.

When we use Inferential Statistics?

Inferential statistics mainly used for finding conclusions about the data, the data can be a sample or set
of features so sometimes we use a large size of data for building a model at that time this inferential
statistics comes in handy.

Contents
1. Z Scores, Z-Test

1.1 Z Value

1.2 Z Test

1.3 Two-sided One-Sample t-Test

1.4 Independent t-Test

1.5 Paired T-test

1. F-test

1. Correlation Coefficients

1. Chi-Square Test

1. Z scores, Z-Test

1.1 Z Value
Z- Value/ Z- Score tells a value (x) is how many standard deviations below or above the population
mean. If the Z value is positive the value/ score (x) is higher than the mean and if the Z value is negative
the value is lesser than the mean

Z-Score can be calculated as follows

z = (X – μ) / σ

where,
X : Single data value
μ : Mean value
σ : Standard Deviation

Z-score in python can be calcualted by using scipy.stats.zscore such as, scipy.stats.zscore(a, axis=0,
ddof=0, nan_policy=’propagate’)

where,

a : array_like
An array like object containing the sample data.

axis : int or None, optional


Axis , either horizontal or vertical

ddof : int, optional


Degrees of freedom correction in standard deviation calcualtion.
Default value is 0 (zero).

nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional


This field defines a way of handling when input contains nan.
Default value is propagate, which returns nan
The value raise, throws an error
The value omit, ignores nan values and performs the calculation.

Note: Whenever the value is omit, the nan values in the input
propagate to the output, but these nan values
do not affect the z-scores that's been computed for the non-nan values

Ex: a = [0.8976,0.9989,0.5678,0.1234,0.7765,1,1.675,1.456]

==> Mean (μ) = Sum of all the elements/N , where N = total number of elements
mean (μ)= (0.8976+0.9989+0.5678+0.1234,0.7765+1+1.675+1.456)/8 =0.9369

==> standard deviation (σ) = sqrt((X-μ)/N) , where X = element

standard deviation (σ) = sqrt((0.8976-0.9369)^2+(0.9989-0.9369)^2+(0.5678-0.9369)^2+(0.1234-


0.9369)^2+(0.7765-0.9369)^2+(1-0.9369)^2+(1.675-0.9369)^2+(1.456-0.9369)^2))/8 = 0.45378

==> Z-score (z) = (X-μ)/σ

z = [(0.8976-0.9369)/0.4537,(0.9989-0.9369)/0.4537,(0.5678-0.9369)/0.4537,(0.1234-0.9369)/0.4537,
(0.7765-0.9369)/0.4537,(1-0.9369)/0.4537,(1.675-0.9369)/0.4537,(1.456-0.9369)/0.4537]

Result is ==> z = [-0.0866,0.1357,-0.8135,-1.7930,-0.3535,0.1390,1.6268,1.144]

Computing z-score using defualt values

In [2]: import numpy as np


import scipy.stats as stats

a = np.array([0.8976,0.9989,0.5678,0.1234,0.7765,1,1.675,1.456])
stats.zscore(a)
array([-0.08660476, 0.13662837, -0.81337952, -1.79269639, -0.35347081,
Out[2]:
0.13905242, 1.62653867, 1.14393202])

Computing z-score along specified axis using degrees of freedom

In [4]: a = np.array([[0.1234,0.4567,0.7890,0.9876],
[0.6789,0.7890,0.9987,0.6657],
[0.2234,0.9987,0.3345,0.5567]])

stats.zscore(a,axis=1,ddof=1)
array([[-1.22576827, -0.3486311 , 0.52587439, 1.04852498],
Out[4]:
[-0.67641081, 0.03847117, 1.40005837, -0.76211873],
[-0.88942498, 1.37202025, -0.56536131, 0.08276604]])

Computing z-score using nan_policy

In [15]: a = np.array([[0.1234,np.nan,0.7890,0.9876],
[0.6789,0.7890,0.9987,0.6657],
[np.nan,0.9987,0.3345,np.nan]])

stats.zscore(a,axis=1) # default value of nan_policy is propagate, w


array([[ nan, nan, nan, nan],
Out[15]:
[-0.78105192, 0.04442268, 1.61664815, -0.88001891],
[ nan, nan, nan, nan]])

In [16]: a = np.array([[0.1234,np.nan,0.7890,0.9876],
[0.6789,0.7890,0.9987,0.6657],
[np.nan,0.9987,0.3345,np.nan]])

# nan_policy='raise', throws error

stats.zscore(a,axis=1,nan_policy='raise')

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-7d7bc298bb30> in <module>
3 [np.nan,0.9987,0.3345,np.nan]])
4
----> 5 stats.zscore(a,axis=1,nan_policy='raise') # nan_policy='raise', throws error

~\anaconda3\lib\site-packages\scipy\stats\stats.py in zscore(a, axis, ddof, nan_polic


y)
2545 return np.empty(a.shape)
2546
-> 2547 contains_nan, nan_policy = _contains_nan(a, nan_policy)
2548
2549 if contains_nan and nan_policy == 'omit':

~\anaconda3\lib\site-packages\scipy\stats\stats.py in _contains_nan(a, nan_policy)


237
238 if contains_nan and nan_policy == 'raise':
--> 239 raise ValueError("The input contains nan values")
240
241 return contains_nan, nan_policy

ValueError: The input contains nan values


In [17]: a = np.array([[0.1234,np.nan,0.7890,0.9876],
[0.6789,0.7890,0.9987,0.6657],
[np.nan,0.9987,0.3345,np.nan]])

stats.zscore(a,axis=1,nan_policy='omit') # nan_policy='omit', comput


array([[-1.37976297, nan, 0.4211984 , 0.95856458],
Out[17]:
[-0.78105192, 0.04442268, 1.61664815, -0.88001891],
[ nan, 1. , -1. , nan]])

1.2 Z-Test
Z- Test is to test the population proportion. Z-Test can be used to test the given mean, when the sample
is large, which means the length of the data is more than 30 , and when the population standard
deviation is known as well as variance is known. This test is perfromed to check if the 2 sample means
are approximately equal or not.

To perfrom this z-test the samples should be taken at random from the population and the data should
be normally distributed. If the data taken is larger than 30 then it is assumed that the data is normally
distributed. If the sample size is less than 30 then the t-test is considered.

We check if the value obtained is approimately equal or not by considering hypothesis such as Null
Hypothesis (H0) : If the value is equal to the other value, this hypothesis is accepted Alternate
Hypothesis (HA) :If the values is not equal to the other value, this hypothesis is accepted

This Z-test is calculated by using the formula

After performing the z-test, the value obtained should be compared with the alpha value, which is
assumed to be 0.05 in the z-score table, which is considered to be pvalue.

This pvalue if it is less than the alpha value, then the Null Hypothesis is rejected which means
the Alternate Hypothesis is considered. In other words the means of the two samples are not
equal
The pvalue if it is greater than the alpha value, then the Null Hypothesis is accepted. In other
words the means or averages of the two samples are equal.

In Machine Learning, we calculate z-test by using method ztest from


statsmodels.stats.weightstats
statsmodels.stats.weightstats.ztest(x1, x2=None, value=0,
alternative='two-sided', usevar='pooled', ddof=1.0)

where,
x1,x2 are arrays

value : float
In the one sample case, value is the mean of x1 under the Null
hypothesis. In the two sample case, value is the difference between
mean of x1 and mean of x2 under the Null hypothesis. The test statistic
is x1_mean - x2_mean - value.

alternative : str
The alternative hypothesis, H1, has to be one of the following
‘two-sided’: H1: difference in means not equal to value
(default)
‘larger’ : H1: difference in means larger than value
‘smaller’ : H1: difference in means smaller than value

usevar : str, ‘pooled’


Currently, only ‘pooled’ is implemented. If pooled, then the
standard deviation of the samples is assumed to be the same. see
CompareMeans.ztest_ind for different options.

ddof : int
Degrees of freedom use in the calculation of the variance of the
mean estimate. In the case of comparing means this is one, however it
can be adjusted for testing other statistics (proportion, correlation)

Returns,
tstat : float
test statistic

pvalue : float
pvalue of the t-test

Example
In [27]: import numpy as np
import pandas as pd
from numpy.random import randn
from statsmodels.stats.weightstats import ztest

x1 = [20, 30, 40, 50, 10, 20]


z = ztest(x1,value= 25) # where value is a mean value
z
(0.5547001962252289, 0.5790997419539189)
Out[27]:

The first value from the above result is statistic value and the other value is pvalue. From the above
output, we can understand that the pvalue of the taken data is 0.9 which is greater than the value of
alpha which is 0.05. Hence we come to the output that Null Hypothesis is correct and it is accepted,
which means that the given data and the assumed mean are approximately equal.

In [42]: x1 = [20, 30, 40, 50, 10, 20]


x2 = [11, 12, 13, 14, 15, 16]

z=ztest(x1, x2, value= 0, alternative = 'larger')


z
(2.448717008689441, 0.007168301924196878)
Out[42]:

From the above output, we can understand that the Null Hypothesis is rejected and Alternate Hypothesis
is accepted as the pvalue is less than the alpha value

T-Test
T-test, also known as Student’s T-test, is used to determine the difference among two groups of
variables by comparing their mean values or the averages. This T-test not only determines the difference
but also determines the significance of their differences. In other words, this test simply explains that
the differeneces among the varibale groups is occurred by a chance or relevant to the data taken.

The 3 types of T-test are

1. Independent T-test
2. Paired Sample T-test
3. One-Sample T-test
One-Sample T-test : This one-sample T-test is a t-test where the one group’s mean or avergae is
compared with one significant value which is a mean of the population

Types of One-Sample T-test are


1. One tailed One-Sample T-test
2. Two tailed One-Sample T-test
3. Upper tailed One-Sample T-test
4. Lower tailed One-Sample T-test

1.3 Two Sided One-Sample T-test


Two Sided One-Sample T-test or Two tailed One-Sample T-test ———————

1.4 Independent t-Test


Independent T-test, also known as Two Sample T-test is used to test whether the means of the taken 2
groups are equl or not. This Independent T-test assumes that the variances of the taken population has
equal variance by defualt.

In Machine Learning, we can perform this test using


- Using scipy library
- Using Statsmodels

Using scipy library


scipy.stats.ttest_ind(a, b, axis=0, equal_var=True) where, a, b are two arrays of 2 groups

axis : int or None, optional Axis along which to compute test. If None, compute over the whole arrays, a,
and b.

equal_var : bool, optional If True (default), perform a standard independent 2 sample test that assumes
equal population variances If False, perform Welch’s t-test, which does not assume equal population
variance.

Returns statistic : float or array The calculated t-statistic.

pvalue : float or array The p-value.

Example
For example, we are given 2 different groups of data among which Bag-A has a bunch of apples and
Bag-B has a bunch of mangoes. We need to check if the both bags have the same averages or means.

For this we need to assume 2 hypothesis, null hypothesis and alternate hypothesis H0 -> The means of
two bags are equal HA -> The means of two bags are not equal

Two find which hypothesis is predicted to be different, we check the value of alpha, which is assumed to
be 0.05, with the pvalue that is obtained after performing t-test. If the pvalue is less than the alpha value,
then the HA is considered to be true If the pvalue is greater than the alpha value, then the H0 is
considere to be true

Lets test the above example with ttest using scipy library

In [111… import scipy.stats as stats

a = np.array([5,6,7,8,2,3,4,5])
b = np.array([12,13,14,15,16,2,3,4])

stats.ttest_ind(a, b, equal_var=True) # Assuming that the 2 groups h


Ttest_indResult(statistic=-2.2331335038240865, pvalue=0.042379219768910015)
Out[111…

In [112… stats.ttest_ind(a, b, equal_var=False) # Assuming that the 2 groups


Ttest_indResult(statistic=-2.2331335038240865, pvalue=0.05369587840008499)
Out[112…

Understanding the result from the above test, assuming that the alpha value is 0.05, the both results
assuming that the 2 groups have same variances and different variances, returns out pvalue which is
less than the value of alpha. Hence the 2 bags or the 2 groups means are not supposed to be equal
Using statsmodels
statsmodels.stats.weightstats.ttest_ind(x1, x2)

where,
x1 and x2 are two array groups

Returns:
tstat : float
test statistic

pvalue : float
pvalue of the t-test

df : int or float
degrees of freedom used in the t-test

In [115… from statsmodels.stats.weightstats import ttest_ind

a = np.array([12,14,16,4,5,11,12,11])
b = np.array([12,13,14,15,16,2,3,4])

ttest_ind(a,b)
(0.2963188789948769, 0.7713367820262194, 14.0)
Out[115…

Understanding the result from the above test, assuming that the alpha value is 0.05, the resulted pvalue
is 0.7 which is greater than the assumed alpha value. Hence it can be said that the assumption H0 is
true, i.e., the means of the 2 groups are equal

1.5 Paired t-Test


A Paired t-Test explains the difference between two variables for the same subject. It compares one set
of measurements with the second set from the same sample. This test is also known as Dependent
Sample T-test

In simple words, this T-test measures the difference between two averages or means of two different
groups. This test similar to other tests assumes that there are 2 hypothesis. Null Hypothesis (H0) : The
difference between two means of the two groups is zero Alternate Hypothesis (HA) : The difference
between two means of the two groups is not equal to zero.

In Machine Learning, this Paired T-test can be calculated by using ttest_rel() method defined in
scipy.stats library
scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate',
alternative='two-sided')

where,

a, b : array_like

axis : int or None, optional


Axis along which to compute test. If None, compute over the
whole arrays, a, and b.

nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional


Defines how to handle when input contains nan. The following
options are available (default is ‘propagate’):
‘propagate’: returns nan
‘raise’: throws an error
‘omit’: performs the calculations ignoring nan values

alternative : {‘two-sided’, ‘less’, ‘greater’}, optional


Defines the alternative hypothesis. The following options
are available (default is ‘two-sided’):
‘two-sided’: the means of the distributions underlying the
samples are unequal.
‘less’: the mean of the distribution underlying the first
sample is less than the mean of the distribution underlying the second
sample.
‘greater’: the mean of the distribution underlying the first
sample is greater than the mean of the distribution underlying the
second sample.

Returns
statistic : float or array
t-statistic.

pvalue : float or array


The p-value.

In [3]: import scipy.stats as stats

a = np.array([12, 14, 16, 4, 5, 11, 12, 11])


b = np.array([12, 13, 14, 15, 16, 2, 3, 4])

stats.ttest_rel(a,b)
Ttest_relResult(statistic=0.26355219111613715, pvalue=0.7997147761519707)
Out[3]:

From the above result, the assumed alpha value which is 0.05 is less than the obtained pvalue. Hence
we can accpet the Null hypothesis H0, saying that the difference between two means of two groups is
zero

2. F-Test
F-Test can be applied to test the significant difference between the variance of two populations, based
on the small samples drawn from those populations. The test based on this statistic is known as F-Test.

Simply said, this F-test compares the variances of 2 values by perfroming division.The result of the f-test
is always positive, because the variances are always positive. Let’s assume that the two variables are s1
and s2, the formula is considered as F = s1^2/s2^2

The Hypothesis for this F-test are defined as, Null Hypothesis (H0) : The variances of two variables are
equal and there is no significant difference Alternate Hypothesis (HA) : The variances of two variables
are not equal

F-Statistic, also known to be as F-Value is used in Analysis of Variance (ANOVA) and in regression
models to find the significance between the means of the two populations by comparing variances. F-
Statistic is used in F-test. F-Test is almost similar to T-test, except in F-test we check for the significance
among group of variables, whereas in T-test, we check for the significance among 2 variables. F-test is
used to check for the similarity among the means of different variables.

For an F-test to be conducted we need to assume


- that the data taken is normally distributed
- The numerator variance should be larger and the denominator
variance should be smaller

There are many statistics in which F-Statistic is used, but mostly


used F-test is Analysis Of Variance (ANOVA)
ANOVA : Analysis Of Variances, called as ANOVA, is used to test two or more than two groups of means
differences. This ANOVA uses F-Statistic to calculate the difference between means of 2 groups or
more than that

The Hypothesis here is taken as,


Null Hypothesis (H0) : The groups are significantly equal
Alternate Hypothesis (H1) : The groups are not equal

There are different types of ANOVA such as,


- One-way ANOVA
- Two-way ANOVA
- Factorial ANOVA
- Repeated Measures ANOVA
- MANOVA etc.,

The most used ANOVA is One-way ANOVA, which is used to compare


groups mean with an independent variable to check whether the groups
are likely or not

This test is performed by using a method f_oneway from scipy.stats


as
scipy.stats.f_oneway(*samples, axis=0)

where,
samples can be any number of groups or array like variables
axis defines along which the test to be performed, by
default set to zero and is optional

returns,
statistic : float
The computed F statistic of the test.

pvalue : float
The associated p-value from the F distribution.

As per this if the pvalue is less than the alpha value (0.05) then
the Null Hypothesis is rejected and if the pvalue is higher than the
alpha value then the Null Hypothesis is accepted.

In [ ]: # Importing Libraries

In [2]: import numpy as np


import pandas as pd
import scipy.stats as stats

In [10]: # Creating a dataset

In [12]: cities = ["punjab","delhi","hyderabad","bangalore","mumbai"]

In [15]: people_of_spec_city = np.random.choice(a= cities, p = [0.05, 0.15 ,0


# np.random.choice, returns some random values from the given value

people_of_spec_city
array(['mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai',
Out[15]:
'delhi', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad',
'bangalore', 'hyderabad', 'delhi', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'punjab', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi',
'punjab', 'hyderabad', 'delhi', 'bangalore', 'hyderabad', 'mumbai',
'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'hyderabad',
'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'delhi',
'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'delhi',
'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'hyderabad', 'hyderabad', 'mumbai', 'bangalore', 'mumbai',
'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad',
'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi',
'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'mumbai',
'bangalore', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad',
'hyderabad', 'delhi', 'delhi', 'mumbai', 'delhi', 'delhi',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'punjab',
'delhi', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'bangalore',
'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad',
'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'punjab',
'punjab', 'bangalore', 'bangalore', 'mumbai', 'hyderabad',
'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'mumbai',
'mumbai', 'hyderabad', 'delhi', 'mumbai', 'hyderabad', 'punjab',
'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
'punjab', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad',
'bangalore', 'punjab', 'bangalore', 'hyderabad', 'bangalore',
'mumbai', 'delhi', 'bangalore', 'mumbai', 'delhi', 'mumbai',
'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'hyderabad', 'delhi',
'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'bangalore',
'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai',
'bangalore', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
'delhi', 'mumbai', 'bangalore', 'hyderabad', 'mumbai', 'delhi',
'hyderabad', 'bangalore', 'mumbai', 'delhi', 'delhi', 'delhi',
'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi',
'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'punjab',
'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi',
'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'hyderabad',
'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'punjab',
'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad',
'delhi', 'hyderabad', 'mumbai', 'delhi', 'hyderabad', 'hyderabad',
'delhi', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad',
'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'delhi', 'mumbai',
'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi',
'hyderabad', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore',
'delhi', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai',
'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'hyderabad',
'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai',
'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai',
'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad',
'hyderabad', 'bangalore', 'mumbai', 'hyderabad', 'mumbai',
'hyderabad', 'punjab', 'bangalore', 'mumbai', 'punjab',
'hyderabad', 'mumbai', 'delhi', 'punjab', 'hyderabad', 'hyderabad',
'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
'mumbai', 'punjab', 'delhi', 'mumbai', 'mumbai', 'hyderabad',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad',
'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'bangalore',
'mumbai', 'mumbai', 'delhi', 'hyderabad', 'mumbai', 'mumbai',
'hyderabad', 'delhi', 'mumbai', 'hyderabad', 'punjab', 'bangalore',
'mumbai', 'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'punjab',
'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad',
'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'delhi', 'mumbai',
'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'delhi',
'delhi', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'bangalore',
'mumbai', 'hyderabad', 'hyderabad', 'bangalore', 'delhi', 'mumbai',
'mumbai', 'delhi', 'hyderabad', 'bangalore', 'mumbai', 'mumbai',
'delhi', 'punjab', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'bangalore', 'delhi', 'hyderabad', 'delhi', 'hyderabad',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai',
'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad',
'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'delhi', 'punjab', 'punjab', 'mumbai', 'hyderabad',
'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'bangalore',
'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad',
'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
'hyderabad', 'delhi', 'bangalore', 'hyderabad', 'bangalore',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'hyderabad',
'delhi', 'delhi', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad',
'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
'mumbai', 'punjab', 'delhi', 'mumbai', 'mumbai', 'delhi',
'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'punjab',
'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad',
'mumbai', 'delhi', 'mumbai', 'delhi', 'hyderabad', 'delhi',
'delhi', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'delhi',
'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad',
'mumbai', 'delhi', 'bangalore', 'mumbai', 'mumbai', 'bangalore',
'mumbai', 'mumbai', 'mumbai', 'bangalore', 'delhi', 'mumbai',
'bangalore', 'bangalore', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'delhi',
'hyderabad', 'hyderabad', 'mumbai', 'bangalore', 'mumbai',
'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
'delhi', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad',
'mumbai', 'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai',
'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai',
'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai',
'punjab', 'delhi', 'mumbai', 'punjab', 'hyderabad', 'delhi',
'hyderabad', 'mumbai', 'mumbai', 'delhi', 'punjab', 'mumbai',
'delhi', 'delhi', 'hyderabad', 'mumbai', 'punjab', 'mumbai',
'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'hyderabad',
'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'hyderabad',
'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'hyderabad', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai',
'delhi', 'punjab', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
'mumbai', 'delhi', 'punjab', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'delhi', 'hyderabad', 'mumbai', 'delhi', 'mumbai',
'delhi', 'hyderabad', 'bangalore', 'mumbai', 'hyderabad',
'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
'bangalore', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore',
'mumbai', 'mumbai', 'mumbai', 'mumbai', 'bangalore', 'delhi',
'hyderabad', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai',
'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
'mumbai', 'punjab', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai',
'mumbai', 'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'hyderabad',
'bangalore', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai',
'bangalore', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'bangalore',
'mumbai', 'hyderabad', 'delhi', 'delhi', 'hyderabad', 'mumbai',
'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'hyderabad',
'hyderabad', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'delhi',
'bangalore', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore',
'hyderabad', 'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai',
'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi',
'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi',
'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'delhi', 'delhi',
'mumbai', 'mumbai', 'hyderabad', 'bangalore', 'hyderabad',
'hyderabad', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai',
'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'delhi', 'hyderabad', 'delhi', 'punjab', 'mumbai',
'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'delhi', 'mumbai', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'delhi', 'delhi',
'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'delhi',
'mumbai', 'mumbai', 'punjab', 'hyderabad', 'hyderabad', 'mumbai',
'punjab', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'delhi',
'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
'bangalore', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai',
'bangalore', 'delhi', 'hyderabad', 'mumbai', 'delhi', 'hyderabad',
'hyderabad', 'mumbai', 'delhi', 'delhi', 'delhi', 'mumbai',
'hyderabad', 'hyderabad', 'hyderabad', 'hyderabad', 'punjab',
'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai',
'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi', 'bangalore',
'delhi', 'punjab', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi',
'mumbai', 'mumbai', 'delhi', 'delhi', 'hyderabad', 'punjab',
'delhi', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai',
'punjab', 'mumbai', 'hyderabad', 'mumbai', 'punjab', 'mumbai',
'delhi', 'punjab', 'mumbai', 'mumbai', 'mumbai', 'hyderabad',
'hyderabad', 'mumbai', 'delhi', 'delhi', 'mumbai', 'hyderabad',
'mumbai', 'punjab', 'mumbai', 'bangalore', 'bangalore', 'mumbai',
'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'mumbai',
'delhi', 'mumbai', 'punjab', 'hyderabad', 'mumbai', 'mumbai',
'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi',
'mumbai'], dtype='<U9')

In [16]: population_of_spec_city = stats.poisson.rvs(loc=18, mu=30, size= 1


# stats.poisson.rvs method is used to generate random numbers, where
#paramters and as for the size, it defines the number of values

population_of_spec_city
array([61, 43, 54, 46, 55, 52, 43, 48, 42, 52, 55, 50, 39, 50, 54, 41, 42,
Out[16]:
51, 48, 58, 50, 49, 43, 48, 44, 49, 48, 49, 44, 47, 48, 58, 51, 39,
39, 44, 48, 44, 47, 47, 45, 54, 49, 43, 47, 57, 44, 49, 57, 39, 48,
48, 39, 42, 52, 53, 51, 52, 46, 55, 43, 45, 51, 52, 52, 42, 40, 40,
40, 52, 48, 59, 48, 52, 56, 48, 56, 49, 43, 57, 52, 42, 58, 45, 52,
53, 49, 49, 40, 44, 52, 55, 52, 60, 49, 36, 47, 42, 46, 49, 51, 44,
45, 42, 49, 41, 46, 46, 51, 57, 50, 58, 47, 49, 47, 40, 49, 50, 50,
58, 50, 47, 53, 50, 55, 43, 51, 52, 54, 56, 44, 41, 47, 38, 52, 48,
52, 43, 47, 60, 41, 59, 51, 41, 50, 50, 42, 42, 48, 36, 43, 48, 44,
51, 43, 46, 45, 49, 44, 55, 39, 51, 65, 47, 54, 48, 42, 45, 56, 49,
44, 41, 40, 41, 51, 38, 57, 49, 40, 50, 39, 50, 45, 55, 49, 47, 49,
48, 46, 46, 47, 52, 54, 50, 42, 60, 55, 50, 52, 41, 50, 52, 41, 44,
51, 45, 41, 46, 57, 49, 41, 51, 41, 40, 51, 46, 41, 47, 46, 49, 52,
44, 45, 48, 58, 52, 55, 39, 45, 53, 36, 43, 50, 48, 49, 43, 54, 46,
46, 62, 40, 47, 51, 49, 41, 58, 50, 62, 43, 48, 40, 50, 55, 48, 51,
50, 36, 44, 46, 39, 54, 48, 49, 48, 45, 49, 41, 41, 40, 53, 41, 35,
52, 36, 39, 51, 47, 52, 43, 41, 47, 58, 45, 38, 47, 47, 48, 48, 47,
43, 57, 48, 43, 46, 48, 47, 42, 52, 52, 55, 50, 54, 52, 40, 50, 52,
40, 41, 46, 59, 44, 61, 44, 48, 37, 47, 52, 41, 43, 44, 45, 43, 54,
40, 37, 51, 53, 53, 50, 37, 52, 46, 46, 42, 43, 49, 43, 46, 48, 53,
50, 53, 57, 43, 48, 57, 47, 53, 49, 47, 44, 53, 44, 55, 53, 47, 41,
44, 49, 51, 48, 50, 45, 52, 54, 55, 48, 44, 44, 52, 46, 54, 48, 42,
38, 51, 48, 46, 43, 47, 49, 45, 40, 43, 46, 41, 53, 55, 48, 43, 41,
46, 60, 56, 58, 54, 46, 48, 44, 47, 41, 45, 45, 46, 41, 41, 52, 49,
47, 48, 52, 40, 53, 44, 67, 53, 52, 57, 41, 53, 37, 49, 61, 47, 47,
61, 50, 44, 40, 57, 52, 53, 44, 50, 48, 48, 47, 51, 53, 48, 41, 46,
46, 45, 49, 46, 48, 53, 39, 46, 55, 58, 47, 49, 49, 43, 44, 49, 45,
54, 38, 43, 62, 49, 48, 60, 47, 47, 49, 53, 61, 39, 46, 52, 48, 47,
43, 53, 42, 51, 51, 55, 47, 45, 53, 45, 48, 57, 50, 44, 55, 39, 44,
49, 53, 45, 55, 47, 49, 42, 54, 43, 58, 46, 45, 45, 49, 53, 43, 48,
46, 42, 41, 40, 45, 45, 35, 53, 44, 37, 49, 57, 49, 49, 45, 41, 56,
39, 45, 48, 44, 53, 55, 58, 39, 45, 43, 55, 47, 46, 45, 37, 51, 43,
45, 54, 45, 51, 50, 48, 47, 42, 58, 57, 44, 63, 54, 46, 48, 45, 47,
52, 42, 53, 38, 46, 55, 38, 52, 47, 47, 46, 38, 46, 48, 55, 55, 54,
45, 50, 38, 53, 54, 50, 50, 60, 47, 52, 43, 53, 54, 48, 46, 48, 46,
57, 53, 47, 42, 55, 34, 56, 50, 49, 45, 55, 56, 45, 50, 49, 58, 43,
48, 46, 46, 55, 64, 47, 45, 33, 41, 44, 49, 53, 42, 49, 44, 44, 49,
49, 47, 39, 44, 41, 58, 59, 52, 47, 41, 51, 46, 60, 53, 51, 54, 52,
37, 50, 50, 46, 51, 47, 44, 39, 54, 54, 46, 52, 49, 47, 54, 52, 53,
40, 42, 54, 50, 44, 40, 51, 49, 56, 48, 44, 47, 52, 55, 44, 45, 49,
45, 52, 51, 47, 46, 44, 50, 45, 48, 56, 41, 51, 47, 50, 39, 41, 39,
46, 47, 46, 49, 41, 56, 48, 53, 50, 55, 45, 46, 41, 52, 48, 43, 53,
47, 54, 48, 52, 43, 40, 51, 51, 52, 50, 48, 57, 50, 46, 50, 51, 55,
43, 65, 51, 51, 59, 48, 44, 50, 40, 47, 66, 55, 45, 51, 43, 55, 55,
61, 46, 52, 49, 46, 51, 51, 38, 45, 53, 46, 49, 52, 51, 59, 58, 49,
52, 49, 43, 37, 44, 46, 48, 36, 56, 45, 48, 43, 47, 49, 62, 44, 49,
40, 45, 58, 52, 48, 46, 47, 39, 56, 51, 48, 52, 52, 46, 55, 46, 46,
46, 48, 38, 45, 46, 57, 43, 44, 47, 58, 39, 49, 48, 44, 58, 45, 49,
52, 50, 44, 50, 53, 43, 55, 53, 54, 49, 42, 52, 46, 48, 49, 60, 42,
48, 49, 44, 53, 45, 52, 50, 53, 45, 46, 46, 49, 44, 60, 43, 46, 48,
45, 43, 48, 37, 38, 47, 41, 46, 60, 54, 49, 54, 60, 53, 47, 45, 52,
44, 60, 52, 50, 53, 47, 55, 40, 47, 40, 46, 39, 55, 52, 43, 45, 48,
41, 38, 46, 50, 39, 41, 45, 51, 49, 51, 57, 52, 46, 42, 47, 47, 47,
39, 38, 51, 38, 52, 51, 52, 45, 51, 39, 50, 45, 52, 41, 43, 59, 48,
49, 47, 43, 50, 51, 50, 58, 50, 42, 38, 50, 55, 38, 42, 48, 41, 52,
50, 47, 51, 44, 48, 45, 44, 53, 50, 44, 53, 47, 51, 45, 52, 43, 54,
44, 49, 43, 50, 53, 46, 53, 42, 45, 47, 51, 48, 48, 39, 51, 53, 46,
50, 56, 47, 43, 48, 56, 51, 52, 48, 45, 54, 46, 52, 58, 54, 41, 55,
49, 48, 54, 45, 60, 43, 46, 57, 54, 48, 45, 49, 56, 44],
dtype=int64)

In [19]: # Forming the Dataframe from the obatined values


population_frame = pd.DataFrame({"city":people_of_spec_city,"populat

# Dividing these values by the categorical variables into groups


groups = population_frame.groupby("city").groups

groups
{'bangalore': [12, 33, 45, 75, 96, 103, 105, 134, 147, 148, 180, 182, 184, 187, 202,
Out[19]:
209, 222, 227, 302, 338, 344, 375, 387, 418, 422, 428, 438, 472, 486, 488, 563, 566,
570, 573, 574, 588, 605, 663, 699, 707, 711, 716, 741, 753, 758, 783, 787, 827, 897,
903, 931, 978, 979], 'delhi': [6, 14, 29, 32, 57, 58, 59, 61, 65, 93, 110, 114, 118,
119, 121, 122, 126, 129, 131, 155, 159, 163, 186, 189, 193, 196, 197, 205, 220, 225,
229, 230, 231, 237, 240, 249, 254, 268, 271, 274, 287, 290, 297, 303, 310, 318, 323,
326, 349, 361, 368, 378, 383, 391, 403, 404, 410, 411, 412, 413, 416, 423, 426, 431,
439, 441, 446, 447, 450, 463, 485, 492, 493, 495, 496, 515, 518, 530, 537, 539, 541,
542, 544, 553, 554, 562, 571, 581, 584, 596, 612, 623, 626, 630, 634, 637, 638, 659,
667, 668, ...], 'hyderabad': [1, 3, 7, 11, 13, 15, 25, 31, 34, 36, 38, 43, 46, 47, 5
0, 60, 67, 72, 73, 77, 83, 86, 87, 90, 101, 106, 109, 116, 117, 127, 136, 138, 139, 1
42, 150, 151, 153, 158, 161, 166, 168, 171, 172, 176, 178, 179, 183, 192, 194, 195, 2
00, 210, 211, 217, 219, 223, 226, 233, 238, 244, 247, 252, 255, 258, 259, 267, 269, 2
72, 273, 276, 277, 282, 283, 284, 294, 298, 299, 301, 305, 307, 311, 312, 313, 314, 3
15, 316, 322, 329, 331, 334, 335, 336, 337, 340, 342, 347, 351, 352, 357, 364, ...],
'mumbai': [0, 2, 4, 5, 8, 9, 10, 16, 17, 18, 19, 20, 21, 22, 23, 26, 27, 28, 35, 37,
39, 40, 41, 42, 44, 48, 49, 51, 52, 53, 54, 55, 56, 62, 63, 64, 66, 68, 69, 70, 71, 7
4, 76, 78, 79, 80, 81, 82, 84, 85, 89, 91, 92, 94, 95, 97, 98, 99, 100, 102, 104, 10
7, 108, 111, 112, 113, 115, 120, 123, 124, 125, 130, 132, 133, 135, 137, 140, 141, 14
3, 144, 149, 152, 154, 156, 157, 160, 164, 165, 167, 169, 170, 173, 174, 177, 185, 18
8, 190, 191, 198, 199, ...], 'punjab': [24, 30, 88, 128, 145, 146, 162, 175, 181, 24
3, 261, 292, 343, 346, 350, 360, 386, 393, 432, 464, 465, 514, 529, 625, 628, 635, 64
1, 643, 680, 687, 730, 845, 880, 884, 919, 933, 950, 957, 961, 964, 976, 989]}
In [20]: # Etract individual groups into respective variables

punjab = population_of_spec_city [groups["punjab"]]


bangalore = population_of_spec_city[groups["bangalore"]]
delhi = population_of_spec_city[groups["delhi"]]
hyderabad = population_of_spec_city[groups["hyderabad"]]
mumbai = population_of_spec_city[groups["mumbai"]]

In [21]: # Now calculate the one-way anova test for the obtained individual g

stats.f_oneway(asian, black, hispanic, other, white)


F_onewayResult(statistic=0.9110431706569894, pvalue=0.45674036540270235)
Out[21]:

From the above obtained result, we can decide on to the output that the pvalue which is 0.4 is greater
than alpha value (0.05). Hence we can say that there is no significant difference among the variances of
different groups and are almost equal. And the Null Hypothesis is accepted

3. Correlation coefficients
The correlation coefficient is a statistical measure that shows the degree to which, changes to a value
of one variable predict change to the value of another. The letter r is used to represent the correlation
coefficient and the r is a unit-free value between -1 and 1.

Correlation coefficient measures the relatability among the data, the strength of that relationship is
obtained by using these correlation coefficient formulas. The values obtained from these correlation
coefficients are -1, 0 or 1 where, -1 represents a relationship which is negative and weak 1 represents a
relationsip which is positive and strong 0 (zero) represents no relationship

Let’s suppose, there are two variables x and y, for which the correlation coefficient need to be found

If the value of y goes up, when the value of x goes up, which means x is directly proportional to y
then the correlation coefficient between x and y results out to be between 1 or positive values

If the value of y goes down, whenever the value of x goes up or vice versa, which means x is
inversely proportional to x then the correlation coefficient between x and y results to be -1 or
negative values

Though the value of y goes down, if there is no change in the other variable in our case it is x,
then the correlation coefficient between these 2 variable results to be 0 (zero)

For Example,

Positive Correlation : If the quantity of milk increases, the price also increases Negative
Correlation : If the price of a stock goes down, then the buying of that stock increases Zero
Correlation : There is no relationship between score in video games and grades of an
examination

Before we dig into how the correlation among 2 or more coefficients is calculated. It is
necessary to understand a term called covariance

So, what is covariance? Covariance is a term that is used to describe the linear relationship
between 2 variables. If the covariance is predicted to be positive, then the variables have a linear
relationship i.e., both the variables can change in the same direction. If the covariance is
predicted to be negative, then the variables don’t have that linear relationship i.e., both varibles
tend to go in different directions

There are many types of correlation coefficients. But here we will discuss some of the important
correlation coefficients that are widely being used

1. Pearson’s r
Pearson's r also known as Pearson's product momemnt
correlation coefficient, is used for describing the strength of the
linear relationship amomg 2 variables
This correlation coefficient is used when the data follows
normal distribution, with no outliers, with no skewed data and when
you expect linear relationship among the 2 variables.
The Pearson'r formula is as follows
The strength of the correlation is considered as
- weak positive correlation as 0<r<0.3, weak negative
correlation as -0.3<r<0
- strong positive correlation as 0.5<r<1, strong negative
correlation as -1<r<-0.5
- no correlation as r=0

1. Spearman’s rho

Spearman’s rho, also known as Spearman’s rank correlation coefficient, is used as an alternative to the
Pearson’s correlation coefficient. This correlation coefficient is a rank correlation coefficient as it uses
the rankings to determine the strength of each varibal(say lowest to highest). Unlike Pearson’s r,
Spearman’s rho is used to calculate monotonic relationship, which is Non-linear. Spearman’s rho formula
is as follows

1. Kendall’s tau

Kendall’s tau is used for the calculation of correlation coefficient when there are 2 variables, which may
be continuous variables with outliers or ordinals, but exhibiting monotonic relationship. Spearman’s rho
and Kendall’s tau are almost similar but it is better to use Kendall’s tau for better results.

These 3 correlation coefficients can be calculated in Machine Learning by using a function in pandas
which is

DataFrame.corr(method='pearson', min_periods=1)

Paramaters:
method {'pearson','spearman','kendall')

Calculating correlation coefficient using pandas dataframe


In [2]: import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/leena.ganta/Desktop/DataVedas/happyscore_
df.head()
Out[2]: ADJUST
AVG_SA STD_SA MEDIAN INCOME
ED_SATI AVG_IN HAPPYS COUNT
TISFACT TISFACT _INCOM _INEQU REGION GDP
SFACTI COME CORE RY.1
ION ION E ALITY
ON

COUNT
RY

‘Central
ARMENI 1731.50 31.4455 and
37 4.9 2.42 2096.76 4.350 0.76821 Armenia
A 6667 56 Eastern
Europe’

‘Sub-
ANGOL 1044.24 42.7200
26 4.3 3.19 1448.88 Saharan 4.033 0.75778 Angola
A 0000 00
Africa’

‘Latin
America
ARGENT 5109.40 45.4755 Argentin
60 7.1 1.91 7101.12 and 6.574 1.05351
INA 0000 56 a
Caribbe
an’

AUSTRI 19457.0 16879.6 30.2962 ‘Western


59 7.2 2.11 7.200 1.33723 Austria
A 4 20000 50 Europe’

‘Australi
AUSTRA 19917.0 15846.0 35.2850 a and
65 7.6 1.80 7.284 1.33358 Australia
LIA 0 60000 00 New
Zealand’

In [55]: df.corr(method='pearson') # or df.corr() gives the same result as me

Out[55]: ADJUSTED
AVG_SATI STD_SATI AVG_INCO MEDIAN_I
INCOME_I
HAPPYSC
_SATISFAC NEQUALIT GDP
SFACTION SFACTION ME NCOME ORE
TION Y

ADJUSTED
_SATISFAC 1.000000 0.978067 -0.527553 0.728006 0.704383 -0.123835 0.901213 0.755578
TION

AVG_SATI
0.978067 1.000000 -0.341201 0.689043 0.661883 -0.082471 0.885988 0.776679
SFACTION

STD_SATI
-0.527553 -0.341201 1.000000 -0.478206 -0.481429 0.221831 -0.457896 -0.242038
SFACTION

AVG_INCO
0.728006 0.689043 -0.478206 1.000000 0.995605 -0.382587 0.782122 0.814024
ME

MEDIAN_I
0.704383 0.661883 -0.481429 0.995605 1.000000 -0.449053 0.760328 0.797905
NCOME

INCOME_I
NEQUALIT -0.123835 -0.082471 0.221831 -0.382587 -0.449053 1.000000 -0.187222 -0.303204
Y

HAPPYSC
0.901213 0.885988 -0.457896 0.782122 0.760328 -0.187222 1.000000 0.790061
ORE

GDP 0.755578 0.776679 -0.242038 0.814024 0.797905 -0.303204 0.790061 1.000000

In [61]: df.corr(method='spearman') # correlaion coefficient using spearman m

Out[61]: ADJUSTED
AVG_SATI STD_SATI AVG_INCO MEDIAN_I
INCOME_I
HAPPYSC
_SATISFAC NEQUALIT GDP
SFACTION SFACTION ME NCOME ORE
TION Y

ADJUSTED
_SATISFAC 1.000000 0.981629 -0.497192 0.803010 0.779671 -0.168049 0.900697 0.766098
TION

AVG_SATI
0.981629 1.000000 -0.354810 0.808310 0.782479 -0.137139 0.893395 0.773521
SFACTION

STD_SATI
-0.497192 -0.354810 1.000000 -0.317653 -0.309697 0.182610 -0.421175 -0.275832
SFACTION

AVG_INCO
0.803010 0.808310 -0.317653 1.000000 0.990839 -0.356069 0.819542 0.960969
ME

MEDIAN_I
0.779671 0.782479 -0.309697 0.990839 1.000000 -0.448926 0.806704 0.961583
NCOME

INCOME_I
NEQUALIT -0.168049 -0.137139 0.182610 -0.356069 -0.448926 1.000000 -0.242107 -0.409767
Y

HAPPYSC
0.900697 0.893395 -0.421175 0.819542 0.806704 -0.242107 1.000000 0.793673
ORE

GDP 0.766098 0.773521 -0.275832 0.960969 0.961583 -0.409767 0.793673 1.000000

In [58]: df.corr(method='kendall') # correlation coefficient using kendall me


Out[58]: ADJUSTED
AVG_SATI STD_SATI AVG_INCO MEDIAN_I
INCOME_I
HAPPYSC
_SATISFAC NEQUALIT GDP
SFACTION SFACTION ME NCOME ORE
TION Y

ADJUSTED
_SATISFAC 1.000000 0.905145 -0.378239 0.614896 0.593379 -0.124810 0.741682 0.581131
TION

AVG_SATI
0.905145 1.000000 -0.266347 0.618270 0.593810 -0.104128 0.732966 0.591166
SFACTION

STD_SATI
-0.378239 -0.266347 1.000000 -0.237797 -0.233515 0.124672 -0.320795 -0.205190
SFACTION

AVG_INCO
0.614896 0.618270 -0.237797 1.000000 0.929566 -0.229994 0.622277 0.841441
ME

MEDIAN_I
0.593379 0.593810 -0.233515 0.929566 1.000000 -0.299779 0.614087 0.847011
NCOME

INCOME_I
NEQUALIT -0.124810 -0.104128 0.124672 -0.229994 -0.299779 1.000000 -0.166762 -0.264067
Y

HAPPYSC
0.741682 0.732966 -0.320795 0.622277 0.614087 -0.166762 1.000000 0.601310
ORE

GDP 0.581131 0.591166 -0.205190 0.841441 0.847011 -0.264067 0.601310 1.000000

4. Chi-Square Test
Chi-Square Test is a non-parametric test, which tests the significance of the difference between
observed frequencies and theoretical frequencies of distribution without any assumption about the
distribution of the population. In simple words, chi-square test is used to determine the difference
between the expected data and the observed data

Chi-square test is also used to determine whether the built regression model is good fit or not by
assessing train and test datasets. This test is used on categorical variables

Two Chi-Sqaure tests that are mostly used are

Independence : As the name suggests, it describes the dependence of two variable sets
Good Fit of data: This chi-square depicts whether the taken sample of a data is considered to be
the representive sample that contributes to the good fit for the expected outcome from the
taken population of data.

The formula for the chi-square test is as follows

In Chi-Square Test we assume two hypothesis,

Null Hypothesis (H0) : The 2 variables are independent


Alternate Hypothesis (HA) : The 2 variables are not independent

We determine by performing chi-square test and come out to one reasonable hypothesis as the
solution. This can be done by comparing the statistic value that is obtained after the test with
the alpha value i.e., 0.05 using chi-square table with the help of pvalue.

If the pvalue is less than alpha we accept the alternate hypothesis (HA). If the pvalue is greater
than alpha then we accept the Null hypothesis (H0).

Expected value is calculated as Ei= (Row Total * Column Total)/Total no of observations


In Machine learning, to perform chi-squared test we use a method
named chisquare which is imported from scipy.stats.

scipy.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0)


where,
f_obs - an array, with observed frequencies in each
category
f_exp - an array, optional; with expected frequencies in
each category. If no array is given, it assumes the categories are
almost likely
ddof - int, optional; stands for delta degrees of freedom,
which is used for adjustment to the degrees of freedom for obtaining
p-value. The p-value is determined using a chi-squared distribution
with k-1 delte degrees of freedom (ddof), where k is said to be the
umber of observed frequencies. By defualt the ddof value is 0
axis - int or None, optional; the axis along which to be
applied test. If axis is given as none, then the f_obs values are
treated as a single data set. Defaulted value is 0.
Returns,
chisq - float or ndarray, the value is a float if axis is
None or f_obs or f_exp are 1-dimensional array
p-value - float or ndarry, the values is a float if ddof
and the return value chisq are scalars.

Importing Library

In [37]: from scipy.stats import chisquare

With f_obs values

In [92]: f_obs=[16, 12, 16, 18, 14, 12]


chisq=chisquare(f_obs).statistic
pvalue=chisquare(f_obs).pvalue
print("chisquare statistic :",chisq)
print("p_value :",pvalue)
chisquare statistic : 2.0
p_value : 0.8491450360846096

In [ ]: # Since from the obtained result with pvalue (0.8) > alpha (0.05), w

With f_exp and f_obs values

In [100… f_obs=[16, 12, 16, 18, 14, 12]


f_exp=[16, 8, 16, 16, 16, 16]
chisquare(f_obs,f_exp)
Power_divergenceResult(statistic=3.5, pvalue=0.6233876277495822)
Out[100…

With f_obs as 2d

In [101… f_obs=[[16, 12, 16, 18, 14, 12],[24, 12, 32, 16, 32, 12]] # The test
chisquare(f_obs)
Power_divergenceResult(statistic=array([1.6 , 0. , 5.33333333, 0.1176470
Out[101…
6, 7.04347826,
0. ]), pvalue=array([0.20590321, 1. , 0.02092134, 0.73160059, 0.
00795544,
1. ]))

In [ ]: # Here most of the pvalues obtained are less than the alpha value, w

With axis as None

In [102… f_obs=[[16, 12, 16, 18, 14, 12],[24, 12, 32, 16, 32, 12]] # The test
chisquare(f_obs, axis=None)
Power_divergenceResult(statistic=33.33333333333333, pvalue=0.0004645423926184954)
Out[102…

With axis as 1

In [107… f_obs=[16, 12, 16, 18, 14, 12]


f_exp=[[16, 12, 16, 18, 14, 12],[16, 8, 16, 16, 16, 16]]
chisquare(f_obs,f_exp,axis=1)
Power_divergenceResult(statistic=array([0. , 3.5]), pvalue=array([1. , 0.62338
Out[107…
763]))

With ddof specified


In [103… f_obs=[16, 12, 16, 18, 14, 12]
chisquare(f_obs, ddof=1)
Power_divergenceResult(statistic=2.0, pvalue=0.7357588823428847)
Out[103…

In [104… chisquare(f_obs,ddof=[0,1])
Power_divergenceResult(statistic=2.0, pvalue=array([0.84914504, 0.73575888]))
Out[104…

In [106… chisquare(f_obs,ddof=[0,1,2])
Power_divergenceResult(statistic=2.0, pvalue=array([0.84914504, 0.73575888, 0.5724067
Out[106…
]))

The above results with the different degrees of freedom generates pvalues with values higher than the
alpha value, so we accept the Null hypothesis (H0)

SHARE MY ADVENTURES

(https://twitter.com/share?
Twitter text=Inferential%20Statistical%20Analysis%20Using%20Python&url=https%3A%2F%2Fbrainalystacademy.com%2Finferential-
statistics-in-python%2F)

(https://www.facebook.com/sharer.php?
Facebook u=https%3A%2F%2Fbrainalystacademy.com%2Finferential-
statistics-in-python%2F)

(https://www.pinterest.com/pin/create/button/?url=https%3A%2F%2Fbrainalystacademy.com%2F
Pinterest
PYTHON.jpg&description=INFERENTIAL_STATISTICS_BLOG+Inferential+Statistics%C2%B6Inferential+statistics+is+used+for+finding+inferences+on+the+data

LinkedIn
python%2F&title=Inferential%20Statistical%20Analysis%20Using%20Python&summary=INFERENTIAL_STATISTICS_BLOG+Inferential+Statistics%C2%B6Inferen

(viber://forward?
Viber text=https%3A%2F%2Fbrainalystacademy.com%2Finferential-
statistics-in-python%2F)

(https://vk.com/share.php?
VK url=https%3A%2F%2Fbrainalystacademy.com%2Finferential-
statistics-in-python%2F)

(https://www.reddit.com/submit?
Reddit url=https%3A%2F%2Fbrainalystacademy.com%2Finferential-statistics-in-
python%2F&title=Inferential%20Statistical%20Analysis%20Using%20Python)

(https://www.tumblr.com/widgets/share/tool?
Tumblr canonicalUrl=https%3A%2F%2Fbrainalystacademy.com%2Finferential-
statistics-in-python%2F)

(https://partners.viadeo.com/share?
Viadeo url=https%3A%2F%2Fbrainalystacademy.com%2Finferential-
statistics-in-python%2F)

(whatsapp://send?
WhatsApp text=https%3A%2F%2Fbrainalystacademy.com%2Finferential-
statistics-in-python%2F)

 YOU MIGHT ALSO LIKE

(https://brainalystacademy.com/measur (https://brainalystacademy.com/chi-
es-of-shape/) square/)

Measures of Shape – What is Chi-Square Test?


Skewness And Kurtosis (Formula & Examples)
(https://brainalystacademy (https://brainalystacademy
.com/measures-of-shape/) .com/chi-square/)
 September 2, 2022  September 6, 2022

(https://brainalystacademy.com/anova-
repeated-measures/)
An Introduction to Repeated
Measures Anova?
(https://brainalystacademy
.com/anova-repeated-
measures/)
 September 6, 2022

Leave a Reply

Your comment here...

Name (required) Email (required)

Website

Save my name, email, and website in this browser for the next time I comment.

P O S T C O M M E NT

Search

SEARCH

Recent Posts

Factor Analysis – An Easy Overview With Example


(https://brainalystacademy.com/factor-analysis/)
A Quick Overview Of Boosting In Machine Learning
(https://brainalystacademy.com/boosting/)
Inferential Statistical Analysis Using Python
(https://brainalystacademy.com/inferential-
statistics-in-python/)
A Quick Introduction To Averaging Methods
(https://brainalystacademy.com/averaging-
techniques/)
Unsupervised Anomaly Detection Using Python
(https://brainalystacademy.com/unsupervised-
anomaly-detection/)
Principal Component Analysis ( PCA ) – A Detailed
Overview
(https://brainalystacademy.com/principal-
component-analysis/)
Hierarchical Clustering – How Does It Works And
Its Types
(https://brainalystacademy.com/hierarchical-
clustering/)
What Is Dbscan Clustering Algorithm In Machine
Learning
(https://brainalystacademy.com/dbscan/)
K-means Clustering In Machine Learning
(https://brainalystacademy.com/k-means/)
Support Vector Machine ( Svm ) Algorithm In
Machine Learning
(https://brainalystacademy.com/support-vector-
machine/)

Categories
ANOMALY DETECTION
(https://brainalystacademy.com/category/modelin
g/unsupervised-learning-model/anomaly-
detection/)
BASIC STATISTICS
(https://brainalystacademy.com/category/basic-
statistics/)
BASIC STATISTICS APPLICATION
(https://brainalystacademy.com/category/basic-
statistics-application/)
CLASSIFICATION PROBLEMS
(https://brainalystacademy.com/category/modelin
g/supervised-learning/classification-problems/)
CLUSTERING PROBLEMS
(https://brainalystacademy.com/category/modelin
g/unsupervised-learning-model/clustering-
problems/)
DATA EXPLORATION AND PREPRATION
(https://brainalystacademy.com/category/data-
exploration-and-prepration/)
DESCRIPTIVE STATISTICS
(https://brainalystacademy.com/category/basic-
statistics/descriptive-statistics/)
DIMENSIONALITY REDUCTION
(https://brainalystacademy.com/category/modelin
g/unsupervised-learning-model/dimensionality-
reduction/)
ENSEMBLE METHODS
(https://brainalystacademy.com/category/modelin
g/supervised-learning/regression-
problems/ensemble-methods/)
F-TEST
(https://brainalystacademy.com/category/basic-
statistics/inferential-statistics/f-test/)
FEATURE CONSTRUCTION
(https://brainalystacademy.com/category/data-
exploration-and-prepration/feature-
engineering/feature-construction/)
FEATURE ENGINEERING
(https://brainalystacademy.com/category/data-
exploration-and-prepration/feature-engineering/)
FEATURE REDUCTION
(https://brainalystacademy.com/category/data-
exploration-and-prepration/feature-
engineering/feature-reduction/)
FEATURE SELECTION
(https://brainalystacademy.com/category/data-
exploration-and-prepration/feature-
engineering/feature-reduction/feature-selection/)
IMPORTANT CONCEPTS
(https://brainalystacademy.com/category/basic-
statistics/inferential-statistics/important-
concepts/)
INFERENTIAL STATISTICS
(https://brainalystacademy.com/category/basic-
statistics/inferential-statistics/)
MISCELLANEOUS METHODS
(https://brainalystacademy.com/category/data-
exploration-and-prepration/miscellaneous-
methods/)
MODELING
(https://brainalystacademy.com/category/modelin
g/)
REGRESSION PROBLEMS
(https://brainalystacademy.com/category/modelin
g/supervised-learning/regression-problems/)
SUPERVISED LEARNING
(https://brainalystacademy.com/category/modelin
g/supervised-learning/)
TIME SERIES ANALYSIS
(https://brainalystacademy.com/category/modelin
g/time-series-analysis/)
Uncategorized
(https://brainalystacademy.com/category/uncateg
orized/)
UNSUPERVISED LEARNING MODEL
(https://brainalystacademy.com/category/modelin
g/unsupervised-learning-model/)

(htt
Brainalyst provides services and online professional(htt (httin Data Science,
courses ps:/ Machine Learning, Deep Learning and Artificial
ps:/ Intelligence.
ps:/ (htt /w
/w /w ps:/ ww.
ww. ww. /twi link
fac inst tter. edin
ebo agr co .co
ok.c am. m/ m/c
om/ co Brai om
Brai m/b naly pan
nalyUSEFUL LINKy/br
rain stin
stin alys dia) aina
dia) t/) lyst
ABOUT US
/)

SERVICES

BLOGS

CONTACT US

POLICIES

REFUND POLICY

TERM & CONDITION

PRIVACY POLICY

CONTACT INFORMATION

 training@brainalyst.in

 +91 7419915555

You might also like