Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
129 views

Day 1 Python Notebook

The document discusses fundamentals of Python data types including lists, tuples, strings, sets, and dictionaries. It then loads banking customer data from a CSV file and performs exploratory data analysis on the dataset. This includes viewing the head and tail of the data, checking data types and dimensions, and generating descriptive statistics. Finally, it outlines the key components of exploratory data analysis such as descriptive statistics, data visualization, and measures of central tendency, dispersion, and relationships between variables.

Uploaded by

Aviral Saxena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views

Day 1 Python Notebook

The document discusses fundamentals of Python data types including lists, tuples, strings, sets, and dictionaries. It then loads banking customer data from a CSV file and performs exploratory data analysis on the dataset. This includes viewing the head and tail of the data, checking data types and dimensions, and generating descriptive statistics. Finally, it outlines the key components of exploratory data analysis such as descriptive statistics, data visualization, and measures of central tendency, dispersion, and relationships between variables.

Uploaded by

Aviral Saxena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Day 1 16/12/19, 17:40

In [134]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [135]: # LISTS/ARRAYS ARE FUNDAMENTAL DATATYPES


# LISTS/ARRAY START WITH SQUARE BRACKETS
# LISTS ARE MUTABLE OR APPENDABLE
numlist=[2,3,5,666,223,89.5]
print(numlist)
charlist=['aaa',"AAA",'c','john']
print(charlist)
booleanlist=[True,False,False,True]
print(booleanlist)

[2, 3, 5, 666, 223, 89.5]


['aaa', 'AAA', 'c', 'john']
[True, False, False, True]

In [136]: # TUPPLES ARE IMMUTABLE OBJECTS


# TUPPLES HAVE PARANTHESIS OR ROUND BRACKETS
numtupple=(23,34,24,25)
print(numtupple)
chartupple=('aaa',"BBBB",'CCC')
print(chartupple)

(23, 34, 24, 25)


('aaa', 'BBBB', 'CCC')

In [137]: # STRINGS are character based/text based


string="hello good morning"
print(string)

hello good morning

In [138]: # Sets created by placing items in curly brackets


# multiple datatypes int,float,string,tupple
# Sets are mutable
myset={1,"Hello",(1,2,3)}
print(myset)

# Ordered Output

{1, 'Hello', (1, 2, 3)}

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 1 of 19
Day 1 16/12/19, 17:40

In [139]: # DICTIONARIES - INTERGER OR STRINGS WITH KEYS


# WITH CURLY BRACKETS
dict1={'Name':'John','Age':24,'Gender':'M'}
print(dict1)
dict1['Name']='Pete'
print(dict1)

{'Name': 'John', 'Age': 24, 'Gender': 'M'}


{'Name': 'Pete', 'Age': 24, 'Gender': 'M'}

In [140]: mydict2={}
mydict2['Key1']=[1,2]
mydict2['key2']=['G','F']
print(mydict2)

{'Key1': [1, 2], 'key2': ['G', 'F']}

In [141]: # Indexing is done using square brackets


# PYTHON STARTS WITH 0
print(numlist[0]) # Positive Indexing
print(numlist[-5]) # Negative Indexing
print(dict1.get('Name')) # DICTIONARIES 'get'

print(dict1['Name']) # DICTIONARIES SQUAREBRACKET

print(numtupple[3])

2
3
Pete
Pete
25

In [142]: # FOR IMPORTING DATA FILEPATH WITH FILENAME &


# EXTENSION & CHANGE SLASHES TO OPPOSITE SIDE
# in DOUBLE QUOTES
bankchurn=pd.read_csv("/Users/rajeshprabhakar/Downloads/Churn_Model
ling.csv")

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 2 of 19
Day 1 16/12/19, 17:40

In [143]: bankchurn.head() # FIRST 5 ROWS of DATA DEFAULT

Out[143]:
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance

0 1 15634602 Hargrave 619 France Female 42 2 0.00

1 2 15647311 Hill 608 Spain Female 41 1 83807.86

2 3 15619304 Onio 502 France Female 42 8 159660.80

3 4 15701354 Boni 699 France Female 39 1 0.00

4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82

In [144]: bankchurn.tail()

Out[144]:
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balan

9995 9996 15606229 Obijiaku 771 France Male 39 5

9996 9997 15569892 Johnstone 516 France Male 35 10 57369.

9997 9998 15584532 Liu 709 France Female 36 7

9998 9999 15682355 Sabbatini 772 Germany Male 42 3 75075.

9999 10000 15628319 Walker 792 France Female 28 4 130142.

In [145]: bankchurn.columns

Out[145]: Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geogr


aphy',
'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'Has
CrCard',
'IsActiveMember', 'EstimatedSalary', 'Exited'],
dtype='object')

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 3 of 19
Day 1 16/12/19, 17:40

In [146]: bankchurn.dtypes
# DATA TYPES
# NUMBER - int64 & float64
# CHARACTER - object
# Date - Datetime
# Boolean - True False

Out[146]: RowNumber int64


CustomerId int64
Surname object
CreditScore int64
Geography object
Gender object
Age int64
Tenure int64
Balance float64
NumOfProducts int64
HasCrCard int64
IsActiveMember int64
EstimatedSalary float64
Exited int64
dtype: object

In [147]: bankchurn.shape # NUM of ROWS/OBSERVATIONS


# NUM of COLUMNS/VARIABLES

Out[147]: (10000, 14)

In [148]: bankchurn.describe().transpose()
# describe() - Basic DESCRIPTIVE STATISTICS

Out[148]:
count mean std min 25% 50%

RowNumber 10000.0 5.000500e+03 2886.895680 1.00 2500.75 5.000500e+03

CustomerId 10000.0 1.569094e+07 71936.186123 15565701.00 15628528.25 1.569074e+07

CreditScore 10000.0 6.505288e+02 96.653299 350.00 584.00 6.520000e+02

Age 10000.0 3.892180e+01 10.487806 18.00 32.00 3.700000e+01

Tenure 10000.0 5.012800e+00 2.892174 0.00 3.00 5.000000e+00

Balance 10000.0 7.648589e+04 62397.405202 0.00 0.00 9.719854e+04

NumOfProducts 10000.0 1.530200e+00 0.581654 1.00 1.00 1.000000e+00

HasCrCard 10000.0 7.055000e-01 0.455840 0.00 0.00 1.000000e+00

IsActiveMember 10000.0 5.151000e-01 0.499797 0.00 0.00 1.000000e+00

EstimatedSalary 10000.0 1.000902e+05 57510.492818 11.58 51002.11 1.001939e+05

Exited 10000.0 2.037000e-01 0.402769 0.00 0.00 0.000000e+00

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 4 of 19
Day 1 16/12/19, 17:40

In [149]: # EXPLORATORY DATA ANALYSIS (EDA) - BASIC


# UNDERSTANDING OF DATA
# DESCRIPTIVE STATISTICS+DATA VIZUALIZATION -EDA
# DESCRIPTIVE STATISTICS
# COUNT, MINIMUM, MAXIMUM
# MEASURES OF CENTRAL TENDENCY - MEAN, MEDIAN,
# MODE
# MEASURES OF DISPERSION - VARIANCE FROM MEAN
# RANGE, VARIANCE, STANDARD DEVIATION, QUARTILES
# PERCENTILES,DECILES
# MEASURES OF ASYMMETRY - SKEWNESS & KURTOSIS
# MEASURES OF RELATIONSHIP - COVARIANCE &
# CORRELATION

In [150]: # INDEXING OF ROWS AND COLUMNS - DATAFRAME


# CHARACTER INDEXING - BY COLUMN NAME
bankchurn.Balance.describe()

Out[150]: count 10000.000000


mean 76485.889288
std 62397.405202
min 0.000000
25% 0.000000
50% 97198.540000
75% 127644.240000
max 250898.090000
Name: Balance, dtype: float64

In [151]: bankchurn['Balance'].describe()

Out[151]: count 10000.000000


mean 76485.889288
std 62397.405202
min 0.000000
25% 0.000000
50% 97198.540000
75% 127644.240000
max 250898.090000
Name: Balance, dtype: float64

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 5 of 19
Day 1 16/12/19, 17:40

In [152]: bankchurn[['Balance','Tenure','Age']].describe()
# multi column indexing - double square brackets

Out[152]:
Balance Tenure Age

count 10000.000000 10000.000000 10000.000000

mean 76485.889288 5.012800 38.921800

std 62397.405202 2.892174 10.487806

min 0.000000 0.000000 18.000000

25% 0.000000 3.000000 32.000000

50% 97198.540000 5.000000 37.000000

75% 127644.240000 7.000000 44.000000

max 250898.090000 10.000000 92.000000

In [153]: # NUMERICAL INDEXING - COLUMN NUMBER


# PYTHON STARTS WITH 0
bankchurn.iloc[:,[6,7,8]].describe()
# :, before represents COLUMN Indexing
# IF :, not given default is ROW INDEXING

Out[153]:
Age Tenure Balance

count 10000.000000 10000.000000 10000.000000

mean 38.921800 5.012800 76485.889288

std 10.487806 2.892174 62397.405202

min 18.000000 0.000000 0.000000

25% 32.000000 3.000000 0.000000

50% 37.000000 5.000000 97198.540000

75% 44.000000 7.000000 127644.240000

max 92.000000 10.000000 250898.090000

In [154]: bankchurn.iloc[100:150,4:8] # FIRST ROW INDEX


# SECOND COLUMN INDEX # SUBSETTING DATA FRAME

Out[154]:
Geography Gender Age Tenure

100 France Female 40 6

101 France Female 44 6

102 France Male 31 9

103 Spain Male 36 7

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 6 of 19
Day 1 16/12/19, 17:40

104 Spain Female 65 1

105 Spain Female 46 4

106 Germany Male 32 1

107 Germany Female 36 2

108 Spain Male 33 5

109 Germany Male 35 9

110 Germany Male 30 3

111 Germany Male 39 7

112 France Male 42 2

113 Spain Male 36 9

114 Germany Male 28 9

115 Germany Female 30 9

116 Germany Female 37 6

117 France Female 41 1

118 Germany Female 31 8

119 Germany Male 34 4

120 France Male 34 8

121 Spain Male 39 6

122 France Female 39 6

123 Germany Female 48 10

124 France Female 28 3

125 France Male 42 9

126 France Female 52 1

127 Germany Male 56 0

128 France Female 41 7

129 France Male 24 9

130 France Female 34 3

131 Germany Female 33 9

132 France Male 38 9

133 France Male 25 1

134 Germany Male 39 7

135 Germany Female 50 5

136 Germany Female 34 5

137 France Male 40 2

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 7 of 19
Day 1 16/12/19, 17:40

138 Spain Female 48 2

139 Spain Female 35 1

140 Germany Male 44 10

141 France Male 34 7

142 France Female 43 5

143 Spain Male 52 2

144 France Female 31 5

145 Spain Female 21 5

146 Spain Female 29 8

147 France Male 37 5

148 France Male 44 9

149 France Male 32 0

In [155]: # INDIVIDUAL STATISTICAL FUNCTIONS - NUMPY


print(np.min(bankchurn.Balance))
print(np.max(bankchurn.Balance))
print(np.mean(bankchurn.Balance))
print(np.median(bankchurn.Balance))
print(np.var(bankchurn.Balance))
print(np.std(bankchurn.Balance))
print(np.quantile(bankchurn.Balance,0.25))
print(np.quantile(bankchurn.Balance,0.50))
print(np.quantile(bankchurn.Balance,0.75))
print(np.quantile(bankchurn.Balance,0.80)) # 80%
print(bankchurn.Balance.skew()) # Pandas
print(bankchurn.Balance.kurt()) # Pandas

0.0
250898.09
76485.88928799961
97198.54000000001
3893046832.3731775
62394.285254125454
0.0
97198.54000000001
127644.24
133710.358
-0.14110871094154384
-1.489411767941925

In [156]: print(np.quantile(bankchurn.Balance,0.50))# Q2
print(np.percentile(bankchurn.Balance,50))# Q2

97198.54000000001
97198.54000000001

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 8 of 19
Day 1 16/12/19, 17:40

In [157]: bankchurn.dtypes

Out[157]: RowNumber int64


CustomerId int64
Surname object
CreditScore int64
Geography object
Gender object
Age int64
Tenure int64
Balance float64
NumOfProducts int64
HasCrCard int64
IsActiveMember int64
EstimatedSalary float64
Exited int64
dtype: object

In [158]: # NON NUMERIC DATA - object


# SOME NUMERIC COLUMNS MAY BE CATEGORICAL IN
# NATURE - 0 & 1 (No- 0 & Yes- 1)

In [159]: pd.value_counts(bankchurn.HasCrCard)

Out[159]: 1 7055
0 2945
Name: HasCrCard, dtype: int64

In [160]: pd.value_counts(bankchurn.Exited)
# 1 - Customer Exited 0 - Customer Not Exited

Out[160]: 0 7963
1 2037
Name: Exited, dtype: int64

In [161]: pd.value_counts(bankchurn.IsActiveMember)

Out[161]: 1 5151
0 4849
Name: IsActiveMember, dtype: int64

In [162]: bankchurn.columns

Out[162]: Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geogr


aphy',
'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'Has
CrCard',
'IsActiveMember', 'EstimatedSalary', 'Exited'],
dtype='object')

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 9 of 19
Day 1 16/12/19, 17:40

In [163]: categorycols=['HasCrCard','IsActiveMember',
'Exited','Geography','Gender']

In [164]: for col in categorycols:


freqcounts=pd.value_counts(bankchurn[col])
print(freqcounts)

1 7055
0 2945
Name: HasCrCard, dtype: int64
1 5151
0 4849
Name: IsActiveMember, dtype: int64
0 7963
1 2037
Name: Exited, dtype: int64
France 5014
Germany 2509
Spain 2477
Name: Geography, dtype: int64
Male 5457
Female 4543
Name: Gender, dtype: int64

In [165]: # DATA VIZUALIZATION


pd.value_counts(bankchurn.Gender).plot(
kind='pie')

Out[165]: <matplotlib.axes._subplots.AxesSubplot at 0x1a2adf6610>

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 10 of 19
Day 1 16/12/19, 17:40

In [166]: # ADVANCE PLOTS - HISTOGRAM, BOXPLOT & DENSITY


# CURVE - UNIVARIATE PLOTS
# 3 PLOTS IDENTIFY SKEWNESS, KURTOSIS, OUTLIERS,
# MISSING VALUES, HOW CLOSE TO NORMAL DISTRIBUTION
# HISTOGRAM - FREQUENCY DISTRIBUTION TABLE
# BOXPLOT - QUARTILES ARE USED
# DENSITY CURVE - Z SCORE OR STANDARD SCORE

# BOXPLOT OUTLIERS DETECTTION - IQR=Q3-Q1


# Q1-1.5*IQR
# Q3+1.5*IQR

In [167]: bankchurn.Balance.plot(kind='hist',color="red")

Out[167]: <matplotlib.axes._subplots.AxesSubplot at 0x1a2d146110>

In [168]: bankchurn.EstimatedSalary.plot(kind='box',vert=False)

Out[168]: <matplotlib.axes._subplots.AxesSubplot at 0x1a2ecde950>

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 11 of 19
Day 1 16/12/19, 17:40

In [169]: bankchurn.Balance.plot(kind='density')

Out[169]: <matplotlib.axes._subplots.AxesSubplot at 0x1a3028f4d0>

In [170]: # BIVARIATE STATISTICS


# COVARIANCE & CORRELATION
bankchurn[['Balance','EstimatedSalary']].cov()
# Covariance

Out[170]:
Balance EstimatedSalary

Balance 3.893436e+09 4.592389e+07

EstimatedSalary 4.592389e+07 3.307457e+09

In [171]: bankchurn[['Balance','EstimatedSalary']].corr()
# Correlation

Out[171]:
Balance EstimatedSalary

Balance 1.000000 0.012797

EstimatedSalary 0.012797 1.000000

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 12 of 19
Day 1 16/12/19, 17:40

In [172]: bankchurn.plot(kind='scatter',x='Balance',
y='EstimatedSalary')

Out[172]: <matplotlib.axes._subplots.AxesSubplot at 0x1a2d9c6990>

In [173]: # GROUPING FUNCTION - "groupby" - Slicing Function


# LEFTSIDE GROUPBY MUST BE NUMERICAL
# RIGHTSIDE GROUPBY MUST BE CATEGORICAL
# SPECIFY A STATISTICAL FUNCTION MUST
# 2 TYPES CUSTOMER - EXITED & NOT EXITED
# IS THE AVERAGE BALANCE MAINTAINED BY THESE 2 TYPE
# OF CUSTOMERS EQUAL/SAME.

print(pd.value_counts(bankchurn.Exited))
bankchurn.Balance.groupby(bankchurn.Exited).mean()

0 7963
1 2037
Name: Exited, dtype: int64

Out[173]: Exited
0 72745.296779
1 91108.539337
Name: Balance, dtype: float64

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 13 of 19
Day 1 16/12/19, 17:40

In [174]: pd.crosstab(bankchurn.Exited,bankchurn.Gender,
normalize='index')
# CROSS TABULATION _ FREQUENCY of 2 Categorical
# Variables

Out[174]:
Gender Female Male

Exited

0 0.427477 0.572523

1 0.559156 0.440844

In [175]: bankchurn.Balance.groupby([bankchurn.Exited,
bankchurn.Gender,
bankchurn.Geography]).mean()

Out[175]: Exited Gender


Geography
0 Female
France 58424.310061
Germany 118828.514362
Spain 56594.820688
Male France 61774.503758
Germany 119896.529105
Spain 61871.626285
1 Female France 67755.162630
Germany 119673.872321
Spain 71997.673680
Male France 75710.827800
Germany 121202.242432
Spain 73167.867802
Name: Balance, dtype: float64

In [176]: bankchurn.Balance.groupby(
bankchurn.Exited).mean()

Out[176]: Exited
0 72745.296779
1 91108.539337
Name: Balance, dtype: float64

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 14 of 19
Day 1 16/12/19, 17:40

In [177]: # HYPOTHESIS TESTING - COMPARISION OF MEANS


# COMPARING 2 GROUPS - 2 SAMPLE INDEPENDENT TTEST
# COMPARING MORE THAN 2 GROUPS - ANOVA SINGLE
# FACTOR
# ASSUMPTIONS:
# MEANS OF GROUPS MUST BE DIFFERENT
# NUMERICAL VARIABLE MUST BE CONTINOUS, CLOSER
# TO NORMAL DISTRIBUTION, NO OUTLIERS & MISSING
# VALUES
# OTHER VARIABLE MUST BE CATEGORICAL WITH
# EXACTLY 2 GROUPS OR MORE THAN 2 GROUPS

#NULL - THERE IS NO SIGNIFICANT DIFFERENCE IN


# AVERAGE OF GROUP 1 & GROUP 2
#ALT -THERE IS SIGNIFICANT DIFFERENCE IN
# AVERAGE OF GROUP 1 & GROUP 2

# P-VALUE LESS THAN 0.05, REJECT NULL & ACCEPT


# ALTERNATE
# P-VALUE GREATER THAN 0.05, FAIL TO REJECT NULL
# (ACCEPT NULL) & REJECT ALTERNATE

# PROCESS IN PYTHON
# GROUPBY - TO IDENTIFY NUM OF GROUPS & CONFIRM
# MEAN DIFFERENCE
# SPLIT THE DATAFRAME INTO SUBSETS BASED ON
# GROUPS
# CONDUCT THE RELEVANT TEST & INFER BASED ON
# P-VALUE

In [178]: # SPLIT DATAFRAME INTO EXITED AND NONEXITED


exited=bankchurn[bankchurn.Exited==1]
nonexited=bankchurn[bankchurn.Exited==0]
print(exited.shape)
print(nonexited.shape)

(2037, 14)
(7963, 14)

In [179]: from scipy.stats import ttest_ind


# 2 Sample Independent T test
ttest_ind(exited.Balance,nonexited.Balance,
equal_var=False)
# SINCE p-value less tha 0.05, REJECT NULL

Out[179]: Ttest_indResult(statistic=12.47128032005069, pvalue=6.318663518527


793e-35)

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 15 of 19
Day 1 16/12/19, 17:40

In [180]: # IS THE AVERAGE BALANCE MAINTAINED BY MALE &


# FEMALE CUSTOMERS SAME or EQUAL?
# groupby
# split data into male & female
# Frame NULL & ALT
# Conduct Test & Infer

In [181]: bankchurn.Balance.groupby(
bankchurn.Gender).mean()

Out[181]: Gender
Female 75659.369139
Male 77173.974506
Name: Balance, dtype: float64

In [182]: # NULL - THERE IS NO SIGNIFICANT DIFFERENCE IN


# AVERAGE BALANCE OF MALE & FEMALE CUSTOMERS
# ALT - THERE IS SIGNIFICANT DIFFERENCE IN
# AVERAGE BALANCE OF MALE & FEMALE CUSTOMERS
male=bankchurn[bankchurn.Gender=='Male']
female=bankchurn[bankchurn.Gender=='Female']

In [183]: ttest_ind(male.Balance,female.Balance,
equal_var=False)
# SINCE p-value greater than 0.05, Fail to Reject
# Null

Out[183]: Ttest_indResult(statistic=1.2095754533033731, pvalue=0.22647131768


566747)

In [185]: # IS THE AVERAGE BALANCE MAINTAINED BY DIFFERENT


# GEOGRAPHY EQUAL OR SAME?
bankchurn.Balance.groupby(
bankchurn.Geography).mean()

Out[185]: Geography
France 62092.636516
Germany 119730.116134
Spain 61818.147763
Name: Balance, dtype: float64

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 16 of 19
Day 1 16/12/19, 17:40

In [188]: # SINCE MORE THAN 2 LEVELs - ANOVA ONE WAY or


# ANOVA SINGLE FACTOR
france=bankchurn[bankchurn.Geography=='France']
germany=bankchurn[bankchurn.Geography=='Germany']
spain=bankchurn[bankchurn.Geography=='Spain']

# NULL - THERE IS NO SIGNIFICANT DIFFERENCE IN


# AVERAGE BALANCE OF FRANCE, GERMANY & SPAIN
# alt - THERE IS SIGNIFICANT DIFFERENCE IN
# AVERAGE BALANCE OF FRANCE, GERMANY & SPAIN

In [189]: from scipy.stats import f_oneway


f_oneway(france.Balance,germany.Balance,
spain.Balance)
# SINCE p-value less than 0.05, REJECT NULL

Out[189]: F_onewayResult(statistic=958.4254463368385, pvalue=0.0)

In [191]: # DISTRIBUTIONS
# T TEST - STUDENT T DISTRIBUTION
# ANOVA SINGLE FACTOR - F DISTRIBUTION
# CHI SQUARE TEST - CHI DISTRIBUTION

# BERNOULLI BINOMIAL DISTRIBUTION - BINARY


# YES/NO or TRUE/FALSE - LOGISTIC REGRESSION
# POISSON DISTRIBUTION - INSURANCE DATA
# EXPONENTIAL DISTRIBUTION - TRAFFIC DATA
# NORMAL DISTRIBUTION or GAUSSIAN DISTRIBUTION

In [193]: # ONE NUMERIC VARIABLE OTHER CATEGORICAL VARIABLE


# WITH EXCATLY 2 LEVELS - T TEST
# ONE NUMERIC VARIABLE OTHER CATEGORICAL VARIABLE
# WITH morethan 2 LEVELS - ANOVA SINGLE FACTOR

# BOTH THE VARIABLES NON NUMERICAL or Categorical


# CHI SQUARE TEST OF INDEPENDENCE
# INPUT for CHI SQUARE TEST IS CROSSTABULATION
# NULL - THERE IS NO ASSOCIATION/RELATIONSHIP
# between BOTH VARIABLES
# ALT - THERE IS ASSOCIATION/RELATIONSHIP
# between BOTH VARIABLES

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 17 of 19
Day 1 16/12/19, 17:40

In [194]: # IS THERE ASSOCIATION BETWEEN GENDER & EXITED?


pd.crosstab(bankchurn.Gender,bankchurn.Exited)

Out[194]:
Exited 0 1

Gender

Female 3404 1139

Male 4559 898

In [197]: from scipy.stats import chi2_contingency


chi2_contingency(pd.crosstab(bankchurn.Gender,
bankchurn.Exited))
# Since 2.2482100097131755e-26 less than 0.05, REJECT
# NULL

Out[197]: (112.91857062096116, 2.2482100097131755e-26, 1, array([[3617.5909,


925.4091],
[4345.4091, 1111.5909]]))

In [198]: # IS THERE ASSOCIATION BETWEEN GENDER & HASCRCARD


chi2_contingency(pd.crosstab(bankchurn.Gender,
bankchurn.HasCrCard))
# SINCE p-value greater than 0.05, Fail to REJECT NULL

Out[198]: (0.30756077917984026, 0.5791808600631774, 1, array([[1337.9135, 32


05.0865],
[1607.0865, 3849.9135]]))

In [200]: # PAIRED SAMPLE T TEST - BEFORE & AFTER


# INDIA's INFLATION BEFORE & AFTER DEMONITIZATION
befordemon=[6.59,6.13,6.46,5.30,4.14,3.35]
afterdemon=[2.23,1.86,2.62,2.61,2.21,1.09]
print(np.mean(befordemon))
print(np.mean(afterdemon))

# NULL - THERE IS NO SIGNIFICANT DIFFERENCE IN AVERAGE


# INFLATION BEFORE AND AFTER DEMONITIZATION
# ALT - THERE IS SIGNIFICANT DIFFERENCE IN AVERAGE
# INFLATION BEFORE AND AFTER DEMONITIZATION

5.328333333333334
2.1033333333333335

In [202]: from scipy.stats import ttest_rel


ttest_rel(befordemon,afterdemon)
# SINCE p-value less than 0.05, REJECT NULL

Out[202]: Ttest_relResult(statistic=7.429383454110612, pvalue=0.000696190145


3565623)

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 18 of 19
Day 1 16/12/19, 17:40

In [203]: # BEFORE & AFTER GST MONTHLY INFLATION


beforegst=[1.86,2.62,2.61,2.21,1.09,1.08]
aftergst=[2.52,2.89,3.24,3.97,4.00,5.11]
print(np.mean(beforegst))
print(np.mean(aftergst))

1.9116666666666668
3.6216666666666666

In [204]: ttest_rel(beforegst,aftergst)

Out[204]: Ttest_relResult(statistic=-2.8027263826421933, pvalue=0.0378716146


2210822)

In [ ]:

http://localhost:8889/nbconvert/html/SBIIT%20B2/Day%201%20.ipynb?download=false Page 19 of 19

You might also like