Trainity Data Analytics Trainee Task 6
Trainity Data Analytics Trainee Task 6
Trainity Data Analytics Trainee Task 6
Email : advaitchavan135@gmail.com
Task 6: Bank Loan Case Study (Final Project - 2),
Tech Stack Used: Microsoft Excel
Analysis is being done into two parts or say two dataset wiz:
1. Application data
2. Previous application data
The cleaned and analyzed data in the form of excel sheets have
been uploaded to Google Drive also the excel sheets are large files
due to vastness of data, so they won’t be visible on google excel
sheets online they need to be downloaded and seen offline using
Microsoft Excel 2019
Application Dataset – NULL values
Firstly the percentage of null values needs to be analyzed and those columns that
have more than 50% of the null data have to be dropped
And those columns with less than 50% of the null data have to be replaced with
mean or median or the highest occurring categorical variables
ALL THE COLUMN NAME WHICH ARE HIGHLIGHTED IN BLUE NEED TO BE DROPPED DOWN
AS THEY HAVE NULL VALUES GREATER THAN OR EQUAL TO 50%
Column name Total number of null values Percentage of null value in that column ROUND PER
OWN_CAR_AGE 202930 65.99113528 66
EXT_SOURCE_1 173379 56.38139774 56
APARTMENTS_AVG 156061 50.74972928 51
BASEMENTAREA_AVG 179943 58.51595553 59
YEARS_BUILD_AVG 204488 66.49778382 66
COMMON_AREA_AVG 214865 69.87229725 70
ELEVATORS_AVG 163891 53.29597966 53
ENTRANCES_AVG 154828 49.70488861 50
FLOORSMAX_AVG 153021 49.76114676 50
FLOORSMIN_AVG 208642 67.84862981 68
LANDAREA_AVG 182590 59.37673774 59
LIVINGAPARTMENTS_AVG 210199 68.35495316 68
LIVINGAREA_AVG 154350 50.19332642 50
NONLIVINGAPARTMENTS_AVG 213514 69.43296337 69
NONLIVINGAREA_AVG 169682 55.17916432 55
APARTMENTS_MODE 156061 50.74972928 51
BASEMENTAREA_MODE 179943 58.51595553 59
YEARS_BUILD_MODE 204488 66.49778382 66
COMMON_AREA_MODE 214865 69.87229725 70
ELEVATORS_MODE 163891 53.29597966 53
ENTRANCES_MODE 154828 50.34876801 50
FLOORSMAX_MODE 153020 49.76082156 50
FLOORSMIN_MODE 208642 67.84862981 68
LANDAREA_MODE 182590 59.37673774 59
LIVINGAPARTMENTS_MODE 210199 68.35495316 68
LIVINGAREA_MODE 154350 50.19332642 50
NONLIVINGAPARTMENTS_MODE 213514 69.43296337 69
NONLIVINGAREA_MODE 169682 55.17916432 55
APARTMENTS_MEDIAN 156061 50.74972928 51
BASEMENTAREA_MEDIAN 179943 58.51595553 59
YEARS_BUILD_MEDIAN 204488 66.49778382 66
COMMON_AREA_MEDIAN 214865 69.87229725 70
ELEVATORS_MEDIAN 163891 53.29597966 53
ENTRANCES_MEDIAN 154828 50.34876801 50
FLOORSMAX_MEDIAN 153020 49.76082156 50
FLOORSMIN_MEDIAN 208642 67.84862981 68
LANDAREA_MEDIAN 182590 59.37673774 59
LIVINGAPARTMENTS_MEDIAN 210199 68.35495316 68
LIVINGAREA_MEDIAN 154350 50.19332642 50
NONLIVINGAPARTMENTS_MEDIAN 213514 69.43296337 69
NONLIVINGAREA_MEDIAN 169682 55.17916432 55
FONDKAPREMONT_MODE 210295 68.38617155 68
HOUSETYPE_MODE 154297 50.17609126 50
WALLSMATERIAL_MODE 156341 50.84078293 51
Application
ApplicationDataset
Dataset– NULL values
ALL THE COLUMN NAME WHICH ARE HIGHLIGHTED IN GREEN NEED TO BE DROPPED DOWN
AS THEY ARE IRRELEVANT COLUMNS FOR DOING OUR ANALYSIS
Column name Total number of null values Percentage of null value in that column ROUND PER
FLAG_MOBIL 1 0.000325192 0
FLAG_EMPLOY_PHONE 55387 18.01138821 18
FLAG_WORK_PHONE 0 0 0
FLAG_CONT_MOBILE 0 0 0
FLAG_PHONE 0 0 0
FLAG_EMAIL 0 0 0
CNT_FAMILY_MEMBERS 2 0.000650383 0
REGION_RATING_CLENT 0 0 0
REGION_RATING_CLENT_W_CITY 0 0 0
EXT_SOURCE_3 60965 19.82530706 20
YEAR_BEGINEXPLUATATION_AVG 150008 48.78134441 49
YEAR_BEGINEXPLUATATION_MODE 150007 48.78101922 49
YEAR_BEGINEXPLUATATION_MEDIAN 150007 48.78101922 49
TOTAL_AREA_MODE 148431 48.26851722 48
EMERGENCYSTATE_MODE 145755 47.39830445 47
DAYS_LAST_PHONE_CHANGE 1 0.000325192 0
FLAG DOC 2 0 0 0
FLAG DOC 3 0 0 0
FLAG DOC 4 0 0 0
FLAG DOC 5 0 0 0
FLAG DOC 6 0 0 0
FLAG DOC 7 0 0 0
FLAG DOC 8 0 0 0
FLAG DOC 9 0 0 0
FLAG DOC 10 0 0 0
FLAG DOC 11 0 0 0
FLAG DOC 12 0 0 0
FLAG DOC 13 0 0 0
FLAG DOC 14 0 0 0
FLAG DOC 15 0 0 0
FLAG DOC 16 0 0 0
FLAG DOC 17 0 0 0
FLAG DOC 18 0 0 0
FLAG DOC 19 0 0 0
FLAG DOC 20 0 0 0
FLAG DOC 21 0 0 0
Application
ApplicationDataset
Dataset– NULL values
Median of AMT_ANNUITY
24903
Median of AMT_GOODS_PRICE
450000
Highest occurring
Total
categorical variable is
‘Unaccompanied’
40149
3267 271 866 1770 11370
Application
ApplicationDataset
Dataset– NULL values Row Labels
Advertising
Count of ORGANIZATION_TYPE
429
Agriculture 2454
Bank 2507
categorical variable is
Hotel 966
Housing 2958
Industry: type 1 1039
429
2454 5984
2507 6721
260 379950560 966
2958
1039
109
2704
36967458
3278
877599112
1307243368 6880
597 305 2634
317 2341
2157
396851811 3247
1974 1575
577348
1900
349264 49631 2012204
1187
5398
1327
Telecom 577
Trade: type 1 348
Trade: type 2 1900
Insurance
School
Hotel
Construction
Religion
Services
Trade: type 4
Industry: type 2
Industry: type 6
Housing
Mobile
Police
Realtor
Telecom
Transport: type 1
Bank
Culture
Other
Security Ministries
Trade: type 1
Trade: type 2
Trade: type 3
Trade: type 5
Trade: type 6
Trade: type 7
XNA
Government
Military
Self-employed
Advertising
Industry: type 1
Industry: type 3
Industry: type 4
Industry: type 5
Industry: type 7
Industry: type 8
Industry: type 9
Postal
Agriculture
Security
Transport: type 2
Transport: type 3
Transport: type 4
Business Entity Type 1
Business Entity Type 2
Business Entity Type 3
Cleaning
Electricity
Emergency
Kindergarten
Medicine
Restaurant
University
Industry: type 10
Industry: type 11
Industry: type 12
Industry: type 13
Legal Services
Google Drive Link for Excel sheet of Analysis of Null values done:-
application_data.xlsx - Google
Drive
Application Dataset – Outliers
Quartiles at AMT_INCOME_TOTAL
Here we can observe that there is huge difference MIN 25650
between 25% 112500
the 25%, 50% and 75% quartile and this is due to presence 50% 147150
of outliers 75% 202500
But since the amount of total income varies from person to MAX 117000000
person
we will not remove the outliers
outliers at extreme points i.e. max 1.700x10^8
Application Dataset – Outliers
From the chart it is clear that outliers lie in the 98% and near AMT_CREDIT
max side of the box plot Quartiles at AMT_CREDIT
Also there is a significant difference between the 75% MIN 45000
quartile and the max value and this is due the presence of 25% 270000
the outliers 50% 513531
But since the amount of credit varies from person to person 75% 808650
we will not remove the outliers MAX 4050000
Application Dataset – Outliers
DAYS_OF_BIRTH
Quartiles at DAYS_BIRTH
As seen from the boxplot it is clear that there are no MAX 25,229.00
outliers 75% 19,682.00
The data of DAYS_BIRTH is well distributed 50% 15,750.00
25% 12,413.00
MIN 7,489.00
Application Dataset – Outliers
DAYS_EMPLOYED
Quartiles at DAYS_EMPLOYED
There exists only 1 outlier i.e. + or - 365243 MAX 17912.00
75% 2760.00
Replace with median
1213.00 50% 1213.00
25% 289.00
MIN 365243.00
Application Dataset – Outliers
Google Drive Link for Excel sheet of Analysis of Outliers and cleaned Data
done:-
TARGET VARIABLE
Count of
Row Labels TARGET
0
1
282686
24825
Target Variable
Grand Total 307511
1
8%
0
1
0
The Target Variable Pie chart 92%
shows that almost 92%
of the total clients had
no problem during payment
while 8% of the clients
had some or the other problem
0 → No payment issues
1 → Had some payment issues
Application Dataset – Analysis
GENDER VARIABLE
Count of
Row Labels
F
CODE_GENDER
202448
CODE_GENDER
M 105059
XNA 4
Grand Total 307511
0%
34% F
M
XNA
66%
NAME_HOUSING_TYPE
272868
Total
NAME_HOUSING_TYPE
Percentage of
Row Labels NAME_HOUSING_TYPE
Co-op apartment 0.36%
House / apartment 88.73%
Municipal apartment 3.64%
Office apartment 0.85%
Rented apartment 1.59%
From the bar graphs of count and percentage
With parents 4.83%
Grand Total 100.00% The bank can target those groups who do not have
their
Total
Univariate Analysis
AGE GROUP
Count of
Row Labels YEARS_BIRTH_RANGE Total count of each age group of the banks
20-30 48869
31-40 82770 applicants
41-50 75509 90000
51-60 67955
61-70 32408 80000 82770
Grand Total 307511
75509
70000
67955
60000
From the adjacent bar plot we can
50000
infer that most of the applicants 48869
Total
belong to the Age Group ’31-40’ 40000
30000 32408
20000
10000
0
20-30 31-40 41-50 51-60 61-70
Application Dataset – Analysis
Univariate Analysis
AGE GROUP
Count of TARGET Column Labels
Row Labels 0 Grand Total
20-30 43276 43276
31-40 74961 74961
41-50 69784 69784
51-60 63853 63853
61-70 30812 30812
Grand Total 282686 282686
Clients Age Group with no Payment issues From the adjacent Bar plot we
can infer that
74961
69784
clients/applicants in the Age
63853 Group ’31-40’ are having the
highest number when it
43276 0
comes to doing/returning
Payment to Banks
30812
Univariate Analysis
AGE GROUP
Count of TARGET Column Labels
Row Labels 1 Grand Total
20-30 5593 5593
31-40 7809 7809
41-50 5725 5725
51-60 4102 4102
61-70 1596 1596
Grand Total 24825 24825
Univariate Analysis
Univariate Analysis
Univariate Analysis
OCCUPATION_TYPE
Count of TARGET Column Labels
Grand Clients occupation type with no payment issues
Row Labels 0 Total
Accountants 9339 9339
Cleaning staff 4206 4206 139461
Cooking staff 5325 5325
Core staff 25832 25832
Drivers 16496 16496
High skill tech staff 10679 10679
HR staff 527 527
IT staff 492 492
Laborers 139461 139461
Low-skill Laborers 1734 1734
Managers 20043 20043 0
Medicine staff 7965 7965
Private service staff 2477 2477 25832 29010
Realty agents 692 692 20043
16496
9339 10679
Sales staff 29010 29010 4206 5325 7965 5999
527 492 1734 2477 692 1213 1196
Secretaries 1213 1213
Security staff 5999 5999
Waiters/barmen staff 1196 1196
Grand Total 282686 282686
From the above bar plot we can infer that clients with occupation_type ‘Laborers’ have the highest
number of count when it comes to clients with no payment issues
Application Dataset – Analysis
Univariate Analysis
OCCUPATION_TYPE
Count of TARGET Column Labels
Grand Clients occupation_type with payment issues
Row Labels 1 Total
Accountants 474 474
Cleaning staff 447 447 12116
Cooking staff 621 621
Core staff 1738 1738
Drivers 2107 2107
High skill tech staff 701 701
HR staff 36 36
IT staff 34 34
Laborers 12116 12116
Low-skill Laborers 359 359
Managers 1328 1328 1
3092
Medicine staff 572 572
2107
Private service staff 175 175 1738
1328
Realty agents 59 59 474 447 621 701
359 572 722
36 34 175 59 92 152
Sales staff 3092 3092
Secretaries 92 92
Security staff 722 722
Waiters/barmen staff 152 152
Grand Total 24825 24825
From the above bar plot we can infer that clients with occupation_type ‘Laborers’ have the highest
number of count when it comes to clients with payment issues
Application Dataset – Analysis
Univariate Analysis
NAME_INCOME_TYPE
Count of TARGET Column Labels
Grand NAME_INCOME_TYPE with no payment issues
Row Labels 0 Total
Businessman 10 10 143550
Commercial associate 66257 66257
Maternity leave 3 3
Pensioner 52380 52380
State servant 20454 20454
Student 18 18
Unemployed 14 14
Working 143550 143550
66257 0
Grand Total 282686 282686
52380
20454
10 3 18 14
From the above Bar plot we can infer that clients having income_type as ‘WORKING’ have the
highest count when it comes to clients with no payment issues
Application Dataset – Analysis
Univariate Analysis
NAME_INCOME_TYPE
Count of TARGET Column Labels
Grand NAME_TYPE_INCOME with payment issues
Row Labels 1 Total
15224
Commercial associate 5360 5360
Maternity leave 2 2
Pensioner 2982 2982
State servant 1249 1249
Unemployed 8 8
Working 15224 15224
Grand Total 24825 24825
1
5360
2982
1249
2 8
From the above Bar plot we can infer that clients having income_type as ‘WORKING’ have the
highest count when it comes to clients with payment issues
Application Dataset – Analysis
Univariate Analysis
AMT_TOTAL INCOME
Count of TARGET Column Labels
Row Labels 0
Grand
Total
AMT_TOTAL_INCOME with no payment
HIGH 7595 7595 issues
LOW 201045 201045
MEDIUM 74046 74046
Grand Total 282686 282686
201045
74046
7595
From the above Bar plot we can infer that client having the total income range as ‘LOW’
have the highest count when it comes to clients having no payment issues
Application Dataset – Analysis
Univariate Analysis
AMT_TOTAL INCOME
Count of TARGET Column Labels
AMT_TOTAL_INCOME with payment
Grand
Row Labels 1 Total issues
HIGH 468 468 18551
LOW 18551 18551
MEDIUM 5806 5806
Grand Total 24825 24825
5806
468
From the above Bar plot we can infer that client having the total income range as ‘LOW’
have the highest count when it comes to clients having payment issues
Application Dataset – Analysis
Univariate Analysis
CNT_FAMILY_MEMBERS
Count of CNT_FAM_MEMBERS Column Labels
Row Labels 0
Grand
Total
CNT_FAMILY_MEMBERS with no payment
1
2
62172
146350
62172
146350
issues
3 47993 47993
4 22561 22561
146350
5 3151 3151
6 353 353
7 75 75
8 14 14
9 6 6 0
10 2 2 62172
12 2 2 47993
14 2 2
15 1 1 22561 3151 353 75 14 6 2 2 2 1 2 2
16 2 2
1 2 3 4 5 6 7 8 9 10 12 14 15 16 20
20 2 2
Grand Total 282686 282686
From the above Bar plot we can infer that clients having total count of family
members as 2 have the highest count when it comes to clients having no payment
issues
Application Dataset – Analysis
Univariate Analysis
CNT_FAMILY_MEMBERS
Count of CNT_FAM_MEMBERS Column Labels
Row Labels 1
Grand
Total
CNT_FAMILY_MEMBERS with payment issues
1 5675 5675
2 12009 12009 12009
3 4608 4608
4 2136 2136
5 327 327
6 55 55
7 6 6
8 6 6 5675 1
10 1 1 4608
11 1 1
13 1 1
2136
Grand Total 24825 24825
327 55 6 6 1 1 1
1 2 3 4 5 6 7 8 10 11 13
From the above Bar plot we can infer that clients having total count of family
members as 2 have the highest count when it comes to clients having payment
issues
Application Dataset – Analysis
CODE_GENDER
14170 10655 4
F M XNA
From the above Bar Plot we can infer that Clients with CODE_GENDER = ‘F’ have the highest
number of non-defaulters i.e. 188278-14170 = 174108
Application Dataset – Analysis
NAME_INCOME_TYPE
NAME_EDUCATION_TYPE
Count of
NAME_EDUCATION_TYPE Column Labels
Grand
Row Labels 0 1 Total
198867
Academic degree 161 3 164
Higher education 70854 4009 74863
Incomplete higher 9405 872 10277
Lower secondary 3399 417 3816
Secondary / secondary 0
special 198867 19524 218391 1
70854
Grand Total 282686 24825 307511
161 3 4009 9405 872 3399 417 19524
From the above Bar Plot we can infer that clients having
NAME_EDUCATION_TYPE = ‘SECONDARY/SECONDARY SPECIAL’ have the highest count for Non-
defaulters i.e.
198867-19524 = 179343
Application Dataset – Analysis
NAME_FAMILY_STATUS
Count of
NAME_FAMILY_STATUS Column Labels
Grand
Row Labels 0 1 Total
Civil marriage 26814 2961 29775 181582
Married 181582 14850 196432
Separated 18150 1620 19770
Single / not married 40987 4457 45444
Unknown 2 2
Widow 15151 937 16088
Grand Total 282686 24825 307511
0
1
From the adjacent Bar Plot we can
infer that clients having 40987
NAME_FAMILY_STATUS = ‘MARRIED’ 26814
have the highest count of Non- 2961 14850 18150 1620 4457 2 15151 937
NAME_HOUSING_TYPE
300000
Count of NAME_HOUSING_TYPE Column Labels
251596
Grand
250000
Row Labels 0 1 Total
Co-op apartment 1033 89 1122
200000
House / apartment 251596 21272 272868
Municipal apartment 10228 955 11183
Office apartment 2445 172 2617 150000
0
Rented apartment 4280 601 4881 1
With parents 13104 1736 14840 100000
Grand Total 282686 24825 307511
50000
21272
10228 13104
1033 89 955 2445 172 4280 601 1736
0
Co-op House / Municipal Office Rented With parents
apartment apartment apartment apartment apartment
From the above Bar Plot we can infer that clients having NAME_HOUSING_TYPE =
‘House/Apartment’ have the highest count of Non-defaulters i.e.
251596-21272 = 230324
Application Dataset – Analysis
OCCUPATION_TYPE
Count of Column 160000
OCCUPATION_TYPE Labels
Grand 139461
Row Labels 0 1 Total 140000
Accountants 9339 474 9813
Cleaning staff 4206 447 4653
Cooking staff 5325 621 5946
120000
Core staff 25832 1738 27570
Drivers 16496 2107 18603 100000
High skill tech staff 10679 701 11380
HR staff 527 36 563
IT staff 492 34 526 80000
Laborers 139461 12116 151577
Low-skill Laborers 1734 359 2093 60000 0
Managers 20043 1328 21371
Medicine staff 7965 572 8537 25832 1
1734 20043
Private service staff 2477 175 2652 40000 93394206 5999
16496 29010
Realty agents 692 59 751 5325
Sales staff 29010 3092 32102
10679 527 492 7965 2477 1196
20000 12116 692 1213
Secretaries 1213 92 1305
Security staff 5999 722 6721 474 447 621 1738 2107 701 36 34 359 1328 572 175 59 3092 92 722 152
Waiters/barmen staff 1196 152 1348 0
Grand Total 282686 24825 307511
From the above Bar plot we can infer that Females belonging to Low income group are the
highest number of clients with no payment issues
Application Dataset – Analysis
2000
180 288
0
F M F M F M
HIGH LOW MEDIUM
From the above Bar plot we can infer that Females belonging to Low income group are the
highest number of clients with payment issues
Application Dataset – Analysis
Count of
NAME_EDUCATION_TYPE Column Labels
Secondary
/
Academic Higher Incomplete Lower secondary Grand
Row Labels degree education higher secondary special Total
HIGH 2 1397 205 74 4922 6600
LOW 1081 322 161 6878 8442
MEDIUM 1 1531 345 182 7724 9783
Grand Total 3 4009 872 417 19524 24825
120000
From the adjacent Bar plot
100000 we can infer that clients with
Civil marriage
total_income_range as ‘Low’
Married
80000
Separated
and family_status as
60000 Single / not married ‘Married’ have the highest
49029
Unknown count for clients having no
Widow
40000
29530 payment issues
19204
20000 12688 12318 10415
5248 6993 4955
617 507 1042 1 180 1 2653
0
HIGH LOW MEDIUM
Application Dataset – Analysis
TOTAL_INCOME_RANGE VS FAMILY_STATUS
12000
10998
10000
From the adjacent Bar plot
8000 we can infer that clients with
Civil marriage
Married
total_income_range as ‘Low’
6000
Separated and family_status as
3551
Single / not married ‘Married’ have the highest
4000 3400 Widow
count for clients having
2227
2000
1170
payment issues
756 987
688
301 404 176
46 46 70 5
0
HIGH LOW MEDIUM
Application Dataset – Analysis
• HOUR_APPR_PROCESS_START
• WEEKDAY_APPR_PROCESS_START_PREV
• FLAG_LAST_APPL_PER_CONTRACT
• NFLAG_LAST_APPL_IN_DAY
• SK_ID_CURR
• WEEKDAY_APPR_PROCESS_START
Removing the rows with the values 'XNA' &'XAP' for the column:
NAME_TYPE_SUITE
AMT_ANNUITY
Median of AMT_ANNUITY
21340
Previous Application Dataset – Dropping,
Imputing and analyzing Null values
NAME_TYPE_SUITE
Row Labels Count of NAME_TYPE_SUITE
Children 343
Family 3146
Group of people 39
Other_A 98
Other_B 276
Spouse, partner 1194
Unaccompanied 21206
(blank)
Grand Total 26302
15000
Replace Blanks with Unaccompained
10000
From the above Bar Plot we can infer that Name of Contract status i.e. Repairs work has the highest count of Approved Loans
Previous Application Dataset – Analysis of Cleaned
Data
• The Bank generally lends more loan to Female clients as compared to Males
clients as the count of Female clients in the defaulter’s list is less than that of
Males. Still Bank can look for more Male clients if their credit amount is satisfied
• Also the clients who belong to Working class tend to pay their loans on time
followed by the clients who fall under Commercial Associate
• Clients having Education status like Secondary/ Higher Secondary or more tend
to pay loan on time so bank can prefer lending loans to clients having such
Education Status
• Clients who fall in the Age Group 31-40 have the highest count for paying off their
loans on time followed by the clients who fall in the Age Groups 41-60
• Clients having LOW credit amount range tend to pay off their loans on time than
compared to HIGH and MEDIUM credit range
• Clients living with their Parents tend to pay off their loans quickly as
compared to other housing type. So Bank can lend loan to clients having
housing type → Living with Parents
• Clients taking loan for purchasing New Home i.e. clients taking Home Loans
or purchasing New Car i.e. Car Loans and clients who have a income type
as State Servant tend to pay their loans on time and hence Bank should
prefer clients having such background
• The Bank should be more cautious when lending money to clients with
Repairs purpose because they have high count of Defaulters along with
High count of Defaulters
Google Drive Folder Link for the Analysed datasets in form of Excel sheets
Due to vastness of data the Excel sheets needs to be downloaded and viewed
offline:-