Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Data Preprocessing

Uploaded by

chatgptbolte
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Preprocessing

Uploaded by

chatgptbolte
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

import pandas as pd

import numpy as np

df = pd.read_csv('/Users/nageshjadhav/Desktop/adult.csv')

df.head()

age workclass fnlwgt education educational-num marital-


status \
0 25 Private 226802 11th 7 Never-
married
1 38 Private 89814 HS-grad 9 Married-civ-
spouse
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-
spouse
3 44 Private 160323 Some-college 10 Married-civ-
spouse
4 18 ? 103497 Some-college 10 Never-
married

occupation relationship race gender capital-gain


capital-loss \
0 Machine-op-inspct Own-child Black Male 0
0
1 Farming-fishing Husband White Male 0
0
2 Protective-serv Husband White Male 0
0
3 Machine-op-inspct Husband Black Male 7688
0
4 ? Own-child White Female 0
0

hours-per-week native-country income


0 40 United-States <=50K
1 50 United-States <=50K
2 40 United-States >50K
3 40 United-States >50K
4 30 United-States <=50K

df.corr()

age fnlwgt educational-num capital-gain \


age 1.000000 -0.076628 0.030940 0.077229
fnlwgt -0.076628 1.000000 -0.038761 -0.003706
educational-num 0.030940 -0.038761 1.000000 0.125146
capital-gain 0.077229 -0.003706 0.125146 1.000000
capital-loss 0.056944 -0.004366 0.080972 -0.031441
hours-per-week 0.071558 -0.013519 0.143689 0.082157

capital-loss hours-per-week
age 0.056944 0.071558
fnlwgt -0.004366 -0.013519
educational-num 0.080972 0.143689
capital-gain -0.031441 0.082157
capital-loss 1.000000 0.054467
hours-per-week 0.054467 1.000000

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 workclass 48842 non-null object
2 fnlwgt 48842 non-null int64
3 education 48842 non-null object
4 educational-num 48842 non-null int64
5 marital-status 48842 non-null object
6 occupation 48842 non-null object
7 relationship 48842 non-null object
8 race 48842 non-null object
9 gender 48842 non-null object
10 capital-gain 48842 non-null int64
11 capital-loss 48842 non-null int64
12 hours-per-week 48842 non-null int64
13 native-country 48842 non-null object
14 income 48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB

df.isna().sum()

age 0
workclass 0
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 0
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
income 0
dtype: int64
df.isin(['?']).sum()

age 0
workclass 2799
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 2809
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 857
income 0
dtype: int64

df.describe()

age fnlwgt educational-num capital-gain \


count 48842.000000 4.884200e+04 48842.000000 48842.000000
mean 38.643585 1.896641e+05 10.078089 1079.067626
std 13.710510 1.056040e+05 2.570973 7452.019058
min 17.000000 1.228500e+04 1.000000 0.000000
25% 28.000000 1.175505e+05 9.000000 0.000000
50% 37.000000 1.781445e+05 10.000000 0.000000
75% 48.000000 2.376420e+05 12.000000 0.000000
max 90.000000 1.490400e+06 16.000000 99999.000000

capital-loss hours-per-week
count 48842.000000 48842.000000
mean 87.502314 40.422382
std 403.004552 12.391444
min 0.000000 1.000000
25% 0.000000 40.000000
50% 0.000000 40.000000
75% 0.000000 45.000000
max 4356.000000 99.000000

df.duplicated().sum()

52

df = df.drop_duplicates()
print(df.duplicated().sum())

df['age'].value_counts()
36 1348
35 1336
33 1335
23 1325
31 1324
...
88 6
85 5
87 3
89 2
86 1
Name: age, Length: 74, dtype: int64

for i in df.columns:
a = df[i].value_counts()
print(f'\n\n\nvalues of {a}')

values of 36 1348
35 1336
33 1335
23 1325
31 1324
...
88 6
85 5
87 3
89 2
86 1
Name: age, Length: 74, dtype: int64

values of Private 33860


Self-emp-not-inc 3861
Local-gov 3136
? 2795
State-gov 1981
Self-emp-inc 1694
Federal-gov 1432
Without-pay 21
Never-worked 10
Name: workclass, dtype: int64

values of 203488 21
190290 19
120277 19
125892 18
126569 18
..
293579 1
114874 1
96279 1
509350 1
257302 1
Name: fnlwgt, Length: 28523, dtype: int64

values of HS-grad 15770


Some-college 10863
Bachelors 8013
Masters 2656
Assoc-voc 2060
11th 1812
Assoc-acdm 1601
10th 1389
7th-8th 954
Prof-school 834
9th 756
12th 655
Doctorate 594
5th-6th 507
1st-4th 245
Preschool 81
Name: education, dtype: int64

values of 9 15770
10 10863
13 8013
14 2656
11 2060
7 1812
12 1601
6 1389
4 954
15 834
5 756
8 655
16 594
3 507
2 245
1 81
Name: educational-num, dtype: int64
values of Married-civ-spouse 22366
Never-married 16082
Divorced 6630
Separated 1530
Widowed 1518
Married-spouse-absent 627
Married-AF-spouse 37
Name: marital-status, dtype: int64

values of Prof-specialty 6165


Craft-repair 6102
Exec-managerial 6082
Adm-clerical 5606
Sales 5501
Other-service 4919
Machine-op-inspct 3017
? 2805
Transport-moving 2355
Handlers-cleaners 2071
Farming-fishing 1485
Tech-support 1445
Protective-serv 982
Priv-house-serv 240
Armed-Forces 15
Name: occupation, dtype: int64

values of Husband 19703


Not-in-family 12557
Own-child 7569
Unmarried 5124
Wife 2331
Other-relative 1506
Name: relationship, dtype: int64

values of White 41714


Black 4683
Asian-Pac-Islander 1517
Amer-Indian-Eskimo 470
Other 406
Name: race, dtype: int64
values of Male 32614
Female 16176
Name: gender, dtype: int64

values of 0 44755
15024 513
7688 410
7298 364
99999 244
...
1111 1
7262 1
22040 1
1639 1
2387 1
Name: capital-gain, Length: 123, dtype: int64

values of 0 46508
1902 304
1977 253
1887 233
2415 72
...
2465 1
2080 1
155 1
1911 1
2201 1
Name: capital-loss, Length: 99, dtype: int64

values of 40 22773
50 4242
45 2715
60 2177
35 1934
...
69 1
87 1
94 1
82 1
79 1
Name: hours-per-week, Length: 96, dtype: int64

values of United-States 43792


Mexico 943
? 856
Philippines 294
Germany 206
Puerto-Rico 184
Canada 182
El-Salvador 155
India 151
Cuba 138
England 127
China 122
South 115
Jamaica 106
Italy 105
Dominican-Republic 103
Japan 92
Poland 87
Guatemala 86
Vietnam 86
Columbia 85
Haiti 75
Portugal 67
Taiwan 65
Iran 59
Greece 49
Nicaragua 49
Peru 46
Ecuador 45
France 38
Ireland 37
Hong 30
Thailand 30
Cambodia 28
Trinadad&Tobago 27
Laos 23
Yugoslavia 23
Outlying-US(Guam-USVI-etc) 23
Scotland 21
Honduras 20
Hungary 19
Holand-Netherlands 1
Name: native-country, dtype: int64
values of <=50K 37109
>50K 11681
Name: income, dtype: int64

df = df.replace('?',np.nan)

df.isna().sum()

age 0
workclass 2795
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 2805
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 856
income 0
dtype: int64

Feature Scaling Methods


1) Min Max Scaler 2) Standard Scaler 3) Max Abs Scaler 4) Robust Scaler 5) Quantile
Transformer Scaler 6) Power Transformer Scaler 7) Unit Vector Scaler
# Minmax Scaling
import numpy as np
from sklearn.preprocessing import MinMaxScaler
df2=np.array([[2, 3, 7, 30],
[9, 4, 6, 1],
[8, 15, 2, 40],
[20, 10, 2, 6]])
print(df2)

[[ 2 3 7 30]
[ 9 4 6 1]
[ 8 15 2 40]
[20 10 2 6]]

import matplotlib.pyplot as plt


fig = plt.figure(figsize =(10, 7))
plt.boxplot(df2)
plt.show()
scaler = MinMaxScaler()
scaler.fit(df2)
scaled_features = scaler.transform(df2)
print(scaled_features)

[[0. 0. 1. 0.74358974]
[0.38888889 0.08333333 0.8 0. ]
[0.33333333 1. 0. 1. ]
[1. 0.58333333 0. 0.12820513]]

fig = plt.figure(figsize =(10, 7))


plt.boxplot(scaled_features)
plt.show()
#Standardization

from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
sc_X = sc_X.fit_transform(df2)
print(sc_X)

[[-1.19319056 -1.03142125 1.20740686 0.66200869]


[-0.11547005 -0.825137 0.76834982 -1.12387522]
[-0.26943013 1.44398974 -0.98787834 1.27783073]
[ 1.57809074 0.4125685 -0.98787834 -0.8159642 ]]

fig = plt.figure(figsize =(10, 7))


plt.boxplot(sc_X)
plt.show()
Data cleaning
1. Handling Missing Values
2. Managing Outliers
3. Removal of unwanted observations

Handling Missing Values


1. Leave as it is
2. Filling the missing values
3. Drop them

Filling Missing Values

Drop the rows with missing values


DataFrameName.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and
‘index’ or ‘columns’ for String. how: how takes string value of two kinds only (‘any’ or ‘all’).
‘any’ drops the row/column if ANY value is Null and ‘all’ drops only if ALL values are null.
thresh: thresh takes integer value which tells minimum amount of na values to drop.
subset: It’s an array which limits the dropping process to passed rows/columns through
list. inplace: It is a boolean which makes the changes in data frame itself if True.

Fill the rows with missing values


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None, **kwargs)
value : Static, dictionary, array, series or dataframe to fill instead of NaN. method : Method
is used if user doesn’t pass any value. Pandas has different methods like bfill, backfill or ffill
which fills the place with value in the Forward index or Previous/Back respectively. axis:
axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and ‘index’
or ‘columns’ for String inplace: It is a boolean which makes the changes in data frame itself
if True. limit : This is an integer value which specifies maximum number of consecutive
forward/backward NaN value fills. downcast : It takes a dict which specifies what dtype to
downcast to which one. Like Float64 to int64. **kwargs : Any other Keyword arguments
data = pd.read_csv('/Users/nageshjadhav/Desktop/nba.csv')

data.head()

Name Team Number Position Age Height Weight


\
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0

1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0

2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0

3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0

4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0

College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University NaN
3 Georgia State 1148640.0
4 NaN 5000000.0

data.isna().sum()

Name 1
Team 1
Number 1
Position 1
Age 1
Height 1
Weight 1
College 85
Salary 12
dtype: int64

data['Salary'].fillna( method ='ffill',inplace = True)

data.head(10)

Name Team Number Position Age Height Weight


\
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0

1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0

2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0

3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0

4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0

5 Amir Johnson Boston Celtics 90.0 PF 29.0 6-9 240.0

6 Jordan Mickey Boston Celtics 55.0 PF 21.0 6-8 235.0

7 Kelly Olynyk Boston Celtics 41.0 C 25.0 7-0 238.0

8 Terry Rozier Boston Celtics 12.0 PG 22.0 6-2 190.0

9 Marcus Smart Boston Celtics 36.0 PG 22.0 6-4 220.0

College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University 6796117.0
3 Georgia State 1148640.0
4 NaN 5000000.0
5 NaN 12000000.0
6 LSU 1170960.0
7 Gonzaga 2165160.0
8 Louisville 1824360.0
9 Oklahoma State 3431040.0

data['College'].fillna( method ='bfill',inplace = True)

Simple Imputer
It replaces the NaN values with a specified placeholder.
missing_values : The missing_values placeholder which has to be imputed. By default is
NaN strategy : The data which will replace the NaN values from the dataset. The strategy
argument can take the values – ‘mean'(default), ‘median’, ‘most_frequent’ and ‘constant’.
fill_value : The constant value to be given to the NaN data using the constant strategy.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean', missing_values=np.nan)

data['Salary'] = imputer.fit_transform(data[['Salary']])

data.head()

Name Team Number Position Age Height Weight


\
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0

1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0

2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0

3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0

4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0

College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University 6796117.0
3 Georgia State 1148640.0
4 LSU 5000000.0

imputer1 = SimpleImputer(strategy='most_frequent',
missing_values=np.nan)
data['College'] = imputer1.fit_transform(data[['College']])

data.head()

Name Team Number Position Age Height Weight


\
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0

1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0

2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0

3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0

4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0


College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University 6796117.0
3 Georgia State 1148640.0
4 LSU 5000000.0

df['workclass'].unique()

array(['Private', 'Local-gov', nan, 'Self-emp-not-inc', 'Federal-gov',


'State-gov', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
dtype=object)

from sklearn.impute import SimpleImputer


imp = SimpleImputer(strategy='most_frequent', missing_values=np.nan)
df['workclass'] = imp.fit_transform(df[['workclass']])
df['occupation'] = imp.fit_transform(df[['occupation']])
df['native-country'] = imp.fit_transform(df[['native-country']])

df.isna().sum()

age 0
workclass 0
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 0
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
income 0
dtype: int64

df

age workclass fnlwgt education educational-num \


0 25 Private 226802 11th 7
1 38 Private 89814 HS-grad 9
2 28 Local-gov 336951 Assoc-acdm 12
3 44 Private 160323 Some-college 10
4 18 Private 103497 Some-college 10
... ... ... ... ... ...
48837 27 Private 257302 Assoc-acdm 12
48838 40 Private 154374 HS-grad 9
48839 58 Private 151910 HS-grad 9
48840 22 Private 201490 HS-grad 9
48841 52 Self-emp-inc 287927 HS-grad 9

marital-status occupation relationship race


gender \
0 Never-married Machine-op-inspct Own-child Black
Male
1 Married-civ-spouse Farming-fishing Husband White
Male
2 Married-civ-spouse Protective-serv Husband White
Male
3 Married-civ-spouse Machine-op-inspct Husband Black
Male
4 Never-married Prof-specialty Own-child White
Female
... ... ... ... ... .
..
48837 Married-civ-spouse Tech-support Wife White
Female
48838 Married-civ-spouse Machine-op-inspct Husband White
Male
48839 Widowed Adm-clerical Unmarried White
Female
48840 Never-married Adm-clerical Own-child White
Male
48841 Married-civ-spouse Exec-managerial Wife White
Female

capital-gain capital-loss hours-per-week native-country


income
0 0 0 40 United-States
<=50K
1 0 0 50 United-States
<=50K
2 0 0 40 United-States
>50K
3 7688 0 40 United-States
>50K
4 0 0 30 United-States
<=50K
... ... ... ... ... ..
.
48837 0 0 38 United-States
<=50K
48838 0 0 40 United-States
>50K
48839 0 0 40 United-States
<=50K
48840 0 0 20 United-States
<=50K
48841 15024 0 40 United-States
>50K

[48790 rows x 15 columns]

Outlier Detection
1. Using Boxplot
2. Using Scatter plot
3. Using Z score
4. Using Inter Quartile Range
# Using Boxplot
import seaborn as sns
sns.boxplot(data['Weight'])

/Users/nageshjadhav/opt/anaconda3/envs/ISII_Lab/lib/python3.9/site-
packages/seaborn/_decorators.py:36: FutureWarning: Pass the following
variable as a keyword arg: x. From version 0.12, the only valid
positional argument will be `data`, and passing other arguments
without an explicit keyword will result in an error or
misinterpretation.
warnings.warn(

<AxesSubplot:xlabel='Weight'>

print(np.where(data['Weight']>290))

(array([405]),)
sns.boxplot(df['age'])

/Users/nageshjadhav/opt/anaconda3/envs/ISII_Lab/lib/python3.9/site-
packages/seaborn/_decorators.py:36: FutureWarning: Pass the following
variable as a keyword arg: x. From version 0.12, the only valid
positional argument will be `data`, and passing other arguments
without an explicit keyword will result in an error or
misinterpretation.
warnings.warn(

<AxesSubplot:xlabel='age'>

print(df['age'].unique())

[25 38 28 44 18 34 29 63 24 55 65 36 26 58 48 43 20 37 40 72 45 22 23
54
32 46 56 17 39 52 21 42 33 30 47 41 19 69 50 31 59 49 51 27 57 61 64
79
73 53 77 80 62 35 68 66 75 60 67 71 70 90 81 74 78 82 83 85 76 84 89
88
87 86]

np.where(df['age'] > 78)


#c = np.where(df['age'] > 80)
#t = c[0]
#p = []
#for i in t:
# p.append(df.age[i])
# print(p)
(array([ 193, 234, 898, 925, 950, 1078, 1397, 1833, 2084,
2289, 2981, 3495, 3667, 4454, 4645, 4657, 6401, 6576,
6756, 6914, 6958, 6975, 6978, 7159, 7169, 7413, 7418,
7538, 7546, 7936, 8205, 8312, 8426, 8954, 8981, 9017,
9037, 9080, 9278, 9768, 9887, 10038, 10198, 10222, 10734,
11286, 11325, 11407, 11834, 11868, 11878, 11937, 12057, 12226,
12443, 13022, 13954, 14029, 14259, 14295, 14427, 14564, 14587,
14736, 15084, 15094, 15404, 15930, 15958, 15998, 16101, 16143,
16246, 16350, 16498, 16706, 17194, 17316, 17444, 18211, 18578,
19029, 19166, 19181, 19485, 19612, 19810, 20050, 20236, 20343,
20382, 20992, 21106, 21542, 21561, 21640, 21676, 22270, 22443,
22484, 22502, 22709, 22894, 23018, 23751, 23990, 24142, 24445,
24650, 24700, 24791, 24963, 25075, 25231, 25241, 25737, 26388,
26474, 26809, 27363, 27502, 27776, 27796, 27994, 28259, 28714,
28755, 29093, 29238, 29288, 29289, 29557, 29958, 30190, 30366,
30421, 30865, 30971, 31016, 31163, 31615, 31921, 32151, 32561,
32782, 33021, 33160, 33867, 34294, 34398, 34529, 34534, 34670,
34816, 34980, 35087, 35300, 35427, 35435, 35467, 35744, 35750,
35770, 35943, 36001, 36082, 36503, 36675, 36717, 36736, 36737,
36862, 37078, 37132, 37205, 37594, 37751, 38062, 38085, 38468,
38726, 39139, 39142, 39703, 40144, 40271, 40287, 40482, 40524,
40639, 40804, 41406, 41546, 41640, 42253, 42483, 42971, 44035,
44415, 44701, 44958, 45184, 45829, 45959, 47261, 47663, 47927,
48045, 48067, 48086, 48507, 48597, 48688, 48723, 48754]),)

print(df.age[193])

79

sorted_df = df.sort_values(by=['age'], ascending=True)


Q1 = np.percentile(sorted_df['age'], 25)
Q3 = np.percentile(sorted_df['age'], 75)
IQR = Q3 - Q1
print(IQR)

20.0

lwr_bound = Q1-(1.5*IQR)
upr_bound = Q3+(1.5*IQR)
print(lwr_bound, upr_bound)

-2.0 78.0

outliers = []
for i in df['age']:
if (i<lwr_bound or i>upr_bound):
outliers.append(i)

print(outliers)
print(len(outliers))
[79, 80, 90, 79, 80, 81, 82, 83, 81, 85, 80, 90, 81, 84, 81, 89, 81,
83, 81, 82, 80, 90, 81, 83, 80, 90, 90, 84, 80, 80, 80, 81, 90, 85,
90, 81, 81, 80, 80, 79, 81, 80, 88, 87, 90, 79, 83, 79, 80, 90, 79,
79, 81, 81, 90, 82, 90, 87, 81, 88, 80, 81, 80, 81, 90, 88, 89, 84,
80, 80, 83, 79, 81, 79, 90, 80, 81, 90, 88, 90, 90, 80, 90, 81, 82,
79, 81, 80, 83, 90, 90, 79, 81, 90, 80, 90, 90, 79, 79, 84, 90, 80,
90, 81, 83, 84, 81, 79, 85, 82, 79, 80, 90, 90, 90, 84, 80, 90, 90,
79, 84, 90, 79, 90, 90, 90, 82, 81, 90, 84, 79, 81, 82, 81, 80, 90,
80, 84, 82, 79, 90, 84, 90, 83, 79, 81, 80, 79, 80, 79, 80, 90, 90,
80, 90, 90, 81, 83, 82, 90, 90, 81, 80, 80, 90, 79, 80, 82, 85, 80,
79, 90, 81, 79, 80, 79, 81, 82, 88, 90, 82, 88, 84, 83, 79, 86, 90,
90, 82, 83, 81, 79, 90, 80, 81, 79, 84, 84, 79, 90, 80, 81, 81, 81,
90, 87, 90, 80, 80, 82, 90, 90, 85, 82, 81]
215

Handling Outliers
1. Trimming/Remove the outliers
2. Quantile based flooring and capping
3. Mean/Median imputation
median = np.median(df['age'])# Replace with median
print(median)

for i in outliers:
c = np.where(df['age']==i, 37, df['age'])
print("New array: ",c)
print(c.shape)

37.0
New array: [25 38 28 ... 58 22 52]
(48790,)

df['income'].value_counts()

<=50K 37109
>50K 11681
Name: income, dtype: int64

sns.countplot(df['income'])

/Users/nageshjadhav/opt/anaconda3/envs/ISII_Lab/lib/python3.9/site-
packages/seaborn/_decorators.py:36: FutureWarning: Pass the following
variable as a keyword arg: x. From version 0.12, the only valid
positional argument will be `data`, and passing other arguments
without an explicit keyword will result in an error or
misinterpretation.
warnings.warn(

<AxesSubplot:xlabel='income', ylabel='count'>
Handling Categorical Data
1. Replacing values
2. Encoding labels
3. One-Hot encoding
print(df['gender'].unique())

['Male' 'Female']

df['gender'] = df['gender'].replace('Male',1)
df['gender'] = df['gender'].replace('Female',0)

df['income'].unique()

array(['<=50K', '>50K'], dtype=object)

df['marital-status'] = df['marital-status'].str.replace('Never-
married', 'NotMarried')
df['marital-status'] = df['marital-status'].str.replace('Married-AF-
spouse', 'Married')
df['marital-status'] = df['marital-status'].str.replace('Married-civ-
spouse', 'Married')
df['marital-status'] = df['marital-status'].str.replace('Married-
spouse-absent', 'NotMarried')
df['marital-status'] = df['marital-status'].str.replace('Separated',
'Separated')
df['marital-status'] = df['marital-status'].str.replace('Divorced',
'Separated')
df['marital-status'] = df['marital-status'].str.replace('Widowed',
'Widowed')

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['income'] = le.fit_transform(df['income'])
df['race'] = le.fit_transform(df['race'])
df['marital-status'] = le.fit_transform(df['marital-status'])
print(df['income'].unique())

[0 1]

df['education'].unique()

array(['11th', 'HS-grad', 'Assoc-acdm', 'Some-college', '10th',


'Prof-school', '7th-8th', 'Bachelors', 'Masters', 'Doctorate',
'5th-6th', 'Assoc-voc', '9th', '12th', '1st-4th', 'Preschool'],
dtype=object)

df['education'] = df['education'].replace('Preschool', 'dropout')


df['education'] = df['education'].replace('10th', 'dropout')
df['education'] = df['education'].replace('11th', 'dropout')
df['education'] = df['education'].replace('12th', 'dropout')
df['education'] = df['education'].replace('1st-4th', 'dropout')
df['education'] = df['education'].replace('5th-6th', 'dropout')
df['education'] = df['education'].replace('7th-8th', 'dropout')
df['education'] = df['education'].replace('9th', 'dropout')
df['education'] = df['education'].replace('HS-Grad', 'HighGrad')
df['education'] = df['education'].replace('HS-grad', 'HighGrad')
df['education'] = df['education'].replace('Some-college',
'CommunityCollege')
df['education'] = df['education'].replace('Assoc-acdm',
'CommunityCollege')
df['education'] = df['education'].replace('Assoc-voc',
'CommunityCollege')
df['education'] = df['education'].replace('Bachelors', 'Bachelors')
df['education'] = df['education'].replace('Masters', 'Masters')
df['education'] = df['education'].replace('Prof-school', 'Masters')
df['education'] = df['education'].replace('Doctorate', 'Doctorate')

df['education'].unique()

array(['dropout', 'HighGrad', 'CommunityCollege', 'Masters',


'Bachelors',
'Doctorate'], dtype=object)

df['education'] = le.fit_transform(df['education'])
df['education'].unique()

array([5, 3, 1, 4, 0, 2])
df['marital-status'].unique()

array([1, 0, 3, 2])

df.head()

age workclass fnlwgt education educational-num marital-status


\
0 25 Private 226802 5 7 1

1 38 Private 89814 3 9 0

2 28 Local-gov 336951 1 12 0

3 44 Private 160323 1 10 0

4 18 Private 103497 1 10 1

occupation relationship race gender capital-gain


capital-loss \
0 Machine-op-inspct Own-child 2 1 0
0
1 Farming-fishing Husband 4 1 0
0
2 Protective-serv Husband 4 1 0
0
3 Machine-op-inspct Husband 2 1 7688
0
4 Prof-specialty Own-child 4 0 0
0

hours-per-week native-country income


0 40 United-States 0
1 50 United-States 0
2 40 United-States 1
3 40 United-States 1
4 30 United-States 0

X = df.drop('income',axis=1)

y = df.income

print('Shape of X (input)', X.shape)


print('Shape of y (output or label)', y.shape)

Shape of X (input) (48790, 14)


Shape of y (output or label) (48790,)

from sklearn.model_selection import train_test_split


Xtrain, Xtest, ytrain,ytest = train_test_split(X,y,
test_size = 0.2,
random_state=42,
shuffle=True)

print('Shape of Training (X)', Xtrain.shape)


print('Shape of Training (y)', ytrain.shape)
print('Shape of Test (X)', Xtest.shape)
print('Shape of Test (y)', ytest.shape)

Shape of Training (X) (39032, 14)


Shape of Training (y) (39032,)
Shape of Test (X) (9758, 14)
Shape of Test (y) (9758,)

Random Sampling
yes = df[df['income']==1]
no = df[df['income']==0]

print(yes.shape)
print(no.shape)

(11681, 15)
(37109, 15)

no_sample = no.sample(n=11681)

no_sample.shape

(11681, 15)

sampled_df = pd.concat([yes,no_sample],axis=0)

sampled_df.shape

(23362, 15)

sns.countplot(sampled_df['income'])

/Users/nageshjadhav/opt/anaconda3/envs/ISII_Lab/lib/python3.9/site-
packages/seaborn/_decorators.py:36: FutureWarning: Pass the following
variable as a keyword arg: x. From version 0.12, the only valid
positional argument will be `data`, and passing other arguments
without an explicit keyword will result in an error or
misinterpretation.
warnings.warn(

<AxesSubplot:xlabel='income', ylabel='count'>
sampled_df.head()

age workclass fnlwgt education educational-num


marital-status \
2 28 Local-gov 336951 1 12
0
3 44 Private 160323 1 10
0
7 63 Self-emp-not-inc 104626 4 15
0
10 65 Private 184454 3 9
0
14 48 Private 279724 3 9
0

occupation relationship race gender capital-gain


capital-loss \
2 Protective-serv Husband 4 1 0
0
3 Machine-op-inspct Husband 2 1 7688
0
7 Prof-specialty Husband 4 1 3103
0
10 Machine-op-inspct Husband 4 1 6418
0
14 Machine-op-inspct Husband 4 1 3103
0

hours-per-week native-country income


2 40 United-States 1
3 40 United-States 1
7 32 United-States 1
10 40 United-States 1
14 48 United-States 1

You might also like