0% found this document useful (0 votes)

2 views

Data Preprocessing

Uploaded by

chatgptbolte

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Data Preprocessing

Uploaded by

chatgptbolte

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

import pandas as pd

import numpy as np

df = pd.read_csv('/Users/nageshjadhav/Desktop/adult.csv')

df.head()

age workclass fnlwgt education educational-num marital-

status \
0 25 Private 226802 11th 7 Never-
married
1 38 Private 89814 HS-grad 9 Married-civ-
spouse
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-
spouse
3 44 Private 160323 Some-college 10 Married-civ-
spouse
4 18 ? 103497 Some-college 10 Never-
married

occupation relationship race gender capital-gain

capital-loss \
0 Machine-op-inspct Own-child Black Male 0
0
1 Farming-fishing Husband White Male 0
0
2 Protective-serv Husband White Male 0
0
3 Machine-op-inspct Husband Black Male 7688
0
4 ? Own-child White Female 0
0

hours-per-week native-country income

0 40 United-States <=50K
1 50 United-States <=50K
2 40 United-States >50K
3 40 United-States >50K
4 30 United-States <=50K

df.corr()

age fnlwgt educational-num capital-gain \

age 1.000000 -0.076628 0.030940 0.077229
fnlwgt -0.076628 1.000000 -0.038761 -0.003706
educational-num 0.030940 -0.038761 1.000000 0.125146
capital-gain 0.077229 -0.003706 0.125146 1.000000
capital-loss 0.056944 -0.004366 0.080972 -0.031441
hours-per-week 0.071558 -0.013519 0.143689 0.082157

capital-loss hours-per-week
age 0.056944 0.071558
fnlwgt -0.004366 -0.013519
educational-num 0.080972 0.143689
capital-gain -0.031441 0.082157
capital-loss 1.000000 0.054467
hours-per-week 0.054467 1.000000

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 workclass 48842 non-null object
2 fnlwgt 48842 non-null int64
3 education 48842 non-null object
4 educational-num 48842 non-null int64
5 marital-status 48842 non-null object
6 occupation 48842 non-null object
7 relationship 48842 non-null object
8 race 48842 non-null object
9 gender 48842 non-null object
10 capital-gain 48842 non-null int64
11 capital-loss 48842 non-null int64
12 hours-per-week 48842 non-null int64
13 native-country 48842 non-null object
14 income 48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB

df.isna().sum()

age 0
workclass 0
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 0
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
income 0
dtype: int64
df.isin(['?']).sum()

age 0
workclass 2799
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 2809
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 857
income 0
dtype: int64

df.describe()

age fnlwgt educational-num capital-gain \

count 48842.000000 4.884200e+04 48842.000000 48842.000000
mean 38.643585 1.896641e+05 10.078089 1079.067626
std 13.710510 1.056040e+05 2.570973 7452.019058
min 17.000000 1.228500e+04 1.000000 0.000000
25% 28.000000 1.175505e+05 9.000000 0.000000
50% 37.000000 1.781445e+05 10.000000 0.000000
75% 48.000000 2.376420e+05 12.000000 0.000000
max 90.000000 1.490400e+06 16.000000 99999.000000

capital-loss hours-per-week
count 48842.000000 48842.000000
mean 87.502314 40.422382
std 403.004552 12.391444
min 0.000000 1.000000
25% 0.000000 40.000000
50% 0.000000 40.000000
75% 0.000000 45.000000
max 4356.000000 99.000000

df.duplicated().sum()

df = df.drop_duplicates()
print(df.duplicated().sum())

df['age'].value_counts()
36 1348
35 1336
33 1335
23 1325
31 1324
...
88 6
85 5
87 3
89 2
86 1
Name: age, Length: 74, dtype: int64

for i in df.columns:
a = df[i].value_counts()
print(f'\n\n\nvalues of {a}')

values of 36 1348
35 1336
33 1335
23 1325
31 1324
...
88 6
85 5
87 3
89 2
86 1
Name: age, Length: 74, dtype: int64

values of Private 33860

Self-emp-not-inc 3861
Local-gov 3136
? 2795
State-gov 1981
Self-emp-inc 1694
Federal-gov 1432
Without-pay 21
Never-worked 10
Name: workclass, dtype: int64

values of 203488 21
190290 19
120277 19
125892 18
126569 18
..
293579 1
114874 1
96279 1
509350 1
257302 1
Name: fnlwgt, Length: 28523, dtype: int64

values of HS-grad 15770

Some-college 10863
Bachelors 8013
Masters 2656
Assoc-voc 2060
11th 1812
Assoc-acdm 1601
10th 1389
7th-8th 954
Prof-school 834
9th 756
12th 655
Doctorate 594
5th-6th 507
1st-4th 245
Preschool 81
Name: education, dtype: int64

values of 9 15770
10 10863
13 8013
14 2656
11 2060
7 1812
12 1601
6 1389
4 954
15 834
5 756
8 655
16 594
3 507
2 245
1 81
Name: educational-num, dtype: int64
values of Married-civ-spouse 22366
Never-married 16082
Divorced 6630
Separated 1530
Widowed 1518
Married-spouse-absent 627
Married-AF-spouse 37
Name: marital-status, dtype: int64

values of Prof-specialty 6165

Craft-repair 6102
Exec-managerial 6082
Adm-clerical 5606
Sales 5501
Other-service 4919
Machine-op-inspct 3017
? 2805
Transport-moving 2355
Handlers-cleaners 2071
Farming-fishing 1485
Tech-support 1445
Protective-serv 982
Priv-house-serv 240
Armed-Forces 15
Name: occupation, dtype: int64

values of Husband 19703

Not-in-family 12557
Own-child 7569
Unmarried 5124
Wife 2331
Other-relative 1506
Name: relationship, dtype: int64

values of White 41714

Black 4683
Asian-Pac-Islander 1517
Amer-Indian-Eskimo 470
Other 406
Name: race, dtype: int64
values of Male 32614
Female 16176
Name: gender, dtype: int64

values of 0 44755
15024 513
7688 410
7298 364
99999 244
...
1111 1
7262 1
22040 1
1639 1
2387 1
Name: capital-gain, Length: 123, dtype: int64

values of 0 46508
1902 304
1977 253
1887 233
2415 72
...
2465 1
2080 1
155 1
1911 1
2201 1
Name: capital-loss, Length: 99, dtype: int64

values of 40 22773
50 4242
45 2715
60 2177
35 1934
...
69 1
87 1
94 1
82 1
79 1
Name: hours-per-week, Length: 96, dtype: int64

values of United-States 43792

Mexico 943
? 856
Philippines 294
Germany 206
Puerto-Rico 184
Canada 182
El-Salvador 155
India 151
Cuba 138
England 127
China 122
South 115
Jamaica 106
Italy 105
Dominican-Republic 103
Japan 92
Poland 87
Guatemala 86
Vietnam 86
Columbia 85
Haiti 75
Portugal 67
Taiwan 65
Iran 59
Greece 49
Nicaragua 49
Peru 46
Ecuador 45
France 38
Ireland 37
Hong 30
Thailand 30
Cambodia 28
Trinadad&Tobago 27
Laos 23
Yugoslavia 23
Outlying-US(Guam-USVI-etc) 23
Scotland 21
Honduras 20
Hungary 19
Holand-Netherlands 1
Name: native-country, dtype: int64
values of <=50K 37109
>50K 11681
Name: income, dtype: int64

df = df.replace('?',np.nan)

df.isna().sum()

age 0
workclass 2795
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 2805
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 856
income 0
dtype: int64

Feature Scaling Methods

1) Min Max Scaler 2) Standard Scaler 3) Max Abs Scaler 4) Robust Scaler 5) Quantile
Transformer Scaler 6) Power Transformer Scaler 7) Unit Vector Scaler
# Minmax Scaling
import numpy as np
from sklearn.preprocessing import MinMaxScaler
df2=np.array([[2, 3, 7, 30],
[9, 4, 6, 1],
[8, 15, 2, 40],
[20, 10, 2, 6]])
print(df2)

[[ 2 3 7 30]
[ 9 4 6 1]
[ 8 15 2 40]
[20 10 2 6]]

import matplotlib.pyplot as plt

fig = plt.figure(figsize =(10, 7))
plt.boxplot(df2)
plt.show()
scaler = MinMaxScaler()
scaler.fit(df2)
scaled_features = scaler.transform(df2)
print(scaled_features)

[[0. 0. 1. 0.74358974]
[0.38888889 0.08333333 0.8 0. ]
[0.33333333 1. 0. 1. ]
[1. 0.58333333 0. 0.12820513]]

fig = plt.figure(figsize =(10, 7))

plt.boxplot(scaled_features)
plt.show()
#Standardization

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
sc_X = sc_X.fit_transform(df2)
print(sc_X)

[[-1.19319056 -1.03142125 1.20740686 0.66200869]

[-0.11547005 -0.825137 0.76834982 -1.12387522]
[-0.26943013 1.44398974 -0.98787834 1.27783073]
[ 1.57809074 0.4125685 -0.98787834 -0.8159642 ]]

fig = plt.figure(figsize =(10, 7))

plt.boxplot(sc_X)
plt.show()
Data cleaning
1. Handling Missing Values
2. Managing Outliers
3. Removal of unwanted observations

Handling Missing Values

1. Leave as it is
2. Filling the missing values
3. Drop them

Filling Missing Values

Drop the rows with missing values

DataFrameName.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and
‘index’ or ‘columns’ for String. how: how takes string value of two kinds only (‘any’ or ‘all’).
‘any’ drops the row/column if ANY value is Null and ‘all’ drops only if ALL values are null.
thresh: thresh takes integer value which tells minimum amount of na values to drop.
subset: It’s an array which limits the dropping process to passed rows/columns through
list. inplace: It is a boolean which makes the changes in data frame itself if True.

Fill the rows with missing values

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None, **kwargs)
value : Static, dictionary, array, series or dataframe to fill instead of NaN. method : Method
is used if user doesn’t pass any value. Pandas has different methods like bfill, backfill or ffill
which fills the place with value in the Forward index or Previous/Back respectively. axis:
axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and ‘index’
or ‘columns’ for String inplace: It is a boolean which makes the changes in data frame itself
if True. limit : This is an integer value which specifies maximum number of consecutive
forward/backward NaN value fills. downcast : It takes a dict which specifies what dtype to
downcast to which one. Like Float64 to int64. **kwargs : Any other Keyword arguments
data = pd.read_csv('/Users/nageshjadhav/Desktop/nba.csv')

data.head()

Name Team Number Position Age Height Weight

\
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0

1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0

2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0

3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0

4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0

College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University NaN
3 Georgia State 1148640.0
4 NaN 5000000.0

data.isna().sum()

Name 1
Team 1
Number 1
Position 1
Age 1
Height 1
Weight 1
College 85
Salary 12
dtype: int64

data['Salary'].fillna( method ='ffill',inplace = True)

data.head(10)

Name Team Number Position Age Height Weight

\
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0

1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0

2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0

3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0

4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0

5 Amir Johnson Boston Celtics 90.0 PF 29.0 6-9 240.0

6 Jordan Mickey Boston Celtics 55.0 PF 21.0 6-8 235.0

7 Kelly Olynyk Boston Celtics 41.0 C 25.0 7-0 238.0

8 Terry Rozier Boston Celtics 12.0 PG 22.0 6-2 190.0

9 Marcus Smart Boston Celtics 36.0 PG 22.0 6-4 220.0

College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University 6796117.0
3 Georgia State 1148640.0
4 NaN 5000000.0
5 NaN 12000000.0
6 LSU 1170960.0
7 Gonzaga 2165160.0
8 Louisville 1824360.0
9 Oklahoma State 3431040.0

data['College'].fillna( method ='bfill',inplace = True)

Simple Imputer
It replaces the NaN values with a specified placeholder.
missing_values : The missing_values placeholder which has to be imputed. By default is
NaN strategy : The data which will replace the NaN values from the dataset. The strategy
argument can take the values – ‘mean'(default), ‘median’, ‘most_frequent’ and ‘constant’.
fill_value : The constant value to be given to the NaN data using the constant strategy.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean', missing_values=np.nan)

data['Salary'] = imputer.fit_transform(data[['Salary']])

data.head()

Name Team Number Position Age Height Weight

\
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0

1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0

2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0

3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0

4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0

College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University 6796117.0
3 Georgia State 1148640.0
4 LSU 5000000.0

imputer1 = SimpleImputer(strategy='most_frequent',
missing_values=np.nan)
data['College'] = imputer1.fit_transform(data[['College']])

data.head()

Name Team Number Position Age Height Weight

\
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0

1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0

2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0

3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0

4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0

College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University 6796117.0
3 Georgia State 1148640.0
4 LSU 5000000.0

df['workclass'].unique()

array(['Private', 'Local-gov', nan, 'Self-emp-not-inc', 'Federal-gov',

'State-gov', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
dtype=object)

from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='most_frequent', missing_values=np.nan)
df['workclass'] = imp.fit_transform(df[['workclass']])
df['occupation'] = imp.fit_transform(df[['occupation']])
df['native-country'] = imp.fit_transform(df[['native-country']])

df.isna().sum()

age workclass fnlwgt education educational-num \

0 25 Private 226802 11th 7
1 38 Private 89814 HS-grad 9
2 28 Local-gov 336951 Assoc-acdm 12
3 44 Private 160323 Some-college 10
4 18 Private 103497 Some-college 10
... ... ... ... ... ...
48837 27 Private 257302 Assoc-acdm 12
48838 40 Private 154374 HS-grad 9
48839 58 Private 151910 HS-grad 9
48840 22 Private 201490 HS-grad 9
48841 52 Self-emp-inc 287927 HS-grad 9

marital-status occupation relationship race

gender \
0 Never-married Machine-op-inspct Own-child Black
Male
1 Married-civ-spouse Farming-fishing Husband White
Male
2 Married-civ-spouse Protective-serv Husband White
Male
3 Married-civ-spouse Machine-op-inspct Husband Black
Male
4 Never-married Prof-specialty Own-child White
Female
... ... ... ... ... .
..
48837 Married-civ-spouse Tech-support Wife White
Female
48838 Married-civ-spouse Machine-op-inspct Husband White
Male
48839 Widowed Adm-clerical Unmarried White
Female
48840 Never-married Adm-clerical Own-child White
Male
48841 Married-civ-spouse Exec-managerial Wife White
Female

capital-gain capital-loss hours-per-week native-country

income
0 0 0 40 United-States
<=50K
1 0 0 50 United-States
<=50K
2 0 0 40 United-States
>50K
3 7688 0 40 United-States
>50K
4 0 0 30 United-States
<=50K
... ... ... ... ... ..
.
48837 0 0 38 United-States
<=50K
48838 0 0 40 United-States
>50K
48839 0 0 40 United-States
<=50K
48840 0 0 20 United-States
<=50K
48841 15024 0 40 United-States
>50K

[48790 rows x 15 columns]

Outlier Detection
1. Using Boxplot
2. Using Scatter plot
3. Using Z score
4. Using Inter Quartile Range
# Using Boxplot
import seaborn as sns
sns.boxplot(data['Weight'])

/Users/nageshjadhav/opt/anaconda3/envs/ISII_Lab/lib/python3.9/site-
packages/seaborn/_decorators.py:36: FutureWarning: Pass the following
variable as a keyword arg: x. From version 0.12, the only valid
positional argument will be `data`, and passing other arguments
without an explicit keyword will result in an error or
misinterpretation.
warnings.warn(

<AxesSubplot:xlabel='Weight'>

print(np.where(data['Weight']>290))

(array([405]),)
sns.boxplot(df['age'])

<AxesSubplot:xlabel='age'>

print(df['age'].unique())

[25 38 28 44 18 34 29 63 24 55 65 36 26 58 48 43 20 37 40 72 45 22 23
54
32 46 56 17 39 52 21 42 33 30 47 41 19 69 50 31 59 49 51 27 57 61 64
79
73 53 77 80 62 35 68 66 75 60 67 71 70 90 81 74 78 82 83 85 76 84 89
88
87 86]

np.where(df['age'] > 78)

#c = np.where(df['age'] > 80)
#t = c[0]
#p = []
#for i in t:
# p.append(df.age[i])
# print(p)
(array([ 193, 234, 898, 925, 950, 1078, 1397, 1833, 2084,
2289, 2981, 3495, 3667, 4454, 4645, 4657, 6401, 6576,
6756, 6914, 6958, 6975, 6978, 7159, 7169, 7413, 7418,
7538, 7546, 7936, 8205, 8312, 8426, 8954, 8981, 9017,
9037, 9080, 9278, 9768, 9887, 10038, 10198, 10222, 10734,
11286, 11325, 11407, 11834, 11868, 11878, 11937, 12057, 12226,
12443, 13022, 13954, 14029, 14259, 14295, 14427, 14564, 14587,
14736, 15084, 15094, 15404, 15930, 15958, 15998, 16101, 16143,
16246, 16350, 16498, 16706, 17194, 17316, 17444, 18211, 18578,
19029, 19166, 19181, 19485, 19612, 19810, 20050, 20236, 20343,
20382, 20992, 21106, 21542, 21561, 21640, 21676, 22270, 22443,
22484, 22502, 22709, 22894, 23018, 23751, 23990, 24142, 24445,
24650, 24700, 24791, 24963, 25075, 25231, 25241, 25737, 26388,
26474, 26809, 27363, 27502, 27776, 27796, 27994, 28259, 28714,
28755, 29093, 29238, 29288, 29289, 29557, 29958, 30190, 30366,
30421, 30865, 30971, 31016, 31163, 31615, 31921, 32151, 32561,
32782, 33021, 33160, 33867, 34294, 34398, 34529, 34534, 34670,
34816, 34980, 35087, 35300, 35427, 35435, 35467, 35744, 35750,
35770, 35943, 36001, 36082, 36503, 36675, 36717, 36736, 36737,
36862, 37078, 37132, 37205, 37594, 37751, 38062, 38085, 38468,
38726, 39139, 39142, 39703, 40144, 40271, 40287, 40482, 40524,
40639, 40804, 41406, 41546, 41640, 42253, 42483, 42971, 44035,
44415, 44701, 44958, 45184, 45829, 45959, 47261, 47663, 47927,
48045, 48067, 48086, 48507, 48597, 48688, 48723, 48754]),)

print(df.age[193])

sorted_df = df.sort_values(by=['age'], ascending=True)

Q1 = np.percentile(sorted_df['age'], 25)
Q3 = np.percentile(sorted_df['age'], 75)
IQR = Q3 - Q1
print(IQR)

20.0

lwr_bound = Q1-(1.5*IQR)
upr_bound = Q3+(1.5*IQR)
print(lwr_bound, upr_bound)

-2.0 78.0

outliers = []
for i in df['age']:
if (i<lwr_bound or i>upr_bound):
outliers.append(i)

print(outliers)
print(len(outliers))
[79, 80, 90, 79, 80, 81, 82, 83, 81, 85, 80, 90, 81, 84, 81, 89, 81,
83, 81, 82, 80, 90, 81, 83, 80, 90, 90, 84, 80, 80, 80, 81, 90, 85,
90, 81, 81, 80, 80, 79, 81, 80, 88, 87, 90, 79, 83, 79, 80, 90, 79,
79, 81, 81, 90, 82, 90, 87, 81, 88, 80, 81, 80, 81, 90, 88, 89, 84,
80, 80, 83, 79, 81, 79, 90, 80, 81, 90, 88, 90, 90, 80, 90, 81, 82,
79, 81, 80, 83, 90, 90, 79, 81, 90, 80, 90, 90, 79, 79, 84, 90, 80,
90, 81, 83, 84, 81, 79, 85, 82, 79, 80, 90, 90, 90, 84, 80, 90, 90,
79, 84, 90, 79, 90, 90, 90, 82, 81, 90, 84, 79, 81, 82, 81, 80, 90,
80, 84, 82, 79, 90, 84, 90, 83, 79, 81, 80, 79, 80, 79, 80, 90, 90,
80, 90, 90, 81, 83, 82, 90, 90, 81, 80, 80, 90, 79, 80, 82, 85, 80,
79, 90, 81, 79, 80, 79, 81, 82, 88, 90, 82, 88, 84, 83, 79, 86, 90,
90, 82, 83, 81, 79, 90, 80, 81, 79, 84, 84, 79, 90, 80, 81, 81, 81,
90, 87, 90, 80, 80, 82, 90, 90, 85, 82, 81]
215

Handling Outliers
1. Trimming/Remove the outliers
2. Quantile based flooring and capping
3. Mean/Median imputation
median = np.median(df['age'])# Replace with median
print(median)

for i in outliers:
c = np.where(df['age']==i, 37, df['age'])
print("New array: ",c)
print(c.shape)

37.0
New array: [25 38 28 ... 58 22 52]
(48790,)

df['income'].value_counts()

<=50K 37109
>50K 11681
Name: income, dtype: int64

sns.countplot(df['income'])

<AxesSubplot:xlabel='income', ylabel='count'>
Handling Categorical Data
1. Replacing values
2. Encoding labels
3. One-Hot encoding
print(df['gender'].unique())

['Male' 'Female']

df['gender'] = df['gender'].replace('Male',1)
df['gender'] = df['gender'].replace('Female',0)

df['income'].unique()

array(['<=50K', '>50K'], dtype=object)

df['marital-status'] = df['marital-status'].str.replace('Never-
married', 'NotMarried')
df['marital-status'] = df['marital-status'].str.replace('Married-AF-
spouse', 'Married')
df['marital-status'] = df['marital-status'].str.replace('Married-civ-
spouse', 'Married')
df['marital-status'] = df['marital-status'].str.replace('Married-
spouse-absent', 'NotMarried')
df['marital-status'] = df['marital-status'].str.replace('Separated',
'Separated')
df['marital-status'] = df['marital-status'].str.replace('Divorced',
'Separated')
df['marital-status'] = df['marital-status'].str.replace('Widowed',
'Widowed')

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['income'] = le.fit_transform(df['income'])
df['race'] = le.fit_transform(df['race'])
df['marital-status'] = le.fit_transform(df['marital-status'])
print(df['income'].unique())

[0 1]

df['education'].unique()

array(['11th', 'HS-grad', 'Assoc-acdm', 'Some-college', '10th',

'Prof-school', '7th-8th', 'Bachelors', 'Masters', 'Doctorate',
'5th-6th', 'Assoc-voc', '9th', '12th', '1st-4th', 'Preschool'],
dtype=object)

df['education'] = df['education'].replace('Preschool', 'dropout')

df['education'] = df['education'].replace('10th', 'dropout')
df['education'] = df['education'].replace('11th', 'dropout')
df['education'] = df['education'].replace('12th', 'dropout')
df['education'] = df['education'].replace('1st-4th', 'dropout')
df['education'] = df['education'].replace('5th-6th', 'dropout')
df['education'] = df['education'].replace('7th-8th', 'dropout')
df['education'] = df['education'].replace('9th', 'dropout')
df['education'] = df['education'].replace('HS-Grad', 'HighGrad')
df['education'] = df['education'].replace('HS-grad', 'HighGrad')
df['education'] = df['education'].replace('Some-college',
'CommunityCollege')
df['education'] = df['education'].replace('Assoc-acdm',
'CommunityCollege')
df['education'] = df['education'].replace('Assoc-voc',
'CommunityCollege')
df['education'] = df['education'].replace('Bachelors', 'Bachelors')
df['education'] = df['education'].replace('Masters', 'Masters')
df['education'] = df['education'].replace('Prof-school', 'Masters')
df['education'] = df['education'].replace('Doctorate', 'Doctorate')

df['education'].unique()

array(['dropout', 'HighGrad', 'CommunityCollege', 'Masters',

'Bachelors',
'Doctorate'], dtype=object)

df['education'] = le.fit_transform(df['education'])
df['education'].unique()

array([5, 3, 1, 4, 0, 2])
df['marital-status'].unique()

array([1, 0, 3, 2])

df.head()

age workclass fnlwgt education educational-num marital-status

\
0 25 Private 226802 5 7 1

1 38 Private 89814 3 9 0

2 28 Local-gov 336951 1 12 0

3 44 Private 160323 1 10 0

4 18 Private 103497 1 10 1

occupation relationship race gender capital-gain

capital-loss \
0 Machine-op-inspct Own-child 2 1 0
0
1 Farming-fishing Husband 4 1 0
0
2 Protective-serv Husband 4 1 0
0
3 Machine-op-inspct Husband 2 1 7688
0
4 Prof-specialty Own-child 4 0 0
0

hours-per-week native-country income

0 40 United-States 0
1 50 United-States 0
2 40 United-States 1
3 40 United-States 1
4 30 United-States 0

X = df.drop('income',axis=1)

y = df.income

print('Shape of X (input)', X.shape)

print('Shape of y (output or label)', y.shape)

Shape of X (input) (48790, 14)

Shape of y (output or label) (48790,)

from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain,ytest = train_test_split(X,y,
test_size = 0.2,
random_state=42,
shuffle=True)

print('Shape of Training (X)', Xtrain.shape)

print('Shape of Training (y)', ytrain.shape)
print('Shape of Test (X)', Xtest.shape)
print('Shape of Test (y)', ytest.shape)

Shape of Training (X) (39032, 14)

Shape of Training (y) (39032,)
Shape of Test (X) (9758, 14)
Shape of Test (y) (9758,)

Random Sampling
yes = df[df['income']==1]
no = df[df['income']==0]

print(yes.shape)
print(no.shape)

(11681, 15)
(37109, 15)

no_sample = no.sample(n=11681)

no_sample.shape

(11681, 15)

sampled_df = pd.concat([yes,no_sample],axis=0)

sampled_df.shape

(23362, 15)

sns.countplot(sampled_df['income'])

<AxesSubplot:xlabel='income', ylabel='count'>
sampled_df.head()

age workclass fnlwgt education educational-num

marital-status \
2 28 Local-gov 336951 1 12
0
3 44 Private 160323 1 10
0
7 63 Self-emp-not-inc 104626 4 15
0
10 65 Private 184454 3 9
0
14 48 Private 279724 3 9
0

occupation relationship race gender capital-gain

capital-loss \
2 Protective-serv Husband 4 1 0
0
3 Machine-op-inspct Husband 2 1 7688
0
7 Prof-specialty Husband 4 1 3103
0
10 Machine-op-inspct Husband 4 1 6418
0
14 Machine-op-inspct Husband 4 1 3103
0

hours-per-week native-country income

2 40 United-States 1
3 40 United-States 1
7 32 United-States 1
10 40 United-States 1
14 48 United-States 1

Chapter 4 Analysis and Interpretation of Assessment Results
No ratings yet
Chapter 4 Analysis and Interpretation of Assessment Results
36 pages
FRA Milestone 1 Jupyter Notebook PDF
100% (3)
FRA Milestone 1 Jupyter Notebook PDF
42 pages
Statisitics Project 6
100% (2)
Statisitics Project 6
48 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Rifqiirsyad 10123897
No ratings yet
Rifqiirsyad 10123897
16 pages
KBD M7 Faris
No ratings yet
KBD M7 Faris
17 pages
M7 Muhammad Sandhi Khadafi 2KB04 (20122007)
No ratings yet
M7 Muhammad Sandhi Khadafi 2KB04 (20122007)
16 pages
2 Tekrek M7 KNN - DGX 1
No ratings yet
2 Tekrek M7 KNN - DGX 1
15 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Name:Fedrick Samuel W Reg No: 19MIS1112 Course: Machine Learning (SWE4012) Slot: L11 + L12 Faculty: Dr.M. Premalatha
No ratings yet
Name:Fedrick Samuel W Reg No: 19MIS1112 Course: Machine Learning (SWE4012) Slot: L11 + L12 Faculty: Dr.M. Premalatha
30 pages
2IA02 Fauzan Ramadhan
No ratings yet
2IA02 Fauzan Ramadhan
10 pages
Output1
No ratings yet
Output1
8 pages
Vertopal.com_ML Project 2
No ratings yet
Vertopal.com_ML Project 2
19 pages
Practice Test
No ratings yet
Practice Test
12 pages
group-2-th (1)
No ratings yet
group-2-th (1)
25 pages
Untitled4 Assigment 3
No ratings yet
Untitled4 Assigment 3
9 pages
Bank Marketing Ingles
No ratings yet
Bank Marketing Ingles
37 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Diwali Sales Analysis
No ratings yet
Diwali Sales Analysis
14 pages
Practical 3
No ratings yet
Practical 3
8 pages
aerofit_eda
No ratings yet
aerofit_eda
25 pages
paddy diesease
No ratings yet
paddy diesease
20 pages
Tugas-Bank-Campaign (1) .Ipynb - Colaboratory
No ratings yet
Tugas-Bank-Campaign (1) .Ipynb - Colaboratory
29 pages
230103-ECON209_S2025__Lab_2.ipynb-Colab
No ratings yet
230103-ECON209_S2025__Lab_2.ipynb-Colab
10 pages
Data Analysis Process
No ratings yet
Data Analysis Process
95 pages
DA Manual - Part B
No ratings yet
DA Manual - Part B
13 pages
Naive Bayes vs Logistic Regression
No ratings yet
Naive Bayes vs Logistic Regression
16 pages
Supervised Learning Project - Ipynb - Colab
No ratings yet
Supervised Learning Project - Ipynb - Colab
14 pages
Practica 11
No ratings yet
Practica 11
7 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
Loading The Dataset: 'Churn - Modelling - CSV'
No ratings yet
Loading The Dataset: 'Churn - Modelling - CSV'
6 pages
15 - 11 - 24 - SVM - Jupyter Notebook
No ratings yet
15 - 11 - 24 - SVM - Jupyter Notebook
5 pages
ML Assignment 5
No ratings yet
ML Assignment 5
8 pages
"Normal" "Fraud": #Check For Any Null Values
No ratings yet
"Normal" "Fraud": #Check For Any Null Values
7 pages
Task 1 Vijaya Lakshman PDF
No ratings yet
Task 1 Vijaya Lakshman PDF
10 pages
Xii Ip Practical File 24-25
No ratings yet
Xii Ip Practical File 24-25
111 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
#Group: B (ML) : Numpy NP Pandas PD
No ratings yet
#Group: B (ML) : Numpy NP Pandas PD
9 pages
CardioGoodFitness - Descriptive Statistics (2) (1) - Jupyter Notebook
No ratings yet
CardioGoodFitness - Descriptive Statistics (2) (1) - Jupyter Notebook
14 pages
Vertopal.com_PROJET SWIFT (3)
No ratings yet
Vertopal.com_PROJET SWIFT (3)
12 pages
Week 3_FBA
No ratings yet
Week 3_FBA
2 pages
Ii Avaliação Parcial - Ia - 25.0-Gabarito
No ratings yet
Ii Avaliação Parcial - Ia - 25.0-Gabarito
9 pages
BA Project - Section 1 Group 1
No ratings yet
BA Project - Section 1 Group 1
27 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
SPPUML3
No ratings yet
SPPUML3
12 pages
01 Working With CSV Files
No ratings yet
01 Working With CSV Files
27 pages
tugas-ilmu-data-2-222103030
No ratings yet
tugas-ilmu-data-2-222103030
13 pages
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
100% (1)
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
25 pages
m3 Rifqiirsyad 10123897
No ratings yet
m3 Rifqiirsyad 10123897
6 pages
Aerofit_business_Case - JupyterLab
No ratings yet
Aerofit_business_Case - JupyterLab
36 pages
Ml Projects
No ratings yet
Ml Projects
22 pages
Viii - Atso - Level - 2 - Averages - Sol
No ratings yet
Viii - Atso - Level - 2 - Averages - Sol
3 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Projet de Sciences Des Données Appliquées en Assurance.ipynb - Colab
No ratings yet
Projet de Sciences Des Données Appliquées en Assurance.ipynb - Colab
26 pages
Project Linear Regression
No ratings yet
Project Linear Regression
7 pages
Data Cleaning Test
No ratings yet
Data Cleaning Test
15 pages
credit card-fraud-detection
No ratings yet
credit card-fraud-detection
39 pages
Ankur Assignment
No ratings yet
Ankur Assignment
10 pages
Linear Algebra
No ratings yet
Linear Algebra
323 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Math Workbook - Grade 7
From Everand
Math Workbook - Grade 7
Beverly Nance
3.5/5 (10)
SMDM Assignment PDF
100% (1)
SMDM Assignment PDF
15 pages
DS With R Lab Record
No ratings yet
DS With R Lab Record
37 pages
Lesson 2: Simple Comparative Experiments
No ratings yet
Lesson 2: Simple Comparative Experiments
8 pages
A Hybrid Low Light Image Enhancement Method Using Retinex Decomposi - 2022 - Opt
No ratings yet
A Hybrid Low Light Image Enhancement Method Using Retinex Decomposi - 2022 - Opt
10 pages
Unit 4A Test Study Guide
No ratings yet
Unit 4A Test Study Guide
7 pages
DOC-20241220-WA0001.
No ratings yet
DOC-20241220-WA0001.
68 pages
Choosing Appropriate Descriptive Statistics, Graphs and Statistical Tests
No ratings yet
Choosing Appropriate Descriptive Statistics, Graphs and Statistical Tests
47 pages
6 1 Worksheet
0% (1)
6 1 Worksheet
4 pages
BA Computer Lab 1-Data Preprocessing
No ratings yet
BA Computer Lab 1-Data Preprocessing
6 pages
Sample APA Lab Write Up
No ratings yet
Sample APA Lab Write Up
24 pages
Bda Unit 5
No ratings yet
Bda Unit 5
14 pages
Autos Automobile.. EDA Project by Anjali Sinha
No ratings yet
Autos Automobile.. EDA Project by Anjali Sinha
26 pages
Data Description PDF
No ratings yet
Data Description PDF
38 pages
Data Analysis Using Python Day_1 to Day_4
No ratings yet
Data Analysis Using Python Day_1 to Day_4
30 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Making Sense of Data Statistic Course
No ratings yet
Making Sense of Data Statistic Course
39 pages
Zuur 2010
No ratings yet
Zuur 2010
12 pages
DAVP Lab Manual
No ratings yet
DAVP Lab Manual
12 pages
MegaStat Users Guide
No ratings yet
MegaStat Users Guide
72 pages
Stats Mcqs Calculations
No ratings yet
Stats Mcqs Calculations
21 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Quiz 2
No ratings yet
Quiz 2
22 pages
01. G6-Math-FinalModel-Al.Adwaa-T1-2025
No ratings yet
01. G6-Math-FinalModel-Al.Adwaa-T1-2025
11 pages
Quantitative Approaches For Second Language Education Research
No ratings yet
Quantitative Approaches For Second Language Education Research
129 pages
Statistical Analysis in JASP - A Students Guide v1.0
100% (1)
Statistical Analysis in JASP - A Students Guide v1.0
123 pages
6.5 Measures of Relative Position
No ratings yet
6.5 Measures of Relative Position
2 pages
Solution Manual For Business Analytics: Data Analysis & Decision Making 6th Edition Albright
100% (1)
Solution Manual For Business Analytics: Data Analysis & Decision Making 6th Edition Albright
49 pages
Bản sao Ngọc Khánh - Hoàng Ngọc Khánh- sc2405- private test4pdf
No ratings yet
Bản sao Ngọc Khánh - Hoàng Ngọc Khánh- sc2405- private test4pdf
26 pages