0% found this document useful (0 votes)

17 views

Numpy

This document discusses various operations that can be performed on NumPy arrays such as indexing, slicing, concatenation, splitting, and summarization. It also provides an introduction to pandas for working with labeled data, including how to load data, describe and inspect datasets, sort values, remove duplicates, and map values to create new variables. Various functions such as describe(), info(), sort_values(), drop_duplicates(), and map() are demonstrated.

Uploaded by

xhittu2

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Numpy

Uploaded by

xhittu2

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Array Indexing

The important thing to remember is that indexing in python starts at zero.

x1 = np.array([4, 3, 4, 4, 8, 4])
x1
array([4, 3, 4, 4, 8, 4])

#assess value to index zero

x1[0]
4

#assess fifth value

x1[4]
8

#get the last value

x1[-1]
4

#get the second last value

x1[-2]
8

#in a multidimensional array, we need to specify row and column index

x2
array([[3, 7, 5, 5],
[0, 1, 5, 9],
[3, 0, 5, 0]])

#1st row and 2nd column value

x2[2,3]
0

#3rd row and last value from the 3rd column

x2[2,-1]
0

#replace value at 0,0 index

x2[0,0] = 12
x2
array([[12, 7, 5, 5],
[ 0, 1, 5, 9],
[ 3, 0, 5, 0]])

Array Slicing
Now, we'll learn to access multiple or a range of elements from an array.

x = np.arange(10)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#from start to 4th position

x[:5]
array([0, 1, 2, 3, 4])

#from 4th position to end

x[4:]
array([4, 5, 6, 7, 8, 9])

#from 4th to 6th position

x[4:7]
array([4, 5, 6])

#return elements at even place

x[ : : 2]
array([0, 2, 4, 6, 8])

#return elements from first position step by two

x[1::2]
array([1, 3, 5, 7, 9])

#reverse the array

x[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Array Concatenation
Many a time, we are required to combine different arrays. So, instead of typing each of their
elements manually, you can use array concatenation to handle such tasks easily.

#You can concatenate two or more arrays at once.

x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = [21,21,21]
np.concatenate([x, y,z])
array([ 1, 2, 3, 3, 2, 1, 21, 21, 21])

#You can also use this function to create 2-dimensional arrays.

grid = np.array([[1,2,3],[4,5,6]])
np.concatenate([grid,grid])
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])

#Using its axis parameter, you can define row-wise or column-wise matrix
np.concatenate([grid,grid],axis=1)
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
Until now, we used the concatenation function of arrays of equal dimension. But, what if you are
required to combine a 2D array with 1D array? In such situations, np.concatenate might not be the
best option to use. Instead, you can use np.vstack or np.hstack to do the task. Let's see how!
x = np.array([3,4,5])
grid = np.array([[1,2,3],[17,18,19]])
np.vstack([x,grid])
array([[ 3, 4, 5],
[ 1, 2, 3],
[17, 18, 19]])

#Similarly, you can add an array using np.hstack

z = np.array([[9],[9]])
np.hstack([grid,z])
array([[ 1, 2, 3, 9],
[17, 18, 19, 9]])

Also, we can split the arrays based on pre-defined positions. Let's see how!
x = np.arange(10)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

x1,x2,x3 = np.split(x,[3,6])
print x1,x2,x3
[0 1 2] [3 4 5] [6 7 8 9]

grid = np.arange(16).reshape((4,4))
grid
upper,lower = np.vsplit(grid,[2])
print (upper, lower)
(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([[ 8, 9, 10, 11],
[12, 13, 14, 15]]))

In addition to the functions we learned above, there are several other mathematical functions
available in the numpy library such as sum, divide, multiple, abs, power, mod, sin, cos, tan, log, var,
min, mean, max, etc. which you can be used to perform basic arithmetic calculations. Feel free to
refer to numpy documentation for more information on such functions.
Let's move on to pandas now. Make sure you following each line below because it'll help you in
doing data manipulation using pandas.

Let's start with Pandas

#load library - pd is just an alias. I used pd because it's short and
literally abbreviates pandas.
#You can use any name as an alias.
import pandas as pd
#create a data frame - dictionary is used here where keys get converted to
column names and values to row values.
data = pd.DataFrame({'Country':
['Russia','Colombia','Chile','Equador','Nigeria'],
'Rank':[121,40,100,130,11]})
data
Country Rank

0 Russia 121

1 Colombia 40

2 Chile 100

3 Equador 130

4 Nigeria 11

#We can do a quick analysis of any data set using:

data.describe()

Rank

count 5.000000

mea
80.400000
n

std 52.300096

min 11.000000

25% 40.000000

50% 100.000000
Rank

75% 121.000000

max 130.000000

Remember, describe() method computes summary statistics of integer / double variables. To get the
complete information about the data set, we can use info() function.

#Among other things, it shows the data set has 5 rows and 2 columns with their
respective names.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
Country 5 non-null object
Rank 5 non-null int64
dtypes: int64(1), object(1)
memory usage: 152.0+ bytes

#Let's create another data frame.

data = pd.DataFrame({'group':['a', 'a', 'a', 'b','b', 'b', 'c',
'c','c'],'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

grou
ounces
p

0 a 4.0

1 a 3.0

2 a 12.0

3 b 6.0
grou
ounces
p

4 b 7.5

5 b 8.0

6 c 3.0

7 c 5.0

8 c 6.0

#Let's sort the data frame by ounces - inplace = True will make changes to the
data
data.sort_values(by=['ounces'],ascending=True,inplace=False)

grou
ounces
p

1 a 3.0

6 c 3.0

0 a 4.0

7 c 5.0
grou
ounces
p

3 b 6.0

8 c 6.0

4 b 7.5

5 b 8.0

2 a 12.0

We can sort the data by not just one column but multiple columns as well.

data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)

grou
ounces
p

2 a 12.0

0 a 4.0

1 a 3.0

5 b 8.0

4 b 7.5
grou
ounces
p

3 b 6.0

8 c 6.0

7 c 5.0

6 c 3.0

Often, we get data sets with duplicate rows, which is nothing but noise. Therefore, before training the
model, we need to make sure we get rid of such inconsistencies in the data set. Let's see how we
can remove duplicate rows.

#create another data with duplicated rows

data = pd.DataFrame({'k1':['one']*3 + ['two']*4, 'k2':[3,2,1,3,3,4,4]})
data

k1 k2

0 one 3

1 one 2

2 one 1

3 two 3

4 two 3

5 two 4
k1 k2

6 two 4

#sort values
data.sort_values(by='k2')

k1 k2

2 one 1

1 one 2

0 one 3

3 two 3

4 two 3

5 two 4

6 two 4

#remove duplicates - ta da!

data.drop_duplicates()
k1 k2

0 one 3

1 one 2

2 one 1

3 two 3

5 two 4

Here, we removed duplicates based on matching row values across all columns. Alternatively, we
can also remove duplicates based on a particular column. Let's remove duplicate values from the k1
column.

data.drop_duplicates(subset='k1')

k1 k2

0 one 3

3 two 3

Now, we will learn to categorize rows based on a predefined criteria. It happens a lot while data
processing where you need to categorize a variable. For example, say we have got a column with
country names and we want to create a new variable 'continent' based on these country names. In
such situations, we will require the steps below:

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',

'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
food ounces

0 bacon 4.0

1 pulled pork 3.0

2 bacon 12.0

3 Pastrami 6.0

4 corned beef 7.5

5 Bacon 8.0

6 pastrami 3.0

7 honey ham 5.0

8 nova lox 6.0

Now, we want to create a new variable which indicates the type of animal which acts as the source
of the food. To do that, first we'll create a dictionary to map the food to the animals. Then, we'll use
map function to map the dictionary's values to the keys. Let's see how is it done.

meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

def meat_2_animal(series):
if series['food'] == 'bacon':
return 'pig'
elif series['food'] == 'pulled pork':
return 'pig'
elif series['food'] == 'pastrami':
return 'cow'
elif series['food'] == 'corned beef':
return 'cow'
elif series['food'] == 'honey ham':
return 'pig'
else:
return 'salmon'

#create a new variable

data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data

food ounces animal

0 bacon 4.0 pig

1 pulled pork 3.0 pig

2 bacon 12.0 pig

3 Pastrami 6.0 cow

4 corned beef 7.5 cow

5 Bacon 8.0 pig

6 pastrami 3.0 cow

7 honey ham 5.0 pig

8 nova lox 6.0 salmo

food ounces animal

#another way of doing it is: convert the food values to the lower case and
apply the function
lower = lambda x: x.lower()
data['food'] = data['food'].apply(lower)
data['animal2'] = data.apply(meat_2_animal, axis='columns')
data

food ounces animal animal2

0 bacon 4.0 pig pig

1 pulled pork 3.0 pig pig

2 bacon 12.0 pig pig

3 pastrami 6.0 cow cow

corned
4 7.5 cow cow
beef

5 bacon 8.0 pig pig

6 pastrami 3.0 cow cow

7 honey ham 5.0 pig pig

8 nova lox 6.0 salmon salmon

food ounces animal animal2

Another way to create a new variable is by using the assign function. With this tutorial, as you keep
discovering the new functions, you'll realize how powerful pandas is.

data.assign(new_variable = data['ounces']*10)

ounce
food animal animal2 new_variable
s

0 bacon 4.0 pig pig 40.0

1 pulled pork 3.0 pig pig 30.0

2 bacon 12.0 pig pig 120.0

3 pastrami 6.0 cow cow 60.0

4 corned beef 7.5 cow cow 75.0

5 bacon 8.0 pig pig 80.0

6 pastrami 3.0 cow cow 30.0

7 honey ham 5.0 pig pig 50.0

8 nova lox 6.0 salmon salmon 60.0

Let's remove the column animal2 from our data frame.

data.drop('animal2',axis='columns',inplace=True)
data

food ounces animal

0 bacon 4.0 pig

1 pulled pork 3.0 pig

2 bacon 12.0 pig

3 Pastrami 6.0 cow

4 corned beef 7.5 cow

5 Bacon 8.0 pig

6 pastrami 3.0 cow

7 honey ham 5.0 pig

salmo
8 nova lox 6.0
n

We frequently find missing values in our data set. A quick method for imputing missing values is by
filling the missing value with any random number. Not just missing values, you may find lots of
outliers in your data set, which might require replacing. Let's see how can we replace values.

#Series function from pandas are used to create arrays

data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64

#replace -999 with NaN values

data.replace(-999, np.nan,inplace=True)
data
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64

#We can also replace multiple values at once.

data = pd.Series([1., -999., 2., -999., -1000., 3.])
data.replace([-999,-1000],np.nan,inplace=True)
data
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64

Now, let's learn how to rename column names and axis (row names).

data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado',

'New York'],columns=['one', 'two', 'three', 'four'])
data

tw
one three four
o

Ohio 0 1 2 3

Colorado 4 5 6 7

New
8 9 10 11
York
#Using rename function
data.rename(index = {'Ohio':'SanF'},
columns={'one':'one_p','two':'two_p'},inplace=True)
data

one_p two_p three four

SanF 0 1 2 3

Colorado 4 5 6 7

New York 8 9 10 11

#You can also use string functions

data.rename(index = str.upper, columns=str.title,inplace=True)
data

One_p Two_p Three Four

SANF 0 1 2 3

COLORADO 4 5 6 7

NEW YORK 8 9 10 11

Next, we'll learn to categorize (bin) continuous variables.

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

We'll divide the ages into bins such as 18-25, 26-35,36-60 and 60 and above.

#Understand the output - '(' means the value is included in the bin, '[' means
the value is excluded
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100],
(35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

#To include the right bin value, we can do:

pd.cut(ages,bins,right=False)
[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100),
[35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, object): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

#pandas library intrinsically assigns an encoding to categorical variables.

cats.labels
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

#Let's check how many observations fall under each bin

pd.value_counts(cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64

Also, we can pass a unique name to each label.

bin_names = ['Youth', 'YoungAdult', 'MiddleAge', 'Senior']

new_cats = pd.cut(ages, bins,labels=bin_names)

pd.value_counts(new_cats)

Youth 5

MiddleAge 3

YoungAdul
3
t

Senior 1

dtype:
int64
#we can also calculate their cumulative sum
pd.value_counts(new_cats).cumsum()

Youth 5

MiddleAge 3

YoungAdul
3
t

Senior 1

dtype:
int64

Let's proceed and learn about grouping data and creating pivots in pandas. It's an immensely
important data analysis method which you'd probably have to use on every data set you work with.

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],

'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
df

data1 data2 key1 key2

0.97359
0 0.001761 a
9

0.20728 -
1 a
3 0.990160

1.09964
2 1.872394 b
2
data1 data2 key1 key2

0.93989 -
3 b
7 0.241074

0.60638
4 0.053345 a
9

#calculate the mean of data1 column by key1

grouped = df['data1'].groupby(df['key1'])
grouped.mean()
key1
a 0.595757
b 1.019769
Name: data1, dtype: float64

Now, let's see how to slice the data frame.

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

A B C D

- -
2013-01-01 1.030816 0.837720
1.276989 1.490111

- - -
2013-01-02 0.604572
1.070215 0.209129 1.743058

2013-01-03 1.524227 1.863575 1.291378 1.300696

- - -
2013-01-04 0.918203
0.158800 0.964063 1.990779

-
2013-01-05 0.089731 0.114854 0.298772
0.585815
A B C D

-
2013-01-06 0.222260 0.435183 0.049898
0.045748

#get first n rows from the data frame

df[:3]

A B C D

0.83772 -
2013-01-01 1.030816 -1.276989
0 1.490111

- 0.60457 -
2013-01-02 -0.209129
1.070215 2 1.743058

1.29137
2013-01-03 1.524227 1.863575 1.300696
8

#slice based on date range

df['20130101':'20130104']

A B C D

- -
2013-01-01 1.030816 0.837720
1.276989 1.490111

- - -
2013-01-02 0.604572
1.070215 0.209129 1.743058

2013-01-03 1.524227 1.863575 1.291378 1.300696

2013-01-04 0.918203 - - -
A B C D

0.158800 0.964063 1.990779

#slicing based on column names

df.loc[:,['A','B']]

A B

2013-01- -
1.030816
01 1.276989

2013-01- -
-1.070215
02 0.209129

2013-01-
1.524227 1.863575
03

2013-01- -
0.918203
04 0.158800

2013-01-
0.089731 0.114854
05

2013-01-
0.222260 0.435183
06

#slicing based on both row index labels and column names

df.loc['20130102':'20130103',['A','B']]

A B

2013-01- -1.070215 -
A B

02 0.209129

2013-01-
1.524227 1.863575
03

#slicing based on index of columns

df.iloc[3] #returns 4th row (index is 3rd)
A 0.918203
B -0.158800
C -0.964063
D -1.990779
Name: 2013-01-04 00:00:00, dtype: float64

#returns a specific range of rows

df.iloc[2:4, 0:2]

A B

2013-01- 1.52422
1.863575
03 7

2013-01- 0.91820
-0.158800
04 3

#returns specific rows and columns using lists containing columns or row
indexes
df.iloc[[1,5],[0,2]]

A C

2013-01-
-1.070215 0.604572
02

2013-01- 0.222260 -
A C

06 0.045748

Similarly, we can do Boolean indexing based on column values as well. This helps in filtering a data
set based on a pre-defined condition.

df[df.A > 1]

A B C D

1.03081 - -
2013-01-01 0.837720
6 1.276989 1.490111

1.52422
2013-01-03 1.863575 1.291378 1.300696
7

#we can copy the data set

df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
df2

A B C D E

2013-01- - -
1.030816 0.837720 one
01 1.276989 1.490111

2013-01- - - -
0.604572 one
02 1.070215 0.209129 1.743058

2013-01-
1.524227 1.863575 1.291378 1.300696 two
03
A B C D E

2013-01- - - -
0.918203 three
04 0.158800 0.964063 1.990779

2013-01- -
0.089731 0.114854 0.298772 four
05 0.585815

2013-01- -
0.222260 0.435183 0.049898 three
06 0.045748

#select rows based on column values

df2[df2['E'].isin(['two','four'])]

A B C D E

1.52422 1.86357 1.30069

2013-01-03 1.291378 two
7 5 6

0.08973 0.11485 0.29877

2013-01-05 -0.585815 four
1 4 2

#select all rows except those with two and four

df2[~df2['E'].isin(['two','four'])]

A B C D E

2013-01- - -
1.030816 0.837720 one
01 1.276989 1.490111

2013-01- - - -
0.604572 one
02 1.070215 0.209129 1.743058
A B C D E

2013-01- - - -
0.918203 three
04 0.158800 0.964063 1.990779

2013-01- -
0.222260 0.435183 0.049898 three
06 0.045748

We can also use a query method to select columns based on a criterion. Let's see how!

#list all columns where A is greater than C

df.query('A > C')

A B C D

1.03081 - -
2013-01-01 0.837720
6 1.276989 1.490111

1.52422
2013-01-03 1.863575 1.291378 1.300696
7

0.91820 - -
2013-01-04 -0.964063
3 0.158800 1.990779

0.08973
2013-01-05 0.114854 -0.585815 0.298772
1

0.22226
2013-01-06 0.435183 -0.045748 0.049898
0

#using OR condition
df.query('A < B | C > A')
A B C D

- - -
2013-01-02 0.604572
1.070215 0.209129 1.743058

2013-01-03 1.524227 1.863575 1.291378 1.300696

-
2013-01-05 0.089731 0.114854 0.298772
0.585815

-
2013-01-06 0.222260 0.435183 0.049898
0.045748

Pivot tables are extremely useful in analyzing data using a customized tabular format. I think, among
other things, Excel is popular because of the pivot table option. It offers a super-quick way to
analyze data.

#create a data frame

data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

grou
ounces
p

0 a 4.0

1 a 3.0

2 a 12.0

3 b 6.0

4 b 7.5
grou
ounces
p

5 b 8.0

6 c 3.0

7 c 5.0

8 c 6.0

#calculate means of each group

data.pivot_table(values='ounces',index='group',aggfunc=np.mean)
group
a 6.333333
b 7.166667
c 4.666667
Name: ounces, dtype: float64

#calculate count by each group

data.pivot_table(values='ounces',index='group',aggfunc='count')
group
a 3
b 3
c 3
Name: ounces, dtype: int64

Up till now, we've become familiar with the basics of pandas library using toy examples. Now, we'll
take up a real-life data set and use our newly gained knowledge to explore it.

Exploring ML Data Set

We'll work with the popular adult data set.The data set has been taken from UCI Machine Learning
Repository. You can download the data from here. In this data set, the dependent variable is
"target." It is a binary classification problem. We need to predict if the salary of a given person is less
than or more than 50K.

#load the data

train = pd.read_csv("~/Adult/train.csv")
test = pd.read_csv("~/Adult/test.csv")
#check data set
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age 32561 non-null int64
workclass 30725 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education.num 32561 non-null int64
marital.status 32561 non-null object
occupation 30718 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital.gain 32561 non-null int64
capital.loss 32561 non-null int64
hours.per.week 32561 non-null int64
native.country 31978 non-null object
target 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

We see that, the train data has 32561 rows and 15 columns. Out of these 15 columns, 6 have
integers classes and the rest have object (or character) classes. Similarly, we can check for test
data. An alternative way of quickly checking rows and columns is

print ("The train data has",train.shape)

print ("The test data has",test.shape)
('The train data has', (32561, 15))
('The test data has', (16281, 15))

#Let have a glimpse of the data set

train.head()

w e
f ed ma oc ho t
or d rel ca ca nat
n uc rit cu r urs a
a k u ati s pit pit ive
l ati al. pa a .pe r
g cl c on e al. al. .co
w on. sta ti c r.w g
e a at sh x ga lo un
g nu tu o e ee e
s io ip in ss try
t m s n k t
s n

0 3 St 7 B 13 Ne A No W M 21 0 40 Uni <
9 at 7 ac ver d t- h a 74 ted =
e- 5 h - m- in- it l - 5
g 1 el ma cl fa e e Sta 0
ov 6 or rrie eri mil tes K
s d ca y
w e
f ed ma oc ho t
or d rel ca ca nat
n uc rit cu r urs a
a k u ati s pit pit ive
l ati al. pa a .pe r
g cl c on e al. al. .co
w on. sta ti c r.w g
e a at sh x ga lo un
g nu tu o e ee e
s io ip in ss try
t m s n k t
s n

S
elf
Ma Ex
-
B rrie ec
e 8 Uni <
ac d- - Hu W M
m 3 ted =
5 h civ m sb h a
1 p- 3 13 0 0 13 - 5
0 el - an an it l
n 1 Sta 0
or sp ag d e e
ot 1 tes K
s ou eri
-
se al
in
c

H
an
2 No
H dl Uni <
Pr 1 t- W M
S- Div er ted =
3 iv 5 in- h a
2 gr 9 orc s- 0 0 40 - 5
8 at 6 fa it l
a ed cl Sta 0
e 4 mil e e
d ea tes K
6 y
ne
rs

H
Ma
an
2 rrie
dl B Uni <
Pr 3 d- Hu M
1 er l ted =
5 iv 4 civ sb a
3 1t 7 s- a 0 0 40 - 5
3 at 7 - an l
h cl c Sta 0
e 2 sp d e
ea k tes K
1 ou
ne
se
rs
w e
f ed ma oc ho t
or d rel ca ca nat
n uc rit cu r urs a
a k u ati s pit pit ive
l ati al. pa a .pe r
g cl c on e al. al. .co
w on. sta ti c r.w g
e a at sh x ga lo un
g nu tu o e ee e
s io ip in ss try
t m s n k t
s n

Ma
3 B rrie Pr F
B <
Pr 3 ac d- of- e
l =
2 iv 8 h civ sp Wi m Cu
4 13 a 0 0 40 5
8 at 4 el - ec fe a ba
c 0
e 0 or sp ial l
k K
9 s ou ty e
se

Now, let's check the missing values (if present) in this data.

nans = train.shape[0] - train.dropna().shape[0]

print ("%d rows have missing values in the train data" %nans)

nand = test.shape[0] - test.dropna().shape[0]

print ("%d rows have missing values in the test data" %nand)

2399 rows have missing values in the train data

1221 rows have missing values in the test data

We should be more curious to know which columns have missing values.

#only 3 columns have missing values

train.isnull().sum()

age 0

workclass 1836

fnlwgt 0

education 0
education.num 0

marital.status 0

occupation 1843

relationship 0

race 0

sex 0

capital.gain 0

capital.loss 0

hours.per.wee
0
k

native.country 583

target 0

dtype: int64

Let's count the number of unique values from character variables.

cat = train.select_dtypes(include=['O'])
cat.apply(pd.Series.nunique)
workclass 8

education 16

marital.status 7

occupation 14

relationship 6

race 5

sex 2

native.country 41

target 2

dtype: int64

Since missing values are found in all 3 character variables, let's impute these missing values with
their respective modes.

#Education
train.workclass.value_counts(sort=True)
train.workclass.fillna('Private',inplace=True)

#Occupation
train.occupation.value_counts(sort=True)
train.occupation.fillna('Prof-specialty',inplace=True)

#Native Country
train['native.country'].value_counts(sort=True)
train['native.country'].fillna('United-States',inplace=True)

Let's check again if there are any missing values left.

train.isnull().sum()

age 0

workclass 0

fnlwgt 0

education 0

education.num 0

marital.status 0

occupation 0

relationship 0

race 0

sex 0

capital.gain 0

capital.loss 0
hours.per.wee
0
k

native.country 0

target 0

dtype: int64

Now, we'll check the target variable to investigate if this data is imbalanced or not.

#check proportion of target variable

train.target.value_counts()/train.shape[0]
<=50K 0.75919
>50K 0.24081
Name: target, dtype: float64

We see that 75% of the data set belongs to <=50K class. This means that even if we take a rough
guess of target prediction as <=50K, we'll get 75% accuracy. Isn't that amazing? Let's create a cross
tab of the target variable with education. With this, we'll try to understand the influence of education
on the target variable.

pd.crosstab(train.education, train.target,margins=True)/train.shape[0]

target <=50K >50K All

education

0.00190
10th 0.026750 0.028654
4

0.00184
11th 0.034243 0.036086
3

0.00101
12th 0.012285 0.013298
3
target <=50K >50K All

0.00018
1st-4th 0.004975 0.005160
4

0.00049
5th-6th 0.009736 0.010227
1

0.00122
7th-8th 0.018611 0.019840
8

0.00082
9th 0.014957 0.015786
9

0.00813
Assoc-acdm 0.024631 0.032769
9

0.01108
Assoc-voc 0.031357 0.042443
7

0.06821
Bachelors 0.096250 0.164461
0

0.00939
Doctorate 0.003286 0.012684
8

0.05144
HS-grad 0.271060 0.322502
2

0.02945
Masters 0.023464 0.052916
2
target <=50K >50K All

0.00000
Preschool 0.001566 0.001566
0

0.01299
Prof-school 0.004699 0.017690
1

Some- 0.04259
0.181321 0.223918
college 7

0.24081
All 0.759190 1.000000
0

We see that out of 75% people with <=50K salary, 27% people are high school graduates, which is
correct as people with lower levels of education are expected to earn less. On the other hand, out of
25% people with >=50K salary, 6% are bachelors and 5% are high-school grads. Now, this pattern
seems to be a matter of concern. That's why we'll have to consider more variables before coming to
a conclusion.
If you've come this far, you might be curious to get a taste of building your first machine learning
model. In the coming week we'll share an exclusive tutorial on machine learning in python. However,
let's get a taste of it here.
We'll use the famous and formidable scikit learn library. Scikit learn accepts data in numeric format.
Now, we'll have to convert the character variable into numeric. We'll use the labelencoder function.
In label encoding, each unique value of a variable gets assigned a number, i.e., let's say a variable
color has four values ['red','green','blue','pink'].
Label encoding this variable will return output as: red = 2 green = 0 blue = 1 pink = 3

#load sklearn and encode all object type variables

from sklearn import preprocessing

for x in train.columns:
if train[x].dtype == 'object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train[x].values))
train[x] = lbl.transform(list(train[x].values))

Let's check the changes applied to the data set.

train.head()
e
f ed t
w d ma oc rel ca ca ho nat
n uc r a
a or u rit cu ati s pit pit urs ive
l ati a r
g kc c al. pa on e al. al. .pe .co
w on. c g
e la at sta tio sh x ga lo r.w unt
g nu e e
ss io tus n ip in ss eek ry
t m t
n

7
7
3 21
0 6 5 9 13 4 0 1 4 1 0 40 38 0
9 74
1
6

8
3
5
1 5 3 9 13 2 3 0 4 1 0 0 13 38 0
0
1
1

2
1
3 5 1
2 3 9 0 5 1 4 1 0 0 40 38 0
8 6 1
4
6

2
3
5 4
3 3 1 7 2 5 0 2 1 0 0 40 38 0
3 7
2
1

3
3
2 8
4 3 9 13 2 9 5 2 0 0 0 40 4 0
8 4
0
9

As we can see, all the variables have been converted to numeric, including the target variable.
#<50K = 0 and >50K = 1
train.target.value_counts()
0 24720
1 7841
Name: target, dtype: int64

Building a Random Forest Model

Let's create a random forest model and check the model's accuracy.

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import accuracy_score

y = train['target']
del train['target']

X = train
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)

#train the RF classifier

clf = RandomForestClassifier(n_estimators = 500, max_depth = 6)
clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini',
max_depth=6, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=1, oob_score=False,
random_state=None,
verbose=0, warm_start=False)

clf.predict(X_test)

Now, let's make prediction on the test set and check the model's accuracy.

#make prediction and check model's accuracy

prediction = clf.predict(X_test)
acc = accuracy_score(np.array(y_test),prediction)
print ('The accuracy of Random Forest is {}'.format(acc))
The accuracy of Random Forest is 0.85198075545.

# Import `numpy` as `np`

import numpy as np

# Initialize `x` and `y`

x = np.ones((3,4))

y = np.random.random((5,1,4))

# Add `x` and `y`

x+y

BOM Empties V2
No ratings yet
BOM Empties V2
2 pages
Symmetry of Mixed Partials - Young and Schwarz Theorem
No ratings yet
Symmetry of Mixed Partials - Young and Schwarz Theorem
6 pages
NumPy & Pandas
No ratings yet
NumPy & Pandas
27 pages
EDA_Ex4 - Colab
No ratings yet
EDA_Ex4 - Colab
5 pages
Python-Handson-MODULE-4 (1)
No ratings yet
Python-Handson-MODULE-4 (1)
8 pages
Fdspracticals - Ipynb - Colaboratory
No ratings yet
Fdspracticals - Ipynb - Colaboratory
21 pages
Neural Network Data Analysis
No ratings yet
Neural Network Data Analysis
13 pages
Exam 2019s1 Solution
No ratings yet
Exam 2019s1 Solution
16 pages
Lab Manual
No ratings yet
Lab Manual
32 pages
Week2 lab
No ratings yet
Week2 lab
8 pages
Pandas & Numpy
No ratings yet
Pandas & Numpy
32 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Tutorials
No ratings yet
Tutorials
10 pages
Arbol de Decisiones XGBoos
No ratings yet
Arbol de Decisiones XGBoos
7 pages
Heart Failure Prediction
100% (1)
Heart Failure Prediction
41 pages
Assignment 2
No ratings yet
Assignment 2
12 pages
IP Book 12 Question Bank
No ratings yet
IP Book 12 Question Bank
20 pages
Python Abstract
No ratings yet
Python Abstract
7 pages
Pandas Notes
No ratings yet
Pandas Notes
3 pages
Missing data-Titanic dataset
No ratings yet
Missing data-Titanic dataset
4 pages
CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory
No ratings yet
CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory
17 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
Precision and Recall
No ratings yet
Precision and Recall
13 pages
grr
No ratings yet
grr
11 pages
R Module 8 - Data Cleaning
No ratings yet
R Module 8 - Data Cleaning
48 pages
Fds SLOT 2
No ratings yet
Fds SLOT 2
12 pages
Numpy & Pandas
No ratings yet
Numpy & Pandas
13 pages
Data Science With Python
No ratings yet
Data Science With Python
12 pages
Lecture 9 PDF
No ratings yet
Lecture 9 PDF
22 pages
Python Imp 2
No ratings yet
Python Imp 2
10 pages
Logistic Regression Implementation in R: The Dataset
No ratings yet
Logistic Regression Implementation in R: The Dataset
8 pages
K Means On IRIS Dataset
No ratings yet
K Means On IRIS Dataset
4 pages
Maneesha Nidigonda Minor Project .Ipynb
No ratings yet
Maneesha Nidigonda Minor Project .Ipynb
35 pages
AD3301 - Numpy - and - Pandas - Ipynb - Colaboratory
No ratings yet
AD3301 - Numpy - and - Pandas - Ipynb - Colaboratory
18 pages
Module 2 Iris Data Set
No ratings yet
Module 2 Iris Data Set
1 page
Ww2 Coastal Edu Kingw Statistics R Tutorials Simplelinear HT
No ratings yet
Ww2 Coastal Edu Kingw Statistics R Tutorials Simplelinear HT
15 pages
Pandas & Mysql
No ratings yet
Pandas & Mysql
20 pages
24mcs1025-ex2-partb
No ratings yet
24mcs1025-ex2-partb
5 pages
Qwertyu
No ratings yet
Qwertyu
6 pages
N Umpy Notebook
No ratings yet
N Umpy Notebook
17 pages
dav 2 unit
No ratings yet
dav 2 unit
55 pages
R Notes For Data Analysis and Statistical Inference
No ratings yet
R Notes For Data Analysis and Statistical Inference
10 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Data Visualization Notes-2
No ratings yet
Data Visualization Notes-2
223 pages
R Practice
No ratings yet
R Practice
38 pages
download (2)
No ratings yet
download (2)
10 pages
LAB 2 DWM
No ratings yet
LAB 2 DWM
13 pages
05-Unit-V Python Lecture Notes
No ratings yet
05-Unit-V Python Lecture Notes
14 pages
Pandas
No ratings yet
Pandas
82 pages
Matlab Arrays
No ratings yet
Matlab Arrays
6 pages
Lecture 1
No ratings yet
Lecture 1
167 pages
Principal Component Analysis Notes : Info
No ratings yet
Principal Component Analysis Notes : Info
22 pages
Python in 90 Minutes
No ratings yet
Python in 90 Minutes
53 pages
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
Regresion Logistica R
No ratings yet
Regresion Logistica R
38 pages
HW2P
No ratings yet
HW2P
19 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
Lect2 Array Part2
No ratings yet
Lect2 Array Part2
26 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
PCG-GR250 GR250K GR250P GR390P GR214MP
No ratings yet
PCG-GR250 GR250K GR250P GR390P GR214MP
32 pages
Cutter Option: Installation Instructions
No ratings yet
Cutter Option: Installation Instructions
14 pages
Koerner 171 Failed
No ratings yet
Koerner 171 Failed
8 pages
Power Network Synthesis and Analysis with JupiterXT and primepower
No ratings yet
Power Network Synthesis and Analysis with JupiterXT and primepower
14 pages
Gustav Mie Theorie
No ratings yet
Gustav Mie Theorie
52 pages
Security Management in Wireless Sensor Network (WSN)
No ratings yet
Security Management in Wireless Sensor Network (WSN)
4 pages
Non-Photorealistic Rendering: Cross Hatching: Supervised by
No ratings yet
Non-Photorealistic Rendering: Cross Hatching: Supervised by
39 pages
Widyowijatnoko, Andry
No ratings yet
Widyowijatnoko, Andry
17 pages
Aerodynamic Analysis of The Undertray of Formula 1 Alberto Gomez PDF
No ratings yet
Aerodynamic Analysis of The Undertray of Formula 1 Alberto Gomez PDF
92 pages
Sun Tracking Thesis
100% (3)
Sun Tracking Thesis
7 pages
Crossovers The Basics: What Is A Crossover?
No ratings yet
Crossovers The Basics: What Is A Crossover?
5 pages
Elife-Drive H Series Application Reference Manual RevB v2.4
No ratings yet
Elife-Drive H Series Application Reference Manual RevB v2.4
80 pages
Loop (Mesh) Analysis (3.2) : Dr. Holbert February 27, 2006
No ratings yet
Loop (Mesh) Analysis (3.2) : Dr. Holbert February 27, 2006
26 pages
AHL 3.14 Vector Equation of Line
No ratings yet
AHL 3.14 Vector Equation of Line
54 pages
Homework Chapters 5-8
No ratings yet
Homework Chapters 5-8
27 pages
Identity Identification
No ratings yet
Identity Identification
12 pages
Intel CPU Install PDF
No ratings yet
Intel CPU Install PDF
30 pages
RDBMS
No ratings yet
RDBMS
166 pages
Rohit Kumar
No ratings yet
Rohit Kumar
2 pages
Sikacrete Fire Protection Mortar 201102
No ratings yet
Sikacrete Fire Protection Mortar 201102
4 pages
DL Modules
No ratings yet
DL Modules
1 page
Biotechnology Progress - 2021 - Saleh - in Silico Process Characterization For Biopharmaceutical Development Following The
No ratings yet
Biotechnology Progress - 2021 - Saleh - in Silico Process Characterization For Biopharmaceutical Development Following The
13 pages
Exercise 1: Pricing Options With The Binomial Model
No ratings yet
Exercise 1: Pricing Options With The Binomial Model
4 pages
1 s2.0 S0254629916002507 Main PDF
No ratings yet
1 s2.0 S0254629916002507 Main PDF
7 pages
Rolling Shutters Brochure
100% (1)
Rolling Shutters Brochure
21 pages
Assessment Protocol For Nozzle Loads On Pressure Vessels
No ratings yet
Assessment Protocol For Nozzle Loads On Pressure Vessels
5 pages
Traveler 9.0.1 High Availability Setup and Best Practice: Speaker: Wang Rui Si IBM ICS Software Engineer
No ratings yet
Traveler 9.0.1 High Availability Setup and Best Practice: Speaker: Wang Rui Si IBM ICS Software Engineer
43 pages
Community - CS
No ratings yet
Community - CS
64 pages