Numpy
Numpy
x1 = np.array([4, 3, 4, 4, 8, 4])
x1
array([4, 3, 4, 4, 8, 4])
Array Slicing
Now, we'll learn to access multiple or a range of elements from an array.
x = np.arange(10)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Array Concatenation
Many a time, we are required to combine different arrays. So, instead of typing each of their
elements manually, you can use array concatenation to handle such tasks easily.
#Using its axis parameter, you can define row-wise or column-wise matrix
np.concatenate([grid,grid],axis=1)
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
Until now, we used the concatenation function of arrays of equal dimension. But, what if you are
required to combine a 2D array with 1D array? In such situations, np.concatenate might not be the
best option to use. Instead, you can use np.vstack or np.hstack to do the task. Let's see how!
x = np.array([3,4,5])
grid = np.array([[1,2,3],[17,18,19]])
np.vstack([x,grid])
array([[ 3, 4, 5],
[ 1, 2, 3],
[17, 18, 19]])
Also, we can split the arrays based on pre-defined positions. Let's see how!
x = np.arange(10)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
x1,x2,x3 = np.split(x,[3,6])
print x1,x2,x3
[0 1 2] [3 4 5] [6 7 8 9]
grid = np.arange(16).reshape((4,4))
grid
upper,lower = np.vsplit(grid,[2])
print (upper, lower)
(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([[ 8, 9, 10, 11],
[12, 13, 14, 15]]))
In addition to the functions we learned above, there are several other mathematical functions
available in the numpy library such as sum, divide, multiple, abs, power, mod, sin, cos, tan, log, var,
min, mean, max, etc. which you can be used to perform basic arithmetic calculations. Feel free to
refer to numpy documentation for more information on such functions.
Let's move on to pandas now. Make sure you following each line below because it'll help you in
doing data manipulation using pandas.
0 Russia 121
1 Colombia 40
2 Chile 100
3 Equador 130
4 Nigeria 11
Rank
count 5.000000
mea
80.400000
n
std 52.300096
min 11.000000
25% 40.000000
50% 100.000000
Rank
75% 121.000000
max 130.000000
Remember, describe() method computes summary statistics of integer / double variables. To get the
complete information about the data set, we can use info() function.
#Among other things, it shows the data set has 5 rows and 2 columns with their
respective names.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
Country 5 non-null object
Rank 5 non-null int64
dtypes: int64(1), object(1)
memory usage: 152.0+ bytes
grou
ounces
p
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
grou
ounces
p
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
#Let's sort the data frame by ounces - inplace = True will make changes to the
data
data.sort_values(by=['ounces'],ascending=True,inplace=False)
grou
ounces
p
1 a 3.0
6 c 3.0
0 a 4.0
7 c 5.0
grou
ounces
p
3 b 6.0
8 c 6.0
4 b 7.5
5 b 8.0
2 a 12.0
We can sort the data by not just one column but multiple columns as well.
data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)
grou
ounces
p
2 a 12.0
0 a 4.0
1 a 3.0
5 b 8.0
4 b 7.5
grou
ounces
p
3 b 6.0
8 c 6.0
7 c 5.0
6 c 3.0
Often, we get data sets with duplicate rows, which is nothing but noise. Therefore, before training the
model, we need to make sure we get rid of such inconsistencies in the data set. Let's see how we
can remove duplicate rows.
k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
4 two 3
5 two 4
k1 k2
6 two 4
#sort values
data.sort_values(by='k2')
k1 k2
2 one 1
1 one 2
0 one 3
3 two 3
4 two 3
5 two 4
6 two 4
0 one 3
1 one 2
2 one 1
3 two 3
5 two 4
Here, we removed duplicates based on matching row values across all columns. Alternatively, we
can also remove duplicates based on a particular column. Let's remove duplicate values from the k1
column.
data.drop_duplicates(subset='k1')
k1 k2
0 one 3
3 two 3
Now, we will learn to categorize rows based on a predefined criteria. It happens a lot while data
processing where you need to categorize a variable. For example, say we have got a column with
country names and we want to create a new variable 'continent' based on these country names. In
such situations, we will require the steps below:
0 bacon 4.0
2 bacon 12.0
3 Pastrami 6.0
5 Bacon 8.0
6 pastrami 3.0
Now, we want to create a new variable which indicates the type of animal which acts as the source
of the food. To do that, first we'll create a dictionary to map the food to the animals. Then, we'll use
map function to map the dictionary's values to the keys. Let's see how is it done.
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
def meat_2_animal(series):
if series['food'] == 'bacon':
return 'pig'
elif series['food'] == 'pulled pork':
return 'pig'
elif series['food'] == 'pastrami':
return 'cow'
elif series['food'] == 'corned beef':
return 'cow'
elif series['food'] == 'honey ham':
return 'pig'
else:
return 'salmon'
#another way of doing it is: convert the food values to the lower case and
apply the function
lower = lambda x: x.lower()
data['food'] = data['food'].apply(lower)
data['animal2'] = data.apply(meat_2_animal, axis='columns')
data
corned
4 7.5 cow cow
beef
Another way to create a new variable is by using the assign function. With this tutorial, as you keep
discovering the new functions, you'll realize how powerful pandas is.
data.assign(new_variable = data['ounces']*10)
ounce
food animal animal2 new_variable
s
salmo
8 nova lox 6.0
n
We frequently find missing values in our data set. A quick method for imputing missing values is by
filling the missing value with any random number. Not just missing values, you may find lots of
outliers in your data set, which might require replacing. Let's see how can we replace values.
Now, let's learn how to rename column names and axis (row names).
tw
one three four
o
Ohio 0 1 2 3
Colorado 4 5 6 7
New
8 9 10 11
York
#Using rename function
data.rename(index = {'Ohio':'SanF'},
columns={'one':'one_p','two':'two_p'},inplace=True)
data
SanF 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
SANF 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
We'll divide the ages into bins such as 18-25, 26-35,36-60 and 60 and above.
#Understand the output - '(' means the value is included in the bin, '[' means
the value is excluded
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100],
(35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pd.value_counts(new_cats)
Youth 5
MiddleAge 3
YoungAdul
3
t
Senior 1
dtype:
int64
#we can also calculate their cumulative sum
pd.value_counts(new_cats).cumsum()
Youth 5
MiddleAge 3
YoungAdul
3
t
Senior 1
dtype:
int64
Let's proceed and learn about grouping data and creating pivots in pandas. It's an immensely
important data analysis method which you'd probably have to use on every data set you work with.
0.97359
0 0.001761 a
9
0.20728 -
1 a
3 0.990160
1.09964
2 1.872394 b
2
data1 data2 key1 key2
0.93989 -
3 b
7 0.241074
0.60638
4 0.053345 a
9
A B C D
- -
2013-01-01 1.030816 0.837720
1.276989 1.490111
- - -
2013-01-02 0.604572
1.070215 0.209129 1.743058
- - -
2013-01-04 0.918203
0.158800 0.964063 1.990779
-
2013-01-05 0.089731 0.114854 0.298772
0.585815
A B C D
-
2013-01-06 0.222260 0.435183 0.049898
0.045748
A B C D
0.83772 -
2013-01-01 1.030816 -1.276989
0 1.490111
- 0.60457 -
2013-01-02 -0.209129
1.070215 2 1.743058
1.29137
2013-01-03 1.524227 1.863575 1.300696
8
A B C D
- -
2013-01-01 1.030816 0.837720
1.276989 1.490111
- - -
2013-01-02 0.604572
1.070215 0.209129 1.743058
2013-01-04 0.918203 - - -
A B C D
A B
2013-01- -
1.030816
01 1.276989
2013-01- -
-1.070215
02 0.209129
2013-01-
1.524227 1.863575
03
2013-01- -
0.918203
04 0.158800
2013-01-
0.089731 0.114854
05
2013-01-
0.222260 0.435183
06
A B
2013-01- -1.070215 -
A B
02 0.209129
2013-01-
1.524227 1.863575
03
A B
2013-01- 1.52422
1.863575
03 7
2013-01- 0.91820
-0.158800
04 3
#returns specific rows and columns using lists containing columns or row
indexes
df.iloc[[1,5],[0,2]]
A C
2013-01-
-1.070215 0.604572
02
2013-01- 0.222260 -
A C
06 0.045748
Similarly, we can do Boolean indexing based on column values as well. This helps in filtering a data
set based on a pre-defined condition.
df[df.A > 1]
A B C D
1.03081 - -
2013-01-01 0.837720
6 1.276989 1.490111
1.52422
2013-01-03 1.863575 1.291378 1.300696
7
A B C D E
2013-01- - -
1.030816 0.837720 one
01 1.276989 1.490111
2013-01- - - -
0.604572 one
02 1.070215 0.209129 1.743058
2013-01-
1.524227 1.863575 1.291378 1.300696 two
03
A B C D E
2013-01- - - -
0.918203 three
04 0.158800 0.964063 1.990779
2013-01- -
0.089731 0.114854 0.298772 four
05 0.585815
2013-01- -
0.222260 0.435183 0.049898 three
06 0.045748
A B C D E
A B C D E
2013-01- - -
1.030816 0.837720 one
01 1.276989 1.490111
2013-01- - - -
0.604572 one
02 1.070215 0.209129 1.743058
A B C D E
2013-01- - - -
0.918203 three
04 0.158800 0.964063 1.990779
2013-01- -
0.222260 0.435183 0.049898 three
06 0.045748
We can also use a query method to select columns based on a criterion. Let's see how!
A B C D
1.03081 - -
2013-01-01 0.837720
6 1.276989 1.490111
1.52422
2013-01-03 1.863575 1.291378 1.300696
7
0.91820 - -
2013-01-04 -0.964063
3 0.158800 1.990779
0.08973
2013-01-05 0.114854 -0.585815 0.298772
1
0.22226
2013-01-06 0.435183 -0.045748 0.049898
0
#using OR condition
df.query('A < B | C > A')
A B C D
- - -
2013-01-02 0.604572
1.070215 0.209129 1.743058
-
2013-01-05 0.089731 0.114854 0.298772
0.585815
-
2013-01-06 0.222260 0.435183 0.049898
0.045748
Pivot tables are extremely useful in analyzing data using a customized tabular format. I think, among
other things, Excel is popular because of the pivot table option. It offers a super-quick way to
analyze data.
grou
ounces
p
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
grou
ounces
p
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
Up till now, we've become familiar with the basics of pandas library using toy examples. Now, we'll
take up a real-life data set and use our newly gained knowledge to explore it.
We see that, the train data has 32561 rows and 15 columns. Out of these 15 columns, 6 have
integers classes and the rest have object (or character) classes. Similarly, we can check for test
data. An alternative way of quickly checking rows and columns is
w e
f ed ma oc ho t
or d rel ca ca nat
n uc rit cu r urs a
a k u ati s pit pit ive
l ati al. pa a .pe r
g cl c on e al. al. .co
w on. sta ti c r.w g
e a at sh x ga lo un
g nu tu o e ee e
s io ip in ss try
t m s n k t
s n
0 3 St 7 B 13 Ne A No W M 21 0 40 Uni <
9 at 7 ac ver d t- h a 74 ted =
e- 5 h - m- in- it l - 5
g 1 el ma cl fa e e Sta 0
ov 6 or rrie eri mil tes K
s d ca y
w e
f ed ma oc ho t
or d rel ca ca nat
n uc rit cu r urs a
a k u ati s pit pit ive
l ati al. pa a .pe r
g cl c on e al. al. .co
w on. sta ti c r.w g
e a at sh x ga lo un
g nu tu o e ee e
s io ip in ss try
t m s n k t
s n
S
elf
Ma Ex
-
B rrie ec
e 8 Uni <
ac d- - Hu W M
m 3 ted =
5 h civ m sb h a
1 p- 3 13 0 0 13 - 5
0 el - an an it l
n 1 Sta 0
or sp ag d e e
ot 1 tes K
s ou eri
-
se al
in
c
H
an
2 No
H dl Uni <
Pr 1 t- W M
S- Div er ted =
3 iv 5 in- h a
2 gr 9 orc s- 0 0 40 - 5
8 at 6 fa it l
a ed cl Sta 0
e 4 mil e e
d ea tes K
6 y
ne
rs
H
Ma
an
2 rrie
dl B Uni <
Pr 3 d- Hu M
1 er l ted =
5 iv 4 civ sb a
3 1t 7 s- a 0 0 40 - 5
3 at 7 - an l
h cl c Sta 0
e 2 sp d e
ea k tes K
1 ou
ne
se
rs
w e
f ed ma oc ho t
or d rel ca ca nat
n uc rit cu r urs a
a k u ati s pit pit ive
l ati al. pa a .pe r
g cl c on e al. al. .co
w on. sta ti c r.w g
e a at sh x ga lo un
g nu tu o e ee e
s io ip in ss try
t m s n k t
s n
Ma
3 B rrie Pr F
B <
Pr 3 ac d- of- e
l =
2 iv 8 h civ sp Wi m Cu
4 13 a 0 0 40 5
8 at 4 el - ec fe a ba
c 0
e 0 or sp ial l
k K
9 s ou ty e
se
Now, let's check the missing values (if present) in this data.
age 0
workclass 1836
fnlwgt 0
education 0
education.num 0
marital.status 0
occupation 1843
relationship 0
race 0
sex 0
capital.gain 0
capital.loss 0
hours.per.wee
0
k
native.country 583
target 0
dtype: int64
cat = train.select_dtypes(include=['O'])
cat.apply(pd.Series.nunique)
workclass 8
education 16
marital.status 7
occupation 14
relationship 6
race 5
sex 2
native.country 41
target 2
dtype: int64
Since missing values are found in all 3 character variables, let's impute these missing values with
their respective modes.
#Education
train.workclass.value_counts(sort=True)
train.workclass.fillna('Private',inplace=True)
#Occupation
train.occupation.value_counts(sort=True)
train.occupation.fillna('Prof-specialty',inplace=True)
#Native Country
train['native.country'].value_counts(sort=True)
train['native.country'].fillna('United-States',inplace=True)
train.isnull().sum()
age 0
workclass 0
fnlwgt 0
education 0
education.num 0
marital.status 0
occupation 0
relationship 0
race 0
sex 0
capital.gain 0
capital.loss 0
hours.per.wee
0
k
native.country 0
target 0
dtype: int64
Now, we'll check the target variable to investigate if this data is imbalanced or not.
We see that 75% of the data set belongs to <=50K class. This means that even if we take a rough
guess of target prediction as <=50K, we'll get 75% accuracy. Isn't that amazing? Let's create a cross
tab of the target variable with education. With this, we'll try to understand the influence of education
on the target variable.
pd.crosstab(train.education, train.target,margins=True)/train.shape[0]
education
0.00190
10th 0.026750 0.028654
4
0.00184
11th 0.034243 0.036086
3
0.00101
12th 0.012285 0.013298
3
target <=50K >50K All
0.00018
1st-4th 0.004975 0.005160
4
0.00049
5th-6th 0.009736 0.010227
1
0.00122
7th-8th 0.018611 0.019840
8
0.00082
9th 0.014957 0.015786
9
0.00813
Assoc-acdm 0.024631 0.032769
9
0.01108
Assoc-voc 0.031357 0.042443
7
0.06821
Bachelors 0.096250 0.164461
0
0.00939
Doctorate 0.003286 0.012684
8
0.05144
HS-grad 0.271060 0.322502
2
0.02945
Masters 0.023464 0.052916
2
target <=50K >50K All
0.00000
Preschool 0.001566 0.001566
0
0.01299
Prof-school 0.004699 0.017690
1
Some- 0.04259
0.181321 0.223918
college 7
0.24081
All 0.759190 1.000000
0
We see that out of 75% people with <=50K salary, 27% people are high school graduates, which is
correct as people with lower levels of education are expected to earn less. On the other hand, out of
25% people with >=50K salary, 6% are bachelors and 5% are high-school grads. Now, this pattern
seems to be a matter of concern. That's why we'll have to consider more variables before coming to
a conclusion.
If you've come this far, you might be curious to get a taste of building your first machine learning
model. In the coming week we'll share an exclusive tutorial on machine learning in python. However,
let's get a taste of it here.
We'll use the famous and formidable scikit learn library. Scikit learn accepts data in numeric format.
Now, we'll have to convert the character variable into numeric. We'll use the labelencoder function.
In label encoding, each unique value of a variable gets assigned a number, i.e., let's say a variable
color has four values ['red','green','blue','pink'].
Label encoding this variable will return output as: red = 2 green = 0 blue = 1 pink = 3
for x in train.columns:
if train[x].dtype == 'object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train[x].values))
train[x] = lbl.transform(list(train[x].values))
train.head()
e
f ed t
w d ma oc rel ca ca ho nat
n uc r a
a or u rit cu ati s pit pit urs ive
l ati a r
g kc c al. pa on e al. al. .pe .co
w on. c g
e la at sta tio sh x ga lo r.w unt
g nu e e
ss io tus n ip in ss eek ry
t m t
n
7
7
3 21
0 6 5 9 13 4 0 1 4 1 0 40 38 0
9 74
1
6
8
3
5
1 5 3 9 13 2 3 0 4 1 0 0 13 38 0
0
1
1
2
1
3 5 1
2 3 9 0 5 1 4 1 0 0 40 38 0
8 6 1
4
6
2
3
5 4
3 3 1 7 2 5 0 2 1 0 0 40 38 0
3 7
2
1
3
3
2 8
4 3 9 13 2 9 5 2 0 0 0 40 4 0
8 4
0
9
As we can see, all the variables have been converted to numeric, including the target variable.
#<50K = 0 and >50K = 1
train.target.value_counts()
0 24720
1 7841
Name: target, dtype: int64
y = train['target']
del train['target']
X = train
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)
RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini',
max_depth=6, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=1, oob_score=False,
random_state=None,
verbose=0, warm_start=False)
clf.predict(X_test)
Now, let's make prediction on the test set and check the model's accuracy.
import numpy as np
y = np.random.random((5,1,4))
x+y