Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Datasciencepythonlab

The document is a lab manual for a Data Science course using Python, specifically focusing on the NumPy library. It covers various topics including creating arrays, reshaping, expanding, squeezing, sorting, and indexing of NumPy arrays. The manual provides code examples and explanations for each concept to aid in understanding and practical application.

Uploaded by

SwethaRouthu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Datasciencepythonlab

The document is a lab manual for a Data Science course using Python, specifically focusing on the NumPy library. It covers various topics including creating arrays, reshaping, expanding, squeezing, sorting, and indexing of NumPy arrays. The manual provides code examples and explanations for each concept to aid in understanding and practical application.

Uploaded by

SwethaRouthu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 77

DSP lab manual 111

Data Science using Python (Sasi Institute of Technology and


Engineering)

Scan to open on Studocu


Downloaded by SWETHA ROUTHU (swetharth@gmail.com)
1. Creating a NumPy Array
a. Basic ndarray
b. Array of zeros
c. Array of ones
d. Random numbers in ndarray
e. An array of your choice
f. Imatrix in NumPy
g. Evenly spaced ndarray

a. Creating a NumPy Array


Basic ndarray
NumPy arrays are very easy to create given the complex problems they solve. To create a
very basic ndarray, you use the np.array() method. All you have to pass are the values of the
array as a list:
import numpy as np

np.array([1,2,3,4],dtype=np.float32)

This array contains integer values. You can specify the type of data in
the dtype argument.

Output:

array([1., 2., 3., 4.], dtype=float32)

Since NumPy arrays can contain only homogeneous datatypes, values will be upcast if the types
do not match:
np.array([1,2.0,3,4])
Output:
array([1., 2., 3., 4.])
Here, NumPy has upcast integer values to float values.

NumPy arrays can be multi-dimensional too.


np.array([[1,2,3,4],[5,6,7,8]])
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
Here, we created a 2-dimensional array of values.
Array of zeros
NumPy lets you create an array of all zeros using the np.zeros() method. All you have to do is
pass the shape of the desired array:
np.zeros(5)
array([0., 0., 0., 0., 0.])
The one above is a 1-D array while the one below is a 2-D array:
np.zeros((2,3))
array([[0., 0., 0.],
[0., 0., 0.]])

Array of ones
You could also create an array of all 1s using the np.ones() method:
np.ones(5,dtype=np.int32)
array([1, 1, 1, 1, 1])

Random numbers in ndarrays


Another very commonly used method to create ndarrays is np.random.rand() method. It creates
an array of a given shape with random values from [0,1):
# random
np.random.rand(2,3)
array([[0.95580785, 0.98378873, 0.65133872],
[0.38330437, 0.16033608, 0.13826526]])

An array of your choice


Or, in fact, you can create an array filled with any given value using the np.full() method. Just
pass in the shape of the desired array and the value you want:
np.full((2,2),7)
array([[7, 7],
[7, 7]])

Imatrix in NumPy
Another great method is np.eye() that returns an array with 1s along its diagonal
and 0s everywhere else.
An Identity matrix is a square matrix that has 1s along its main diagonal and 0s everywhere
else. Below is an Identity matrix of shape 3 x 3.
Note: A square matrix has an N x N shape. This means it has the same number of rows and
columns.
# identity matrix
np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
However, NumPy gives you the flexibility to change the diagonal along which the values have to
be 1s. You can either move it above the main diagonal:
# not an identity matrix
np.eye(3,k=1)
array([[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])
Or move it below the main diagonal:
np.eye(3,k=-2)
array([[0., 0., 0.],
[0., 0., 0.],
[1., 0., 0.]])
Note: A matrix is called the Identity matrix only when the 1s are along the main diagonal and
not any other diagonal!

Evenly spaced ndarray


You can quickly get an evenly spaced array of numbers using the np.arange() method:
np.arange(5)
array([0, 1, 2, 3, 4])
The start, end and step size of the interval of values can be explicitly defined by passing in three
numbers as arguments for these values respectively. A point to be noted here is that the interval
is defined as [start,end) where the last number will not be included in the array:
np.arange(2,10,2)
array([2, 4, 6, 8])

np.arange(1,10,3)

Alternate elements were printed because the step-size was defined as 2. Notice that 10 was not
printed as it was the last element.

Another similar function is np.linspace(), but instead of step size, it takes in the number of
samples that need to be retrieved from the interval. A point to note here is that the last number is
included in the values returned unlike in the case of np.arange().
np.linspace(0,1,5)
array([0. , 0.25, 0.5 , 0.75, 1. ])
Great! Now you know how to create arrays using NumPy. But its also important to know the
shape of the array.
2. The Shape and Reshaping of NumPy Array
a. Dimensions of NumPy array
b. Shape of NumPy array
c. Size of NumPy array
d. Reshaping a NumPy array
e. Flattening a NumPy array
f. Transpose of a NumPy array

The Shape and Reshaping of NumPy Arrays


Once you have created your ndarray, the next thing you would want to do is check the number of
axes, shape, and the size of the ndarray.

a. Dimensions of NumPy arrays


You can easily determine the number of dimensions or axes of a NumPy array using
the ndims attribute:
# number of axis
import numpy as np
a = np.array([[5,10,15],[20,25,20]])
print('Array :','\n',a)
print('Dimensions :','\n',a.ndim)
Array :
[[ 5 10 15]
[20 25 20]]
Dimensions :
2
This array has two dimensions: 2 rows and 3 columns.

b. Shape of NumPy array


The shape is an attribute of the NumPy array that shows how many rows of elements are there
along each dimension. You can further index the shape so returned by the ndarray to get value
along each dimension:
import numpy as np
a = np.array([[1,2,3],[4,5,6]])
print('Array :','\n',a)
print('Shape :','\n',a.shape)
print('Rows = ',a.shape[0])
print('Columns = ',a.shape[1])
Array :
[[1 2 3]
[4 5 6]]
Shape :
(2, 3)
Rows = 2
Columns = 3
c. Size of NumPy array
You can determine how many values there are in the array using the size attribute. It just
multiplies the number of rows by the number of columns in the ndarray:
# size of array
import numpy as np
a = np.array([[5,10,15],[20,25,20]])
print('Size of array :',a.size)
print('Manual determination of size of array :',a.shape[0]*a.shape[1])
Size of array : 6
Manual determination of size of array : 6

d. Reshaping a NumPy array


Reshaping a ndarray can be done using the np.reshape() method. It changes the shape of the
ndarray without changing the data within the ndarray:
# reshape
import numpy as np
a = np.array([3,6,9,12])
np.reshape(a,(2,2))
array([[ 3, 6],
[ 9, 12]])
Here, I reshaped the ndarray from a 1-D to a 2-D ndarray.
While reshaping, if you are unsure about the shape of any of the axis, just input -1. NumPy
automatically calculates the shape when it sees a -1:
import numpy as np
a = np.array([3,6,9,12,18,24])
print('Three rows :','\n',np.reshape(a,(3,-1)))
print('Three columns :','\n',np.reshape(a,(-1,3)))
Three rows :
[[ 3 6]
[ 9 12]
[18 24]]
Three columns :
[[ 3 6 9]
[12 18 24]]

e. Flattening a NumPy array


Sometimes when you have a multidimensional array and want to collapse it to a single-
dimensional array, you can either use the flatten() method or the ravel() method:

import numpy as np
a = np.ones((2,2))
b = a.flatten()
c = a.ravel()
print('Original shape :', a.shape)
print('Array :','\n', a)
print('Shape after flatten :',b.shape)
print('Array :','\n', b)
print('Shape after ravel :',c.shape)
print('Array :','\n', c)
Original shape : (2, 2)
Array :
[[1. 1.]
[1. 1.]]
Shape after flatten : (4,)
Array :
[1. 1. 1. 1.]
Shape after ravel : (4,)
Array :
[1. 1. 1. 1.]

But an important difference between flatten() and ravel() is that the former returns a copy of the
original array while the latter returns a reference to the original array. This means any changes
made to the array returned from ravel() will also be reflected in the original array while this will
not be the case with flatten().
b[0] = 0
print(a)
[[1. 1.]
[1. 1.]]
The change made was not reflected in the original array.
c[0] = 0
print(a)
[[0. 1.]
[1. 1.]]
But here, the changed value is also reflected in the original ndarray.
What is happening here is that flatten() creates a Deep copy of the ndarray while ravel() creates
a Shallow copy of the ndarray.
Deep copy means that a completely new ndarray is created in memory and the ndarray object
returned by flatten() is now pointing to this memory location. Therefore, any changes made here
will not be reflected in the original ndarray.
A Shallow copy, on the other hand, returns a reference to the original memory location. Meaning
the object returned by ravel() is pointing to the same memory location as the original ndarray
object. So, definitely, any changes made to this ndarray will also be reflected in the original
ndarray too.
f. Transpose of a NumPy array
Another very interesting reshaping method of NumPy is the transpose() method. It takes the
input array and swaps the rows with the column values, and the column values with the values of
the rows:
import numpy as np
a = np.array([[1,2,3],
[4,5,6]])
b = np.transpose(a) print('Original','\
n','Shape',a.shape,'\n',a)
print('Expand along columns:','\n','Shape',b.shape,'\n',b)
Original
Shape (2, 3)
[[1 2 3]
[4 5 6]]
Expand along columns:
Shape (3, 2)
[[1 4]
[2 5]
[3 6]]
On transposing a 2 x 3 array, we got a 3 x 2 array. Transpose has a lot of significance in linear
algebra.
3. Expanding and Squeezing a NumPy Array
a. Expanding a NumPy array
b. Squeezing a NumPy array
c. Sorting in NumPy Arrays

a. Expanding and Squeezing a NumPy array


Expanding a NumPy array
You can add a new axis to an array using the expand_dims() method by providing the array and
the axis along which to expand:
# expand dimensions
a = np.array([1,2,3])
b = np.expand_dims(a,axis=0)
c = np.expand_dims(a,axis=1) print('Original:','\n','Shape',a.shape,'\
n',a)
print('Expand along columns:','\n','Shape',b.shape,'\n',b)
print('Expand along rows:','\n','Shape',c.shape,'\n',c)
Original:
Shape (3,)
[1 2 3]
Expand along columns:
Shape (1, 3)
[[1 2 3]]
Expand along rows:
Shape (3, 1)
[[1]
[2]
[3]]
b. Squeezing a NumPy array
On the other hand, if you instead want to reduce the axis of the array, use the squeeze() method.
It removes the axis that has a single entry. This means if you have created a 2 x 2 x 1 matrix,
squeeze() will remove the third dimension from the matrix:
# squeeze
a = np.array([[[1,2,3],
[4,5,6]]])
b = np.squeeze(a, axis=0) print('Original','\
n','Shape',a.shape,'\n',a)
print('Squeeze array:','\n','Shape',b.shape,'\n',b)
Original
Shape (1, 2, 3)
[[[1 2 3]
[4 5 6]]]
Squeeze array:
Shape (2, 3)
[[1 2 3]
[4 5 6]]
However, if you already had a 2 x 2 matrix, using squeeze() in that case would give you an error:
# squeeze
a = np.array([[1,2,3],
[4,5,6]])
b = np.squeeze(a, axis=0) print('Original','\
n','Shape',a.shape,'\n',a)
print('Squeeze array:','\n','Shape',b.shape,'\n',b)

c. Sorting in NumPy arrays


For any programmer, the time complexity of any algorithm is of prime essence. Sorting is an
important and very basic operation that you might well use on a daily basis as a data scientist.
So, it is important to use a good sorting algorithm with minimum time complexity.
The NumPy library is a legend when it comes to sorting elements of an array. It has a range of
sorting functions that you can use to sort your array elements. It has implemented quicksort,
heapsort, mergesort, and timesort for you under the hood when you use the sort() method:
a = np.array([1,4,2,5,3,6,8,7,9])
np.sort(a, kind='quicksort')
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
You can even sort the array along any axis you desire:
a = np.array([[5,6,7,4],
[9,2,3,7]])# sort along the column
print('Sort along column :','\n',np.sort(a,
kind='mergresort',axis=1)) # sort along the row
print('Sort along row :','\n',np.sort(a, kind='mergresort',axis=0))
Sort along column :
[[4 5 6 7]
[2 3 7 9]]
Sort along row :
[[5 2 3 4]
[9 6 7 7]]
4. Indexing and Slicing of NumPy Array
a. Slicing 1-D NumPy arrays
b. Slicing 2-D NumPy arrays
c. Slicing 3-D NumPy arrays
d. Negative slicing of NumPy arrays

a. Slicing 1-D NumPy arrays

Slicing means retrieving elements from one index to another index. All we have to do is to pass

the starting and ending point in the index like this: [start: end].

However, you can even take it up a notch by passing the step-size. What is that? Well, suppose

you wanted to print every other element from the array, you would define your step-size as 2,

meaning get the element 2 places away from the present index.

Incorporating all this into a single index would look something like this: [start:end:step-

size]. a = np.array([1,2,3,4,5,6])
print(a[1:5:2])
[2 4]

Notice that the last element did not get considered. This is because slicing includes the start

index but excludes the end index.

A way around this is to write the next higher index to the final index value you want to retrieve:

a = np.array([1,2,3,4,5,6])
print(a[1:6:2])
[2 4 6]

If you don’t specify the start or end index, it is taken as 0 or array size, respectively, as default.

And the step-size by default is 1.

a = np.array([1,2,3,4,5,6])
print(a[:6:2])
print(a[1::2])
print(a[1:6:])
[1 3 5]
[2 4 6]
[2 3 4 5 6]

b. Slicing 2-D NumPy arrays

Now, a 2-D array has rows and columns so it can get a little tricky to slice 2-D arrays. But once

you understand it, you can slice any dimension array!

Before learning how to slice a 2-D array, let’s have a look at how to retrieve an element from

a 2-D array:

a = np.array([[1,2,3],
[4,5,6]])
print(a[0,0])
print(a[1,2])
print(a[1,0])
1
6
4

Here, we provided the row value and column value to identify the element we wanted to extract.

While in a 1-D array, we were only providing the column value since there was only 1 row.

So, to slice a 2-D array, you need to mention the slices for both, the row and the

column: a = np.array([[1,2,3],[4,5,6]])
# print first row values
print('First row values :','\n',a[0:1,:])
# with step-size for columns
print('Alternate values from first row:','\n',a[0:1,::2])
#
print('Second column values :','\n',a[:,1::2])
print('Arbitrary values :','\n',a[0:1,1:3])
First row values :
[[1 2 3]]
Alternate values from first row:
[[1 3]]
Second column values :
[[2]
[5]]
Arbitrary values :
[[2 3]]

c. Slicing 3-D NumPy arrays

So far we haven’t seen a 3-D array. Let’s first visualize how a 3-D array looks like:

a = np.array([[[1,2],[3,4],[5,6]],# first axis array


[[7,8],[9,10],[11,12]],# second axis array
[[13,14],[15,16],[17,18]]])# third axis array
# 3-D array
print(a)
[[[ 1 2]
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]

[[13 14]
[15 16]
[17 18]]]

In addition to the rows and columns, as in a 2-D array, a 3-D array also has a depth axis where it

stacks one 2-D array behind the other. So, when you are slicing a 3-D array, you also need to

mention which 2-D array you are slicing. This usually comes as the first value in the index:

# value
print('First array, first row, first column value :','\n',a[0,0,0])
print('First array last column :','\n',a[0,:,1])
print('First two rows for second and third arrays :','\n',a[1:,0:2,0:2])
First array, first row, first column value :
1
First array last column :
[2 4 6]
First two rows for second and third arrays :
[[[ 7 8]
[ 9 10]]

[[13 14]
[15 16]]]

If in case you wanted the values as a single dimension array, you can always use the flatten()

method to do the job!

print('Printing as a single array :','\n',a[1:,0:2,0:2].flatten())


Printing as a single array :
[ 7 8 9 10 13 14 15 16]

d. Negative slicing of NumPy arrays

An interesting way to slice your array is to use negative slicing. Negative slicing prints elements

from the end rather than the beginning. Have a look below:

a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print(a[:,-1])
[ 5 10]

Here, the last values for each row were printed. If, however, we wanted to extract from the end,

we would have to explicitly provide a negative step-size otherwise the result would be an

empty list.

print(a[:,-1:-3:-1])
[[ 5 4]
[10 9]]

Having said that, the basic logic of slicing remains the same, i.e. the end index is never

included in the output.

An interesting use of negative slicing is to reverse the original array.

a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print('Original array :','\n',a)
print('Reversed array :','\n',a[::-1,::-1])
Original array :
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
Reversed array :
[[10 9 8 7 6]
[ 5 4 3 2 1]]
You can also use the flip() method to reverse an ndarray.

a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print('Original array :','\n',a)
print('Reversed array vertically :','\n',np.flip(a,axis=1))
print('Reversed array horizontally :','\n',np.flip(a,axis=0))
Original array :
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
Reversed array vertically :
[[ 5 4 3 2 1]
[10 9 8 7 6]]
Reversed array horizontally :
[[ 6 7 8 9 10]
[ 1 2 3 4 5]]
5. Stacking and Concatenating Numpy Arrays
a. Stacking ndarrays
b. Concatenating ndarrays
c. Broadcasting in Numpy Array

Stacking and Concatenating NumPy arrays

a. Stacking ndarrays

You can create a new array by combining existing arrays. This you can do in two ways:

 Either combine the arrays vertically (i.e. along the rows) using the vstack() method,
thereby increasing the number of rows in the resulting array
 Or combine the arrays in a horizontal fashion (i.e. along the columns) using the hstack(), thereby
increasing the number of columns in the resultant array

a = np.arange(0,5)
b = np.arange(5,10)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Vertical stacking :','\n',np.vstack((a,b)))
print('Horizontal stacking :','\n',np.hstack((a,b)))
Array 1 :
[0 1 2 3 4]
Array 2 :
[5 6 7 8 9]
Vertical stacking :
[[0 1 2 3 4]
[5 6 7 8 9]]
Horizontal stacking :
[0 1 2 3 4 5 6 7 8 9]

A point to note here is that the axis along which you are combining the array should have the

same size otherwise you are bound to get an error!

a = np.arange(0,5)
b = np.arange(5,9)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Vertical stacking :','\n',np.vstack((a,b)))
print('Horizontal stacking :','\n',np.hstack((a,b)))

Another interesting way to combine arrays is using the dstack() method. It combines array

elements index by index and stacks them along the depth axis:

a = [[1,2],[3,4]]
b = [[5,6],[7,8]]
c = np.dstack((a,b))
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Dstack :','\n',c)
print(c.shape)
Array 1 :
[[1, 2], [3, 4]]
Array 2 :
[[5, 6], [7, 8]]
Dstack :
[[[1 5]
[2 6]]

[[3 7]
[4 8]]]
(2, 2, 2)

b. Concatenating ndarrays

While stacking arrays is one way of combining old arrays to get a new one, you could also use

the concatenate() method where the passed arrays are joined along an existing axis:

a = np.arange(0,5).reshape(1,5)
b = np.arange(5,10).reshape(1,5)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Concatenate along rows :','\n',np.concatenate((a,b),axis=0))
print('Concatenate along columns :','\n',np.concatenate((a,b),axis=1))
Array 1 :
[[0 1 2 3 4]]
Array 2 :
[[5 6 7 8 9]]
Concatenate along rows :
[[0 1 2 3 4]
[5 6 7 8 9]]
Concatenate along columns :
[[0 1 2 3 4 5 6 7 8 9]]

The drawback of this method is that the original array must have the axis along which you want

to combine. Otherwise, get ready to be greeted by an error.

Another very useful function is the append method that adds new elements to the end of a

ndarray. This is obviously useful when you already have an existing ndarray but want to add new

values to it.

# append values to ndarray


a = np.array([[1,2],
[3,4]])
np.append(a,[[5,6]], axis=0)
array([[1, 2],
[3, 4],
[5, 6]])

c. Broadcasting in NumPy arrays

Broadcasting is one of the best features of ndarrays. It lets you perform arithmetics operations

between ndarrays of different sizes or between an ndarray and a simple number!

Broadcasting essentially stretches the smaller ndarray so that it matches the shape of the larger

ndarray:
a = np.arange(10,20,2)
b = np.array([[2],[2]])
print('Adding two different size arrays :','\n',a+b)
print('Multiplying an ndarray and a number :',a*2)
Adding two different size arrays :
[[12 14 16 18 20]
[12 14 16 18 20]]
Multiplying an ndarray and a number : [20 24 28 32 36]

Its working can be thought of like stretching or making copies of the scalar, the number, [2, 2, 2]

to match the shape of the ndarray and then perform the operation element-wise. But no such

copies are being made. It is just a way of thinking about how broadcasting is working.

This is very useful because it is more efficient to multiply an array with a scalar value rather

than another array! It is important to note that two ndarrays can broadcast together only when

they are compatible.

Ndarrays are compatible when:

1. Both have the same dimensions


2. Either of the ndarrays has a dimension of 1. The one having a dimension of 1 is
broadcast to meet the size requirements of the larger ndarray

In case the arrays are not compatible, you will get a ValueError.

a = np.ones((3,3))
b = np.array([2])
a+b
array([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])

Here, the second ndarray was stretched, hypothetically, to a 3 x 3 shape, and then the result was

calculated.
6. . Perform following operations using pandas

a. Creating dataframe
b. concat()
c. Setting conditions
d. Adding a new column

Pandas is one of the most popular and powerful data science libraries in Python. It can be
considered as the stepping stone for any aspiring data scientist who prefers to code in Python.
Even though the library is easy to get started, it can certainly do a wide variety of data
manipulation. This makes Pandas one of the handiest data science libraries in the developer’s
community. Pandas basically allow the manipulation of large datasets and data frames. It can
also be considered as one of the most efficient statistical tools for mathematical computations of
tabular data.

a. Creating dataframe

Let’s start off by creating a small sample dataset to try out various operations with Pandas. In

this tutorial, we shall create a Football data frame that stores the record of 4 players each from

Euro Cup 2020’s finalists – England and Italy.

import pandas as pd
# Create team data
data_england = {'Name': ['Kane', 'Sterling', 'Saka', 'Maguire'], 'Age': [27, 26, 19, 28]}
data_italy = {'Name': ['Immobile', 'Insigne', 'Chiellini', 'Chiesa'], 'Age': [31, 30, 36, 23]}

# Create Dataframe
df_england = pd.DataFrame(data_england)
df_italy = pd.DataFrame(data_italy)

The England data frame looks something like this


The Italy data frame looks something like this

b. The concat() function

Let’s start by concatenating our two data frames. The word “concatenate” means to “link

together in series”. Now that we have created two data frames, let’s try and “concat” them.

We do this by implementing the concat() function.

frames = [df_england, df_italy]


both_teams = pd.concat(frames)
both_teams

The result looks something like this:

A similar operation could also be done using the append() function.

Try doing:
df_england.append(df_italy)

You’ll get the same result!

Now, imagine you wanted to label your original data frames with the associated countries of

these players. You can do this by setting specific keys to your data frames.

Try doing:

pd.concat(frames, keys=["England", "Italy"])

And our result looks like this:

c. Setting conditions in Pandas

Conditional statements basically define conditions for data frame columns. There may be

situations where you have to filter out various data by applying certain column conditions

(numeric or non-numeric). For eg: In an Employee data frame, you might have to list out a

bunch of people whose salary is more than Rs. 50000. Also, you might want to filter the people

who live in New Delhi, or whose name starts with “A”. Let’s see a hands-on example.

Imagine we want to filter experienced players from our squad. Let’s say, we want to filter those

players whose age is greater than or equal to 30. In such case, try doing:

both_teams[both_teams["Age"] >= 30]


Hmm! Looks like Italians are more experienced lads.

Now, let’s try to do some string filtration. We want to filter those players whose name starts with

“S”. This implementation can be done by pandas’ startswith() function. Let’s try:

both_teams[both_teams["Name"].str.startswith('S')]

d. Adding a new column

Let’s try adding more data to our df_england data frame.

club = ['Tottenham', 'Man City', 'Arsenal', 'Man Utd']


# 'Associated Club' is our new column name
df_england['Associated Clubs'] = club
df_england

This will add a new column ‘Associated Club’ to England’s data frame.

Name Age Associated Clubs


0 Kane 27 Tottenham
1 Sterling 26 Man City
2 Saka 19 Arsenal
3 Maguire 28 Man Utd

Let’s try to repeat implementing the concat function after updating the data for England.

frames = [df_england, df_italy]


both_teams = pd.concat(frames)
both_teams
Name Age Associated Clubs
0 Kane 27 Tottenham
1 Sterling 26 Man City
2 Saka 19 Arsenal
3 Maguire 28 Man Utd
0 Immobile 31 NaN
1 Insigne 30 NaN
2 Chiellini 36 NaN
3 Chiesa 23 NaN

Now, this is interesting! Pandas seem to have automatically appended the NaN values in the

rows where ‘Associated Clubs’ weren’t explicitly mentioned. In this case, we had only updated

‘Associated Clubs’ data on England. The corresponding values for Italy were set to NaN.
7. Perform following operations using pandas
a. Filling NaN with string
b. Sorting based on column values
c. groupby()

a. Filling NaN with string

Now, what if, instead of NaN, we want to include some other text? Let’s try adding “No Data

found” instead of NaN values.

both_teams['Associated Clubs'].fillna('No Data Found', inplace=True)


both_teams
Name Age Associated Clubs
0 Kane 27 Tottenham
1 Sterling 26 Man City
2 Saka 19 Arsenal
3 Maguire 28 Man Utd
0 Immobile 31 No Data Found
1 Insigne 30 No Data Found
2 Chiellini 36 No Data Found
3 Chiesa 23 No Data Found

b. Sorting based on column values

Sorting operation is straightforward in Pandas. Sorting basically allows the data frame to be

ordered by numbers or alphabets (in either increasing or decreasing order). Let’s try and sort the

players according to their names.

both_teams.sort_values('Name')
Name Age Associated Clubs
2 Chiellini 36 No Data Found
3 Chiesa 23 No Data Found
0 Immobile 31 No Data Found
1 Insigne 30 No Data Found
0 Kane 27 Tottenham
3 Maguire 28 Man Utd
2 Saka 19 Arsenal
Name Age Associated Clubs
1 Sterling 26 Man City

Fair enough, we sorted the data frame according to the names of the players. We did this by

implementing the sort_values() function.

Let’s sort them by ages:

both_teams.sort_values('Age')
Name Age Associated Clubs
2 Saka 19 Arsenal
3 Chiesa 23 No Data Found
1 Sterling 26 Man City
0 Kane 27 Tottenham
3 Maguire 28 Man Utd
1 Insigne 30 No Data Found
0 Immobile 31 No Data Found
2 Chiellini 36 No Data Found

Ah, yes! Arsenal’s Bukayo Saka is the youngest lad out there!

Can we also sort by the oldest players? Absolutely!

both_teams.sort_values('Age', ascending=False)
Name Age Associated Clubs
2 Chiellini 36 No Data Found
0 Immobile 31 No Data Found
1 Insigne 30 No Data Found
3 Maguire 28 Man Utd
0 Kane 27 Tottenham
1 Sterling 26 Man City
3 Chiesa 23 No Data Found
2 Saka 19 Arsenal

c. Group by

Grouping is arguably the most important feature of Pandas. A groupby() function simply

groups a particular column. Let’s see a simple example by creating a new data frame.

a={
'UserID': ['U1001', 'U1002', 'U1001', 'U1001', 'U1003'],
'Transaction': [500, 300, 200, 300, 700]
}
df_a = pd.DataFrame(a)
df_a
UserID Transaction
0 U1001 500
1 U1002 300
2 U1001 200
3 U1001 300
4 U1003 700

Notice, we have two columns – UserID and Transaction. You can also see a repeating UserID

(U1001). Let’s apply a groupby() function to it.

df_a.groupby('UserID').sum()
Transaction
UserID
U1001 1000
U1002 300
U1003 700

The function grouped the similar UserIDs and took the sum of those IDs.

If you want to unravel a particular UserID, just try mentioning the value name through

get_group().

df_a.groupby('UserID').get_group('U1001')
UserID Transaction
0 U1001 500
2 U1001 200
3 U1001 300

And this is how we grouped our UserIDs and also checked for a particular ID name.
8. Read the following file formats using pandas
a. Text files
b. CSV files
c. Excel files
d. JSON files

a. Reading Text Files

Text files are one of the most common file formats to store data. Python makes it very easy to

read data from text files.

Python provides the open() function to read files that take in the file path and the file access

mode as its parameters. For reading a text file, the file access mode is ‘r’. I have mentioned the

other access modes below:

 ‘w’ – writing to a file


 ‘r+’ or ‘w+’ – read and write to a file
 ‘a’ – appending to an already existing file
 ‘a+’ – append to a file after reading

Python provides us with three functions to read data from a text file:

1. read(n) – This function reads n bytes from the text files or reads the complete
information from the file if no number is specified. It is smart enough to handle
the delimiters when it encounters one and separates the sentences
2. readline(n) – This function allows you to read n bytes from the file but not more
than one line of information
3. readlines() – This function reads the complete information in the file but unlike read(),
it doesn’t bother about the delimiting character and prints them as well in a list format

Let us see how these functions differ in reading a text file:

# read text file

with open(r'./Importing files/Analytics Vidhya.txt','r') as f:

print(f.read())
The read() function imported all the data in the file in the correct structured form.

By providing a number in the read() function, we were able to extract the specified amount of

bytes from the file.

# read text file

with open(r'./Importing files/Analytics Vidhya.txt','r') as f:

print(f.readline())

Using readline(), only a single line from the text file was extracted.

# read text file

with open(r'./Importing files/Analytics Vidhya.txt','r') as f:

print(f.readlines())

Here, the readline() function extracted all the text file data in a list format.
b. Reading CSV Files in Python

Ah, the good old CSV format. A CSV (or Comma Separated Value) file is the most common

type of file that a data scientist will ever work with. These files use a “,” as a delimiter to

separate the values and each row in a CSV file is a data record.

These are useful to transfer data from one application to another and is probably the reason why

they are so commonplace in the world of data science.

If you look at them in the Notepad, you will notice that the values are separated by commas:

The Pandas library makes it very easy to read CSV files using the read_csv() function:

But CSV can run into problems if the values contain commas. This can be overcome by using

different delimiters to separate information in the file, like ‘\t’ or ‘;’, etc. These can also be

imported with the read_csv() function by specifying the delimiter in the parameter value as

shown below while reading a TSV (Tab Separated Values) file:


import pandas as pd

df = pd.read_csv(r'./Importing files/Employee.txt',delimiter='\t')

df

C.Reading Excel Files

Most of you will be quite familiar with Excel files and why they are so widely used to store

tabular data. So I’m going to jump right to the code and import an Excel file in Python using

Pandas.

Pandas has a very handy function called read_excel() to read Excel files:

# read Excel file into a DataFrame

df = pd.read_excel(r'./Importing files/World_city.xlsx')

# print values

df
But an Excel file can contain multiple sheets, right? So how can we access them?

For this, we can use the Pandas’ ExcelFile() function to print the names of all the sheets in the

file:

# read Excel sheets in pandas

xl = pd.ExcelFile(r'./Importing files/World_city.xlsx')

# print sheet name

xl.sheet_names

After doing that, we can easily read data from any sheet we wish by providing its name in

the sheet_name parameter in the read_excel() function:

# read Europe sheet

df = pd.read_excel(r'./Importing files/World_city.xlsx',sheet_name='Europe')

df

Working with JSON Files in Python

d.JSON (JavaScript Object Notation)


JSON files are lightweight and human-readable to store and exchange data. It is easy

for machines to parse and generate these files and are based on the JavaScript

programming language.

JSON files store data within {} similar to how a dictionary stores it in Python. But their major

benefit is that they are language-independent, meaning they can be used with any programming

language – be it Python, C or even Java!

This is how a JSON file looks:

Python provides a json module to read JSON files. You can read JSON files just like simple text

files. However, the read function, in this case, is replaced by json.load() function that returns a

JSON dictionary.

Once you have done that, you can easily convert it into a Pandas dataframe using

the pandas.DataFrame() function:

import json
# open json file

with open('./Importing files/sample_json.json','r') as file:

data = json.load(file)

# json dictionary

print(type(data))

# loading into a DataFrame

df_json = pd.DataFrame(data)

df_json

But you can even load the JSON file directly into a dataframe using

the pandas.read_json() function as shown below:

# reading directly into a DataFrame usind pd.read_json()

path = './Importing files/sample_json.json'

df = pd.read_json(path)

df
9. Read the following file formats
a. Pickle files
b. Image files using PIL
c. Multiple files using Glob
d. Importing data from database
a. Reading Data from Pickle Files in Python

Pickle files are used to store the serialized form of Python objects. This means objects like list,

set, tuple, dict, etc. are converted to a character stream before being stored on the disk. This

allows you to continue working with the objects later on. These are particularly useful when you

have trained your machine learning model and want to save them to make predictions later on.

So, if you serialized the files before saving them, you need to de-serialize them before you use

them in your Python programs. This is done using the pickle.load() function in the pickle

module. But when you open the pickle file with Python’s open() function, you need to provide

the ‘rb’ parameter to read the binary file.

import pickle

with open('./Importing files/sample_pickle.pkl','rb') as file:

data = pickle.load(file)

# pickle data

print(type(data))

df_pkl = pd.DataFrame(data)

df_pkl
b. Reading Image Files using PIL

The advent of Convolutional Neural Networks (CNN) has opened the flood gates to working in

the computer vision domain and solving problems like object detection, object classification,

generating new images and what not!

But before you jump on to working with these problems, you need to know how to open your

images in Python. Let’s see how we can do that by retrieving images from the webpage that we

stored in our local folder.

You will need the Python PIL (Python Image Library) for this job.

Simply call the open() function in the Image module of PIL and pass in the path to your image:

from PIL import Image

# filename = r'C:\Users\Dell\Desktop\Analytics Vidhya\Delhi\1.jpg'

filename = r'./Importing files/Delhi/1.jpg'

Image.open(filename)
Read Multiple Files using Glob
And now, what if you want to read multiple files in one go? That’s quite a common challenge in

data science projects.

Python’s Glob module lets you traverse through multiple files in the same location.

Using glob.glob(), we can import all the files from our local folder that match a special pattern.

These filename patterns can be made using different wildcards like “*” (for matching multiple

characters), “?” (for matching any single character), or ‘[0-9]’ (for matching any number). Let’s

see glob in action below.

When importing multiple .py files from the same directory as your Python script, we can use

the “*” wildcard:

for i in glob.glob('.\Importing files\*.py'):

print(i)

When importing only a 5 character long Python file, we can use the “?” wildcard:

for i in glob.glob('.\Importing files\?????.py'):

print(i)

When importing an image file containing a number in the filename, we can use the “[0-

9]” wildcard:
for i in glob.glob('./Importing files/test_image[0-9].png'):

print(i)

Reading Image Files using PIL

The advent of Convolutional Neural Networks (CNN) has opened the flood gates to working in

the computer vision domain and solving problems like object detection, object classification,

generating new images and what not!

But before you jump on to working with these problems, you need to know how to open your

images in Python. Let’s see how we can do that by retrieving images from the webpage that we

stored in our local folder.

You will need the Python PIL (Python Image Library) for this job.

Simply call the open() function in the Image module of PIL and pass in the path to your image:

from PIL import Image

# filename = r'C:\Users\Dell\Desktop\Analytics Vidhya\Delhi\1.jpg'

filename = r'./Importing files/Delhi/1.jpg'

Image.open(filename)

view rawImport_files_20.py hosted with by GitHub


Voila! We have our image to work with! And isn’t my Delhi just beautiful?

Read Multiple Files using Glob

And now, what if you want to read multiple files in one go? That’s quite a common challenge in

data science projects.

Python’s Glob module lets you traverse through multiple files in the same location.

Using glob.glob(), we can import all the files from our local folder that match a special pattern.

These filename patterns can be made using different wildcards like “*” (for matching multiple

characters), “?” (for matching any single character), or ‘[0-9]’ (for matching any number). Let’s

see glob in action below.

When importing multiple .py files from the same directory as your Python script, we can use

the “*” wildcard:


for i in glob.glob('.\Importing files\*.py'):

print(i)

When importing only a 5 character long Python file, we can use the “?” wildcard:

for i in glob.glob('.\Importing files\?????.py'):

print(i)

When importing an image file containing a number in the filename, we can use the “[0-

9]” wildcard:

for i in glob.glob('./Importing files/test_image[0-9].png'):

print(i)

Earlier, we imported a few images from the Wikipedia page on Delhi and saved them in a local

folder. I will retrieve these images using the glob module and then display them using

the PIL library:

import cv2

import matplotlib.pyplot as plt

# import glob
filepath = r'./Importing files/Delhi'

images = glob.glob(filepath+'\*.jpg')

for i in images[:3]:

im = Image.open(i)

plt.imshow(im)

plt.show()
d. Importing Data from a Database using Python

When you are working on a real-world project, you would need to connect your program to a

database to retrieve data. There is no way around it (that’s why learning SQL is an important part

of your data science journey).

Data in databases is stored in the form of tables and these systems are known as Relational

database management systems (RDBMS). However, connecting to RDBMS and retrieving the

data from it can prove to be quite a challenging task. Here’s the good news – we can easily do

this using Python’s built-in modules!

One of the most popular RDBMS is SQLite. It has many plus points:

1. Lightweight database and hence it is easy to use in embedded software


2. 35% faster reading and writing compared to the File System
3. No intermediary server required. Reading and writing are done directly from the
database files on the disk
4. Cross-platform database file format. This means a file written on one machine can
be copied to and used on a different machine with a different architecture
There are many more reasons for its popularity. But for now, let’s connect with an SQLite database

and retrieve our data!

You will need to import the sqlite3 module to use SQLite. Then, you need to work through the

following steps to access your data:

1. Create a connection with the database connect(). You need to pass the name of your
database to access it. It returns a Connection object
2. Once you have done that, you need to create a cursor object using the cursor() function.
This will allow you to implement SQL commands with which you can manipulate your
data
3. You can execute the commands in SQL by calling the execute() function on the cursor
object. Since we are retrieving data from the database, we will use the SELECT
statement and store the query in an object
4. Store the data from the object into a dataframe by either calling fetchone(), for one
row, or fecthall(), for all the rows, function on the object

And just like that, you have retrieved the data from the database into a Pandas dataframe!

A good practice is to save/commit your transactions using the commit() function even if you are

only reading the data.

import pandas as pd
import sqlite3

# open engine connection


con=sqlite3.connect('./Importing files/sample_test.db')

# create a cursor object cur


= con.cursor()

# Perform query: rs
rs = cur.execute('select * from TEST')

# Save results of the query to DataFrame: df


df = pd.DataFrame(rs.fetchall())

# Close connection
con.commit()

# Print head of DataFrame df


df
10. Demonstrate web scraping using python

Web Scraping using Python

Web Scraping refers to extracting large amounts of data from the web. This is important for a

data scientist who has to analyze large amounts of data.

Python provides a very handy module called requests to retrieve data from any website.

The requests.get() function takes in a URL as its parameter and returns the HTML response as

its output. The way it works is summarized in the following steps:

1. It packages the Get request to retrieve data from webpage


2. Sends the request to the server
3. Receives the HTML response and stores in a response object

For this example, I want to show you a bit about my city – Delhi. So, I will retrieve data from the

Wikipedia page on Delhi:

import requests

# url = "https://weather.com/en-
IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27fb7eb73c70b912
57d413147b69"

url = "https://en.wikipedia.org/wiki/Delhi"

# response object

resp = requests.get(url)

# using text attribute of the response object, return the HTML of webpage as
string
text = resp.text

print(text)

But as you can see, the data is not very readable. The tree-like structure of the HTML content

retrieved by our request is not very comprehensible. To improve this readability, Python has

another wonderful library called BeautifulSoup.

BeautifulSoup is a Python library for parsing the tree-like structure of HTML and extracting data

from the HTML document.

Find more about BeautifulSoup in this here.

Right, let’s see the wonder of BeautifulSoup.

To make it work, we need to pass the text response from the request object

to BeautifulSoup() which creates its own object – “soup” in this case. Calling prettify() on

BeautifulSoup object parses the tree-like structure of the HTML document:

import requests

from bs4 import BeautifulSoup


# url

# url = "https://weather.com/en-
IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27fb7eb73c70b912
57d413147b69"

url = "https://en.wikipedia.org/wiki/Delhi"

# Package the request, send the request and catch the response:

r r = requests.get(url)

# Extracts the response as html: html_doc

html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup

soup = BeautifulSoup(html_doc)

# Print the response

print(soup.prettify())
You must have noticed the difference in the output. We have a more structured output in this

case!

Now, we can extract the title of the webpage by calling the title() function of our soup object:

title = soup.title

title

view rawImport_files_17.py hosted with by GitHub

The webpage has a lot of pictures of the famous monuments in Delhi and other things related to

Delhi. Let’s try and store these in a local folder.

We will need the Python urllib library to retrieve the URL of the images that we want to store. It

has a urllib.request() function that is used for opening and reading URLs. Calling

the urlretrieve() function on this object allows us to download objects denoted by the URL to a

local file:

import urllib

# function to save image from the passed URL

def download_img(url, i):

# folder = r'C:\Users\Dell\Desktop\Analytics Vidhya\

Delhi\\' folder = r'./Importing files/Delhi/'


# define the file path to store images

filepath = folder + str(i) +'.jpg'

# retrieve the image from the URL and save in the folder

urllib.request.urlretrieve(url,filepath)

The images are stored in the “img” tag in HTML. These can be found by calling find_all() on

the soup object. After this, we can iterate over the image and get its source by calling

the get() function on the image object. The rest is handled by our download function:

images = soup.find_all('img')

i=1

for image in images[2:10]:

try:

download_img('https:'+image.get('src'), i)

i = i+1

except:

continue
11. Perform following preprocessing techniques on loan prediction dataset
a. Feature Scaling
b. Feature Standardization
c. Label Encoding
d. One Hot Encoding
Available Data set

For this article, I have used a subset of the Loan Prediction (missing value observations are

dropped) data set from You can download the final training and testing data set from

here: https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-

scikit-learn/

Now, lets get started by importing important packages and the data set.

# Importing pandas
>> import pandas as pd
# Importing training data set
>> X_train=pd.read_csv('X_train.csv')
>> Y_train=pd.read_csv('Y_train.csv')
# Importing testing data set
>> X_test=pd.read_csv('X_test.csv')
>> Y_test=pd.read_csv('Y_test.csv')

Lets take a closer look at our data set.

>> print (X_train.head())


Loan_ID Gender Married Dependents Education Self_Employed
15 LP001032 Male No 0 Graduate No
248 LP001824 Male Yes 1 Graduate No
590 LP002928 Male Yes 0 Graduate No
246 LP001814 Male Yes 2 Graduate No
388 LP002244 Male Yes 0 Graduate No

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term


15 4950 0.0 125.0 360.0
248 2882 1843.0 123.0 480.0
590 3000 3416.0 56.0 180.0
246 9703 0.0 112.0 360.0
388 2333 2417.0 136.0 360.0

Credit_History Property_Area
15 1.0 Urban
248 1.0 Semiurban
590 1.0 Semiurban
246 1.0 Urban
388 1.0 Urban

a. Feature Scaling

Feature scaling is the method to limit the range of variables so that they can be compared on

common grounds. It is performed on continuous variables. Lets plot the distribution of all the

continuous variables in the data set.

>> import matplotlib.pyplot as plt


>> X_train[X_train.dtypes[(X_train.dtypes=="float64")|(X_train.dtypes=="int64")]
.index.values].hist(figsize=[11,11])
After understanding these plots, we infer that ApplicantIncome and CoapplicantIncome are in

similar range (0-50000$) where as LoanAmount is in thousands and it ranges from 0 to 600$.

The story for Loan_Amount_Term is completely different from other variables because its unit is

months as opposed to other variables where the unit is dollars.

Feature Standardization
Before jumping to this section I suggest you to complete Exercise 1.

In the previous section, we worked on the Loan_Prediction data set and fitted a kNN learner on

the data set. After scaling down the data, we have got an accuracy of 75% which is very

considerably good. I tried the same exercise on Logistic Regression and I got the following result

Before Scaling : 61%

After Scaling : 63%

The accuracy we got after scaling is close to the prediction which we made by guessing, which is

not a very impressive achievement. So, what is happening here? Why hasn’t the accuracy

increased by a satisfactory amount as it increased in kNN?

Here is the answer:

In logistic regression, each feature is assigned a weight or coefficient (Wi). If there is a feature

with relatively large range and it is insignificant in the objective function then logistic regression

will itself assign a very low value to its co-efficient, thus neutralizing the dominant effect of that

particular feature, whereas distance based method such as kNN does not have this inbuilt

strategy, thus it requires scaling.

Aren’t we forgetting something ? Our logistic model is still predicting with an accuracy almost

closer to a guess.

Now, I’ll be introducing a new concept here called standardization. Many machine learning

algorithms in sklearn requires standardized data which means having zero mean and unit

variance.

b. Standardization (or Z-score normalization)


It is the process where the features are rescaled so that they’ll have the properties of a standard

normal distribution with μ=0 and σ=1, where μ is the mean (average) and σ is the standard

deviation from the mean. Standard scores (also called z scores) of the samples are calculated as

follows :

Elements such as l1 ,l2 regularizer in linear models (logistic comes under this category) and RBF

kernel in SVM in objective function of learners assumes that all the features are centered around

zero and have variance in the same order.

Features having larger order of variance would dominate on the objective function as it happened

in the previous section with the feature having large range. As we saw in the Exercise 1 that

without any preprocessing on the data the accuracy was 61%, lets standardize our data apply

logistic regression on that. Sklearn provides scale to standardize the data.

# Standardizing the train and test data


>> from sklearn.preprocessing import scale
>> X_train_scale=scale(X_train[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
>> X_test_scale=scale(X_test[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
# Fitting logistic regression on our standardized data set
>> from sklearn.linear_model import LogisticRegression
>> log=LogisticRegression(penalty='l2',C=.01)
>> log.fit(X_train_scale,Y_train)
# Checking the model's accuracy
>> accuracy_score(Y_test,log.predict(X_test_scale))
Out : 0.75
We again reached to our maximum score that was attained using kNN after scaling. This means

standardizing the data when using a estimator having l1 or l2 regularization helps us to increase

the accuracy of the prediction model. Other learners like kNN with euclidean distance measure,

k-means, SVM, perceptron, neural networks, linear discriminant analysis, principal component

analysis may perform better with standardized data.

Though, I suggest you to understand your data and what kind of algorithm you are going to

apply on it; over the time you will be able to judge weather to standardize your data or not.

Note : Choosing between scaling and standardizing is a confusing choice, you have to dive

deeper in your data and learner that you are going to use to reach the decision. For starters, you

can try both the methods and check cross validation score for making a choice.

c. Label Encoding

In previous sections, we did the pre-processing for continuous numeric features. But, our data set

has other features too such as Gender, Married, Dependents, Self_Employed and Education. All

these categorical features have string values. For example, Gender has two levels

either Male or Female. Lets feed the features in our logistic regression model.

# Fitting a logistic regression model on whole data


>> log=LogisticRegression(penalty='l2',C=.01)
>> log.fit(X_train,Y_train)
# Checking the model's accuracy
>> accuracy_score(Y_test,log.predict(X_test))
Out : ValueError: could not convert string to float: Semiurban

We got an error saying that it cannot convert string to float. So, what’s actually happening here

is learners like logistic regression, distance based methods such as kNN, support vector

machines,
tree based methods etc. in sklearn needs numeric arrays. Features having string values cannot be

handled by these learners.

Sklearn provides a very efficient tool for encoding the levels of a categorical features into

numeric values. LabelEncoder encode labels with value between 0 and n_classes-1.

Lets encode all the categorical features.

# Importing LabelEncoder and initializing it


>> from sklearn.preprocessing import LabelEncoder
>> le=LabelEncoder()
# Iterating over all the common columns in train and test
>> for col in X_test.columns.values:
# Encoding only categorical variables
if X_test[col].dtypes=='object':
# Using whole data to form an exhaustive list of levels
data=X_train[col].append(X_test[col])
le.fit(data.values)
X_train[col]=le.transform(X_train[col])
X_test[col]=le.transform(X_test[col])

All our categorical features are encoded. You can look at your updated data set

using X_train.head(). We are going to take a look at Gender frequency distribution before and

after the encoding.

Before : Male 318


Female 66
Name: Gender, dtype: int64
After : 1 318
0 66
Name: Gender, dtype: int64

Now that we are done with label encoding, lets now run a logistic regression model on the data

set with both categorical and continuous features.

# Standardizing the features


>> X_train_scale=scale(X_train)
>> X_test_scale=scale(X_test)
# Fitting the logistic regression model
>> log=LogisticRegression(penalty='l2',C=.01)
>> log.fit(X_train_scale,Y_train)
# Checking the models accuracy
>> accuracy_score(Y_test,log.predict(X_test_scale))
Out : 0.75

Its working now. But, the accuracy is still the same as we got with logistic regression after

standardization from numeric features. This means categorical features we added are not very

significant in our objective function.

d. One-Hot Encoding

One-Hot Encoding transforms each categorical feature with n possible values into n binary

features, with only one active.

Most of the ML algorithms either learn a single weight for each feature or it computes distance

between the samples. Algorithms like linear models (such as logistic regression) belongs to the

first category.
Lets take a look at an example from loan_prediction data set. Feature Dependents have 4

possible values 0,1,2 and 3+ which are then encoded without loss of generality to 0,1,2 and 3.

We, then have a weight “W” assigned for this feature in a linear classifier,which will make a

decision based on the constraints W*Dependents + K > 0 or eqivalently W*Dependents < K.

Let f(w)= W*Dependents

Possible values that can be attained by the equation are 0, W, 2W and 3W. A problem with this

equation is that the weight “W” cannot make decision based on four choices. It can reach to a

decision in following ways:

 All leads to the same decision (all of them <K or vice versa)
 3:1 division of the levels (Decision boundary at f(w)>2W)
 2:2 division of the levels (Decision boundary at f(w)>W)

Here we can see that we are loosing many different possible decisions such as the case where

“0” and “2W” should be given same label and “3W” and “W” are odd one out.

This problem can be solved by One-Hot-Encoding as it effectively changes the dimensionality of

the feature “Dependents” from one to four, thus every value in the feature “Dependents” will

have their own weights. Updated equation for the decison would be f'(w) < K.

where, f'(w) = W1*D_0 + W2*D_1 + W3*D_2 + W4*D_3

All four new variable has boolean values (0 or 1).

The same thing happens with distance based methods such as kNN. Without encoding, distance

between “0” and “1” values of Dependents is 1 whereas distance between “0” and “3+” will be 3,
which is not desirable as both the distances should be similar. After encoding, the values will be

new features (sequence of columns is 0,1,2,3+) : [1,0,0,0] and [0,0,0,1] (initially we were finding

distance between “0” and “3+”), now the distance would be √2.

For tree based methods, same situation (more than two values in a feature) might effect the

outcome to extent but if methods like random forests are deep enough, it can handle the

categorical variables without one-hot encoding.

Now, lets take look at the implementation of one-hot encoding with various algorithms.

Lets create a logistic regression model for classification without one-hot encoding.

# We are using scaled variable as we saw in previous section that


# scaling will effect the algo with l1 or l2 reguralizer
>> X_train_scale=scale(X_train)
>> X_test_scale=scale(X_test)
# Fitting a logistic regression model
>> log=LogisticRegression(penalty='l2',C=1)
>> log.fit(X_train_scale,Y_train)
# Checking the model's accuracy
>> accuracy_score(Y_test,log.predict(X_test_scale))
Out : 0.73958333333333337

Now we are going to encode the data.

>> from sklearn.preprocessing import OneHotEncoder


>> enc=OneHotEncoder(sparse=False)
>> X_train_1=X_train
>> X_test_1=X_test
>> columns=['Gender', 'Married', 'Dependents', 'Education','Self_Employed',
'Credit_History', 'Property_Area']
>> for col in columns:
# creating an exhaustive list of all possible categorical values
data=X_train[[col]].append(X_test[[col]])
enc.fit(data)
# Fitting One Hot Encoding on train data
temp = enc.transform(X_train[[col]])
# Changing the encoded features into a data frame with new column names
temp=pd.DataFrame(temp,columns=[(col+"_"+str(i)) for i in data[col]
.value_counts().index])
# In side by side concatenation index values should be same
# Setting the index values similar to the X_train data frame
temp=temp.set_index(X_train.index.values)
# adding the new One Hot Encoded varibales to the train data frame
X_train_1=pd.concat([X_train_1,temp],axis=1)
# fitting One Hot Encoding on test data
temp = enc.transform(X_test[[col]])
# changing it into data frame and adding column names
temp=pd.DataFrame(temp,columns=[(col+"_"+str(i)) for i in data[col]
.value_counts().index])
# Setting the index for proper concatenation
temp=temp.set_index(X_test.index.values)
# adding the new One Hot Encoded varibales to test data frame
X_test_1=pd.concat([X_test_1,temp],axis=1)

Now, lets apply logistic regression model on one-hot encoded data.

# Standardizing the data set


>> X_train_scale=scale(X_train_1)
>> X_test_scale=scale(X_test_1)
# Fitting a logistic regression model
>> log=LogisticRegression(penalty='l2',C=1)
>> log.fit(X_train_scale,Y_train)
# Checking the model's accuracy
>> accuracy_score(Y_test,log.predict(X_test_scale))
Out : 0.75

Here, again we got the maximum accuracy as 0.75 that we have gotten so far. In this

case, logistic regression regularization(C) parameter 1 where as earlier we used C=0.01.


12. Perform following visualizations using matplotlib
a. Bar Graph
b. Pie Chart
c. Box Plot
d. Histogram
e. Line Chart and Subplots
f. Scatter Plot

Data Visualization: It is a way to express your data in a visual context so that patterns,

correlations, trends between the data can be easily understood. Data Visualization helps in

finding hidden insights by providing skin to your raw data (skeleton).

In this article, we will be using multiple datasets to show exactly how things work. The base

dataset will be the iris dataset which we will import from sklearn. We will create the rest of the

dataset according to need.

Let’s import all the libraries which are required for doing

import math,os,random
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stat
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

Reading the data (Main focused dataset)

iris = pd.read_csv('../input/iris-flower-dataset/IRIS.csv') # iris dataset


iris_feat = iris.iloc[:,:-1]
iris_species = iris.iloc[:,-1]

Let’s Start the Hunt for ways to visualize data.

Scatter Plot
These are the charts/plots that are used to observe and display relationships between
variables using Cartesian Coordinates. The values (x: first variable , y: second variable)
of the variables are represented by dots. Scatter plots are also known as scattergrams,
scatter graphs, scatter charts , or scatter diagrams. It is best suited for situations where
the dependent variable can have multiple values for the independent variable.

Scatter Plot with Matplotlib

plt.scatter(iris_feat['sepal_length'],iris_feat['petal_length'],alpha=1) # alpha chances the


transparency
#Adding the aesthetics
plt.title('Scatter Plot')
plt.xlabel('sepal_length')
plt.ylabel('petal_length')
#Show the plot
plt.show()

Scatter plot for Multivariate Analysis

colors = {'Iris-setosa':'r', 'Iris-virginica':'g', 'Iris-versicolor':'b'}


# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for i in range(len(iris_feat['sepal_length'])):
ax.scatter(iris_feat['sepal_length'][i], iris_feat['sepal_width'][i],color=colors[iris_species[i]])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')

ax.legend()
plt.show()

Two common issues with the use of scatter plots are – overplotting and the interpretation of

causation as correlation.

Overplotting occurs when there are too many data points to plot, which results in the overlapping

of different data points and make it hard to identify any relationship between points
Correlation does not mean that the changes observed in one variable are responsible for changes

in another variable. So any conclusions made by correlation should be treated carefully.

Line Plot

Line plots is a graph that is used for the representation of continuous data points on a

number line. Line plots are created by first plotting data points on the Cartesian plane

then joining those points with a number line. Line plots can help display data points for

both single variable analysis as well as multiple variable analysis.

Line Plot in Matplotlib

# get columns to plot


columns = iris_feat.columns
# create x data
x_data = range(0, iris.shape[0])
# create figure and axis
fig, ax = plt.subplots()
# plot each column
for column in columns:
ax.plot(x_data, iris[column], label=column)
# set title and legend
ax.set_title('Iris Dataset')
ax.legend()
Seaborn Implementation Line Graphs

# Seaborn Implementation
df = pd.DataFrame({
'A': [1,3,2,7,9,6,8,10],
'B': [2,4,1,8,10,3,11,12],
'C': ['a','a','a','a','b','b','b','b']
})
sns.lineplot( data=
df,
x="A", y="B", hue="C",style="C",
markers=True, dashes=False
)
Histograms

Histograms are used to represent the frequency distribution of continuous variables.

The width of the histogram represents interval and the length represents frequency. To

create a histogram you need to create bins of the interval which are not overlapping.

Histogram allows the inspection of data for its underlying distribution, outliers,

skewness.

Histograms in Matplotlib

fig, ax = plt.subplots()
# plot histogram
ax.hist(iris_feat['sepal_length'])
# set title and labels
ax.set_title('sepal_length')
ax.set_xlabel('Points')
ax.set_ylabel('Frequency')
The values with longer plots signify that more values are concentrated there.

Histograms help in understanding the frequency distribution of the overall data points

for that feature.

LINE HISTOGRAMS (MODIFICATION TO HISTOGRAMS)

Line histograms are the modification to the standard histogram to understand and

represent the distribution of a single feature with different data points. Line histograms

have a density curve passing through the histogram.

Matplotlib Implementation

#Creating the dataset


test_data = np.random.normal(0, 1, (500, ))
density = stat.gaussian_kde(test_data)
#Creating the line histogram
n, x, _ = plt.hist(test_data, bins=np.linspace(-3, 3, 50), histtype=u'step', density=True)
plt.plot(x, density(x))
#Adding the aesthetics
plt.title('Title')
plt.xlabel('X axis')
plt.ylabel('Y axis')
#Show the plot
plt.show()

Normal Histogram is a bell-shaped histogram with most of the frequency counts focused in the

middle with diminishing tails. Line with orange color passing through histogram represents

Gaussian distribution for data points.

Bar Chart

Bar charts are best suited for the visualization of categorical data because they allow

you to easily see the difference between feature values by measuring the size(length) of
the bars. There are 2 types of bar charts depending upon their orientation (i.e. vertical

or horizontal). Moreover, there are 3 types of bar charts based on their representation

that is shown below.

1. NORMAL BAR CHART

Matplotlib Implementation

df = iris.groupby('species')['sepal_length'].sum().to_frame().reset_index()
#Creating the bar chart
plt.bar(df['species'],df['sepal_length'],color =
['cornflowerblue','lightseagreen','steelblue']) #Adding the aesthetics
plt.title('Bar Chart')
plt.xlabel('Species')
plt.ylabel('sepal_length')
#Show the plot
plt.show()
With the above image, we can clearly see the difference in the sum of sepal_length for

each species of a leaf.

GROUPED BAR CHART

These bar charts allows us to compare multiple categorical features. Lets see an

example.

Matplotlib Implementation

mid_term_marks=[ random.uniform(0,50) for i in range(5)]


end_term_marks=[ random.uniform(0,100) for i in range(5)]
fig=plt.figure(figsize=(10,6))
students=['students A','students B','students C','students D','students E']
x_pos_mid=list(range(1,6))
x_pos_end=[ i+0.4 for i in x_pos_mid]
graph_mid_term=plt.bar(x_pos_mid,
mid_term_marks,color='indianred',label='midterm',width=0.4)
graph_end_term=plt.bar(x_pos_end,
end_term_marks,color='steelblue',label='endterm',width=0.4)
plt.xticks([i+0.2 for i in x_pos_mid],students)
plt.title('students names')
plt.ylabel('Scores')
plt.legend()
plt.show()
STACKED BAR CHART

Matplotlib Implementation

df = pd.DataFrame(columns=[“A”,”B”, “C”,”D”],
data=[["E",1,2,0],
["F",3,1,3],
["G",1,2,1]])
df.plot.bar(x='A', y=["B", "C","D"], stacked=True, alpha=0.8 ,color=['steelblue','darkorange'
,'mediumseagreen'])
plt.title('Title')
#Show the plot
plt.show()
Pie Plot

A pie plot is a circular representation of data that can be represented in relative proportions. A

pie chart is divided into various parts depending on the number of numerical relative proportions.

#Creating the dataset


students = ['A','B','C','D','E']
scores = [30,40,50,10,5]
#Creating the pie chart
plt.pie(scores, explode=[0,0.1,0,0,0], labels = students,
colors = ['#EF8FFF','#ff6347','#B0E0E6','#7B68EE','#483D8B'])
#Show the plot
plt.show()
By removing argument explodes from the above function you can create a completely joined pie

chart.

Box Plot

This is one of the most used methods by data scientists. Box plot is a way of displaying the

distribution of data based on the five-number theory. It basically gives information about the

outliers and how much spread out data is from the center. It can tell if data symmetry is

present or not. It also gives information about how tightly or skewed your data is. They are

also known as Whisker plots

sepal_length = iris_feat['sepal_length']
petal_length = iris_feat['petal_length']
petal_width = iris_feat['petal_width']
sepal_width = iris_feat['sepal_width']
data = [sepal_length , petal_length , petal_width , sepal_width]
fig1, ax1 = plt.subplots()
ax1.set_title('Basic Plot')
ax1.boxplot(data)
plt.show()
The dots or bubbles outside the 4th boxplot are the outliers. The line inside the box is depicting

the median of data points of that category of variables.

You might also like