Datasciencepythonlab
Datasciencepythonlab
np.array([1,2,3,4],dtype=np.float32)
This array contains integer values. You can specify the type of data in
the dtype argument.
Output:
Since NumPy arrays can contain only homogeneous datatypes, values will be upcast if the types
do not match:
np.array([1,2.0,3,4])
Output:
array([1., 2., 3., 4.])
Here, NumPy has upcast integer values to float values.
Array of ones
You could also create an array of all 1s using the np.ones() method:
np.ones(5,dtype=np.int32)
array([1, 1, 1, 1, 1])
Imatrix in NumPy
Another great method is np.eye() that returns an array with 1s along its diagonal
and 0s everywhere else.
An Identity matrix is a square matrix that has 1s along its main diagonal and 0s everywhere
else. Below is an Identity matrix of shape 3 x 3.
Note: A square matrix has an N x N shape. This means it has the same number of rows and
columns.
# identity matrix
np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
However, NumPy gives you the flexibility to change the diagonal along which the values have to
be 1s. You can either move it above the main diagonal:
# not an identity matrix
np.eye(3,k=1)
array([[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])
Or move it below the main diagonal:
np.eye(3,k=-2)
array([[0., 0., 0.],
[0., 0., 0.],
[1., 0., 0.]])
Note: A matrix is called the Identity matrix only when the 1s are along the main diagonal and
not any other diagonal!
np.arange(1,10,3)
Alternate elements were printed because the step-size was defined as 2. Notice that 10 was not
printed as it was the last element.
Another similar function is np.linspace(), but instead of step size, it takes in the number of
samples that need to be retrieved from the interval. A point to note here is that the last number is
included in the values returned unlike in the case of np.arange().
np.linspace(0,1,5)
array([0. , 0.25, 0.5 , 0.75, 1. ])
Great! Now you know how to create arrays using NumPy. But its also important to know the
shape of the array.
2. The Shape and Reshaping of NumPy Array
a. Dimensions of NumPy array
b. Shape of NumPy array
c. Size of NumPy array
d. Reshaping a NumPy array
e. Flattening a NumPy array
f. Transpose of a NumPy array
import numpy as np
a = np.ones((2,2))
b = a.flatten()
c = a.ravel()
print('Original shape :', a.shape)
print('Array :','\n', a)
print('Shape after flatten :',b.shape)
print('Array :','\n', b)
print('Shape after ravel :',c.shape)
print('Array :','\n', c)
Original shape : (2, 2)
Array :
[[1. 1.]
[1. 1.]]
Shape after flatten : (4,)
Array :
[1. 1. 1. 1.]
Shape after ravel : (4,)
Array :
[1. 1. 1. 1.]
But an important difference between flatten() and ravel() is that the former returns a copy of the
original array while the latter returns a reference to the original array. This means any changes
made to the array returned from ravel() will also be reflected in the original array while this will
not be the case with flatten().
b[0] = 0
print(a)
[[1. 1.]
[1. 1.]]
The change made was not reflected in the original array.
c[0] = 0
print(a)
[[0. 1.]
[1. 1.]]
But here, the changed value is also reflected in the original ndarray.
What is happening here is that flatten() creates a Deep copy of the ndarray while ravel() creates
a Shallow copy of the ndarray.
Deep copy means that a completely new ndarray is created in memory and the ndarray object
returned by flatten() is now pointing to this memory location. Therefore, any changes made here
will not be reflected in the original ndarray.
A Shallow copy, on the other hand, returns a reference to the original memory location. Meaning
the object returned by ravel() is pointing to the same memory location as the original ndarray
object. So, definitely, any changes made to this ndarray will also be reflected in the original
ndarray too.
f. Transpose of a NumPy array
Another very interesting reshaping method of NumPy is the transpose() method. It takes the
input array and swaps the rows with the column values, and the column values with the values of
the rows:
import numpy as np
a = np.array([[1,2,3],
[4,5,6]])
b = np.transpose(a) print('Original','\
n','Shape',a.shape,'\n',a)
print('Expand along columns:','\n','Shape',b.shape,'\n',b)
Original
Shape (2, 3)
[[1 2 3]
[4 5 6]]
Expand along columns:
Shape (3, 2)
[[1 4]
[2 5]
[3 6]]
On transposing a 2 x 3 array, we got a 3 x 2 array. Transpose has a lot of significance in linear
algebra.
3. Expanding and Squeezing a NumPy Array
a. Expanding a NumPy array
b. Squeezing a NumPy array
c. Sorting in NumPy Arrays
Slicing means retrieving elements from one index to another index. All we have to do is to pass
the starting and ending point in the index like this: [start: end].
However, you can even take it up a notch by passing the step-size. What is that? Well, suppose
you wanted to print every other element from the array, you would define your step-size as 2,
meaning get the element 2 places away from the present index.
Incorporating all this into a single index would look something like this: [start:end:step-
size]. a = np.array([1,2,3,4,5,6])
print(a[1:5:2])
[2 4]
Notice that the last element did not get considered. This is because slicing includes the start
A way around this is to write the next higher index to the final index value you want to retrieve:
a = np.array([1,2,3,4,5,6])
print(a[1:6:2])
[2 4 6]
If you don’t specify the start or end index, it is taken as 0 or array size, respectively, as default.
a = np.array([1,2,3,4,5,6])
print(a[:6:2])
print(a[1::2])
print(a[1:6:])
[1 3 5]
[2 4 6]
[2 3 4 5 6]
Now, a 2-D array has rows and columns so it can get a little tricky to slice 2-D arrays. But once
Before learning how to slice a 2-D array, let’s have a look at how to retrieve an element from
a 2-D array:
a = np.array([[1,2,3],
[4,5,6]])
print(a[0,0])
print(a[1,2])
print(a[1,0])
1
6
4
Here, we provided the row value and column value to identify the element we wanted to extract.
While in a 1-D array, we were only providing the column value since there was only 1 row.
So, to slice a 2-D array, you need to mention the slices for both, the row and the
column: a = np.array([[1,2,3],[4,5,6]])
# print first row values
print('First row values :','\n',a[0:1,:])
# with step-size for columns
print('Alternate values from first row:','\n',a[0:1,::2])
#
print('Second column values :','\n',a[:,1::2])
print('Arbitrary values :','\n',a[0:1,1:3])
First row values :
[[1 2 3]]
Alternate values from first row:
[[1 3]]
Second column values :
[[2]
[5]]
Arbitrary values :
[[2 3]]
So far we haven’t seen a 3-D array. Let’s first visualize how a 3-D array looks like:
[[13 14]
[15 16]
[17 18]]]
In addition to the rows and columns, as in a 2-D array, a 3-D array also has a depth axis where it
stacks one 2-D array behind the other. So, when you are slicing a 3-D array, you also need to
mention which 2-D array you are slicing. This usually comes as the first value in the index:
# value
print('First array, first row, first column value :','\n',a[0,0,0])
print('First array last column :','\n',a[0,:,1])
print('First two rows for second and third arrays :','\n',a[1:,0:2,0:2])
First array, first row, first column value :
1
First array last column :
[2 4 6]
First two rows for second and third arrays :
[[[ 7 8]
[ 9 10]]
[[13 14]
[15 16]]]
If in case you wanted the values as a single dimension array, you can always use the flatten()
An interesting way to slice your array is to use negative slicing. Negative slicing prints elements
from the end rather than the beginning. Have a look below:
a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print(a[:,-1])
[ 5 10]
Here, the last values for each row were printed. If, however, we wanted to extract from the end,
we would have to explicitly provide a negative step-size otherwise the result would be an
empty list.
print(a[:,-1:-3:-1])
[[ 5 4]
[10 9]]
Having said that, the basic logic of slicing remains the same, i.e. the end index is never
a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print('Original array :','\n',a)
print('Reversed array :','\n',a[::-1,::-1])
Original array :
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
Reversed array :
[[10 9 8 7 6]
[ 5 4 3 2 1]]
You can also use the flip() method to reverse an ndarray.
a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print('Original array :','\n',a)
print('Reversed array vertically :','\n',np.flip(a,axis=1))
print('Reversed array horizontally :','\n',np.flip(a,axis=0))
Original array :
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
Reversed array vertically :
[[ 5 4 3 2 1]
[10 9 8 7 6]]
Reversed array horizontally :
[[ 6 7 8 9 10]
[ 1 2 3 4 5]]
5. Stacking and Concatenating Numpy Arrays
a. Stacking ndarrays
b. Concatenating ndarrays
c. Broadcasting in Numpy Array
a. Stacking ndarrays
You can create a new array by combining existing arrays. This you can do in two ways:
Either combine the arrays vertically (i.e. along the rows) using the vstack() method,
thereby increasing the number of rows in the resulting array
Or combine the arrays in a horizontal fashion (i.e. along the columns) using the hstack(), thereby
increasing the number of columns in the resultant array
a = np.arange(0,5)
b = np.arange(5,10)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Vertical stacking :','\n',np.vstack((a,b)))
print('Horizontal stacking :','\n',np.hstack((a,b)))
Array 1 :
[0 1 2 3 4]
Array 2 :
[5 6 7 8 9]
Vertical stacking :
[[0 1 2 3 4]
[5 6 7 8 9]]
Horizontal stacking :
[0 1 2 3 4 5 6 7 8 9]
A point to note here is that the axis along which you are combining the array should have the
a = np.arange(0,5)
b = np.arange(5,9)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Vertical stacking :','\n',np.vstack((a,b)))
print('Horizontal stacking :','\n',np.hstack((a,b)))
Another interesting way to combine arrays is using the dstack() method. It combines array
elements index by index and stacks them along the depth axis:
a = [[1,2],[3,4]]
b = [[5,6],[7,8]]
c = np.dstack((a,b))
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Dstack :','\n',c)
print(c.shape)
Array 1 :
[[1, 2], [3, 4]]
Array 2 :
[[5, 6], [7, 8]]
Dstack :
[[[1 5]
[2 6]]
[[3 7]
[4 8]]]
(2, 2, 2)
b. Concatenating ndarrays
While stacking arrays is one way of combining old arrays to get a new one, you could also use
the concatenate() method where the passed arrays are joined along an existing axis:
a = np.arange(0,5).reshape(1,5)
b = np.arange(5,10).reshape(1,5)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Concatenate along rows :','\n',np.concatenate((a,b),axis=0))
print('Concatenate along columns :','\n',np.concatenate((a,b),axis=1))
Array 1 :
[[0 1 2 3 4]]
Array 2 :
[[5 6 7 8 9]]
Concatenate along rows :
[[0 1 2 3 4]
[5 6 7 8 9]]
Concatenate along columns :
[[0 1 2 3 4 5 6 7 8 9]]
The drawback of this method is that the original array must have the axis along which you want
Another very useful function is the append method that adds new elements to the end of a
ndarray. This is obviously useful when you already have an existing ndarray but want to add new
values to it.
Broadcasting is one of the best features of ndarrays. It lets you perform arithmetics operations
Broadcasting essentially stretches the smaller ndarray so that it matches the shape of the larger
ndarray:
a = np.arange(10,20,2)
b = np.array([[2],[2]])
print('Adding two different size arrays :','\n',a+b)
print('Multiplying an ndarray and a number :',a*2)
Adding two different size arrays :
[[12 14 16 18 20]
[12 14 16 18 20]]
Multiplying an ndarray and a number : [20 24 28 32 36]
Its working can be thought of like stretching or making copies of the scalar, the number, [2, 2, 2]
to match the shape of the ndarray and then perform the operation element-wise. But no such
copies are being made. It is just a way of thinking about how broadcasting is working.
This is very useful because it is more efficient to multiply an array with a scalar value rather
than another array! It is important to note that two ndarrays can broadcast together only when
In case the arrays are not compatible, you will get a ValueError.
a = np.ones((3,3))
b = np.array([2])
a+b
array([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])
Here, the second ndarray was stretched, hypothetically, to a 3 x 3 shape, and then the result was
calculated.
6. . Perform following operations using pandas
a. Creating dataframe
b. concat()
c. Setting conditions
d. Adding a new column
Pandas is one of the most popular and powerful data science libraries in Python. It can be
considered as the stepping stone for any aspiring data scientist who prefers to code in Python.
Even though the library is easy to get started, it can certainly do a wide variety of data
manipulation. This makes Pandas one of the handiest data science libraries in the developer’s
community. Pandas basically allow the manipulation of large datasets and data frames. It can
also be considered as one of the most efficient statistical tools for mathematical computations of
tabular data.
a. Creating dataframe
Let’s start off by creating a small sample dataset to try out various operations with Pandas. In
this tutorial, we shall create a Football data frame that stores the record of 4 players each from
import pandas as pd
# Create team data
data_england = {'Name': ['Kane', 'Sterling', 'Saka', 'Maguire'], 'Age': [27, 26, 19, 28]}
data_italy = {'Name': ['Immobile', 'Insigne', 'Chiellini', 'Chiesa'], 'Age': [31, 30, 36, 23]}
# Create Dataframe
df_england = pd.DataFrame(data_england)
df_italy = pd.DataFrame(data_italy)
Let’s start by concatenating our two data frames. The word “concatenate” means to “link
together in series”. Now that we have created two data frames, let’s try and “concat” them.
Try doing:
df_england.append(df_italy)
Now, imagine you wanted to label your original data frames with the associated countries of
these players. You can do this by setting specific keys to your data frames.
Try doing:
Conditional statements basically define conditions for data frame columns. There may be
situations where you have to filter out various data by applying certain column conditions
(numeric or non-numeric). For eg: In an Employee data frame, you might have to list out a
bunch of people whose salary is more than Rs. 50000. Also, you might want to filter the people
who live in New Delhi, or whose name starts with “A”. Let’s see a hands-on example.
Imagine we want to filter experienced players from our squad. Let’s say, we want to filter those
players whose age is greater than or equal to 30. In such case, try doing:
Now, let’s try to do some string filtration. We want to filter those players whose name starts with
“S”. This implementation can be done by pandas’ startswith() function. Let’s try:
both_teams[both_teams["Name"].str.startswith('S')]
This will add a new column ‘Associated Club’ to England’s data frame.
Let’s try to repeat implementing the concat function after updating the data for England.
Now, this is interesting! Pandas seem to have automatically appended the NaN values in the
rows where ‘Associated Clubs’ weren’t explicitly mentioned. In this case, we had only updated
‘Associated Clubs’ data on England. The corresponding values for Italy were set to NaN.
7. Perform following operations using pandas
a. Filling NaN with string
b. Sorting based on column values
c. groupby()
Now, what if, instead of NaN, we want to include some other text? Let’s try adding “No Data
Sorting operation is straightforward in Pandas. Sorting basically allows the data frame to be
ordered by numbers or alphabets (in either increasing or decreasing order). Let’s try and sort the
both_teams.sort_values('Name')
Name Age Associated Clubs
2 Chiellini 36 No Data Found
3 Chiesa 23 No Data Found
0 Immobile 31 No Data Found
1 Insigne 30 No Data Found
0 Kane 27 Tottenham
3 Maguire 28 Man Utd
2 Saka 19 Arsenal
Name Age Associated Clubs
1 Sterling 26 Man City
Fair enough, we sorted the data frame according to the names of the players. We did this by
both_teams.sort_values('Age')
Name Age Associated Clubs
2 Saka 19 Arsenal
3 Chiesa 23 No Data Found
1 Sterling 26 Man City
0 Kane 27 Tottenham
3 Maguire 28 Man Utd
1 Insigne 30 No Data Found
0 Immobile 31 No Data Found
2 Chiellini 36 No Data Found
Ah, yes! Arsenal’s Bukayo Saka is the youngest lad out there!
both_teams.sort_values('Age', ascending=False)
Name Age Associated Clubs
2 Chiellini 36 No Data Found
0 Immobile 31 No Data Found
1 Insigne 30 No Data Found
3 Maguire 28 Man Utd
0 Kane 27 Tottenham
1 Sterling 26 Man City
3 Chiesa 23 No Data Found
2 Saka 19 Arsenal
c. Group by
Grouping is arguably the most important feature of Pandas. A groupby() function simply
groups a particular column. Let’s see a simple example by creating a new data frame.
a={
'UserID': ['U1001', 'U1002', 'U1001', 'U1001', 'U1003'],
'Transaction': [500, 300, 200, 300, 700]
}
df_a = pd.DataFrame(a)
df_a
UserID Transaction
0 U1001 500
1 U1002 300
2 U1001 200
3 U1001 300
4 U1003 700
Notice, we have two columns – UserID and Transaction. You can also see a repeating UserID
df_a.groupby('UserID').sum()
Transaction
UserID
U1001 1000
U1002 300
U1003 700
The function grouped the similar UserIDs and took the sum of those IDs.
If you want to unravel a particular UserID, just try mentioning the value name through
get_group().
df_a.groupby('UserID').get_group('U1001')
UserID Transaction
0 U1001 500
2 U1001 200
3 U1001 300
And this is how we grouped our UserIDs and also checked for a particular ID name.
8. Read the following file formats using pandas
a. Text files
b. CSV files
c. Excel files
d. JSON files
Text files are one of the most common file formats to store data. Python makes it very easy to
Python provides the open() function to read files that take in the file path and the file access
mode as its parameters. For reading a text file, the file access mode is ‘r’. I have mentioned the
Python provides us with three functions to read data from a text file:
1. read(n) – This function reads n bytes from the text files or reads the complete
information from the file if no number is specified. It is smart enough to handle
the delimiters when it encounters one and separates the sentences
2. readline(n) – This function allows you to read n bytes from the file but not more
than one line of information
3. readlines() – This function reads the complete information in the file but unlike read(),
it doesn’t bother about the delimiting character and prints them as well in a list format
print(f.read())
The read() function imported all the data in the file in the correct structured form.
By providing a number in the read() function, we were able to extract the specified amount of
print(f.readline())
Using readline(), only a single line from the text file was extracted.
print(f.readlines())
Here, the readline() function extracted all the text file data in a list format.
b. Reading CSV Files in Python
Ah, the good old CSV format. A CSV (or Comma Separated Value) file is the most common
type of file that a data scientist will ever work with. These files use a “,” as a delimiter to
separate the values and each row in a CSV file is a data record.
These are useful to transfer data from one application to another and is probably the reason why
If you look at them in the Notepad, you will notice that the values are separated by commas:
The Pandas library makes it very easy to read CSV files using the read_csv() function:
But CSV can run into problems if the values contain commas. This can be overcome by using
different delimiters to separate information in the file, like ‘\t’ or ‘;’, etc. These can also be
imported with the read_csv() function by specifying the delimiter in the parameter value as
df = pd.read_csv(r'./Importing files/Employee.txt',delimiter='\t')
df
Most of you will be quite familiar with Excel files and why they are so widely used to store
tabular data. So I’m going to jump right to the code and import an Excel file in Python using
Pandas.
Pandas has a very handy function called read_excel() to read Excel files:
df = pd.read_excel(r'./Importing files/World_city.xlsx')
# print values
df
But an Excel file can contain multiple sheets, right? So how can we access them?
For this, we can use the Pandas’ ExcelFile() function to print the names of all the sheets in the
file:
xl = pd.ExcelFile(r'./Importing files/World_city.xlsx')
xl.sheet_names
After doing that, we can easily read data from any sheet we wish by providing its name in
df = pd.read_excel(r'./Importing files/World_city.xlsx',sheet_name='Europe')
df
for machines to parse and generate these files and are based on the JavaScript
programming language.
JSON files store data within {} similar to how a dictionary stores it in Python. But their major
benefit is that they are language-independent, meaning they can be used with any programming
Python provides a json module to read JSON files. You can read JSON files just like simple text
files. However, the read function, in this case, is replaced by json.load() function that returns a
JSON dictionary.
Once you have done that, you can easily convert it into a Pandas dataframe using
import json
# open json file
data = json.load(file)
# json dictionary
print(type(data))
df_json = pd.DataFrame(data)
df_json
But you can even load the JSON file directly into a dataframe using
df = pd.read_json(path)
df
9. Read the following file formats
a. Pickle files
b. Image files using PIL
c. Multiple files using Glob
d. Importing data from database
a. Reading Data from Pickle Files in Python
Pickle files are used to store the serialized form of Python objects. This means objects like list,
set, tuple, dict, etc. are converted to a character stream before being stored on the disk. This
allows you to continue working with the objects later on. These are particularly useful when you
have trained your machine learning model and want to save them to make predictions later on.
So, if you serialized the files before saving them, you need to de-serialize them before you use
them in your Python programs. This is done using the pickle.load() function in the pickle
module. But when you open the pickle file with Python’s open() function, you need to provide
import pickle
data = pickle.load(file)
# pickle data
print(type(data))
df_pkl = pd.DataFrame(data)
df_pkl
b. Reading Image Files using PIL
The advent of Convolutional Neural Networks (CNN) has opened the flood gates to working in
the computer vision domain and solving problems like object detection, object classification,
But before you jump on to working with these problems, you need to know how to open your
images in Python. Let’s see how we can do that by retrieving images from the webpage that we
You will need the Python PIL (Python Image Library) for this job.
Simply call the open() function in the Image module of PIL and pass in the path to your image:
Image.open(filename)
Read Multiple Files using Glob
And now, what if you want to read multiple files in one go? That’s quite a common challenge in
Python’s Glob module lets you traverse through multiple files in the same location.
Using glob.glob(), we can import all the files from our local folder that match a special pattern.
These filename patterns can be made using different wildcards like “*” (for matching multiple
characters), “?” (for matching any single character), or ‘[0-9]’ (for matching any number). Let’s
When importing multiple .py files from the same directory as your Python script, we can use
print(i)
When importing only a 5 character long Python file, we can use the “?” wildcard:
print(i)
When importing an image file containing a number in the filename, we can use the “[0-
9]” wildcard:
for i in glob.glob('./Importing files/test_image[0-9].png'):
print(i)
The advent of Convolutional Neural Networks (CNN) has opened the flood gates to working in
the computer vision domain and solving problems like object detection, object classification,
But before you jump on to working with these problems, you need to know how to open your
images in Python. Let’s see how we can do that by retrieving images from the webpage that we
You will need the Python PIL (Python Image Library) for this job.
Simply call the open() function in the Image module of PIL and pass in the path to your image:
Image.open(filename)
And now, what if you want to read multiple files in one go? That’s quite a common challenge in
Python’s Glob module lets you traverse through multiple files in the same location.
Using glob.glob(), we can import all the files from our local folder that match a special pattern.
These filename patterns can be made using different wildcards like “*” (for matching multiple
characters), “?” (for matching any single character), or ‘[0-9]’ (for matching any number). Let’s
When importing multiple .py files from the same directory as your Python script, we can use
print(i)
When importing only a 5 character long Python file, we can use the “?” wildcard:
print(i)
When importing an image file containing a number in the filename, we can use the “[0-
9]” wildcard:
print(i)
Earlier, we imported a few images from the Wikipedia page on Delhi and saved them in a local
folder. I will retrieve these images using the glob module and then display them using
import cv2
# import glob
filepath = r'./Importing files/Delhi'
images = glob.glob(filepath+'\*.jpg')
for i in images[:3]:
im = Image.open(i)
plt.imshow(im)
plt.show()
d. Importing Data from a Database using Python
When you are working on a real-world project, you would need to connect your program to a
database to retrieve data. There is no way around it (that’s why learning SQL is an important part
Data in databases is stored in the form of tables and these systems are known as Relational
database management systems (RDBMS). However, connecting to RDBMS and retrieving the
data from it can prove to be quite a challenging task. Here’s the good news – we can easily do
One of the most popular RDBMS is SQLite. It has many plus points:
You will need to import the sqlite3 module to use SQLite. Then, you need to work through the
1. Create a connection with the database connect(). You need to pass the name of your
database to access it. It returns a Connection object
2. Once you have done that, you need to create a cursor object using the cursor() function.
This will allow you to implement SQL commands with which you can manipulate your
data
3. You can execute the commands in SQL by calling the execute() function on the cursor
object. Since we are retrieving data from the database, we will use the SELECT
statement and store the query in an object
4. Store the data from the object into a dataframe by either calling fetchone(), for one
row, or fecthall(), for all the rows, function on the object
And just like that, you have retrieved the data from the database into a Pandas dataframe!
A good practice is to save/commit your transactions using the commit() function even if you are
import pandas as pd
import sqlite3
# Perform query: rs
rs = cur.execute('select * from TEST')
# Close connection
con.commit()
Web Scraping refers to extracting large amounts of data from the web. This is important for a
Python provides a very handy module called requests to retrieve data from any website.
The requests.get() function takes in a URL as its parameter and returns the HTML response as
For this example, I want to show you a bit about my city – Delhi. So, I will retrieve data from the
import requests
# url = "https://weather.com/en-
IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27fb7eb73c70b912
57d413147b69"
url = "https://en.wikipedia.org/wiki/Delhi"
# response object
resp = requests.get(url)
# using text attribute of the response object, return the HTML of webpage as
string
text = resp.text
print(text)
But as you can see, the data is not very readable. The tree-like structure of the HTML content
retrieved by our request is not very comprehensible. To improve this readability, Python has
BeautifulSoup is a Python library for parsing the tree-like structure of HTML and extracting data
To make it work, we need to pass the text response from the request object
to BeautifulSoup() which creates its own object – “soup” in this case. Calling prettify() on
import requests
# url = "https://weather.com/en-
IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27fb7eb73c70b912
57d413147b69"
url = "https://en.wikipedia.org/wiki/Delhi"
# Package the request, send the request and catch the response:
r r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
print(soup.prettify())
You must have noticed the difference in the output. We have a more structured output in this
case!
Now, we can extract the title of the webpage by calling the title() function of our soup object:
title = soup.title
title
The webpage has a lot of pictures of the famous monuments in Delhi and other things related to
We will need the Python urllib library to retrieve the URL of the images that we want to store. It
has a urllib.request() function that is used for opening and reading URLs. Calling
the urlretrieve() function on this object allows us to download objects denoted by the URL to a
local file:
import urllib
# retrieve the image from the URL and save in the folder
urllib.request.urlretrieve(url,filepath)
The images are stored in the “img” tag in HTML. These can be found by calling find_all() on
the soup object. After this, we can iterate over the image and get its source by calling
the get() function on the image object. The rest is handled by our download function:
images = soup.find_all('img')
i=1
try:
download_img('https:'+image.get('src'), i)
i = i+1
except:
continue
11. Perform following preprocessing techniques on loan prediction dataset
a. Feature Scaling
b. Feature Standardization
c. Label Encoding
d. One Hot Encoding
Available Data set
For this article, I have used a subset of the Loan Prediction (missing value observations are
dropped) data set from You can download the final training and testing data set from
here: https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-
scikit-learn/
Now, lets get started by importing important packages and the data set.
# Importing pandas
>> import pandas as pd
# Importing training data set
>> X_train=pd.read_csv('X_train.csv')
>> Y_train=pd.read_csv('Y_train.csv')
# Importing testing data set
>> X_test=pd.read_csv('X_test.csv')
>> Y_test=pd.read_csv('Y_test.csv')
Credit_History Property_Area
15 1.0 Urban
248 1.0 Semiurban
590 1.0 Semiurban
246 1.0 Urban
388 1.0 Urban
a. Feature Scaling
Feature scaling is the method to limit the range of variables so that they can be compared on
common grounds. It is performed on continuous variables. Lets plot the distribution of all the
similar range (0-50000$) where as LoanAmount is in thousands and it ranges from 0 to 600$.
The story for Loan_Amount_Term is completely different from other variables because its unit is
Feature Standardization
Before jumping to this section I suggest you to complete Exercise 1.
In the previous section, we worked on the Loan_Prediction data set and fitted a kNN learner on
the data set. After scaling down the data, we have got an accuracy of 75% which is very
considerably good. I tried the same exercise on Logistic Regression and I got the following result
The accuracy we got after scaling is close to the prediction which we made by guessing, which is
not a very impressive achievement. So, what is happening here? Why hasn’t the accuracy
In logistic regression, each feature is assigned a weight or coefficient (Wi). If there is a feature
with relatively large range and it is insignificant in the objective function then logistic regression
will itself assign a very low value to its co-efficient, thus neutralizing the dominant effect of that
particular feature, whereas distance based method such as kNN does not have this inbuilt
Aren’t we forgetting something ? Our logistic model is still predicting with an accuracy almost
closer to a guess.
Now, I’ll be introducing a new concept here called standardization. Many machine learning
algorithms in sklearn requires standardized data which means having zero mean and unit
variance.
normal distribution with μ=0 and σ=1, where μ is the mean (average) and σ is the standard
deviation from the mean. Standard scores (also called z scores) of the samples are calculated as
follows :
Elements such as l1 ,l2 regularizer in linear models (logistic comes under this category) and RBF
kernel in SVM in objective function of learners assumes that all the features are centered around
Features having larger order of variance would dominate on the objective function as it happened
in the previous section with the feature having large range. As we saw in the Exercise 1 that
without any preprocessing on the data the accuracy was 61%, lets standardize our data apply
standardizing the data when using a estimator having l1 or l2 regularization helps us to increase
the accuracy of the prediction model. Other learners like kNN with euclidean distance measure,
k-means, SVM, perceptron, neural networks, linear discriminant analysis, principal component
Though, I suggest you to understand your data and what kind of algorithm you are going to
apply on it; over the time you will be able to judge weather to standardize your data or not.
Note : Choosing between scaling and standardizing is a confusing choice, you have to dive
deeper in your data and learner that you are going to use to reach the decision. For starters, you
can try both the methods and check cross validation score for making a choice.
c. Label Encoding
In previous sections, we did the pre-processing for continuous numeric features. But, our data set
has other features too such as Gender, Married, Dependents, Self_Employed and Education. All
these categorical features have string values. For example, Gender has two levels
either Male or Female. Lets feed the features in our logistic regression model.
We got an error saying that it cannot convert string to float. So, what’s actually happening here
is learners like logistic regression, distance based methods such as kNN, support vector
machines,
tree based methods etc. in sklearn needs numeric arrays. Features having string values cannot be
Sklearn provides a very efficient tool for encoding the levels of a categorical features into
numeric values. LabelEncoder encode labels with value between 0 and n_classes-1.
All our categorical features are encoded. You can look at your updated data set
using X_train.head(). We are going to take a look at Gender frequency distribution before and
Now that we are done with label encoding, lets now run a logistic regression model on the data
Its working now. But, the accuracy is still the same as we got with logistic regression after
standardization from numeric features. This means categorical features we added are not very
d. One-Hot Encoding
One-Hot Encoding transforms each categorical feature with n possible values into n binary
Most of the ML algorithms either learn a single weight for each feature or it computes distance
between the samples. Algorithms like linear models (such as logistic regression) belongs to the
first category.
Lets take a look at an example from loan_prediction data set. Feature Dependents have 4
possible values 0,1,2 and 3+ which are then encoded without loss of generality to 0,1,2 and 3.
We, then have a weight “W” assigned for this feature in a linear classifier,which will make a
Possible values that can be attained by the equation are 0, W, 2W and 3W. A problem with this
equation is that the weight “W” cannot make decision based on four choices. It can reach to a
All leads to the same decision (all of them <K or vice versa)
3:1 division of the levels (Decision boundary at f(w)>2W)
2:2 division of the levels (Decision boundary at f(w)>W)
Here we can see that we are loosing many different possible decisions such as the case where
“0” and “2W” should be given same label and “3W” and “W” are odd one out.
the feature “Dependents” from one to four, thus every value in the feature “Dependents” will
have their own weights. Updated equation for the decison would be f'(w) < K.
The same thing happens with distance based methods such as kNN. Without encoding, distance
between “0” and “1” values of Dependents is 1 whereas distance between “0” and “3+” will be 3,
which is not desirable as both the distances should be similar. After encoding, the values will be
new features (sequence of columns is 0,1,2,3+) : [1,0,0,0] and [0,0,0,1] (initially we were finding
distance between “0” and “3+”), now the distance would be √2.
For tree based methods, same situation (more than two values in a feature) might effect the
outcome to extent but if methods like random forests are deep enough, it can handle the
Now, lets take look at the implementation of one-hot encoding with various algorithms.
Lets create a logistic regression model for classification without one-hot encoding.
Here, again we got the maximum accuracy as 0.75 that we have gotten so far. In this
Data Visualization: It is a way to express your data in a visual context so that patterns,
correlations, trends between the data can be easily understood. Data Visualization helps in
In this article, we will be using multiple datasets to show exactly how things work. The base
dataset will be the iris dataset which we will import from sklearn. We will create the rest of the
Let’s import all the libraries which are required for doing
import math,os,random
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stat
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
Scatter Plot
These are the charts/plots that are used to observe and display relationships between
variables using Cartesian Coordinates. The values (x: first variable , y: second variable)
of the variables are represented by dots. Scatter plots are also known as scattergrams,
scatter graphs, scatter charts , or scatter diagrams. It is best suited for situations where
the dependent variable can have multiple values for the independent variable.
ax.legend()
plt.show()
Two common issues with the use of scatter plots are – overplotting and the interpretation of
causation as correlation.
Overplotting occurs when there are too many data points to plot, which results in the overlapping
of different data points and make it hard to identify any relationship between points
Correlation does not mean that the changes observed in one variable are responsible for changes
Line Plot
Line plots is a graph that is used for the representation of continuous data points on a
number line. Line plots are created by first plotting data points on the Cartesian plane
then joining those points with a number line. Line plots can help display data points for
# Seaborn Implementation
df = pd.DataFrame({
'A': [1,3,2,7,9,6,8,10],
'B': [2,4,1,8,10,3,11,12],
'C': ['a','a','a','a','b','b','b','b']
})
sns.lineplot( data=
df,
x="A", y="B", hue="C",style="C",
markers=True, dashes=False
)
Histograms
The width of the histogram represents interval and the length represents frequency. To
create a histogram you need to create bins of the interval which are not overlapping.
Histogram allows the inspection of data for its underlying distribution, outliers,
skewness.
Histograms in Matplotlib
fig, ax = plt.subplots()
# plot histogram
ax.hist(iris_feat['sepal_length'])
# set title and labels
ax.set_title('sepal_length')
ax.set_xlabel('Points')
ax.set_ylabel('Frequency')
The values with longer plots signify that more values are concentrated there.
Histograms help in understanding the frequency distribution of the overall data points
Line histograms are the modification to the standard histogram to understand and
represent the distribution of a single feature with different data points. Line histograms
Matplotlib Implementation
Normal Histogram is a bell-shaped histogram with most of the frequency counts focused in the
middle with diminishing tails. Line with orange color passing through histogram represents
Bar Chart
Bar charts are best suited for the visualization of categorical data because they allow
you to easily see the difference between feature values by measuring the size(length) of
the bars. There are 2 types of bar charts depending upon their orientation (i.e. vertical
or horizontal). Moreover, there are 3 types of bar charts based on their representation
Matplotlib Implementation
df = iris.groupby('species')['sepal_length'].sum().to_frame().reset_index()
#Creating the bar chart
plt.bar(df['species'],df['sepal_length'],color =
['cornflowerblue','lightseagreen','steelblue']) #Adding the aesthetics
plt.title('Bar Chart')
plt.xlabel('Species')
plt.ylabel('sepal_length')
#Show the plot
plt.show()
With the above image, we can clearly see the difference in the sum of sepal_length for
These bar charts allows us to compare multiple categorical features. Lets see an
example.
Matplotlib Implementation
Matplotlib Implementation
df = pd.DataFrame(columns=[“A”,”B”, “C”,”D”],
data=[["E",1,2,0],
["F",3,1,3],
["G",1,2,1]])
df.plot.bar(x='A', y=["B", "C","D"], stacked=True, alpha=0.8 ,color=['steelblue','darkorange'
,'mediumseagreen'])
plt.title('Title')
#Show the plot
plt.show()
Pie Plot
A pie plot is a circular representation of data that can be represented in relative proportions. A
pie chart is divided into various parts depending on the number of numerical relative proportions.
chart.
Box Plot
This is one of the most used methods by data scientists. Box plot is a way of displaying the
distribution of data based on the five-number theory. It basically gives information about the
outliers and how much spread out data is from the center. It can tell if data symmetry is
present or not. It also gives information about how tightly or skewed your data is. They are
sepal_length = iris_feat['sepal_length']
petal_length = iris_feat['petal_length']
petal_width = iris_feat['petal_width']
sepal_width = iris_feat['sepal_width']
data = [sepal_length , petal_length , petal_width , sepal_width]
fig1, ax1 = plt.subplots()
ax1.set_title('Basic Plot')
ax1.boxplot(data)
plt.show()
The dots or bubbles outside the 4th boxplot are the outliers. The line inside the box is depicting