Python For Data Science
Python For Data Science
• Visualize data
*About Python
Python provides various data types and data structures for storing and processing
data. For handling single values, there are
data types like int, float, str, and bool. For handling data in groups, python
provides data structures like list, tuple, dictionary,
set, etc.
Python has a wide range of libraries and built-in functions which aid in rapid
development of applications. Python libraries are
collections of pre-written codes to perform specific tasks. This eliminates the
need of rewriting the code from scratch.
Faster application development – Libraries promote code reusability and help the
developers save time and focus on building the
functional logic.
Enhance code efficiency – Use of pre-tested libraries enhances the quality and
stability of the application.
Achieve code modularization – Libraries can be coupled or decoupled based on
requirement.
Over the last two decades, python has emerged as a first-choice tool for tasks that
involve scientific computing, including the
analysis and visualization of large datasets. Python has gained popularity,
particularly in the field of data science because of
large and active ecosystem of third-party libraries.
Few of the popular libraries in data science include NumPy, Pandas, Matplotlib and
Scikit-Learn.
*Business scenario
*XYZ Custom Cars is an automobile restoration company based in New York, USA. This
company is renowned for restoring vintage and
muscle cars. Their team takes great pride in each of their projects, no matter how
big or small. They offer paint jobs, frame
build-ups, engine restoration, body work etc. They are also involved in buying and
reselling of cars.
Every car owner that comes to XYZ Custom Cars gets a documentation drafted that
consists important information about the car.
This information is related to the car’s performance and manufacturing. The company
maintains this database with proper diligence.
Click here to download the XYZ Custom Cars data.
The data consists of features like acceleration, horsepower, region, model year,
etc. And the board of directors think that these
data can help generate insights about their projects. On the other hand, these
insights would help restore similar cars with similar
standards and procedures. Also, this would help them predict better reselling
prices in future. Precisely, these insights would help
generate greater revenue for the company through cost cutting and providing a data
driven approach to their process.
Category
Description
Fuel efficient
Muscle Cars
SUV
Big sized cars designed for high performance, long distance trips and family
comfort
High horsepower, High weight
Racecar
This would allow the company to place specialized mechanics and equipment in
specific workstations, thereby creating a hassle free and efficient work
atmosphere. Another interesting thing would be predicting the fuel efficiency of
cars after restoration based on the available data to minimize field testing.
The defined business scenarios will be used to understand the following Python
libraries:
1. Numpy
2. Pandas
3. Matplotlib
4. Sci-kit learn
To perform operations on the List elements, one needs to iterate through the List.
For example, if five extra marks need to be awarded
to all the entries in the student marks list. The following approach can be used to
achieve the same:
It can be observed that, there is use of a loop. The code is lengthy and becomes
computationally expensive with increase in the
size of the List.
Data Science is a field that utilizes scientific methods and algorithms to generate
insights from the data. These insights can
then be made actionable and applied across a broad range of application domains.
Data Science deals with large datatsets. Operating
on such data with lists and loops is time consuming and computationally expensive.
Let us understand why Python Lists can become a bottleneck if they are used for
large data.
Consider that 1 million numbers must be added from two different lists.
%%time
#Used to calculate total operation time
list1 = list(range(1,1000000))
list2 = list(range(2,1000001))
list3 = []
for i in range(len(list1)):
list3.append(list1[i]+list2[i])
Let us understand, how Numpy can solve the same in minimal time.
%%time
#Used to calculate total operation time
#Importing Numpy
import numpy as np
#Creating a numpy array of 1 million numbers
a = np.arange(1,1000000)
b = np.arange(2,1000001)
c = a+b
It can be observed that the same operation has been completed in 12 milliseconds
when compared to 395 milliseconds taken by
Python List. As the data size and the complexity of operations increases, the
difference between the performance of Numpy and
Python Lists broadens.
In Data Science, there are millions of records to be dealt with. The performance
limitations faced by using Python List can
be managed by usage of advanced Python libraries like Numpy.
*Introduction to Numpy
Numeric-Python (Numpy), is a Python library that is used for numeric and scientific
operations. It serves as a building block for
many libraries available in Python.
*Getting Started
*Importing Numpy
Numpy library needs to be imported in the environment before it can be used as
shown below. 'np' is the standard alias used for Numpy.
import numpy as np
Numpy object creation
Numpy array can be created by using array() function. The array() function in Numpy
returns an array object named ndarray.
Student ID
Marks
78
92
36
4
64
89
import numpy as np
student_marks_arr = np.array([78, 92, 36, 64, 89])
student_marks_arr
The following dataset has been provided by XYZ Custom Cars. This data comes in a
csv file format.
There are various columns in this dataset. Each column contains multiple values.
These values can be represented as lists of items.
Since each column contains homogenous values, Numpy arrays can be used to represent
them.
Let us understand , how to represent the car ‘horsepower’ values in a Numpy array.
*Shape of ndarray
*The numpy.ndarray.shape returns a tuple that describes the shape of the array.
For example:
Here, 3 represents the number of rows and 5 represents the number of elements in
each row.
*'dtype' of ndarray
*'dtype' refers to the data type of the data contained by the array. Numpy supports
multiple datatypes like integer, float, string, boolean etc.
Below is an example of using dtype property to identify the data type of elements
in an array.
Changing dtype
Numpy dtype can be changed as per requirements. For example, an array of integers
can be converted to float.
*The elements in the ndarray are accessed using index within the square brackets
[ ]. In Numpy, both positive and negative indices
can be used to access elements in the ndarray. Positive indices start from the
beginning of the array, while negative indices start
from the end of the array. Array indexing starts from 0 in positive indexing and
from -1 in negative indexing.
*Slicing
Syntax: array_name[start : end] – index starts at ‘start’ and ends at ‘end - 1’.
Problem Statement:
The engineers at XYZ Custom Cars want to know about the mean and median of
horsepower.
Solution:
The mean and median can be calculated with the help of following code:
Problem Statement:
The engineers at XYZ Custom Cars want to know about the minimum and maximum
horsepower.
Solution:
The min and max can be calculated with the help of following code:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Minimum horsepower: ", np.min(horsepower_arr))
print("Maximum horsepower: ", np.max(horsepower_arr))
*Querying/searching in an array
Problem Statement:
The engineers at XYZ Custom Cars want to know the horsepower of cars that are
greater than or equal to 150.
Solution:
The 'where' function can be used for this requirement. Given a condition, 'where'
function returns the indexes of the array where the condition satisfies. Using
these indexes, the respective values from the array can be obtained.
*Filter data
Problem Statement:
The Engineers at XYZ Custom Cars want to create a separate array consisting of
filtered values of horsepower greater than 135.
Solution:
Getting some elements out of an existing array based on certain conditions and
creating a new array out of them is called filtering.
*Sorting an array
Problem Statement:
The engineers at XYZ Custom Cars want the horsepower in sorted order.
Solution:
The numpy array can be sorted by passing the array to the function sort(array) or
by array.sort.
So, what is the difference between these two functions though they are used for the
same functionality?
The difference is that the array.sort() function modifies the original array by
default, whereas the sort(array) function does not.
*Vectorized operations
The mathematical operations can be performed on Numpy arrays. Numpy makes use of
optimized, pre-compiled code to perform mathematical operations on each array
element. This eliminates the need of using loops, thereby enhancing the
performance. This process is called vectorization.
Numpy provides various mathematical functions such as sum(), add(), sub(), log(),
sin() etc. which uses vectorization.
Subject
Marks
English
78
Mathematics
92
Physics
36
Chemistry
64
Biology
89
Problem Statement:
Calculate the sum of all the marks.
Solution:
The sum() function can be used which internally uses vectorizaton .
student_marks_arr = np.array([78, 92, 36, 64, 89])
print(np.sum(student_marks_arr))
Problem Statement:
Award extra marks in subjects as follows:
English: +2
Mathematics: +2
Physics: +5
Chemistry: +10
Biology: +2
Solution:
Below is the solution to the problem:
*Broadcasting
"Broadcasting" refers to the term on how Numpy handles arrays with different shapes
during arithmetic operations. Array of smaller size is stretched or copied across
the larger array.
import numpy as np
# Array 1
array1=np.array([5, 10, 15])
# Array 2
array2=np.array([5])
array3= array1 * array2
array3
In this example, the array2 is being stretched or copied to match array1 during the
arithmetic operation resulting in new array array3 with the same shape as array1.
In the first operation, the shape of first array is 1x3 and the shape of second
array is 1x1. Hence, according to broadcasting
rules, the second array gets stretched to match the shape of first array and the
shape of the resulting array is 1x3.
In the second operation, the shape of first array is 3x3 and the shape of second
array is 1x3. Hence, according to broadcasting
rules, the second array gets stretched to match the shape of first array and the
shape of the resulting array is 3x3.
In the third operation, the shape of first array is 3x1 and the shape of second
array is 1x3. Hence, according to broadcasting
rules, both first and second arrays get stretched and the shape of the resulting
array is 3x3.
*Broadcasting - demo
Consider the following table consisting marks scored by four student in two
different subjects:
Students
Chemistry
Physics
Subodh
67
45
Ram
90
92
Abdul
66
72
John
32
40
The teacher of these students wants to represent their marks in an array. To do so,
the marks can be stored using a 4x2 array as follows:
Problem Statement:
Now the teacher wants to award extra five marks in Chemistry and extra ten marks in
Physics.
Solution:
#Students marks in 4 subjects
students_marks = np.array([[67, 45],[90, 92],[66, 72],[32, 40]])
#Broadcasting
students_marks += [5,10]
students_marks
The student's marks array is a 2D array of shape 4x2. The marks to be added are in
the form of a 1D array of size 1x2. According to
the broadcasting rules, the marks to
be added get stretched to match the shape of student marks array and the shape of
the resulting array is 4x2.
Certain basic operations and manipulations can be carried out on images using Numpy
and scikit-image package. Scikit-image is an
image processing package.
Importing an image:
#Importing path and skimage i/o library
import os.path
from skimage.io import imread
from skimage import data_dir
#reading the astronaut image
img = imread(os.path.join(data_dir, 'astronaut.png'))
print(img)
Properties of image
Let us understand the type, dimensions and shape of the image.
So far, you have become familiar with, how to retrieve the basic attributes of the
image. Let us proceed to understand some
examples on indexing and selection on images.
In this case, the image has been sliced corresponding to the rocket from the
original image.
img_slice[np.greater_equal(img_slice[:,:,0],100) &
np.less_equal(img_slice[:,:,0],150)] = 0
plt.figure()
plt.imshow(img_slice)
The place where the sliced rocket image was present initially, is now filled with
black color because 0 is assigned to the
values corresponding to the sliced image.
*Summary
Arange
This method returns evenly spaced values between the given intervals excluding the
end limit. The values are generated based on
the step value and by default, the step value is 1.
#step value = 2
np.arange(0,10,2)
Linspace
This method returns the given number of evenly spaced values, between the given
intervals. By default, the number of values
between a given interval is 50.
#Number of values = 3
print(np.linspace(0,10,3))
Zeros
Returns an array of given shape filled with zeros.
#1D
np.zeros(5)
#2D
np.zeros([2,3])
Ones
Returns an array of given shape filled with ones.
#1D
np.ones(3)
#2D
np.ones([2,1])
Full:
Returns an array of given shape, filled with given value, irrespective of datatype.
#number=5, value=8
np.full(5,8)
#shape=[3,3], value=numpy
np.full([3,3],'numpy')
Eye
Returns an identity matrix for the given shape.
*Random
Random
NumPy has numerous ways to create random number arrays. Random numbers can be
created for the required length, from a uniform distribution by just passing the
value of required length to the random.rand function.
The choice() method takes an array as a parameter and randomly returns the values
based on the size.
*Introduction to pandas
The steps involved to perform data analysis using Pandas are as follows:
Grouping operations
Sorting operations
Masking operations
Merging operations
Concatenating operations
Visualizing data
The next step is to visualize the data to get a clear picture of various
relationshipsamong the data. The following plots can help visualize the data:
Scatter plot
Box plot
Bar plot
Histogram and many more
Generating Insights
All the above steps help generating insights about our data.
*Why Pandas?
*Pandas is one of the most popular data wrangling and analysis tools because it:
To get started with Pandas, Numpy and Pandas needs to be imported as shown below:
#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of
numpy
import numpy as np
#importing pandas
import pandas as pd
To get started with Pandas, Numpy and Pandas needs to be imported as shown below:
#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of
numpy
import numpy as np
#importing pandas
import pandas as pd
Student ID
Marks
78
92
36
64
89
The pandas series object can be used to represent this data in a meaningful manner.
Series is created using the following syntax:
Syntax:
dtype – This represents the data type used in the series (optional
parameter).
As shown in the above output, the series object provides the values along with
their index attributes.
series.values
series[1]
Slicing a series:
series[1:3]