Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
88 views

Python For Data Science

This document provides an introduction to using Python libraries like NumPy, Pandas, Matplotlib and Scikit-Learn for data analysis and machine learning. It discusses the need for libraries to improve code efficiency and modularity over writing code from scratch. NumPy is introduced as a fundamental library for scientific computing that allows faster operations on large datasets using ndarrays compared to native Python lists. The document demonstrates creating 1D and 2D NumPy arrays from lists of data to represent columns of a car dataset.

Uploaded by

Mohit Malghade
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Python For Data Science

This document provides an introduction to using Python libraries like NumPy, Pandas, Matplotlib and Scikit-Learn for data analysis and machine learning. It discusses the need for libraries to improve code efficiency and modularity over writing code from scratch. NumPy is introduced as a fundamental library for scientific computing that allows faster operations on large datasets using ndarrays compared to native Python lists. The document demonstrates creating 1D and 2D NumPy arrays from lists of data to represent columns of a car dataset.

Uploaded by

Mohit Malghade
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 22

*By the end of this course, you will be able to :

• Explain the need for Python libraries

• Use Numpy to work with arrays

• Use Pandas to load, explore, manipulate, analyze and process data

• Derive statistical outcomes of a real dataset

• Visualize data

• Create a machine learning model for predictive analysis

*About Python

Python is an open source, general-purpose programming language. It supports both


structured and object-oriented style of programming.
It can be utilized for developing wide range of applications including web
applications, data analytics, machine learning applications
etc.

Python provides various data types and data structures for storing and processing
data. For handling single values, there are
data types like int, float, str, and bool. For handling data in groups, python
provides data structures like list, tuple, dictionary,
set, etc.

Python has a wide range of libraries and built-in functions which aid in rapid
development of applications. Python libraries are
collections of pre-written codes to perform specific tasks. This eliminates the
need of rewriting the code from scratch.

*Why Python Libraries?

Let us consider the following scenario:

John is a software developer. His project requires developing an application that


connects to various database servers like MySQL,
Postgre, MongoDB etc. To implement this requirement from scratch, John needs to
invest his time and effort to understand the
underlying architectures of the respective databases. Instead, John can choose to
use pre-defined libraries to perform the database
operations which abstracts the complexities involved.

Use of libraries will help John in the following ways:

Faster application development – Libraries promote code reusability and help the
developers save time and focus on building the
functional logic.
Enhance code efficiency – Use of pre-tested libraries enhances the quality and
stability of the application.
Achieve code modularization – Libraries can be coupled or decoupled based on
requirement.
Over the last two decades, python has emerged as a first-choice tool for tasks that
involve scientific computing, including the
analysis and visualization of large datasets. Python has gained popularity,
particularly in the field of data science because of
large and active ecosystem of third-party libraries.

Few of the popular libraries in data science include NumPy, Pandas, Matplotlib and
Scikit-Learn.

Let us proceed to understand about these libraries in detail.

*Business scenario

*XYZ Custom Cars is an automobile restoration company based in New York, USA. This
company is renowned for restoring vintage and
muscle cars. Their team takes great pride in each of their projects, no matter how
big or small. They offer paint jobs, frame
build-ups, engine restoration, body work etc. They are also involved in buying and
reselling of cars.

Every car owner that comes to XYZ Custom Cars gets a documentation drafted that
consists important information about the car.
This information is related to the car’s performance and manufacturing. The company
maintains this database with proper diligence.
Click here to download the XYZ Custom Cars data.

The data consists of features like acceleration, horsepower, region, model year,
etc. And the board of directors think that these
data can help generate insights about their projects. On the other hand, these
insights would help restore similar cars with similar
standards and procedures. Also, this would help them predict better reselling
prices in future. Precisely, these insights would help
generate greater revenue for the company through cost cutting and providing a data
driven approach to their process.

For example, the company may be interested in setting up different workstations


that cater to specific categories of cars as follows –

Category

Description

Features coming in play

Fuel efficient

Cars designed with low power and high fuel efficiency

High MPG, Low Horsepower, Low weight

Muscle Cars

Intermediate sized cars designed for high performance

High displacement, High horsepower, Moderate weight

SUV

Big sized cars designed for high performance, long distance trips and family
comfort
High horsepower, High weight

Racecar

Cars specifically designed for race tracks

High horsepower, Low weight, High acceleration

This would allow the company to place specialized mechanics and equipment in
specific workstations, thereby creating a hassle free and efficient work
atmosphere. Another interesting thing would be predicting the fuel efficiency of
cars after restoration based on the available data to minimize field testing.

The defined business scenarios will be used to understand the following Python
libraries:

1. Numpy

To get the mathematical and structural understanding of such data


To build a base for Pandas, the data manipulation library
To get familiar with terms like arrays, axis, vectorization etc.

2. Pandas

To read the data


To explore the data
To do operations on the data
To manipulate the data
To draw simple visualizations for self-consumption
To generate insights

3. Matplotlib

To visualize the data


To generate deep insights
To present the data to the leadership
To get a visual understanding of various features

4. Sci-kit learn

To get data ready for model building


To build predictive models
To evaluate the models
To infer the model results and present to the leadership

*Revisiting Python List

*a Python List can be used to store a group of elements together in a sequence. It


can contain heterogeneous elements.

Following are some examples of List:


item_list = ['Bread', 'Milk', 'Eggs', 'Butter', 'Cocoa']
student_marks = [78, 47, 96, 55, 34]
hetero_list = [ 1,2,3.0, ‘text’, True, 3+2j ]

To perform operations on the List elements, one needs to iterate through the List.
For example, if five extra marks need to be awarded
to all the entries in the student marks list. The following approach can be used to
achieve the same:

student_marks = [78, 47, 96, 55, 34]


for i in range(len(student_marks)):
student_marks[i]+=5
print(student_marks)

It can be observed that, there is use of a loop. The code is lengthy and becomes
computationally expensive with increase in the
size of the List.

Data Science is a field that utilizes scientific methods and algorithms to generate
insights from the data. These insights can
then be made actionable and applied across a broad range of application domains.
Data Science deals with large datatsets. Operating
on such data with lists and loops is time consuming and computationally expensive.

*Comparing Python List and Numpy performance.

Let us understand why Python Lists can become a bottleneck if they are used for
large data.

Consider that 1 million numbers must be added from two different lists.

%%time
#Used to calculate total operation time
list1 = list(range(1,1000000))
list2 = list(range(2,1000001))
list3 = []
for i in range(len(list1)):
list3.append(list1[i]+list2[i])

Note: Time taken will be different in different systems.

Let us understand, how Numpy can solve the same in minimal time.

Note: Ignore the syntax and focus on only the output.

%%time
#Used to calculate total operation time
#Importing Numpy
import numpy as np
#Creating a numpy array of 1 million numbers
a = np.arange(1,1000000)
b = np.arange(2,1000001)
c = a+b
It can be observed that the same operation has been completed in 12 milliseconds
when compared to 395 milliseconds taken by
Python List. As the data size and the complexity of operations increases, the
difference between the performance of Numpy and
Python Lists broadens.

In Data Science, there are millions of records to be dealt with. The performance
limitations faced by using Python List can
be managed by usage of advanced Python libraries like Numpy.

*Introduction to Numpy

Numeric-Python (Numpy), is a Python library that is used for numeric and scientific
operations. It serves as a building block for
many libraries available in Python.

Data structures in Numpy


The main data structure of NumPy is the ndarray or n-dimensional array.

The ndarray is a multidimensional container of elements of the same type as


depicted below. It can easily deal with matrix and
vector operations.

*Getting Started

*Importing Numpy
Numpy library needs to be imported in the environment before it can be used as
shown below. 'np' is the standard alias used for Numpy.
import numpy as np
Numpy object creation
Numpy array can be created by using array() function. The array() function in Numpy
returns an array object named ndarray.

Syntax: np.array(object, dtype)

object – A python object(for example, a list)

dtype – data type of object (for example, integer)

Example: Consider the following marks scored by students:

Student ID

Marks

78

92

36
4

64

89

These marks can be represented in a one-dimensional Numpy array as shown below:

import numpy as np
student_marks_arr = np.array([78, 92, 36, 64, 89])
student_marks_arr

This is one way to create a simple one-dimensional array.

*Numpy object creation demo- 1D array

The following dataset has been provided by XYZ Custom Cars. This data comes in a
csv file format.

There are various columns in this dataset. Each column contains multiple values.
These values can be represented as lists of items.
Since each column contains homogenous values, Numpy arrays can be used to represent
them.

Let us understand , how to represent the car ‘horsepower’ values in a Numpy array.

#creating a list of 5 horsepower values


horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
horsepower_arr

*Numpy object creation demo- 2D array

*How can multiple columns be represented together?


This can be achieved by creating the Numpy array from List of Lists.

Let us understand , how to represent the car 'mpg', ‘horsepower’, and


'acceleration' values in a Numpy array.

#creating a list of lists of 5 mpg, horsepower and acceleration values


car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#creating a numpy array from car_attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr
The example demonstrates that the Numpy array created using the List of Lists
results in a two-dimensional array.

*Shape of ndarray

*The numpy.ndarray.shape returns a tuple that describes the shape of the array.

For example:

a one-dimensional array having 10 elements will have a shape as (10,)


a two-dimensional array having 10 elements distributed evenly in two rows will have
a shape as (2,5)
Let us comprehend, how to find out the shape of car attributes array.

#creating a list of lists of mpg, horsepower and acceleration values


car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#creating a numpy array from attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr.shape

Here, 3 represents the number of rows and 5 represents the number of elements in
each row.

*'dtype' of ndarray

*'dtype' refers to the data type of the data contained by the array. Numpy supports
multiple datatypes like integer, float, string, boolean etc.

Below is an example of using dtype property to identify the data type of elements
in an array.

#creating a list of lists of 5 mpg, horsepower and acceleration values


car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#creating a numpy array from attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr.dtype

Changing dtype
Numpy dtype can be changed as per requirements. For example, an array of integers
can be converted to float.

Below is an example of using dtype as an argument of np.array() function to convert


the data type of elements from integer to float.

#creating a list of lists of 5 mpg, horsepower and acceleration values


car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#converting dtype
car_attributes_arr = np.array(car_attributes, dtype = 'float')
print(car_attributes_arr)
print(car_attributes_arr.dtype)
*Accessing Numpy arrays

*The elements in the ndarray are accessed using index within the square brackets
[ ]. In Numpy, both positive and negative indices
can be used to access elements in the ndarray. Positive indices start from the
beginning of the array, while negative indices start
from the end of the array. Array indexing starts from 0 in positive indexing and
from -1 in negative indexing.

Below are some examples of accessing data from numpy arrays:

1. Accessing element from 1D array.


#creating an array of cars
cars = np.array(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino'])
#accessing the second car from the array
cars[1]

2. Accessing elements from a 2D array


#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
car_hp_arr

#Creating a 2D array consisting car names and horsepower


car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing car names
car_hp_arr[0]

#Creating a 2D array consisting car names and horsepower


car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing horsepower
car_hp_arr[1]

#Creating a 2D array consisting car names and horsepower


car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing second car - 0 represents 1st row and 1 represents 2nd element of the
row
car_hp_arr[0,1]
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing name of last car using negative indexing
car_hp_arr[0,-1]

*Slicing

Slicing is a way to access and obtain subsets of ndarray in Numpy.

Syntax: array_name[start : end] – index starts at ‘start’ and ends at ‘end - 1’.

1.Slicing from 1D array


#creating an array of cars
cars = np.array(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino'])
#accessing a subset of cars from the array
cars[1:4]

2. Slicing from a 2D array


#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
car_hp_acc_arr

#Creating a 2D array consisting car names, horsepower and acceleration


car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name and horsepower
car_hp_acc_arr[0:2]

#Creating a 2D array consisting car names, horsepower and acceleration


car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name and horsepower of last two cars
car_hp_acc_arr[0:2, 3:5]
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name, horsepower and acceleration of first three cars
car_hp_acc_arr[0:3, 0:3]

*Mean and Median

Problem Statement:
The engineers at XYZ Custom Cars want to know about the mean and median of
horsepower.

Solution:
The mean and median can be calculated with the help of following code:

#creating a list of 5 horsepower values


horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#mean horsepower
print("Mean horsepower = ",np.mean(horsepower_arr))

#creating a list of 5 horsepower values


horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#median horsepower
print("Median horsepower = ",np.median(horsepower_arr))

*Min and max

Problem Statement:
The engineers at XYZ Custom Cars want to know about the minimum and maximum
horsepower.

Solution:
The min and max can be calculated with the help of following code:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Minimum horsepower: ", np.min(horsepower_arr))
print("Maximum horsepower: ", np.max(horsepower_arr))

Finding the index of minimum and maximum values:


'argmin()' and 'argmax()' return the index of minimum and maximum values in an
array respectively.
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Index of Minimum horsepower: ", np.argmin(horsepower_arr))
print("Index of Maximum horsepower: ", np.argmax(horsepower_arr))

*Querying/searching in an array

Problem Statement:
The engineers at XYZ Custom Cars want to know the horsepower of cars that are
greater than or equal to 150.

Solution:
The 'where' function can be used for this requirement. Given a condition, 'where'
function returns the indexes of the array where the condition satisfies. Using
these indexes, the respective values from the array can be obtained.

#creating a list of 5 horsepower values


horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
x = np.where(horsepower_arr >= 150)
print(x) # gives the indices
# With the indices , we can find those values
horsepower_arr[x]

*Filter data

Problem Statement:
The Engineers at XYZ Custom Cars want to create a separate array consisting of
filtered values of horsepower greater than 135.

Solution:
Getting some elements out of an existing array based on certain conditions and
creating a new array out of them is called filtering.

The following code can be used to accomplish this:

#creating a list of 5 horsepower values


horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#creating filter array
filter_arr = horsepower_arr > 135
newarr = horsepower_arr[filter_arr]
print(filter_arr)
print(newarr)

*Sorting an array

Problem Statement:
The engineers at XYZ Custom Cars want the horsepower in sorted order.

Solution:
The numpy array can be sorted by passing the array to the function sort(array) or
by array.sort.

So, what is the difference between these two functions though they are used for the
same functionality?

#creating a list of 5 horsepower values


horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#using sort(array)
print('original array: ', horsepower_arr)
print('Sorted array: ', np.sort(horsepower_arr))
print('original array after sorting: ', horsepower_arr)

#creating a list of 5 horsepower values


horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#using sort(array)
print('original array: ', horsepower_arr)
horsepower_arr.sort()
print('original array after sorting: ', horsepower_arr)

The difference is that the array.sort() function modifies the original array by
default, whereas the sort(array) function does not.

*Vectorized operations

The mathematical operations can be performed on Numpy arrays. Numpy makes use of
optimized, pre-compiled code to perform mathematical operations on each array
element. This eliminates the need of using loops, thereby enhancing the
performance. This process is called vectorization.

Numpy provides various mathematical functions such as sum(), add(), sub(), log(),
sin() etc. which uses vectorization.

Consider an example of marks scored by a student:

Subject

Marks

English

78

Mathematics

92

Physics

36

Chemistry

64

Biology
89

Problem Statement:
Calculate the sum of all the marks.

Solution:
The sum() function can be used which internally uses vectorizaton .
student_marks_arr = np.array([78, 92, 36, 64, 89])
print(np.sum(student_marks_arr))

Problem Statement:
Award extra marks in subjects as follows:

English: +2

Mathematics: +2

Physics: +5

Chemistry: +10

Biology: +2

Solution:
Below is the solution to the problem:

additional_marks = [2, 2, 5, 10, 1]


student_marks_arr += additional_marks
student_marks_arr
Also, the same operation can be performed as shown below:

student_marks_arr = np.array([78, 92, 36, 64, 89])


student_marks_arr = np.add(student_marks_arr, additional_marks)
student_marks_arr
Both the above methods use vectorization internallly eliminating the need of loops.

Other arithmetic operations can also be performed in a similar manner.

In addition to arithmetic operations, several other mathematical operations


like exponents, logarithms and trigonometric functions are also available in Numpy.
This makes Numpy a very useful tool for
scientific computing.

*Broadcasting

"Broadcasting" refers to the term on how Numpy handles arrays with different shapes
during arithmetic operations. Array of smaller size is stretched or copied across
the larger array.

For example, considering the following arithmetic operations across 2 arrays:

import numpy as np
# Array 1
array1=np.array([5, 10, 15])
# Array 2
array2=np.array([5])
array3= array1 * array2
array3

In this example, the array2 is being stretched or copied to match array1 during the
arithmetic operation resulting in new array array3 with the same shape as array1.

The following diagram explains broadcasting:

In the first operation, the shape of first array is 1x3 and the shape of second
array is 1x1. Hence, according to broadcasting
rules, the second array gets stretched to match the shape of first array and the
shape of the resulting array is 1x3.
In the second operation, the shape of first array is 3x3 and the shape of second
array is 1x3. Hence, according to broadcasting
rules, the second array gets stretched to match the shape of first array and the
shape of the resulting array is 3x3.
In the third operation, the shape of first array is 3x1 and the shape of second
array is 1x3. Hence, according to broadcasting
rules, both first and second arrays get stretched and the shape of the resulting
array is 3x3.

*Broadcasting - demo

Consider the following table consisting marks scored by four student in two
different subjects:
Students

Chemistry

Physics

Subodh

67

45

Ram

90

92

Abdul

66

72

John

32

40
The teacher of these students wants to represent their marks in an array. To do so,
the marks can be stored using a 4x2 array as follows:

#Students marks in 4 subjects


students_marks = np.array([[67, 45],[90, 92],[66, 72],[32, 40]])
students_marks

Problem Statement:
Now the teacher wants to award extra five marks in Chemistry and extra ten marks in
Physics.

Solution:
#Students marks in 4 subjects
students_marks = np.array([[67, 45],[90, 92],[66, 72],[32, 40]])
#Broadcasting
students_marks += [5,10]
students_marks

The student's marks array is a 2D array of shape 4x2. The marks to be added are in
the form of a 1D array of size 1x2. According to
the broadcasting rules, the marks to
be added get stretched to match the shape of student marks array and the shape of
the resulting array is 4x2.

*Image as a Numpy matrix

Images are stored as arrays of hundreds, thousands or even millions of picture


elements called as pixels. Therefore, images can also
be treated as Numpy array, as they can be represented as matrix of pixels.

Certain basic operations and manipulations can be carried out on images using Numpy
and scikit-image package. Scikit-image is an
image processing package.

The package is imported as skimage.

Importing an image:
#Importing path and skimage i/o library
import os.path
from skimage.io import imread
from skimage import data_dir
#reading the astronaut image
img = imread(os.path.join(data_dir, 'astronaut.png'))

#Using matplotlib.pyplot to visualize the image


import matplotlib.pyplot as plt
plt.imshow(img)

To view as a matrix, the below command must be followed:

print(img)
Properties of image
Let us understand the type, dimensions and shape of the image.

print('Type of image: ', type(img))


print('Dimensions of image: ', img.ndim)
print('Shape of image:', img.shape)

*Indexing and selection

So far, you have become familiar with, how to retrieve the basic attributes of the
image. Let us proceed to understand some
examples on indexing and selection on images.

Cutting the rocket out of the image


#Importing path and skimage i/o library
import os.path
from skimage.io import imread
from skimage import data_dir
#reading the astronaut image
img = imread(os.path.join(data_dir, 'astronaut.png'))
#Slicing out the rocket
img_slice = img.copy()
img_slice = img_slice[0:300,360:480]
plt.figure()
plt.imshow(img_slice)

In this case, the image has been sliced corresponding to the rocket from the
original image.

Assigning the values corresponding to the sliced image as 0:


img[0:300,360:480,:] = 0
plt.imshow(img)

img_slice[np.greater_equal(img_slice[:,:,0],100) &
np.less_equal(img_slice[:,:,0],150)] = 0
plt.figure()
plt.imshow(img_slice)
The place where the sliced rocket image was present initially, is now filled with
black color because 0 is assigned to the
values corresponding to the sliced image.

Replacing the ‘rocket’ back to its original place:


img[0:300,360:480,:] = img_slice
plt.imshow(img)
For the above picture, the black image in the previous step is replaced with the
sliced ‘rocket’.

*Summary

So far, you have learnt the following key features of Numpy:

Numpy offers multi-dimensional arrays.


It provides array operations that are better than python list operations in terms
of speed, efficiency and ease of writing code.
Numpy provides fast and convenient operations in the form of vectorization and
broadcasting.
Numpy offers additional capabilities to perform Linear Algebra and scientific
computing. This is out of scope of this module.
There is another Python library called Pandas, built on top of Numpy when it comes
to analysis and manipulation on tabular data.
Let us proceed to learn more about it.

*Additional methods to create numpy arrays - Arange and Linspace

Arange
This method returns evenly spaced values between the given intervals excluding the
end limit. The values are generated based on
the step value and by default, the step value is 1.

#start and end limit


np.arange(0,10000)

#step value = 2
np.arange(0,10,2)

Linspace
This method returns the given number of evenly spaced values, between the given
intervals. By default, the number of values
between a given interval is 50.

#Generating values between 0 and 10


arr = np.linspace(0,10)
print(arr)
print('Length of arr: ',len(arr))

#Number of values = 3
print(np.linspace(0,10,3))

*Zeroes and Ones

Zeros
Returns an array of given shape filled with zeros.

#1D
np.zeros(5)

#2D
np.zeros([2,3])

Ones
Returns an array of given shape filled with ones.

#1D
np.ones(3)

#2D
np.ones([2,1])

*Full and Eye

Full:
Returns an array of given shape, filled with given value, irrespective of datatype.

#number=5, value=8
np.full(5,8)

#shape=[3,3], value=numpy
np.full([3,3],'numpy')

Eye
Returns an identity matrix for the given shape.

#3x3 identity matrix


np.eye(3)

*Random

Random
NumPy has numerous ways to create random number arrays. Random numbers can be
created for the required length, from a uniform distribution by just passing the
value of required length to the random.rand function.

#generating 5 random numbers from a uniform distribution


np.random.rand(5)

Note: Output might not be same as it is randomly generated.

Similarly, to generate random numbers from a Normal distribution, use random.randn


function.

Random numbers of type 'integer' can also be generated using random.randint


function. Below shown is an example of creating five random numbers between 1 and
10.

#random integer values low=1, high=10, number of values=5


np.random.randint(1,10, size=5)

Similarly, two-dimensional arrays of random numbers can also be created by passing


the shape instead of number of values.

#random integer values high=100, shape = (3,5)


x = np.random.randint(100, size=(3, 5))
print(x)
print(type(x))

To generate a random number from a predefined set of values present in an array,


the choice() method can be used.

The choice() method takes an array as a parameter and randomly returns the values
based on the size.

#returns a single random value from the array


x = np.random.choice([9, 3, 7, 5])
print(x)

#returns 3*5 random values from the array


x = np.random.choice([9, 3, 7, 5], size=(3, 5)) # sampling to create an nd-array
print(x)
print(type(x))

*Introduction to pandas

Pandas is an open-source library for real world data analysis in python. It is


built on top of Numpy. Using Pandas, data can
be cleaned, transformed, manipulated, and analyzed. It is suited for different
kinds of data including tabular as in a SQL table
or a Excel spreadsheets, time series data, observational or statistical datasets.

The steps involved to perform data analysis using Pandas are as follows:

*Steps in data analysis

Reading the data


The first step is to read the data. There are multiple formats in which data can be
obtained such as '.csv', '.json', '.xlsx' etc.

Below are the examples:

Example of an excel file:

Example of a json (javascript object notation) file:

Example of a csv (comma separated values) file:

*Steps in daata analysis

Exploring the data


The next step is to explore the data. Exploring data helps to:

know the shape(number of rows and columns) of the data


understand the nature of the data by obtaining subsets of the data
identify missing values and treat them accordingly
get insights about the data using descriptive statistics
Performing operations on the data
Some of the operations supported by pandas for data manipulation are as follows:

Grouping operations
Sorting operations
Masking operations
Merging operations
Concatenating operations
Visualizing data
The next step is to visualize the data to get a clear picture of various
relationshipsamong the data. The following plots can help visualize the data:

Scatter plot
Box plot
Bar plot
Histogram and many more
Generating Insights
All the above steps help generating insights about our data.

*Why Pandas?

*Pandas is one of the most popular data wrangling and analysis tools because it:

has the capability to load huge sizes of data easily


provides us with extremely streamlined forms of data representation
can handle heterogenous data, has extensive set of data manipulation features and
makes data flexible and customizable

*Getting started with Pandas

To get started with Pandas, Numpy and Pandas needs to be imported as shown below:

#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of
numpy
import numpy as np
#importing pandas
import pandas as pd

In a nutshell, Pandas objects are advanced versions of NumPy structured arrays in


which the rows and columns are identified with
labels instead of simple integer indices.

The basic data structures of Pandas are Series and DataFrame.

*Getting started with Pandas

To get started with Pandas, Numpy and Pandas needs to be imported as shown below:

#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of
numpy
import numpy as np
#importing pandas
import pandas as pd

In a nutshell, Pandas objects are advanced versions of NumPy structured arrays in


which the rows and columns are identified with
labels instead of simple integer indices.

The basic data structures of Pandas are Series and DataFrame.

*Panda series object

Series is one dimensional labelled array. It supports different datatypes like


integer, float, string etc. Let us understand more about series with the following
example.
Consider the scenario where marks of students are given as shown in the following
table:

Student ID

Marks

78

92

36

64

89

The pandas series object can be used to represent this data in a meaningful manner.
Series is created using the following syntax:

Syntax:

pd.Series(data, index, dtype)

data – It can be a list, a list of lists or even a dictionary.

index – The index can be explicitly defined for different valuesif


required.

dtype – This represents the data type used in the series (optional
parameter).

series = pd.Series(data = [78, 92, 36, 64, 89])


series

As shown in the above output, the series object provides the values along with
their index attributes.

Series.values provides the values.

series.values

Series.index provides the index.


series.index

Accessing data in series:


Data can be accessed by the associated index using [ ].

series[1]

Slicing a series:
series[1:3]

*Custom index in series

You might also like