0% found this document useful (0 votes)

101 views

Python For Data Science

This document provides an introduction to using Python libraries like NumPy, Pandas, Matplotlib and Scikit-Learn for data analysis and machine learning. It discusses the need for libraries to improve code efficiency and modularity over writing code from scratch. NumPy is introduced as a fundamental library for scientific computing that allows faster operations on large datasets using ndarrays compared to native Python lists. The document demonstrates creating 1D and 2D NumPy arrays from lists of data to represent columns of a car dataset.

Uploaded by

Mohit Malghade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

Python For Data Science

Uploaded by

Mohit Malghade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 22

*By the end of this course, you will be able to :

• Explain the need for Python libraries

• Use Numpy to work with arrays

• Use Pandas to load, explore, manipulate, analyze and process data

• Derive statistical outcomes of a real dataset

• Visualize data

• Create a machine learning model for predictive analysis

*About Python

Python is an open source, general-purpose programming language. It supports both

structured and object-oriented style of programming.
It can be utilized for developing wide range of applications including web
applications, data analytics, machine learning applications
etc.

Python provides various data types and data structures for storing and processing
data. For handling single values, there are
data types like int, float, str, and bool. For handling data in groups, python
provides data structures like list, tuple, dictionary,
set, etc.

Python has a wide range of libraries and built-in functions which aid in rapid
development of applications. Python libraries are
collections of pre-written codes to perform specific tasks. This eliminates the
need of rewriting the code from scratch.

*Why Python Libraries?

Let us consider the following scenario:

John is a software developer. His project requires developing an application that

connects to various database servers like MySQL,
Postgre, MongoDB etc. To implement this requirement from scratch, John needs to
invest his time and effort to understand the
underlying architectures of the respective databases. Instead, John can choose to
use pre-defined libraries to perform the database
operations which abstracts the complexities involved.

Use of libraries will help John in the following ways:

Faster application development – Libraries promote code reusability and help the
developers save time and focus on building the
functional logic.
Enhance code efficiency – Use of pre-tested libraries enhances the quality and
stability of the application.
Achieve code modularization – Libraries can be coupled or decoupled based on
requirement.
Over the last two decades, python has emerged as a first-choice tool for tasks that
involve scientific computing, including the
analysis and visualization of large datasets. Python has gained popularity,
particularly in the field of data science because of
large and active ecosystem of third-party libraries.

Few of the popular libraries in data science include NumPy, Pandas, Matplotlib and
Scikit-Learn.

Let us proceed to understand about these libraries in detail.

*Business scenario

*XYZ Custom Cars is an automobile restoration company based in New York, USA. This
company is renowned for restoring vintage and
muscle cars. Their team takes great pride in each of their projects, no matter how
big or small. They offer paint jobs, frame
build-ups, engine restoration, body work etc. They are also involved in buying and
reselling of cars.

Every car owner that comes to XYZ Custom Cars gets a documentation drafted that
consists important information about the car.
This information is related to the car’s performance and manufacturing. The company
maintains this database with proper diligence.
Click here to download the XYZ Custom Cars data.

The data consists of features like acceleration, horsepower, region, model year,
etc. And the board of directors think that these
data can help generate insights about their projects. On the other hand, these
insights would help restore similar cars with similar
standards and procedures. Also, this would help them predict better reselling
prices in future. Precisely, these insights would help
generate greater revenue for the company through cost cutting and providing a data
driven approach to their process.

For example, the company may be interested in setting up different workstations

that cater to specific categories of cars as follows –

Features coming in play

Fuel efficient

Cars designed with low power and high fuel efficiency

High MPG, Low Horsepower, Low weight

Muscle Cars

Intermediate sized cars designed for high performance

High displacement, High horsepower, Moderate weight

SUV

Big sized cars designed for high performance, long distance trips and family
comfort
High horsepower, High weight

Racecar

Cars specifically designed for race tracks

High horsepower, Low weight, High acceleration

This would allow the company to place specialized mechanics and equipment in
specific workstations, thereby creating a hassle free and efficient work
atmosphere. Another interesting thing would be predicting the fuel efficiency of
cars after restoration based on the available data to minimize field testing.

The defined business scenarios will be used to understand the following Python
libraries:

1. Numpy

To get the mathematical and structural understanding of such data

To build a base for Pandas, the data manipulation library
To get familiar with terms like arrays, axis, vectorization etc.

2. Pandas

To read the data

To explore the data
To do operations on the data
To manipulate the data
To draw simple visualizations for self-consumption
To generate insights

3. Matplotlib

To visualize the data

To generate deep insights
To present the data to the leadership
To get a visual understanding of various features

4. Sci-kit learn

To get data ready for model building

To build predictive models
To evaluate the models
To infer the model results and present to the leadership

*Revisiting Python List

*a Python List can be used to store a group of elements together in a sequence. It

can contain heterogeneous elements.

Following are some examples of List:

item_list = ['Bread', 'Milk', 'Eggs', 'Butter', 'Cocoa']
student_marks = [78, 47, 96, 55, 34]
hetero_list = [ 1,2,3.0, ‘text’, True, 3+2j ]

To perform operations on the List elements, one needs to iterate through the List.
For example, if five extra marks need to be awarded
to all the entries in the student marks list. The following approach can be used to
achieve the same:

student_marks = [78, 47, 96, 55, 34]

for i in range(len(student_marks)):
student_marks[i]+=5
print(student_marks)

It can be observed that, there is use of a loop. The code is lengthy and becomes
computationally expensive with increase in the
size of the List.

Data Science is a field that utilizes scientific methods and algorithms to generate
insights from the data. These insights can
then be made actionable and applied across a broad range of application domains.
Data Science deals with large datatsets. Operating
on such data with lists and loops is time consuming and computationally expensive.

*Comparing Python List and Numpy performance.

Let us understand why Python Lists can become a bottleneck if they are used for
large data.

Consider that 1 million numbers must be added from two different lists.

%%time
#Used to calculate total operation time
list1 = list(range(1,1000000))
list2 = list(range(2,1000001))
list3 = []
for i in range(len(list1)):
list3.append(list1[i]+list2[i])

Note: Time taken will be different in different systems.

Let us understand, how Numpy can solve the same in minimal time.

Note: Ignore the syntax and focus on only the output.

%%time
#Used to calculate total operation time
#Importing Numpy
import numpy as np
#Creating a numpy array of 1 million numbers
a = np.arange(1,1000000)
b = np.arange(2,1000001)
c = a+b
It can be observed that the same operation has been completed in 12 milliseconds
when compared to 395 milliseconds taken by
Python List. As the data size and the complexity of operations increases, the
difference between the performance of Numpy and
Python Lists broadens.

In Data Science, there are millions of records to be dealt with. The performance
limitations faced by using Python List can
be managed by usage of advanced Python libraries like Numpy.

*Introduction to Numpy

Numeric-Python (Numpy), is a Python library that is used for numeric and scientific
operations. It serves as a building block for
many libraries available in Python.

Data structures in Numpy

The main data structure of NumPy is the ndarray or n-dimensional array.

The ndarray is a multidimensional container of elements of the same type as

depicted below. It can easily deal with matrix and
vector operations.

*Getting Started

*Importing Numpy
Numpy library needs to be imported in the environment before it can be used as
shown below. 'np' is the standard alias used for Numpy.
import numpy as np
Numpy object creation
Numpy array can be created by using array() function. The array() function in Numpy
returns an array object named ndarray.

Syntax: np.array(object, dtype)

object – A python object(for example, a list)

dtype – data type of object (for example, integer)

Example: Consider the following marks scored by students:

Student ID

Marks

36
4

These marks can be represented in a one-dimensional Numpy array as shown below:

import numpy as np
student_marks_arr = np.array([78, 92, 36, 64, 89])
student_marks_arr

This is one way to create a simple one-dimensional array.

*Numpy object creation demo- 1D array

The following dataset has been provided by XYZ Custom Cars. This data comes in a
csv file format.

There are various columns in this dataset. Each column contains multiple values.
These values can be represented as lists of items.
Since each column contains homogenous values, Numpy arrays can be used to represent
them.

Let us understand , how to represent the car ‘horsepower’ values in a Numpy array.

#creating a list of 5 horsepower values

horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
horsepower_arr

*Numpy object creation demo- 2D array

*How can multiple columns be represented together?

This can be achieved by creating the Numpy array from List of Lists.

Let us understand , how to represent the car 'mpg', ‘horsepower’, and

'acceleration' values in a Numpy array.

#creating a list of lists of 5 mpg, horsepower and acceleration values

car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#creating a numpy array from car_attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr
The example demonstrates that the Numpy array created using the List of Lists
results in a two-dimensional array.

*Shape of ndarray

*The numpy.ndarray.shape returns a tuple that describes the shape of the array.

For example:

a one-dimensional array having 10 elements will have a shape as (10,)

a two-dimensional array having 10 elements distributed evenly in two rows will have
a shape as (2,5)
Let us comprehend, how to find out the shape of car attributes array.

#creating a list of lists of mpg, horsepower and acceleration values

car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#creating a numpy array from attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr.shape

Here, 3 represents the number of rows and 5 represents the number of elements in
each row.

*'dtype' of ndarray

*'dtype' refers to the data type of the data contained by the array. Numpy supports
multiple datatypes like integer, float, string, boolean etc.

Below is an example of using dtype property to identify the data type of elements
in an array.

#creating a list of lists of 5 mpg, horsepower and acceleration values

Changing dtype
Numpy dtype can be changed as per requirements. For example, an array of integers
can be converted to float.

Below is an example of using dtype as an argument of np.array() function to convert

the data type of elements from integer to float.

#creating a list of lists of 5 mpg, horsepower and acceleration values

car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#converting dtype
car_attributes_arr = np.array(car_attributes, dtype = 'float')
print(car_attributes_arr)
print(car_attributes_arr.dtype)
*Accessing Numpy arrays

*The elements in the ndarray are accessed using index within the square brackets
[ ]. In Numpy, both positive and negative indices
can be used to access elements in the ndarray. Positive indices start from the
beginning of the array, while negative indices start
from the end of the array. Array indexing starts from 0 in positive indexing and
from -1 in negative indexing.

Below are some examples of accessing data from numpy arrays:

1. Accessing element from 1D array.

#creating an array of cars
cars = np.array(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino'])
#accessing the second car from the array
cars[1]

2. Accessing elements from a 2D array

#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
car_hp_arr

#Creating a 2D array consisting car names and horsepower

car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing second car - 0 represents 1st row and 1 represents 2nd element of the
row
car_hp_arr[0,1]
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing name of last car using negative indexing
car_hp_arr[0,-1]

*Slicing

Slicing is a way to access and obtain subsets of ndarray in Numpy.

Syntax: array_name[start : end] – index starts at ‘start’ and ends at ‘end - 1’.

1.Slicing from 1D array

#creating an array of cars
cars = np.array(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino'])
#accessing a subset of cars from the array
cars[1:4]

2. Slicing from a 2D array

#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
car_hp_acc_arr

#Creating a 2D array consisting car names, horsepower and acceleration

car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name and horsepower of last two cars
car_hp_acc_arr[0:2, 3:5]
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name, horsepower and acceleration of first three cars
car_hp_acc_arr[0:3, 0:3]

*Mean and Median

Problem Statement:
The engineers at XYZ Custom Cars want to know about the mean and median of
horsepower.

Solution:
The mean and median can be calculated with the help of following code:

#creating a list of 5 horsepower values

horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#mean horsepower
print("Mean horsepower = ",np.mean(horsepower_arr))

#creating a list of 5 horsepower values

horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#median horsepower
print("Median horsepower = ",np.median(horsepower_arr))

*Min and max

Problem Statement:
The engineers at XYZ Custom Cars want to know about the minimum and maximum
horsepower.

Solution:
The min and max can be calculated with the help of following code:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Minimum horsepower: ", np.min(horsepower_arr))
print("Maximum horsepower: ", np.max(horsepower_arr))

Finding the index of minimum and maximum values:

'argmin()' and 'argmax()' return the index of minimum and maximum values in an
array respectively.
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Index of Minimum horsepower: ", np.argmin(horsepower_arr))
print("Index of Maximum horsepower: ", np.argmax(horsepower_arr))

*Querying/searching in an array

Problem Statement:
The engineers at XYZ Custom Cars want to know the horsepower of cars that are
greater than or equal to 150.

Solution:
The 'where' function can be used for this requirement. Given a condition, 'where'
function returns the indexes of the array where the condition satisfies. Using
these indexes, the respective values from the array can be obtained.

#creating a list of 5 horsepower values

horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
x = np.where(horsepower_arr >= 150)
print(x) # gives the indices
# With the indices , we can find those values
horsepower_arr[x]

*Filter data

Problem Statement:
The Engineers at XYZ Custom Cars want to create a separate array consisting of
filtered values of horsepower greater than 135.

Solution:
Getting some elements out of an existing array based on certain conditions and
creating a new array out of them is called filtering.

The following code can be used to accomplish this:

#creating a list of 5 horsepower values

horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#creating filter array
filter_arr = horsepower_arr > 135
newarr = horsepower_arr[filter_arr]
print(filter_arr)
print(newarr)

*Sorting an array

Problem Statement:
The engineers at XYZ Custom Cars want the horsepower in sorted order.

Solution:
The numpy array can be sorted by passing the array to the function sort(array) or
by array.sort.

So, what is the difference between these two functions though they are used for the
same functionality?

#creating a list of 5 horsepower values

horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#using sort(array)
print('original array: ', horsepower_arr)
print('Sorted array: ', np.sort(horsepower_arr))
print('original array after sorting: ', horsepower_arr)

#creating a list of 5 horsepower values

horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#using sort(array)
print('original array: ', horsepower_arr)
horsepower_arr.sort()
print('original array after sorting: ', horsepower_arr)

The difference is that the array.sort() function modifies the original array by
default, whereas the sort(array) function does not.

*Vectorized operations

The mathematical operations can be performed on Numpy arrays. Numpy makes use of
optimized, pre-compiled code to perform mathematical operations on each array
element. This eliminates the need of using loops, thereby enhancing the
performance. This process is called vectorization.

Numpy provides various mathematical functions such as sum(), add(), sub(), log(),
sin() etc. which uses vectorization.

Consider an example of marks scored by a student:

Subject

Marks

English

Mathematics

Physics

Chemistry

Biology
89

Problem Statement:
Calculate the sum of all the marks.

Solution:
The sum() function can be used which internally uses vectorizaton .
student_marks_arr = np.array([78, 92, 36, 64, 89])
print(np.sum(student_marks_arr))

Problem Statement:
Award extra marks in subjects as follows:

English: +2

Mathematics: +2

Physics: +5

Chemistry: +10

Biology: +2

Solution:
Below is the solution to the problem:

additional_marks = [2, 2, 5, 10, 1]

student_marks_arr += additional_marks
student_marks_arr
Also, the same operation can be performed as shown below:

student_marks_arr = np.array([78, 92, 36, 64, 89])

student_marks_arr = np.add(student_marks_arr, additional_marks)
student_marks_arr
Both the above methods use vectorization internallly eliminating the need of loops.

Other arithmetic operations can also be performed in a similar manner.

In addition to arithmetic operations, several other mathematical operations

like exponents, logarithms and trigonometric functions are also available in Numpy.
This makes Numpy a very useful tool for
scientific computing.

*Broadcasting

"Broadcasting" refers to the term on how Numpy handles arrays with different shapes
during arithmetic operations. Array of smaller size is stretched or copied across
the larger array.

For example, considering the following arithmetic operations across 2 arrays:

import numpy as np
# Array 1
array1=np.array([5, 10, 15])
# Array 2
array2=np.array([5])
array3= array1 * array2
array3

In this example, the array2 is being stretched or copied to match array1 during the
arithmetic operation resulting in new array array3 with the same shape as array1.

The following diagram explains broadcasting:

In the first operation, the shape of first array is 1x3 and the shape of second
array is 1x1. Hence, according to broadcasting
rules, the second array gets stretched to match the shape of first array and the
shape of the resulting array is 1x3.
In the second operation, the shape of first array is 3x3 and the shape of second
array is 1x3. Hence, according to broadcasting
rules, the second array gets stretched to match the shape of first array and the
shape of the resulting array is 3x3.
In the third operation, the shape of first array is 3x1 and the shape of second
array is 1x3. Hence, according to broadcasting
rules, both first and second arrays get stretched and the shape of the resulting
array is 3x3.

*Broadcasting - demo

Consider the following table consisting marks scored by four student in two
different subjects:
Students

Chemistry

Physics

Subodh

Ram

Abdul

John

40
The teacher of these students wants to represent their marks in an array. To do so,
the marks can be stored using a 4x2 array as follows:

#Students marks in 4 subjects

students_marks = np.array([[67, 45],[90, 92],[66, 72],[32, 40]])
students_marks

Problem Statement:
Now the teacher wants to award extra five marks in Chemistry and extra ten marks in
Physics.

Solution:
#Students marks in 4 subjects
students_marks = np.array([[67, 45],[90, 92],[66, 72],[32, 40]])
#Broadcasting
students_marks += [5,10]
students_marks

The student's marks array is a 2D array of shape 4x2. The marks to be added are in
the form of a 1D array of size 1x2. According to
the broadcasting rules, the marks to
be added get stretched to match the shape of student marks array and the shape of
the resulting array is 4x2.

*Image as a Numpy matrix

Images are stored as arrays of hundreds, thousands or even millions of picture

elements called as pixels. Therefore, images can also
be treated as Numpy array, as they can be represented as matrix of pixels.

Certain basic operations and manipulations can be carried out on images using Numpy
and scikit-image package. Scikit-image is an
image processing package.

The package is imported as skimage.

Importing an image:
#Importing path and skimage i/o library
import os.path
from skimage.io import imread
from skimage import data_dir
#reading the astronaut image
img = imread(os.path.join(data_dir, 'astronaut.png'))

#Using matplotlib.pyplot to visualize the image

import matplotlib.pyplot as plt
plt.imshow(img)

To view as a matrix, the below command must be followed:

print(img)
Properties of image
Let us understand the type, dimensions and shape of the image.

print('Type of image: ', type(img))

print('Dimensions of image: ', img.ndim)
print('Shape of image:', img.shape)

*Indexing and selection

So far, you have become familiar with, how to retrieve the basic attributes of the
image. Let us proceed to understand some
examples on indexing and selection on images.

Cutting the rocket out of the image

#Importing path and skimage i/o library
import os.path
from skimage.io import imread
from skimage import data_dir
#reading the astronaut image
img = imread(os.path.join(data_dir, 'astronaut.png'))
#Slicing out the rocket
img_slice = img.copy()
img_slice = img_slice[0:300,360:480]
plt.figure()
plt.imshow(img_slice)

In this case, the image has been sliced corresponding to the rocket from the
original image.

Assigning the values corresponding to the sliced image as 0:

img[0:300,360:480,:] = 0
plt.imshow(img)

img_slice[np.greater_equal(img_slice[:,:,0],100) &
np.less_equal(img_slice[:,:,0],150)] = 0
plt.figure()
plt.imshow(img_slice)
The place where the sliced rocket image was present initially, is now filled with
black color because 0 is assigned to the
values corresponding to the sliced image.

Replacing the ‘rocket’ back to its original place:

img[0:300,360:480,:] = img_slice
plt.imshow(img)
For the above picture, the black image in the previous step is replaced with the
sliced ‘rocket’.

*Summary

So far, you have learnt the following key features of Numpy:

Numpy offers multi-dimensional arrays.

It provides array operations that are better than python list operations in terms
of speed, efficiency and ease of writing code.
Numpy provides fast and convenient operations in the form of vectorization and
broadcasting.
Numpy offers additional capabilities to perform Linear Algebra and scientific
computing. This is out of scope of this module.
There is another Python library called Pandas, built on top of Numpy when it comes
to analysis and manipulation on tabular data.
Let us proceed to learn more about it.

*Additional methods to create numpy arrays - Arange and Linspace

Arange
This method returns evenly spaced values between the given intervals excluding the
end limit. The values are generated based on
the step value and by default, the step value is 1.

#start and end limit

np.arange(0,10000)

#step value = 2
np.arange(0,10,2)

Linspace
This method returns the given number of evenly spaced values, between the given
intervals. By default, the number of values
between a given interval is 50.

#Generating values between 0 and 10

arr = np.linspace(0,10)
print(arr)
print('Length of arr: ',len(arr))

#Number of values = 3
print(np.linspace(0,10,3))

*Zeroes and Ones

Zeros
Returns an array of given shape filled with zeros.

#1D
np.zeros(5)

#2D
np.zeros([2,3])

Ones
Returns an array of given shape filled with ones.

#1D
np.ones(3)

#2D
np.ones([2,1])

*Full and Eye

Full:
Returns an array of given shape, filled with given value, irrespective of datatype.

#number=5, value=8
np.full(5,8)

#shape=[3,3], value=numpy
np.full([3,3],'numpy')

Eye
Returns an identity matrix for the given shape.

#3x3 identity matrix

np.eye(3)

*Random

Random
NumPy has numerous ways to create random number arrays. Random numbers can be
created for the required length, from a uniform distribution by just passing the
value of required length to the random.rand function.

#generating 5 random numbers from a uniform distribution

np.random.rand(5)

Note: Output might not be same as it is randomly generated.

Similarly, to generate random numbers from a Normal distribution, use random.randn

function.

Random numbers of type 'integer' can also be generated using random.randint

function. Below shown is an example of creating five random numbers between 1 and
10.

#random integer values low=1, high=10, number of values=5

np.random.randint(1,10, size=5)

Similarly, two-dimensional arrays of random numbers can also be created by passing

the shape instead of number of values.

#random integer values high=100, shape = (3,5)

x = np.random.randint(100, size=(3, 5))
print(x)
print(type(x))

To generate a random number from a predefined set of values present in an array,

the choice() method can be used.

The choice() method takes an array as a parameter and randomly returns the values
based on the size.

#returns a single random value from the array

x = np.random.choice([9, 3, 7, 5])
print(x)

#returns 3*5 random values from the array

x = np.random.choice([9, 3, 7, 5], size=(3, 5)) # sampling to create an nd-array
print(x)
print(type(x))

*Introduction to pandas

Pandas is an open-source library for real world data analysis in python. It is

built on top of Numpy. Using Pandas, data can
be cleaned, transformed, manipulated, and analyzed. It is suited for different
kinds of data including tabular as in a SQL table
or a Excel spreadsheets, time series data, observational or statistical datasets.

The steps involved to perform data analysis using Pandas are as follows:

*Steps in data analysis

Reading the data

The first step is to read the data. There are multiple formats in which data can be
obtained such as '.csv', '.json', '.xlsx' etc.

Below are the examples:

Example of an excel file:

Example of a json (javascript object notation) file:

Example of a csv (comma separated values) file:

*Steps in daata analysis

Exploring the data

The next step is to explore the data. Exploring data helps to:

know the shape(number of rows and columns) of the data

understand the nature of the data by obtaining subsets of the data
identify missing values and treat them accordingly
get insights about the data using descriptive statistics
Performing operations on the data
Some of the operations supported by pandas for data manipulation are as follows:

Grouping operations
Sorting operations
Masking operations
Merging operations
Concatenating operations
Visualizing data
The next step is to visualize the data to get a clear picture of various
relationshipsamong the data. The following plots can help visualize the data:

Scatter plot
Box plot
Bar plot
Histogram and many more
Generating Insights
All the above steps help generating insights about our data.

*Why Pandas?

*Pandas is one of the most popular data wrangling and analysis tools because it:

has the capability to load huge sizes of data easily

provides us with extremely streamlined forms of data representation
can handle heterogenous data, has extensive set of data manipulation features and
makes data flexible and customizable

*Getting started with Pandas

To get started with Pandas, Numpy and Pandas needs to be imported as shown below:

#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of
numpy
import numpy as np
#importing pandas
import pandas as pd

In a nutshell, Pandas objects are advanced versions of NumPy structured arrays in

which the rows and columns are identified with
labels instead of simple integer indices.

The basic data structures of Pandas are Series and DataFrame.

*Getting started with Pandas

To get started with Pandas, Numpy and Pandas needs to be imported as shown below:

#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of
numpy
import numpy as np
#importing pandas
import pandas as pd

In a nutshell, Pandas objects are advanced versions of NumPy structured arrays in

which the rows and columns are identified with
labels instead of simple integer indices.

The basic data structures of Pandas are Series and DataFrame.

*Panda series object

Series is one dimensional labelled array. It supports different datatypes like

integer, float, string etc. Let us understand more about series with the following
example.
Consider the scenario where marks of students are given as shown in the following
table:

Student ID

Marks

The pandas series object can be used to represent this data in a meaningful manner.
Series is created using the following syntax:

Syntax:

pd.Series(data, index, dtype)

data – It can be a list, a list of lists or even a dictionary.

index – The index can be explicitly defined for different valuesif

required.

dtype – This represents the data type used in the series (optional
parameter).

series = pd.Series(data = [78, 92, 36, 64, 89])

series

As shown in the above output, the series object provides the values along with
their index attributes.

Series.values provides the values.

series.values

Series.index provides the index.

series.index

Accessing data in series:

Data can be accessed by the associated index using [ ].

series[1]

Slicing a series:
series[1:3]

*Custom index in series

Python For Data Science Extended Ebook PDF
100% (5)
Python For Data Science Extended Ebook PDF
56 pages
Weather Observing Handbook PDF
100% (2)
Weather Observing Handbook PDF
153 pages
Ai, Ds & ML
No ratings yet
Ai, Ds & ML
52 pages
suraj report file
No ratings yet
suraj report file
17 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
DATA ANALYSIS USING PYTHON2
No ratings yet
DATA ANALYSIS USING PYTHON2
27 pages
DSBA Curriculum Guide
No ratings yet
DSBA Curriculum Guide
18 pages
RR
No ratings yet
RR
35 pages
Data Preprocessing-AIML Algorithm1
No ratings yet
Data Preprocessing-AIML Algorithm1
47 pages
tool and lib in Data Science
No ratings yet
tool and lib in Data Science
32 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
No ratings yet
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
179 pages
Data Science Using With Python
No ratings yet
Data Science Using With Python
14 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
MGNM801 Ca2 Final
No ratings yet
MGNM801 Ca2 Final
13 pages
D P Lab Manual
No ratings yet
D P Lab Manual
54 pages
Python Ca22
No ratings yet
Python Ca22
14 pages
Unit 2 MCA275 PPT Part 1
No ratings yet
Unit 2 MCA275 PPT Part 1
34 pages
Top 18 Python Libraries
100% (1)
Top 18 Python Libraries
11 pages
lab2report
No ratings yet
lab2report
6 pages
Program Delivery
No ratings yet
Program Delivery
37 pages
Microsoft Ai Automate
No ratings yet
Microsoft Ai Automate
259 pages
PDS_unit1-1
No ratings yet
PDS_unit1-1
104 pages
DVAP - Final Project Report
No ratings yet
DVAP - Final Project Report
27 pages
Machine Learning - Manual
No ratings yet
Machine Learning - Manual
32 pages
Part A
No ratings yet
Part A
24 pages
Python and Its Libraries in Data Science and Related Fields
No ratings yet
Python and Its Libraries in Data Science and Related Fields
4 pages
Data Analysis with Python
No ratings yet
Data Analysis with Python
51 pages
NPTEL PRESENTATION
No ratings yet
NPTEL PRESENTATION
24 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Finall Report Internship
No ratings yet
Finall Report Internship
45 pages
nitin_seminar_report
No ratings yet
nitin_seminar_report
47 pages
Vibhin Pro
No ratings yet
Vibhin Pro
36 pages
What Is Python?: Why Python For Data Science?
No ratings yet
What Is Python?: Why Python For Data Science?
3 pages
Programming For Data Science
No ratings yet
Programming For Data Science
48 pages
DVAP - Final Project Report
No ratings yet
DVAP - Final Project Report
27 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
IDS18ME124 internship ppt
No ratings yet
IDS18ME124 internship ppt
14 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Lab - Manual FDS
No ratings yet
Lab - Manual FDS
12 pages
Important Libraries For Data Science
No ratings yet
Important Libraries For Data Science
29 pages
Basic Libraries For Data Science
No ratings yet
Basic Libraries For Data Science
4 pages
CS3361 - Data Science
No ratings yet
CS3361 - Data Science
56 pages
dsbda Unit4
No ratings yet
dsbda Unit4
110 pages
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
No ratings yet
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
76 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
Lab 05 ICT
No ratings yet
Lab 05 ICT
4 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Analyzing The Impact of Python Libraries On Data Science
No ratings yet
Analyzing The Impact of Python Libraries On Data Science
23 pages
FDS Syllabus and CIS
No ratings yet
FDS Syllabus and CIS
10 pages
data science
No ratings yet
data science
10 pages
data science lab exp lis
No ratings yet
data science lab exp lis
72 pages
Numpy_Data_Analysis_and_visualisation_with_Python
No ratings yet
Numpy_Data_Analysis_and_visualisation_with_Python
75 pages
03-Python Libraries - Numpy - Matplotlib
No ratings yet
03-Python Libraries - Numpy - Matplotlib
56 pages
Adobe Scan 15 Apr 2025
No ratings yet
Adobe Scan 15 Apr 2025
19 pages
Unit 1
100% (1)
Unit 1
69 pages
Explain The Role of Data Science With Python? Ans
No ratings yet
Explain The Role of Data Science With Python? Ans
2 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
NUMPY - INTRODUCTION
No ratings yet
NUMPY - INTRODUCTION
118 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Oop
No ratings yet
Oop
9 pages
Document 50
No ratings yet
Document 50
1 page
TA3 Rollno 71
No ratings yet
TA3 Rollno 71
13 pages
Document
No ratings yet
Document
2 pages
Southern Motors v. Barbosa PDF
No ratings yet
Southern Motors v. Barbosa PDF
2 pages
Literacy_Report_Telangana
No ratings yet
Literacy_Report_Telangana
2 pages
Using Technology To Enhance Higher Education
No ratings yet
Using Technology To Enhance Higher Education
12 pages
Work Holding
No ratings yet
Work Holding
337 pages
Standard Quality Assessment Criteria For Evaluating Primary Research Papers From A Variety of Fields
No ratings yet
Standard Quality Assessment Criteria For Evaluating Primary Research Papers From A Variety of Fields
31 pages
Successful Entrepreneurs in Animal and Fish Raising
100% (2)
Successful Entrepreneurs in Animal and Fish Raising
14 pages
Makindye Division
No ratings yet
Makindye Division
44 pages
10 Types of Innovation
No ratings yet
10 Types of Innovation
12 pages
Community Planning An Introduction to the Comprehensive Plan 2nd Edition Eric Damian Kelly - The ebook is available for instant download, no waiting required
100% (1)
Community Planning An Introduction to the Comprehensive Plan 2nd Edition Eric Damian Kelly - The ebook is available for instant download, no waiting required
86 pages
Class 7 History Lno6 Traders Craftsmen
No ratings yet
Class 7 History Lno6 Traders Craftsmen
2 pages
Analysis of Education Loan: A Case Study of National Capital Territory of Delhi
No ratings yet
Analysis of Education Loan: A Case Study of National Capital Territory of Delhi
14 pages
Flowserve Reg Pump
100% (2)
Flowserve Reg Pump
66 pages
selfstudys_com_file (7)
No ratings yet
selfstudys_com_file (7)
20 pages
What Is Engineering? (Definition and Types)
No ratings yet
What Is Engineering? (Definition and Types)
14 pages
Legal Machine Translation Explained MT i
No ratings yet
Legal Machine Translation Explained MT i
30 pages
Privileges: Step Into The World of
No ratings yet
Privileges: Step Into The World of
27 pages
Stability Testing
No ratings yet
Stability Testing
3 pages
EXERCISE IN ORDER TO, SO THAT
No ratings yet
EXERCISE IN ORDER TO, SO THAT
1 page
Screenshot 2022-03-26 at 7.07.09 AM
No ratings yet
Screenshot 2022-03-26 at 7.07.09 AM
1 page
Arnav Singh PPT SIP
No ratings yet
Arnav Singh PPT SIP
12 pages
Evalution of Management Thought
No ratings yet
Evalution of Management Thought
16 pages
APznzab18Vr_pgmpfksNSxTE-ekFpLKsgP1zZDbhi1cu6QuWCOQenbVoUOxYH1juBbPv2UOTq7Cahkkc6VLcdSx_tPhsAJx7ZSDcKhRC8fVB8uH62PqnWGYhlpo3YYqP38xRyA2o5VcxZ4juUXSxjWGGgv88zThprrP3qzg6bW3wf2gp1yJVQpESxh0beY6RcYdU1rIDc42XDRRdAgz5hy
No ratings yet
APznzab18Vr_pgmpfksNSxTE-ekFpLKsgP1zZDbhi1cu6QuWCOQenbVoUOxYH1juBbPv2UOTq7Cahkkc6VLcdSx_tPhsAJx7ZSDcKhRC8fVB8uH62PqnWGYhlpo3YYqP38xRyA2o5VcxZ4juUXSxjWGGgv88zThprrP3qzg6bW3wf2gp1yJVQpESxh0beY6RcYdU1rIDc42XDRRdAgz5hy
9 pages
Jurisdiction Vs Venue
No ratings yet
Jurisdiction Vs Venue
2 pages
Week6 PDF
No ratings yet
Week6 PDF
10 pages
Pentagon & Boeing 757 Ground Effect: - Question From Eric
No ratings yet
Pentagon & Boeing 757 Ground Effect: - Question From Eric
10 pages
HP HP2-T19 Exam
No ratings yet
HP HP2-T19 Exam
9 pages
As Work Energy and Power Questions
No ratings yet
As Work Energy and Power Questions
35 pages
Helios DCS-WCDMA Hybrid Pico Repeater User Manual Ver1.0B
No ratings yet
Helios DCS-WCDMA Hybrid Pico Repeater User Manual Ver1.0B
17 pages
Thor Release Notes 1.1
No ratings yet
Thor Release Notes 1.1
3 pages