0% found this document useful (0 votes)

97 views

Introduction To Python (Part III)

The document discusses important Python libraries including NumPy, Pandas, Matplotlib, and Scikit-learn. NumPy is used for numerical computing and contains functionality for multidimensional arrays. Pandas is used for data manipulation and analysis. Matplotlib is primarily used for scientific plotting. Common tasks covered include loading and manipulating data, creating arrays, data visualization, and handling outliers and missing values.

Uploaded by

Subhradeep Pal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views

Introduction To Python (Part III)

Uploaded by

Subhradeep Pal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

INTRODUCTION TO PYTHON

(PART III)

Presenter: Prof. Amit Kumar Das

Assistant Professor,
Dept. of Computer Science and Engg.,
Institute of Engineering & Management.
IMPORTANT LIBRARIES IN PYTHON
 scikit-learn
 Numpy – Fundamental package for scientific
computing
 SciPy – Package providing mathematical functions
and statistical distributions
 matplotlib – Primary library supporting scientific
plotting e.g. line diagrams, histograms, scatter plots
 pandas – Primary library providing data manipulation
functionalities
BASIC PYTHON LIBRARIES - NUMPY
NumPy package contains functionality for multidimensional arrays, high-level mathematical
functions e.g. linear algebra and Fourier transform operations, random number generators, etc.
In scikit-learn, NumPy array is the primary data structure, used to input data. Any data used needs
to be converted to a NumPy array.
numpy.array(object, dtype, copy, order, subok, ndmin)
dtype means data-type i.e. the desired data-type for the array. If not given, then the type will be
determined as the minimum type required to hold the objects in the sequence.

 empty - Return a new uninitialized array

 full - Return a new array of given shape filled with value
 ones - Return a new array setting values to one
 zeros - Return a new array setting values to zero
BASIC PYTHON LIBRARIES - NUMPY
# Defining an array variable with data ...
import numpy as np

arr1 = np.empty((2,3))

arr2 = np.array([[10,2,3], [23,45,67]])

print(var1)
[[10 2 3]
[23 45 67]]

# Create an array of 1s ...

Arr3 = np.ones((2,3))
[[ 1., 1., 1.],
[ 1., 1., 1.]]

# Create an array of 0s ...

Arr4 = np.zeros((2,3),dtype=np.int)
[[0, 0, 0],
[0, 0, 0]]

# Create an array with random numbers ...

np.random.random((2,2))
[[ 0.47448072, 0.49876875],
[ 0.29531478, 0.48425055]]
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
# Defining 1-D array variable with data ...
var2 = np.empty(4)
var2[0] = 5.67
var2[1] = 2
var2[2] = 56
var2[3] = 304
print(var2)
[ 5.67 2. 56. 304. ]
print(var2.shape) # Returns the dimension of the array ...
(4,)
print(var2.size) # Returns the size of the array ...
4
# Defining 2-D array variable with data ...
var3 = np.empty((2,3))
var3[0][0] = 5.67
var3[0][1] = 2
var3[0][2] = 56
var3[1][0] = .09
var3[1][1] = 132
var3[1][2] = 1056
print(var3)
[[ 5.67000000e+00 2.00000000e+00 5.60000000e+01]
[ 9.00000000e-02 1.32000000e+02 1.05600000e+03]]
[Note: Same result will be obtained with dtype=np.float]
print(var3.shape)
(2, 3)
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
# Same declaration with dtype mentioned ...
var3 = np.empty((2,3), dtype=np.int)
[[ 5, 2, 56],
[ 0, 132, 1056]]
print(var3[1]) # Returns a row of an array ...
[ 0 132 1056]
print(var3[[0, 1]]) # Returns multiple rows of an array ...
[[ 5 2 56]
[ 0 132 1056]]
print(var3[:, 2]) # Returns a column of an array ...
[ 56 1056]
print(var3[:, [1, 2]]) # Returns multiple column of an array ...
[[ 2 56]
[ 132 1056]]
print(var3[1][2]) # Returns a cell value of an array ...
1056
print(var3[1, 2]) # Returns a cell value of an array ...
1056
print(np.transpose(var3)) # Returns transpose of an array ...
[[ 5 0]
[ 2 132]
[ 56 1056]]
print(var3.reshape(3,2)) # Returns a re-shaped array ...
[[ 5 2]
[ 56 0]
[ 132 1056]]
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
Create and concatenate arrays:
import numpy as np

arr1= np.empty((2,3), dtype=np.int)

arr1[0][0] = 5.67
arr1[0][1] = 2
arr1[0][2] = 56
arr1[1][0] = .09
arr1[1][1] = 132
arr1[1][2] = 1056

[[ 5, 2, 56],
[ 0, 132, 1056]]

arr2 = np.empty((1,3), dtype=np.int)

arr2[0][0] = 37
arr2[0][1] = 2.193
arr2[0][2] = 5609

[[ 37, 2, 5609]]
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
arr_concat = np.concatenate((arr1, arr2), axis = 0)
print(arr_concat)

[[ 5 2 56]
[ 0 132 1056]
[ 37 2 5609]]

var2.min() # Returns minimum value stored in an array ...

2.0

var2.max() # Returns maximum value stored in an array ...

304.0

var2.cumsum() # Returns cumulative sum of the values stored in an array

...
array([ 5.67, 7.67, 63.67, 367.67])

var2.mean() # Returns mean or average value stored in an array ...

91.917500000000004

var2.std() # Returns standard deviation of values stored in an array ...

124.2908299865682
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
BASIC PYTHON LIBRARIES – PANDAS
pandas is a Python package providing fast and flexible functionalities designed to work with
“relational” or “labeled” data.
import pandas as pd # “pd” is just an alias for pandas

data = pd.read_csv("auto-mpg.csv") # Uploads data from a .csv file

type(data) # To find the type of the data set object loaded

pandas.core.frame.DataFrame

data.shape # To find the dimensions i.e. number of rows and columns of

the data set loaded

(398, 9)

nrow_count = data.shape[0] # To find just the number of rows

print(nrow_count)
398

ncol_count = data.shape[1] # To find just the number of columns

print(ncol_count)
9
BASIC PYTHON LIBRARIES – PANDAS (CONTD.)
data.columns # To get the columns of a dataframe

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',

'acceleration', 'model year', 'origin', 'car name'],
dtype='object')

# To change the column names of a dataframe e.g. ‘mpg’ in this case …

data.columns = ['miles_per_gallon', 'cylinders', 'displacement',

'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car
name']

data.columns # To get the revised column names of the dataframe ...

Index(['miles_per_gallon', 'cylinders', ...], dtype='object')

data.rename(columns={'displacement': 'disp'}, inplace=True)

BASIC PYTHON LIBRARIES – PANDAS (CONTD.)
data.head() # By default displays top 5 rows

data.head(3) # To display the top 3 rows

data.tail () # By default displays bottom 5 rows

data.tail (3) # To display the bottom 3 rows

data.at[200,'cylinders'] # Will return cell value of the 200th row

and column ‘cylinders’ of the data frame
6
Alternatively, we can use the following code:

data.get_value(200,'cylinders')

data_cyl = data.loc[: , "car name"]

data_cyl.head()
0 chevrolet chevelle malibu
1 buick skylark 320
2 plymouth satellite
3 amc rebel sst
4 ford torino
Name: car name, dtype: object
BASIC PYTHON LIBRARIES – PANDAS (CONTD.)
Find missing values in a data set:
import numpy as np
import pandas as pd

# Creation of a data set with missing values ...

var1 = [np.nan, np.nan, np.nan, 10.1, 12, 123.14, 0.121]
var2 = [40.2, 11.78, 7801, 0.25, 34.2, np.nan, np.nan]
var3 = [1234, np.nan, 34.5, np.nan, 78.25, 14.5, np.nan]
df = pd.DataFrame({'Attr_1': var1, 'Attr_2': var2, 'Attr_3': var3})
print(df)
Attr_1 Attr_2 Attr_3
0 NaN 40.20 1234.00
1 NaN 11.78 NaN
2 NaN 7801.00 34.50
3 10.100 0.25 NaN
4 12.000 34.20 78.25
5 123.140 NaN 14.50
6 0.121 NaN NaN
# Find missing values in a data set
miss_val = df[df['Attr_1'].isnull()]
print(miss_val)
Attr_1 Attr_2 Attr_3
0 NaN 40.20 1234.0
1 NaN 11.78 NaN
2 NaN 7801.00 34.5
BASIC PYTHON LIBRARIES – PANDAS (CONTD.)
>>> np.mean(data[["mpg"]])
23.514573
>>> np.median(data[["mpg"]])
23.0
>>> np.var(data[["mpg"]])
60.936119
>>> np.std(data[["mpg"]])
7.806159
BASIC PYTHON LIBRARIES – MATPLOTLIB
Constructing Box plot for Iris data set
 Popular data set in the machine learning
 Consists of 3 different types of iris flower - Setosa, Versicolour, and
Virginica
 4 columns - Sepal Length, Sepal Width, Petal Length and Petal Width
 First have to import the Python library datasets

>>> from sklearn import datasets

# import some data to play with
>>> iris = datasets.load_iris()
>>> import matplotlib.pyplot as plt
>>> X = iris.data[:, :4]
>>> plt.boxplot(X)
>>> plt.show()
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
Box plot for Iris data set (all features):
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
>>> plt.boxplot(X[:, 1])
>>> plt.show()

Box plot for Iris data set (single feature)

BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
>>> import matplotlib.pyplot as plt
>>> X = iris.data[:, :1]
Histogram
>>> plt.hist(X)
>>> plt.xlabel('Sepal length')
>>> plt.show()
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)

Scatterplot of Iris data set : Sepal length vs. Petal length

>>> X = iris.data[:, :4] # We take the first 4 features

>>> y = iris.target
>>> plt.scatter(X[:, 2], X[:, 0], c=y, cmap=plt.cm.Set1,
edgecolor='k')
>>> plt.xlabel('Petal length')
>>> plt.ylabel('Sepal length')
>>> plt.show()
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
Scatterplot of Iris data set : Sepal length vs. Petal length
DATA PRE-PROCESSING

Mainly deals with two things –

 Handling outliers
 Remediating missing values

Primary measures for remediating outliers and missing values are:

 Removing specific rows containing outliers / missing values
 Imputing the value (i.e. outlier / missing value) with a standard
statistical measure e.g. mean or median or mode for that attribute
 Estimate the value (i.e. outlier / missing value) based on value of the
attribute in similar records and replace with the estimated value.
 Cap the values within 1.5 X IQR limits
DATA PRE-PROCESSING (CONTD.)
>>> df = pd.read_csv("auto-mpg.csv")

Finding missing values in a data set:

>>> miss_val = df[df['horsepower'].isnull()]

>>> print(miss_val)
DATA PRE-PROCESSING (CONTD.)
Finding Outliers (Option 1) :
>>> import matplotlib.pyplot as plt
>>> X = data["mpg"]
>>> plt.boxplot(X)
>>> plt.show()

>>> outliers = plt.boxplot(X[:, ])["fliers"][0].get_data()[1]

>>> outliers
array([ 46.6])
DATA PRE-PROCESSING (CONTD.)
Finding Outliers (Option 2) :
def find_outlier(ds, col):
quart1 = ds[col].quantile(0.25)
quart3 = ds[col].quantile(0.75)
IQR = quart3 - quart1 #Inter-quartile range
low_val = quart1 - 1.5*IQR
high_val = quart3 + 1.5*IQR
ds = ds.loc[(ds[col] < low_val) | (ds[col] > high_val)]
return ds
>>> outliers = find_outlier(data, "mpg")
>>> outliers
mpg cylinders displacement horsepower weight acceleration \
322 46.6 4 86.0 65.0 2110 17.9

model year origin car name

322 80 3 mazda glc
DATA PRE-PROCESSING (CONTD.)
Removing records with missing values / outliers:
We can drop the rows / columns with missing values using the code below.

>>> data.dropna(axis=0, how=‘any')

In a similar way, outlier values can be removed.

def remove_outlier(ds, col):

quart1 = ds[col].quantile(0.25)
quart3 = ds[col].quantile(0.75)
IQR = quart3 - quart1 #Interquartile range
low_val = quart1 - 1.5*IQR
high_val = quart3 + 1.5*IQR
df_out = ds.loc[(ds[col] > low_val) & (ds[col] <
high_val)]
return df_out

>>> data = remove_outlier(data, "mpg")

DATA PRE-PROCESSING (CONTD.)
Imputing standard values:
Only the affected rows are identified and the value of the attribute is transformed to the mean
value of the attribute.

>>> hp_mean = np.mean(data['horsepower'])

>>> imputedrows = data[data['horsepower'].isnull()]
>>> imputedrows = imputedrows.replace(np.nan, hp_mean)

Then the portion of the data set not having any missing row is kept apart.

>>> missval_removed_rows = data.dropna(subset=['horsepower'])

Then join back the imputed rows and the remaining part of the data set.

>>> data_mod = missval_removed_rows.append(imputedrows,

ignore_index=True)

In a similar way, outlier values can be imputed.

THANK YOU &
STAY TUNED!

Kumpulan Bug All Operator V 2
100% (3)
Kumpulan Bug All Operator V 2
14 pages
SAP HR Organizational Management (OM) Tutorial
0% (2)
SAP HR Organizational Management (OM) Tutorial
3 pages
3-numpy_pandas
No ratings yet
3-numpy_pandas
37 pages
Python Abstract
No ratings yet
Python Abstract
7 pages
ELE492 - ELE492 - Image Process Lecture Notes 5
No ratings yet
ELE492 - ELE492 - Image Process Lecture Notes 5
41 pages
22mbada303 Module 4
No ratings yet
22mbada303 Module 4
32 pages
Python-Unit-4
No ratings yet
Python-Unit-4
43 pages
Unit 5 PythonPackages(Matplotlib)
No ratings yet
Unit 5 PythonPackages(Matplotlib)
24 pages
EXP1-siddhant gupta (23_SE_148)
No ratings yet
EXP1-siddhant gupta (23_SE_148)
17 pages
PR Final File
No ratings yet
PR Final File
70 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
DP prog
No ratings yet
DP prog
10 pages
DV Lab2 Updated
No ratings yet
DV Lab2 Updated
12 pages
Numpy Pandas
No ratings yet
Numpy Pandas
54 pages
Scipy,Matplotlib,Pandas
No ratings yet
Scipy,Matplotlib,Pandas
16 pages
PR final file
No ratings yet
PR final file
49 pages
RAW Data
No ratings yet
RAW Data
22 pages
Tutorial 2
No ratings yet
Tutorial 2
9 pages
Ty B Tech - Bda - Ai315 - Lab Manual
No ratings yet
Ty B Tech - Bda - Ai315 - Lab Manual
52 pages
Unit 5
No ratings yet
Unit 5
27 pages
Exp-1
No ratings yet
Exp-1
22 pages
unit 5
No ratings yet
unit 5
28 pages
Numpy
No ratings yet
Numpy
64 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Introduction To Numpy: Aniruddh Kadam Reg No-12109237 Lovely Professional University
100% (1)
Introduction To Numpy: Aniruddh Kadam Reg No-12109237 Lovely Professional University
84 pages
Fundamentals of Data Science Lab Manual New1
No ratings yet
Fundamentals of Data Science Lab Manual New1
32 pages
Fds Lab Record
No ratings yet
Fds Lab Record
84 pages
DSL Rough Draft
No ratings yet
DSL Rough Draft
34 pages
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
No ratings yet
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
9 pages
Python Libraries
No ratings yet
Python Libraries
79 pages
4 Introduction to Python Part 3(1)
No ratings yet
4 Introduction to Python Part 3(1)
62 pages
CS3361 Data Science Lab Manual
No ratings yet
CS3361 Data Science Lab Manual
43 pages
Essential Python Libraries
100% (1)
Essential Python Libraries
41 pages
Batch2_FDS_printout
No ratings yet
Batch2_FDS_printout
38 pages
fdsa lab manual final
No ratings yet
fdsa lab manual final
70 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
No ratings yet
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
76 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Lesson 03 3.01 Python Libraries For Data Science
No ratings yet
Lesson 03 3.01 Python Libraries For Data Science
79 pages
Fundamentals of Data Science Lab Manual
No ratings yet
Fundamentals of Data Science Lab Manual
34 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Numpy Python
No ratings yet
Numpy Python
36 pages
Numpy_and_matplotlib_practical
No ratings yet
Numpy_and_matplotlib_practical
8 pages
Data Visualization1
No ratings yet
Data Visualization1
52 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
Python CA2
No ratings yet
Python CA2
11 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Week 4- Introduction to Python #3
No ratings yet
Week 4- Introduction to Python #3
47 pages
4 Introduction to Python Part 3 (2)
No ratings yet
4 Introduction to Python Part 3 (2)
48 pages
Ip Chapter 1
No ratings yet
Ip Chapter 1
36 pages
CS3361-DATA SCIENCE LAB MANUAL
No ratings yet
CS3361-DATA SCIENCE LAB MANUAL
44 pages
Unit-V Python_BCC402
No ratings yet
Unit-V Python_BCC402
20 pages
Data Analysis and Visualization Using Python Libraries and Streamlit - RTF Pre Read Materials
No ratings yet
Data Analysis and Visualization Using Python Libraries and Streamlit - RTF Pre Read Materials
29 pages
Unit Vi
No ratings yet
Unit Vi
60 pages
AD3301 DEV Lab Manual
No ratings yet
AD3301 DEV Lab Manual
26 pages
Python Libraries
No ratings yet
Python Libraries
53 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
42 pages
Pandas Class XII (2021-22)
No ratings yet
Pandas Class XII (2021-22)
246 pages
EDA_UNIT_1
No ratings yet
EDA_UNIT_1
7 pages
Data Science Lab Manual Full
No ratings yet
Data Science Lab Manual Full
47 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Model Evaluation and Improvement 2
No ratings yet
Model Evaluation and Improvement 2
17 pages
Model Evaluation and Improvement 1
No ratings yet
Model Evaluation and Improvement 1
8 pages
AIML Week3 QUIZ1-200519-114501
No ratings yet
AIML Week3 QUIZ1-200519-114501
3 pages
Introduction To Python (Part I)
No ratings yet
Introduction To Python (Part I)
25 pages
Introduction To Python (Part II)
No ratings yet
Introduction To Python (Part II)
29 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
Foundational Learnwise
No ratings yet
Foundational Learnwise
1 page
Assignment - Build Your Solution Demo and Conduct Solution Interviews
No ratings yet
Assignment - Build Your Solution Demo and Conduct Solution Interviews
6 pages
Assignment - Build Your Solution Demo and Conduct Solution Interviews
No ratings yet
Assignment - Build Your Solution Demo and Conduct Solution Interviews
6 pages
PL Function
No ratings yet
PL Function
5 pages
C Programmming: 1. Distinguish Between The Following: A. While and Do-While B. Break and Continue
No ratings yet
C Programmming: 1. Distinguish Between The Following: A. While and Do-While B. Break and Continue
23 pages
Software Requiremet Assignment 1
No ratings yet
Software Requiremet Assignment 1
4 pages
Participants Blockchain
No ratings yet
Participants Blockchain
10 pages
Safety Precautions
100% (1)
Safety Precautions
11 pages
Lesson 11 Empowerment Technologies
No ratings yet
Lesson 11 Empowerment Technologies
16 pages
How C++ Using' or Alias-Declaration Is Better Than Typedef - Nextptr
No ratings yet
How C++ Using' or Alias-Declaration Is Better Than Typedef - Nextptr
3 pages
Manual Bridge Master E, Furuno, Nucleus
100% (2)
Manual Bridge Master E, Furuno, Nucleus
96 pages
SBA 24-25
No ratings yet
SBA 24-25
12 pages
8 Nvidia PDF
No ratings yet
8 Nvidia PDF
48 pages
Chaitanya Verma: Bachelor of Engineering (B.E), Information Technology
No ratings yet
Chaitanya Verma: Bachelor of Engineering (B.E), Information Technology
2 pages
CWDP Pw0250 Objectives v1.0
No ratings yet
CWDP Pw0250 Objectives v1.0
14 pages
How To Create A Bootable DVD (Windows 7 or Windows Vista)
No ratings yet
How To Create A Bootable DVD (Windows 7 or Windows Vista)
3 pages
Manual Hikvision
100% (1)
Manual Hikvision
143 pages
Swathy U: Courses Certifications Education
No ratings yet
Swathy U: Courses Certifications Education
4 pages
Oracle AVDF Administrators Guide
No ratings yet
Oracle AVDF Administrators Guide
812 pages
Multiple Choice Questions (2021)
No ratings yet
Multiple Choice Questions (2021)
55 pages
Seminar Topic Nosql
No ratings yet
Seminar Topic Nosql
73 pages
AN565 (1)
No ratings yet
AN565 (1)
116 pages
Building Technology Level 5
No ratings yet
Building Technology Level 5
65 pages
Manual Berhinger Xenix 302 USB
No ratings yet
Manual Berhinger Xenix 302 USB
8 pages
Configuring Cisco Unified Im Presence Server 9x PDF
No ratings yet
Configuring Cisco Unified Im Presence Server 9x PDF
38 pages
Assignment 2 P2 Test Plan
No ratings yet
Assignment 2 P2 Test Plan
2 pages
CN Full Lab
100% (1)
CN Full Lab
56 pages
Attributes of An Engineer
No ratings yet
Attributes of An Engineer
3 pages
Chapter 1 Iot MCQ
No ratings yet
Chapter 1 Iot MCQ
6 pages
CS625
No ratings yet
CS625
254 pages
Service Manual Samsung SGH T959v PDF
No ratings yet
Service Manual Samsung SGH T959v PDF
50 pages