Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Week 4- Introduction to Python #3

Uploaded by

adamchicken9.9.9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Week 4- Introduction to Python #3

Uploaded by

adamchicken9.9.9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

PYTHON LIBRARIES: NUMPY AND

PANDAS

American University of Sharjah

Prepared by Dr Tamer Shanableh, CSE


Material mainly based on “Python for Programmers” by Paul Deitel and Harvey Deitel, Pearson;
Illustrated edition, ISBN-10 : 0135224330
Python Libraries
2

Popular libraires in Python:

•Python Libraries for Data Science


• Python Libraries for Data Processing and
Python has many software Model Deployment
libraries that can be imported to • 1) Pandas
your program • 2) NumPy
• 3) SciPy
A software library is collection of • 4) Sci-Kit Learn
• 5) PyCaret
pre-written code, such that • 6) Tensorflow
programmers do not reinvent the • 7) OpenCV
wheel •Python Libraries for Data Mining and Data
Scraping
You have previously used “import • 8) SQLAlchemy
math” where math is the name for • 9) Scrapy
the math library in Python • 10) BeautifulSoup
•Python Libraries for Data Visualization
• 11) Matplotlib
• 12) Ggplot
• 13) Plotly
• 14) Altair
• 15) seaborn
Source: https://www.projectpro.io/article/top-5-libraries-for-data-
science-in-python/196
Importing libraries
3

Import the whole library:

import numpy
myarr = numpy.array([1,2,3,4])

OR: Import the whole library with an alias:

import numpy as np
myarr = np.array([1,2,3,4])
Importing a specific object
4

OR: Import a specific function or an object:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Create a line plot


plt.plot(x, y)
The NumPy Library
5

• NumPy is a popular open-source library in Python for data


science and AI

• A standard way for working with numeric data in Python

• It can be used for creating and manipulating N-dimensional


arrays

• 1D for lists of numbers


• 3D for images (R,G,B)
• 4D for videos (a sequence of 3D images)
Creating NumPy Arrays
6

¨ Start by importing the numpy library

import numpy as np

# Create a 1D array
numpy_array = np.array( [10,20,30] )

# Create a 1D array from a list of numbers


data = [10, 20, 30, 40, 50]
numpy_array = np.array(data)
Numpy 2D arrays
7

Think of 2D arrays as an “Array of Arrays”

import numpy as np

arr_2D = np.array([
[10, 20, 30, 4],
[2, 8, 2, 4],
[30, 12, 67, 44],
[24, 10, 32, 0]
])
print(arr_2D)
print('Shape: ', arr_2D.shape) #prints the
dimensions of the array
Reshaping NumPy Arrays
8

¨ You can use the np reshape function to transform a 1D array into a


multidimensional array (row-wise)
¨ Example: we can reshape a 12-element 1D array into a 4x3 2D array
¨ Clearly, reshaping a 12-element 1D array into a 4x4 2D array does not work
and generates an error.

import numpy as np
arr =
np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print('arr contains: \n', arr)

arr_2D = arr.reshape(4,3)
print('arr_2D contains: \n', arr_2D)
Transposing NumPy Arrays
9

¨ You can use the np transpose function to replace rows with columns in a 2D array
¨ The first row becomes the first column, the second row becomes the second column
and so forth…

import numpy as np
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print('arr contains: \n', arr)
arr_2D = arr.reshape(4,3)
print('arr_2D contains: \n', arr_2D)
#------------------------------------
arr_2D_transposed = np.transpose(arr_2D)
print('arr_2D_transposed contains: \n',
arr_2D_transposed)
NumPy Sorting (FYI)
10

#Numpy Example: sort #Use the function np.sort(name of


method array, axis to sort: None|0|1)
import numpy as np
rst = np.sort(arr_2D,axis=None)
arr_2D = np.array([ print('sort the whole Array: \n',
[10, 20, 30, 4], rst)
[2, 8, 2, 4],
[30, 12, 67, 44],
[24, 10, 32, 0] # Sort row-wise
]) rst = np.sort(arr_2D,axis=1)
print(arr_2D) print('Row-wise sorting: \n',rst)

# Sort column-wise
rst = np.sort(arr_2D,axis=0)
print('Column-wise sorting:
\n',rst)
NumPy Calculation Functions
11

¨ We will use the sum, min, max, mean, std and var functions on NumPy arrays

import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90],[100, 81, 82]])
print('The grades are: \n', grades)

sum = grades.sum(axis=1) # row-wise


print('Summation row-wise:\n',sum)

sum = grades.sum(axis=0) # col-wise


print('Summation col-wise:\n',sum)

sum = grades.sum(axis=None) # all


print('Summation of all grades:\n',sum)
NumPy Calculation Functions
12

¨ We will use the sum, min, max, mean, std and var functions on NumPy arrays

import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90], [100, 81, 82]])
print('The grades are: \n', grades)

min = grades.min(axis=1) # row-wise


print('min row-wise:\n',min)

min = grades.min(axis=0) # col-wise


print('min col-wise:\n',min)

min = grades.min(axis=None) # all


print('min of all grades:\n',min)
Indexing and Slicing - 1
13

# You can access individual elements and individual rows


in a NumPy array

import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90],[100,81, 82]])
print('The grades are: \n', grades)

#Select one grade using: grade[row index, col index]


print('grades[0,0] = ', grades[0,0])
print('grades[1,2] = ', grades[1,2])

#Select one row of grades using : grade[row index]


print('grades[3] = ', grades[3])
Indexing and Slicing - 2
14

#You can select multiple rows from a NumPy array

import numpy as np
grades = np.array([[87,96, 70],[100, 87, 90],[94, 77,
90], [100, 81, 82]])

print('The grades are: \n', grades)

#Select multiple sequential rows of grades using :


grade[row index from : row index to]

print('grades[0:2] = \n', grades[0:2]) #up to but not


including row 2
Indexing and Slicing - 3
15

¨ You can select a subset of columns in NumPy arrays


¨ grades[:,0] means select all rows, column 0
¨ grades[:, 0:2] means select all rows, columns 0,1 (up to but
not including 2)

import numpy as np
grades = np.array([[87, 96, 70], [100, 87, 90],
[94, 77, 90], [100, 81, 82]])
print('The grades are: \n', grades)

print('First column | grades[:,0] = \n', grades[:,0])


print(‘Last 2 columns| grades[:, 1:3] = \n', grades[:,1:3])

Adopted from https://www.w3resource.com/python-


exercises/numpy/python-numpy-exercise-104.php
Indexing and Slicing - 4
16

¨ Python allows negative indices in arrays


¨ One particular important case the access of the last column using the negative
column index of ‘-1’

import numpy as np
grades = np.array([[87, 96, 70], [100, 87, 90], [94, 77,
90], [100, 81, 82]])
print('The grades are: \n', grades)

print('First column | grades[:,0] = \n', grades[:,0])


print('Last column | grades[:, -1] = \n', grades[:,-1])
17

Pandas: Series and DataFrames


Pandas Series and DataFrames | 1
18

¨ NumPy arrays are optimized for homogenous numeric data


¨ However, in ML applications, we need to provide:
¤ Support for a heterogeneous types (ex. numeric and strings)
¤ Support for missing data
¤ Support for headers and indices (see next slide)

¨ Pandas is the commonly used library for dealing with such data
¨ It provides support for:
¤ Series:
for 1D collections (enhanced 1D array)
¤ DataFrames: for 2D collections (enhanced 2D array)
Pandas Series and DataFrames | 2
19

Index value
Index header header header header

Rest of columns are called “values”


First column is called “index”
Pandas Series | 1
20

¨ An enhanced 1D array
¨ Can be indexed using integers like NumPy or strings

import pandas as pd
grades = pd.Series([87, 100, 94])
print('Grades Series:\n',grades)
print('First grade: ',grades[0])

Output (index and value):


0 87
1 100
2 94
First grade: 87
Pandas Series | 2
21

¨ Provides for statistical functions like import pandas as pd


count, mean, min, max and std grades = pd.Series([87, 100, 94])
¨ For a full numerical summary you can
use the describe function print('Grades Series:\n',grades)
print('Count: ', grades.count())
print('Mean: ' , grades.mean())
print('Min: ' , grades.min())
print('Max: ' , grades.max())
print('Std: ' , grades.std())

# for an overall summary you can


use:
print('Description:\n',grades.des
cribe())
Series with a Custom Index
22

¨ You can use a custom indices with the index argument


Index value

import pandas as pd
grades = pd.Series([87, 100, 94],
index=['First', 'Second', 'final'])
print(grades)

Output:

First 87
Second 100
final 94
Accessing Series Using String Indices
23

¨ In the previous example, a Series with a custom indices, can be accessed via
square brackets [ ] containing a custom index value:
import pandas as pd
grades = pd.Series([87, 100, 94], index=['First',
'Second', 'final'])
print('Grade of first = ',grades['First']) # or
print('Grade of first = ',grades[0])

#--You can also access all values and all indices


print('Series values are: ', grades.values)
print('Series indices are: ', grades.index)

Output:
Grade of first = 87
Grade of first = 87
Series values are: [ 87 100 94]
Series indices are: Index(['First', 'Second', 'final'],
dtype='object')
24

DataFrames
DataFrames
25

¨ Enhanced 2D arrays
Index header header header header

¨ Can have custom indices and headers

¨ Each column in a DataFrame is a Series


Creating DataFrames From Files
26

• Pandas provides a read_csv() function to read data stored as a .csv file into a
pandas DataFrame.

• Pandas supports many different file formats including csv and excel:
• myDataFrame = pd.read_csv(“myfile.csv”)

• Or myDataFrame = pd.read_excel(“myfile.xlsx”)

• To save data from DataFrames to files use:


• myDataFrame.to_csv(“myOutputFile.csv”)

• Or myDataFrame.to_excel(“myOutputFile.xlsx”)

• After reading a file, you can display the first and last 5 rows using
myDataFrame.head()
Creating DataFrames From Files in Colab (FYI)
27

Click to upload a file

I uploaded this file

df2.to_csv('testFileToWrite.csv') # this will create an output file with .csv extension


Creating DataFrames From Internet Files | 1

• We will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa,
Versicolour, and Virginica.
• Each flower is characterized by five attributes:
1. sepal_length in centimeters
2. sepal_width in centimeters
3. petal_length in centimeters
4. petal_width in centimeters

Each flower belongs to one type, which is the last column in dataFrame:
(Setosa, Versicolour, Virginica)

Data is available online at: https://archive.ics.uci.edu/dataset/53/iris


Iris Flowers Dataset
29
Creating DataFrames From Internet Files | 2
30

import pandas as pd

#The argument header=None says that this dataset does not


contain a header yet, so we will add one next
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-
learning-databases/iris/iris.data',header=None)

#You can then Add column headers


data.columns=['sepal_length','sepal_width','petal_length','pe
tal_width','class']

#And display the first 5 rows to make sure that the reading
is successful
data.head()
Creating DataFrames From Internet Files | 3

31

The output:
Accessing DataFrame’s Columns and Rows | 1
32

petal_length columns:
#Access one column using a header’s name 0 1.4
print('petal_length 1 1.4
columns:\n',data['petal_length']) 2 1.3
3 1.5
4 1.4
...
145 5.2
146 5.0
147 5.2
148 5.4
149 5.1

First row:
#Access one row using the .iloc function sepal_length 5.1
print('\n\nFirst row:') sepal_width 3.5
petal_length 1.4
print(data.iloc[0]) petal_width 0.2
class Iris-setosa
Accessing DataFrame’s Columns and Rows | 2
33

#Access a sequential slice of rows using the .iloc


function

print('\n\nFirst 5 rows:')

print(data.iloc[0:5]) # up to but not including 5

First 5 rows:

sepal_length sepal_width petal_length petal_width class


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Accessing DataFrame’s Columns and Rows | 3
34

#Access a sequential slice of rows and columns using the


.iloc function
print('\n\nFirst 5 rows and first 2 columns:')

#print up to but not including row 5, up to but not


including col 2
#.iloc[ rows from:to , cols from:to ]
print(data.iloc[0:5 , 0:2 ])

First 5 rows and first 2 columns:


sepal_length sepal_width
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
3 4.6 3.1
4 5.0 3.6
Accessing DataFrame’s Columns and Rows | 4
35

#Access a sequential slice of rows and columns using the


.iloc function
print('\n\nFirst 5 rows and first 2 columns:')

#print up to but not including row 5, and cols 0,1 and the
last column
#.loc[ rows from:to , [cols indices] ]
print(data.iloc[0:5 , [0,1,-1]])

sepal_length sepal_width class


0 5.1 3.5 Iris-setosa
1 4.9 3.0 Iris-setosa
2 4.7 3.2 Iris-setosa
3 4.6 3.1 Iris-setosa
4 5.0 3.6 Iris-setosa
DataFrames Boolean Indexing | 1
36

¨ Pandas provide a powerful selection feature called Boolean indexing


¨ That is, you can use Boolean expression that return T/F to filter a
dataFrame
¨ Let us starts by extracting the numeric data from our dataFrame:
data_numeric = data.iloc[:, 0:4]
data_numeric.head()
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
DataFrames Boolean Indexing | 2
37

# from the previous slide


data_numeric = data.iloc[:, 0:4]
#Filter the dataFrame, locate values >= 5.0 sepal_length sepal_width petal_length petal_width
rst = data_numeric[data_numeric >= 5.0] 0 5.1 NaN NaN NaN
rst 1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
• Pandas checks every element to determine 3 NaN NaN NaN NaN
whether its value is greater than or equal 4 5.0 NaN NaN NaN
to 5.0 ... ... ... ... ...
145 6.7 NaN 5.2 NaN
• If True then it includes it in the new 146 6.3 NaN 5.0 NaN
DataFrame (rst in the example above). 147 6.5 NaN 5.2 NaN
148 6.2 NaN 5.4 NaN
• Elements for which the condition is False 149 5.9 NaN 5.1 NaN
are represented as NaN (not a number) in 150 rows × 4 column
the new DataFrame
DataFrames Boolean Indexing | 3
38

• In Boolean expression you can use:


• AND which is the & operator
• OR which is the | operator
data_numeric = data.iloc[:, 0:4]
rst = data_numeric[data_numeric >= 5.0]
rst.head()

#Other examples (data_numeric >= 3.0) AND (data_numeric <=


5.0):
rst = data_numeric[(data_numeric >= 3.0) & (data_numeric <=
5.0)]
rst.head()

#Other examples (data_numeric < 3.0) OR (data_numeric > 5.0):


rst = data_numeric[(data_numeric < 3.0) | (data_numeric > 5.0)]
rst.head()
DataFrames Boolean Indexing | 4
39

• In Boolean expression you can use the .loc function for filtering rows according to a
Boolean criteria
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)
data.columns=['sepal_length','sepal_width','petal_length','petal_width'
,'class']

#Select row where sepal_length >= 5.0


rst = data.loc[ data.sepal_length >= 5.0 ]
print('Select row where sepal_length >= 5.0')
print(rst.head())

#Select row where sepal_length >= 5.0 AND & data.sepal_width >= 3.5
rst = data.loc[ (data.sepal_length >= 5.0) & (data.sepal_width >= 3.5)]
print('Select row where sepal_length >= 5.0 & data.sepal_width >= 3.5')
print(rst.head())
DataFrames Boolean Indexing | 4
40

Select row where sepal_length >= 5.0


Output: sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa

Select row where sepal_length >= 5.0 & data.sepal_width >= 3.5
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
DataFrames Statistics | 1
41

¨ Similar to Series, you can use the describe() function to print out statistics.
¨ In DataFrames, the statistics are calculated by column (for the numeric columns only).

sepal_length sepal_width petal_length petal_width


count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
data.describe() 50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
DataFrames Statistics | 2
42

¨ Similar to Series, you can use the mean(), min(), max(), std(), var()
¨ In DataFrames, the statistics are calculated by column (for the numeric columns only).

Avg per col:


print('Avg per col:') sepal_length 5.843333
sepal_width 3.054000
print(data.mean()) petal_length 3.758667
print('Std per col:') petal_width 1.198667
print(data.std())
Std per col:
print('Min per col:') sepal_length 0.828066
print(data.min()) sepal_width 0.433594
print('Max per col:') petal_length 1.764420
petal_width 0.763161
print(data.max())

DataFrames <-> NumPy | 1
43

¨ There are cases where you need to convert a DataFrame into a NumPy Array and
vice versa
¨ This is needed in machine learning tasks like classification and regression that you
will study next
¨ Let us start by converting a DataFrame into a NumPy array using to_numpy()
function
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)
data.columns=['sepal_length','sepal_width','petal_length','petal_width
', 'class']

#Convert a dataFrame into a numPy array


numpy_from_dataFrame = data.to_numpy()
#print(numpy_from_dataFrame)

#OR: Convert the first 4 columns of a dataFrame into a numPy array


numpy_from_dataFrame = data.iloc[:, 0:4].to_numpy()
#print(numpy_from_dataFrame)
DataFrames <-> NumPy | 2
44

Output of Output of data.iloc[:,


data.to_numpy() 0:4].to_numpy()

[[5.1 3.5 1.4 0.2 'Iris-setosa'] [[5.1 3.5 1.4 0.2]


[4.9 3.0 1.4 0.2 'Iris-setosa'] [4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2 'Iris-setosa'] [4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2 'Iris-setosa'] [4.6 3.1 1.5 0.2]
[5.0 3.6 1.4 0.2 'Iris-setosa'] [5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4 'Iris-setosa'] [5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3 'Iris-setosa'] [4.6 3.4 1.4 0.3]
[5.0 3.4 1.5 0.2 'Iris-setosa'] [5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2 'Iris-setosa'] [4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1 'Iris-setosa'] [4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2 'Iris-setosa'] [5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2 'Iris-setosa'] [4.8 3.4 1.6 0.2]
… …
DataFrames <-> NumPy | 3
45

¨ To convert a NumPy array into a dataFrame we can use the constructor


pd.DataFrame()
¨ Notice how you can add columns (which are the headers), using the argument
columns=[…]

dataFrame_from_numpy =
pd.DataFrame(numpy_from_dataFrame, columns =
['sepal_length', 'sepal_width', 'petal_length',
'petal_width','class'])

dataFrame_from_numpy.head()
Other Ways of Creating DataFrames – 1

import pandas as pd Output:


df = pd.DataFrame( Name Age Gender
{ 0 Braund, Mr. Owen Harris 22 male
"Name":["Braund, Mr. Owen 1 Allen, Mr. William Henry 35 male
Harris", "Allen, Mr. William Henry", 2 Bonnell, Miss. Elizabeth 58 female
"Bonnell, Miss. Elizabeth"],
"Age":[22, 35, 58], Age
“Gender":["male","male", "female"] count 3.000000
}
mean 38.333333
)
print(df) std 18.230012
df.describe() min 22.000000
25% 28.500000
50% 35.000000
75% 46.500000
max 58.000000
https://pandas.pydata.org/
Other Ways of Creating DataFrames – 2
47

#You can create a DataFrame from an existing The dictionary’s


dictionary as follows keys become the
import pandas as pd column names
my_dictionary={ (headers).
"Name": [
"Dr. Sami Batata", The values
"Prof. Marwa Halawah", become the
"Mr. Fawzi Kamal" element values in
], the corresponding
"Age": [29, 40, 60], column.
"Gender": ["male", "female", "male"]
}
df = pd.DataFrame( my_dictionary)
print(df)

Name Age Gender


0 Dr. Sami Batata 29 male
1 Prof. Marwa Halawah 40 female
2 Mr. Fawzi Kamal 60 male

You might also like