Week 4- Introduction to Python #3
Week 4- Introduction to Python #3
PANDAS
import numpy
myarr = numpy.array([1,2,3,4])
import numpy as np
myarr = np.array([1,2,3,4])
Importing a specific object
4
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
import numpy as np
# Create a 1D array
numpy_array = np.array( [10,20,30] )
import numpy as np
arr_2D = np.array([
[10, 20, 30, 4],
[2, 8, 2, 4],
[30, 12, 67, 44],
[24, 10, 32, 0]
])
print(arr_2D)
print('Shape: ', arr_2D.shape) #prints the
dimensions of the array
Reshaping NumPy Arrays
8
import numpy as np
arr =
np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print('arr contains: \n', arr)
arr_2D = arr.reshape(4,3)
print('arr_2D contains: \n', arr_2D)
Transposing NumPy Arrays
9
¨ You can use the np transpose function to replace rows with columns in a 2D array
¨ The first row becomes the first column, the second row becomes the second column
and so forth…
import numpy as np
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print('arr contains: \n', arr)
arr_2D = arr.reshape(4,3)
print('arr_2D contains: \n', arr_2D)
#------------------------------------
arr_2D_transposed = np.transpose(arr_2D)
print('arr_2D_transposed contains: \n',
arr_2D_transposed)
NumPy Sorting (FYI)
10
# Sort column-wise
rst = np.sort(arr_2D,axis=0)
print('Column-wise sorting:
\n',rst)
NumPy Calculation Functions
11
¨ We will use the sum, min, max, mean, std and var functions on NumPy arrays
import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90],[100, 81, 82]])
print('The grades are: \n', grades)
¨ We will use the sum, min, max, mean, std and var functions on NumPy arrays
import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90], [100, 81, 82]])
print('The grades are: \n', grades)
import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90],[100,81, 82]])
print('The grades are: \n', grades)
import numpy as np
grades = np.array([[87,96, 70],[100, 87, 90],[94, 77,
90], [100, 81, 82]])
import numpy as np
grades = np.array([[87, 96, 70], [100, 87, 90],
[94, 77, 90], [100, 81, 82]])
print('The grades are: \n', grades)
import numpy as np
grades = np.array([[87, 96, 70], [100, 87, 90], [94, 77,
90], [100, 81, 82]])
print('The grades are: \n', grades)
¨ Pandas is the commonly used library for dealing with such data
¨ It provides support for:
¤ Series:
for 1D collections (enhanced 1D array)
¤ DataFrames: for 2D collections (enhanced 2D array)
Pandas Series and DataFrames | 2
19
Index value
Index header header header header
¨ An enhanced 1D array
¨ Can be indexed using integers like NumPy or strings
import pandas as pd
grades = pd.Series([87, 100, 94])
print('Grades Series:\n',grades)
print('First grade: ',grades[0])
import pandas as pd
grades = pd.Series([87, 100, 94],
index=['First', 'Second', 'final'])
print(grades)
Output:
First 87
Second 100
final 94
Accessing Series Using String Indices
23
¨ In the previous example, a Series with a custom indices, can be accessed via
square brackets [ ] containing a custom index value:
import pandas as pd
grades = pd.Series([87, 100, 94], index=['First',
'Second', 'final'])
print('Grade of first = ',grades['First']) # or
print('Grade of first = ',grades[0])
Output:
Grade of first = 87
Grade of first = 87
Series values are: [ 87 100 94]
Series indices are: Index(['First', 'Second', 'final'],
dtype='object')
24
DataFrames
DataFrames
25
¨ Enhanced 2D arrays
Index header header header header
• Pandas provides a read_csv() function to read data stored as a .csv file into a
pandas DataFrame.
• Pandas supports many different file formats including csv and excel:
• myDataFrame = pd.read_csv(“myfile.csv”)
• Or myDataFrame = pd.read_excel(“myfile.xlsx”)
• Or myDataFrame.to_excel(“myOutputFile.xlsx”)
• After reading a file, you can display the first and last 5 rows using
myDataFrame.head()
Creating DataFrames From Files in Colab (FYI)
27
• We will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa,
Versicolour, and Virginica.
• Each flower is characterized by five attributes:
1. sepal_length in centimeters
2. sepal_width in centimeters
3. petal_length in centimeters
4. petal_width in centimeters
Each flower belongs to one type, which is the last column in dataFrame:
(Setosa, Versicolour, Virginica)
import pandas as pd
#And display the first 5 rows to make sure that the reading
is successful
data.head()
Creating DataFrames From Internet Files | 3
31
The output:
Accessing DataFrame’s Columns and Rows | 1
32
petal_length columns:
#Access one column using a header’s name 0 1.4
print('petal_length 1 1.4
columns:\n',data['petal_length']) 2 1.3
3 1.5
4 1.4
...
145 5.2
146 5.0
147 5.2
148 5.4
149 5.1
First row:
#Access one row using the .iloc function sepal_length 5.1
print('\n\nFirst row:') sepal_width 3.5
petal_length 1.4
print(data.iloc[0]) petal_width 0.2
class Iris-setosa
Accessing DataFrame’s Columns and Rows | 2
33
print('\n\nFirst 5 rows:')
First 5 rows:
#print up to but not including row 5, and cols 0,1 and the
last column
#.loc[ rows from:to , [cols indices] ]
print(data.iloc[0:5 , [0,1,-1]])
• In Boolean expression you can use the .loc function for filtering rows according to a
Boolean criteria
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)
data.columns=['sepal_length','sepal_width','petal_length','petal_width'
,'class']
#Select row where sepal_length >= 5.0 AND & data.sepal_width >= 3.5
rst = data.loc[ (data.sepal_length >= 5.0) & (data.sepal_width >= 3.5)]
print('Select row where sepal_length >= 5.0 & data.sepal_width >= 3.5')
print(rst.head())
DataFrames Boolean Indexing | 4
40
Select row where sepal_length >= 5.0 & data.sepal_width >= 3.5
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
DataFrames Statistics | 1
41
¨ Similar to Series, you can use the describe() function to print out statistics.
¨ In DataFrames, the statistics are calculated by column (for the numeric columns only).
¨ Similar to Series, you can use the mean(), min(), max(), std(), var()
¨ In DataFrames, the statistics are calculated by column (for the numeric columns only).
¨ There are cases where you need to convert a DataFrame into a NumPy Array and
vice versa
¨ This is needed in machine learning tasks like classification and regression that you
will study next
¨ Let us start by converting a DataFrame into a NumPy array using to_numpy()
function
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)
data.columns=['sepal_length','sepal_width','petal_length','petal_width
', 'class']
dataFrame_from_numpy =
pd.DataFrame(numpy_from_dataFrame, columns =
['sepal_length', 'sepal_width', 'petal_length',
'petal_width','class'])
dataFrame_from_numpy.head()
Other Ways of Creating DataFrames – 1