Numpy - Pandas - Lab - Jupyter Notebook
Numpy - Pandas - Lab - Jupyter Notebook
Numpy
Numpy is a general-purpose array-processing package. It provides a high-performance multidimensional array
object, and tools for working with these arrays. It is the fundamental package for scientific computing with
Python.
In [116]:
import numpy as np
a=np.array([1,2,3])
In [117]:
print(a)
[1 2 3]
In [118]:
type(a)
Out[118]:
numpy.ndarray
In [119]:
a=np.array([(1,2,3),(4,5,6)])
print(a)
[[1 2 3]
[4 5 6]]
In [120]:
b = np.array([4,5,6])
In [121]:
c = np.add(a,b)
In [122]:
Out[122]:
array([[ 5, 7, 9],
[ 8, 10, 12]])
Advantages:
Less Memory,Fast,Convenient
localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 1/29
6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook
y
In [123]:
import time
import numpy as np
size_of_vec = 10000000
def pure_python_version():
t1 = time.time()
X = range(size_of_vec)
Y = range(size_of_vec)
Z = [X[i] + Y[i] for i in range(len(X)) ]
return time.time() - t1
def numpy_version():
t1 = time.time()
X = np.arange(size_of_vec)
Y = np.arange(size_of_vec)
Z = X + Y
return time.time() - t1
t1 = pure_python_version()
t2 = numpy_version()
print(t1, t2)
4.500564098358154 0.06101179122924805
Pandas
Pandas is the most popular python library that is used for data analysis. It provides highly optimized
performance with back-end source code is purely written in C or Python.
Series:
Series is one dimensional(1-D) array defined in pandas that can be used to store any data type.
egg: excel
sheet with columns 'name' ,'age' , 'height' , each columns can have many values
In [124]:
import pandas as pd
In [125]:
In [126]:
Out[126]:
a 1
b 2
c 3
d 4
e 5
f 2
dtype: int64
In [127]:
In [128]:
#2D array
Data =[[2, 3, 4], [5, 6, 7]]
s_a = pd.Series(Data)
In [129]:
s_a
Out[129]:
0 [2, 3, 4]
1 [5, 6, 7]
dtype: object
DataFrames:
DataFrames is two-dimensional(2-D) data structure defined in pandas which consists of rows and columns
In [130]:
Out[130]:
col1 col2
0 1 3
1 2 4
In [131]:
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2
Out[131]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
In [132]:
df = pd.read_csv('RegularSeasonCompactResults.csv')
In [133]:
df.to_csv('save.csv')
Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function head()
to see the first couple rows of the dataframe (or the function tail() to see the last few rows).
In [134]:
df.head()
Out[134]:
In [135]:
df.tail()
Out[135]:
We can see the dimensions of the dataframe using the the shape attribute
In [136]:
df.shape
Out[136]:
(145289, 8)
We can also extract all the column names as a list, by using the columns attribute and can extract the rows
with the index attribute
In [137]:
df.columns.tolist()
Out[137]:
In order to get a better idea of the type of data that we are dealing with, we can call the describe() function to
see statistics like mean, min, etc about each column of the dataset.
In [138]:
df.describe()
Out[138]:
Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know
the max value of a certain column. The function max() will show you the maximum values of all columns
In [139]:
df.max()
Out[139]:
Season 2016
Daynum 132
Wteam 1464
Wscore 186
Lteam 1464
Lscore 150
Wloc N
Numot 6
dtype: object
Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column
using the bracket indexing operator
In [140]:
df['Wscore'].max()
Out[140]:
186
In [141]:
df['Lscore'].mean()
Out[141]:
64.49700940883343
But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened.
We can call the argmax() function to identify the row index
In [142]:
df['Wscore'].argmax()
Out[142]:
24970
One of the most useful functions that you can call on certain columns in a dataframe is the value_counts()
function. It shows how many times each item appears in the column. This particular command shows the
number of games in each season
In [143]:
df['Season'].value_counts()
Out[143]:
2016 5369
2014 5362
2015 5354
2013 5320
2010 5263
2012 5253
2009 5249
2011 5246
2008 5163
2007 5043
2006 4757
2005 4675
2003 4616
2004 4571
2002 4555
2000 4519
2001 4467
1999 4222
1998 4167
1997 4155
1992 4127
1991 4123
1996 4122
1995 4077
1994 4060
1990 4045
1989 4037
1993 3982
1988 3955
1987 3915
1986 3783
1985 3737
Acessing Values
iLoc
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean
array.
In [144]:
df.iloc[:3]
Out[144]:
In [145]:
df.shape
Out[145]:
(145289, 8)
In [146]:
df.iloc[0]
Out[146]:
Season 1985
Daynum 20
Wteam 1228
Wscore 81
Lteam 1328
Lscore 64
Wloc N
Numot 0
In [147]:
df.head(1)
Out[147]:
In [148]:
df.iloc[0,:]
Out[148]:
Season 1985
Daynum 20
Wteam 1228
Wscore 81
Lteam 1328
Lscore 64
Wloc N
Numot 0
In [149]:
df.tail(1)
Out[149]:
In [150]:
df.iloc[-1,:]
Out[150]:
Season 2016
Daynum 132
Wteam 1386
Wscore 87
Lteam 1433
Lscore 74
Wloc N
Numot 0
In [151]:
df.iloc[[df['Wscore'].argmax()]]
Out[151]:
In [152]:
df.iloc[[df['Wscore'].argmax()]]
Out[152]:
Then, in order to get attributes about the game, we need to use the iloc[] function. Iloc is definitely one of the
more important functions. The main idea is that you want to use it whenever you have the integer index of a
certain row that you want to access. As per Pandas documentation, iloc is an "integer-location based indexing
for selection by position."
Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is
what we just calculated), but you then want to know how many points the losing team scored.
In [153]:
df.iloc[[df['Wscore'].argmax()]]['Lscore']
Out[153]:
24970 140
When you see data displayed in the above format, you're dealing with a Pandas Series object, not a dataframe
object.
In [154]:
type(df.iloc[[df['Wscore'].argmax()]]['Lscore'])
Out[154]:
pandas.core.series.Series
In [155]:
type(df.iloc[[df['Wscore'].argmax()]])
Out[155]:
pandas.core.frame.DataFrame
The following is a summary of the 3 data structures in Pandas (Haven't ever really used Panels yet)
When you want to access values in a Series, you'll want to just treat the Series like a Python dictionary, so
you'd access the value according to its key (which is normally an integer index)
In [156]:
df.iloc[[df['Wscore'].argmax()]]['Lscore'][24970]
Out[156]:
140
The other really important function in Pandas is the loc function. Contrary to iloc, which is an integer based
indexing, loc is a "Purely label-location based indexer for selection by label". Since all the games are ordered
from 0 to 145288, iloc and loc are going to be pretty interchangable in this type of dataset
In [157]:
df.iloc[:5]
Out[157]:
In [158]:
df.loc[:5]
Out[158]:
Notice the slight difference in that iloc is exclusive of the second number, while loc is inclusive.
Below is an example of how you can use loc to acheive the same task as we did previously with iloc
In [159]:
df.loc[df['Wscore'].argmax(), 'Lscore']
Out[159]:
140
A faster version uses the at() function. At() is really useful wheneever you know the row label and the column
label of the particular value that you want to get.
In [160]:
df.at[df['Wscore'].argmax(), 'Lscore']
Out[160]:
140
Sorting
Let's say that we want to sort the dataframe in increasing order for the scores of the losing team
In [161]:
df.sort_values('Lscore').head()
Out[161]:
In [162]:
Out[162]:
This also works if you have multiple conditions. Let's say we want to find out when the winning team scores
more than 150 points and when the losing team scores below 100.
In [163]:
Out[163]:
Grouping
Another important function in Pandas is groupby(). This is a function that allows you to group entries by certain
attributes (e.g Grouping entries by Wteam number) and then perform operations on them. The following
function groups all the entries (games) with the same Wteam number and finds the mean for each group.
In [164]:
df.groupby('Wteam')['Wscore'].mean().head()
Out[164]:
Wteam
1101 78.111111
1102 69.893204
1103 75.839768
1104 75.825944
1105 74.960894
This next command groups all the games with the same Wteam number and finds where how many times that
specific team won at home, on the road, or at a neutral site
In [165]:
df.groupby('Wteam')['Wloc'].value_counts().head(9)
Out[165]:
Wteam Wloc
1101 H 12
A 3
N 3
1102 H 204
A 73
N 32
1103 H 324
A 153
N 41
Each dataframe has a values attribute which is useful because it basically displays your dataframe in a numpy
array style format
In [166]:
df.values
Out[166]:
...,
Now, you can simply just access elements like you would in an array.
In [167]:
df.values[0][0]
Out[167]:
1985
Dataframe Iteration
In order to iterate through dataframes, we can use the iterrows() function. Below is an example of what the first
two rows look like. Each row in iterrows is a Series object
In [168]:
Season 1985
Daynum 20
Wteam 1228
Wscore 81
Lteam 1328
Lscore 64
Wloc N
Numot 0
Season 1985
Daynum 25
Wteam 1106
Wscore 77
Lteam 1354
Lscore 70
Wloc H
Numot 0
Enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate
object can then be used directly in for loops or be converted into a list of tuples using list() method.
In [213]:
l1 = ["eat","sleep","repeat"]
s1 = "geek"
In [214]:
(0, 'eat')
(1, 'sleep')
(2, 'repeat')
100 eat
101 sleep
102 repeat
In [169]:
df[['Wscore', 'Lscore']].head()
Out[169]:
Wscore Lscore
0 81 64
1 77 70
2 63 56
3 70 54
4 86 74
Notice that you can acheive the same result by using the loc function. Loc is a veryyyy versatile function that
can help you in a lot of accessing and extracting tasks.
In [170]:
Out[170]:
Wscore Lscore
1 77 70
2 63 56
3 70 54
4 86 74
5 79 78
Note the difference is the return types when you use brackets and when you use double brackets.
In [171]:
type(df['Wscore'])
Out[171]:
pandas.core.series.Series
In [172]:
type(df[['Wscore']])
Out[172]:
pandas.core.frame.DataFrame
You've seen before that you can access columns through df['col name']. You can access rows by using slicing
operations.
In [173]:
df[0:3]
Out[173]:
In [174]:
df.iloc[0:3,:]
Out[174]:
In [175]:
df.isnull().sum()
Out[175]:
Season 0
Daynum 0
Wteam 0
Wscore 0
Lteam 0
Lscore 0
Wloc 0
Numot 0
dtype: int64
If you do end up having missing values in your datasets, be sure to get familiar with these two functions.
dropna() - This function allows you to drop all(or some) of the rows that have missing values.
fillna() - This function allows you replace the rows that have missing values with the value that you pass in.
In [176]:
df.head()
Out[176]:
In [177]:
import numpy as np
In [178]:
In [179]:
df2
Out[179]:
A B C D E
In [180]:
In [181]:
table
Out[181]:
C large small
Merge
Merge DataFrame or named Series objects with a database-style join
In [182]:
id Name subject_id
0 1 Alex sub1
1 2 Amy sub2
2 3 Allen sub4
3 4 Alice sub6
4 5 Ayoung sub5
id Name subject_id
0 1 Billy sub2
1 2 Brian sub4
2 3 Bran sub3
3 4 Bryce sub6
4 5 Betty sub5
In [183]:
pd.merge(left,right,on='id')
Out[183]:
In [184]:
pd.merge(left,right,on=['id','subject_id'])
Out[184]:
image.png
(INNER) JOIN: Returns records that have matching values in both tables
LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right
table
RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left
table
FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table
In [185]:
Out[185]:
In [186]:
Out[186]:
In [187]:
Out[187]:
In [188]:
Out[188]:
In [189]:
df2 = df.drop_duplicates()
df2.shape
Out[189]:
(145289, 8)
In [190]:
df2 = df.sample(10)
In [191]:
df2.shape
Out[191]:
(10, 8)
Concat
Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects
The concat() function (in the main pandas namespace) does all of the heavy lifting of performing concatenation
operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the
other axes
pd.concat(objs,axis=0,join='outer',join_axes=None,
ignore_index=False)
In [192]:
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
In [193]:
pd.concat([one,two])
Out[193]:
1 Alex sub1 98
2 Amy sub2 90
3 Allen sub4 87
4 Alice sub6 69
5 Ayoung sub5 78
1 Billy sub2 89
2 Brian sub4 80
3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88
In [194]:
Out[194]:
1 Alex sub1 98
2 Amy sub2 90
x 3 Allen sub4 87
4 Alice sub6 69
5 Ayoung sub5 78
1 Billy sub2 89
2 Brian sub4 80
y 3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88
In [195]:
#The index of the resultant is duplicated; each index is repeated.If the resultant object h
pd.concat([one,two],keys=['x','y'],ignore_index=True)
Out[195]:
0 Alex sub1 98
1 Amy sub2 90
2 Allen sub4 87
3 Alice sub6 69
4 Ayoung sub5 78
5 Billy sub2 89
6 Brian sub4 80
7 Bran sub3 79
8 Bryce sub6 97
9 Betty sub5 88
In [196]:
#If two objects need to be added along axis=1(column), then the new columns will be appende
pd.concat([one,two],axis=1)
Out[196]:
In [197]:
#row wise
pd.concat([one,two],axis=0)
Out[197]:
1 Alex sub1 98
2 Amy sub2 90
3 Allen sub4 87
4 Alice sub6 69
5 Ayoung sub5 78
1 Billy sub2 89
2 Brian sub4 80
3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88
Melt
Pandas Melt helps to transform rows to columns
In [201]:
import pandas as pd
d1 = {"Name": ["Pavan", "Balaji", "Ravi"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Auth
df = pd.DataFrame(d1)
print(df)
Name ID Role
0 Pavan 1 CEO
1 Balaji 2 Editor
2 Ravi 3 Author
In [202]:
print(df_melted)
ID variable value
0 1 Name Pavan
1 2 Name Balaji
2 3 Name Ravi
3 1 Role CEO
4 2 Role Editor
5 3 Role Author
Chunk
Pandas Chunk helps to read large volumns of data in bits
In [215]:
import pandas as pd
for chunk in pd.read_csv("RegularSeasonCompactResults.csv", chunksize=10):
print(chunk)