Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
125 views

Numpy - Pandas - Lab - Jupyter Notebook

The document discusses NumPy and Pandas, two popular Python libraries for scientific computing and data analysis. NumPy provides multidimensional arrays and tools for working with them, while Pandas provides data structures and operations for manipulating structured and time series data. The document demonstrates basic NumPy operations like array creation and arithmetic. It also shows how to create Pandas Series and DataFrames, describe their data, and extract statistics and values.

Uploaded by

Ravi goel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views

Numpy - Pandas - Lab - Jupyter Notebook

The document discusses NumPy and Pandas, two popular Python libraries for scientific computing and data analysis. NumPy provides multidimensional arrays and tools for working with them, while Pandas provides data structures and operations for manipulating structured and time series data. The document demonstrates basic NumPy operations like array creation and arithmetic. It also shows how to create Pandas Series and DataFrames, describe their data, and extract statistics and values.

Uploaded by

Ravi goel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

Numpy
Numpy is a general-purpose array-processing package. It provides a high-performance multidimensional array
object, and tools for working with these arrays. It is the fundamental package for scientific computing with
Python.

In [116]:

import numpy as np
a=np.array([1,2,3])

In [117]:

print(a)

[1 2 3]

In [118]:

type(a)

Out[118]:

numpy.ndarray

In [119]:

a=np.array([(1,2,3),(4,5,6)])
print(a)

[[1 2 3]

[4 5 6]]

In [120]:

b = np.array([4,5,6])

In [121]:

c = np.add(a,b)

In [122]:

Out[122]:

array([[ 5, 7, 9],

[ 8, 10, 12]])

Advantages:
Less Memory,Fast,Convenient
localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 1/29
6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook
y

In [123]:

import time
import numpy as np

size_of_vec = 10000000

def pure_python_version():
t1 = time.time()
X = range(size_of_vec)
Y = range(size_of_vec)
Z = [X[i] + Y[i] for i in range(len(X)) ]
return time.time() - t1

def numpy_version():
t1 = time.time()
X = np.arange(size_of_vec)
Y = np.arange(size_of_vec)
Z = X + Y
return time.time() - t1

t1 = pure_python_version()
t2 = numpy_version()
print(t1, t2)

4.500564098358154 0.06101179122924805

Pandas
Pandas is the most popular python library that is used for data analysis. It provides highly optimized
performance with back-end source code is purely written in C or Python.

Series:

Series is one dimensional(1-D) array defined in pandas that can be used to store any data type.
egg: excel
sheet with columns 'name' ,'age' , 'height' , each columns can have many values

In [124]:

import pandas as pd

In [125]:

#Create series with Data, and Index


#1 Scalar value which can be integerValue, string
#2 Python Dictionary which can be Key, Value pair
#3 Ndarray
Data = [1,2,3,4,5,2]
Index = ['a','b','c','d','e','f']
a = pd.Series(Data, index = Index)

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 2/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [126]:

Out[126]:

a 1

b 2

c 3

d 4

e 5

f 2

dtype: int64

In [127]:

dictionary ={'a':1, 'b':2, 'c':3, 'd':4, 'e':5}


s_d = pd.Series(dictionary)

In [128]:

#2D array
Data =[[2, 3, 4], [5, 6, 7]]
s_a = pd.Series(Data)

In [129]:

s_a

Out[129]:

0 [2, 3, 4]

1 [5, 6, 7]

dtype: object

DataFrames:

DataFrames is two-dimensional(2-D) data structure defined in pandas which consists of rows and columns

In [130]:

d = {'col1': [1, 2], 'col2': [3, 4]}


df = pd.DataFrame(data=d)
df

Out[130]:

col1 col2

0 1 3

1 2 4

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 3/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [131]:

import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2

Out[131]:

a b c

0 1 2 3

1 4 5 6

2 7 8 9

In [132]:

df = pd.read_csv('RegularSeasonCompactResults.csv')

In [133]:

df.to_csv('save.csv')

Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function head()
to see the first couple rows of the dataframe (or the function tail() to see the last few rows).

In [134]:

df.head()

Out[134]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

1 1985 25 1106 77 1354 70 H 0

2 1985 25 1112 63 1223 56 H 0

3 1985 25 1165 70 1432 54 H 0

4 1985 25 1192 86 1447 74 H 0

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 4/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [135]:

df.tail()

Out[135]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

145284 2016 132 1114 70 1419 50 N 0

145285 2016 132 1163 72 1272 58 N 0

145286 2016 132 1246 82 1401 77 N 1

145287 2016 132 1277 66 1345 62 N 0

145288 2016 132 1386 87 1433 74 N 0

We can see the dimensions of the dataframe using the the shape attribute

In [136]:

df.shape

Out[136]:

(145289, 8)

We can also extract all the column names as a list, by using the columns attribute and can extract the rows
with the index attribute

In [137]:

df.columns.tolist()

Out[137]:

['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']

In order to get a better idea of the type of data that we are dealing with, we can call the describe() function to
see statistics like mean, min, etc about each column of the dataset.

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 5/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [138]:

df.describe()

Out[138]:

Season Daynum Wteam Wscore Lteam Lsco

count 145289.000000 145289.000000 145289.000000 145289.000000 145289.000000 145289.0000

mean 2001.574834 75.223816 1286.720646 76.600321 1282.864064 64.4970

std 9.233342 33.287418 104.570275 12.173033 104.829234 11.3806

min 1985.000000 0.000000 1101.000000 34.000000 1101.000000 20.0000

25% 1994.000000 47.000000 1198.000000 68.000000 1191.000000 57.0000

50% 2002.000000 78.000000 1284.000000 76.000000 1280.000000 64.0000

75% 2010.000000 103.000000 1379.000000 84.000000 1375.000000 72.0000

max 2016.000000 132.000000 1464.000000 186.000000 1464.000000 150.0000

Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know
the max value of a certain column. The function max() will show you the maximum values of all columns

In [139]:

df.max()

Out[139]:

Season 2016

Daynum 132

Wteam 1464

Wscore 186

Lteam 1464

Lscore 150

Wloc N

Numot 6

dtype: object

Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column
using the bracket indexing operator

In [140]:

df['Wscore'].max()

Out[140]:

186

If you'd like to find the mean of the Losing teams' score.

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 6/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [141]:

df['Lscore'].mean()

Out[141]:

64.49700940883343

But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened.
We can call the argmax() function to identify the row index

In [142]:

df['Wscore'].argmax()

Out[142]:

24970

One of the most useful functions that you can call on certain columns in a dataframe is the value_counts()
function. It shows how many times each item appears in the column. This particular command shows the
number of games in each season

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 7/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [143]:

df['Season'].value_counts()

Out[143]:

2016 5369

2014 5362

2015 5354

2013 5320

2010 5263

2012 5253

2009 5249

2011 5246

2008 5163

2007 5043

2006 4757

2005 4675

2003 4616

2004 4571

2002 4555

2000 4519

2001 4467

1999 4222

1998 4167

1997 4155

1992 4127

1991 4123

1996 4122

1995 4077

1994 4060

1990 4045

1989 4037

1993 3982

1988 3955

1987 3915

1986 3783

1985 3737

Name: Season, dtype: int64

Acessing Values

iLoc

Purely integer-location based indexing for selection by position.

.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean
array.

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 8/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [144]:

df.iloc[:3]

Out[144]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

1 1985 25 1106 77 1354 70 H 0

2 1985 25 1112 63 1223 56 H 0

In [145]:

df.shape

Out[145]:

(145289, 8)

In [146]:

df.iloc[0]

Out[146]:

Season 1985

Daynum 20

Wteam 1228

Wscore 81

Lteam 1328

Lscore 64

Wloc N

Numot 0

Name: 0, dtype: object

In [147]:

df.head(1)

Out[147]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 9/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [148]:

df.iloc[0,:]

Out[148]:

Season 1985

Daynum 20

Wteam 1228

Wscore 81

Lteam 1328

Lscore 64

Wloc N

Numot 0

Name: 0, dtype: object

In [149]:

df.tail(1)

Out[149]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

145288 2016 132 1386 87 1433 74 N 0

In [150]:

df.iloc[-1,:]

Out[150]:

Season 2016

Daynum 132

Wteam 1386

Wscore 87

Lteam 1433

Lscore 74

Wloc N

Numot 0

Name: 145288, dtype: object

In [151]:

df.iloc[[df['Wscore'].argmax()]]

Out[151]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

24970 1991 68 1258 186 1109 140 H 0

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 10/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [152]:

df.iloc[[df['Wscore'].argmax()]]

Out[152]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

24970 1991 68 1258 186 1109 140 H 0

Then, in order to get attributes about the game, we need to use the iloc[] function. Iloc is definitely one of the
more important functions. The main idea is that you want to use it whenever you have the integer index of a
certain row that you want to access. As per Pandas documentation, iloc is an "integer-location based indexing
for selection by position."

Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is
what we just calculated), but you then want to know how many points the losing team scored.

In [153]:

df.iloc[[df['Wscore'].argmax()]]['Lscore']

Out[153]:

24970 140

Name: Lscore, dtype: int64

When you see data displayed in the above format, you're dealing with a Pandas Series object, not a dataframe
object.

In [154]:

type(df.iloc[[df['Wscore'].argmax()]]['Lscore'])

Out[154]:

pandas.core.series.Series

In [155]:

type(df.iloc[[df['Wscore'].argmax()]])

Out[155]:

pandas.core.frame.DataFrame

The following is a summary of the 3 data structures in Pandas (Haven't ever really used Panels yet)

When you want to access values in a Series, you'll want to just treat the Series like a Python dictionary, so
you'd access the value according to its key (which is normally an integer index)

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 11/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [156]:

df.iloc[[df['Wscore'].argmax()]]['Lscore'][24970]

Out[156]:

140

The other really important function in Pandas is the loc function. Contrary to iloc, which is an integer based
indexing, loc is a "Purely label-location based indexer for selection by label". Since all the games are ordered
from 0 to 145288, iloc and loc are going to be pretty interchangable in this type of dataset

In [157]:

df.iloc[:5]

Out[157]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

1 1985 25 1106 77 1354 70 H 0

2 1985 25 1112 63 1223 56 H 0

3 1985 25 1165 70 1432 54 H 0

4 1985 25 1192 86 1447 74 H 0

In [158]:

df.loc[:5]

Out[158]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

1 1985 25 1106 77 1354 70 H 0

2 1985 25 1112 63 1223 56 H 0

3 1985 25 1165 70 1432 54 H 0

4 1985 25 1192 86 1447 74 H 0

5 1985 25 1218 79 1337 78 H 0

Notice the slight difference in that iloc is exclusive of the second number, while loc is inclusive.

Below is an example of how you can use loc to acheive the same task as we did previously with iloc

In [159]:

df.loc[df['Wscore'].argmax(), 'Lscore']

Out[159]:

140

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 12/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

A faster version uses the at() function. At() is really useful wheneever you know the row label and the column
label of the particular value that you want to get.

In [160]:

df.at[df['Wscore'].argmax(), 'Lscore']

Out[160]:

140

Sorting
Let's say that we want to sort the dataframe in increasing order for the scores of the losing team

In [161]:

df.sort_values('Lscore').head()

Out[161]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

100027 2008 66 1203 49 1387 20 H 0

49310 1997 66 1157 61 1204 21 H 0

89021 2006 44 1284 41 1343 21 A 0

85042 2005 66 1131 73 1216 22 H 0

103660 2009 26 1326 59 1359 22 H 0

Filtering Rows Conditionally


Now, let's say we want to find all of the rows that satisy a particular condition. For example, I want to find all of
the games where the winning team scored more than 150 points. The idea behind this command is you want to
access the column 'Wscore' of the dataframe df (df['Wscore']), find which entries are above 150 (df['Wscore'] >
150), and then returns only those specific rows in a dataframe format (df[df['Wscore'] > 150]).

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 13/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [162]:

df[df['Wscore'] > 150]

Out[162]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

5269 1986 75 1258 151 1109 107 H 0

12046 1988 40 1328 152 1147 84 H 0

12355 1988 52 1328 151 1173 99 N 0

16040 1989 40 1328 152 1331 122 H 0

16853 1989 68 1258 162 1109 144 A 0

17867 1989 92 1258 181 1109 150 H 0

19653 1990 30 1328 173 1109 101 H 0

19971 1990 38 1258 152 1109 137 A 0

20022 1990 40 1116 166 1109 101 H 0

22145 1990 97 1258 157 1362 115 H 0

23582 1991 26 1318 152 1258 123 N 0

24341 1991 47 1328 172 1258 112 H 0

24970 1991 68 1258 186 1109 140 H 0

25656 1991 84 1106 151 1212 97 H 0

28687 1992 54 1261 159 1319 86 H 0

35023 1993 112 1380 155 1341 91 A 0

40060 1995 32 1375 156 1341 114 H 0

52600 1998 33 1395 153 1410 87 H 0

This also works if you have multiple conditions. Let's say we want to find out when the winning team scores
more than 150 points and when the losing team scores below 100.

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 14/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [163]:

df[(df['Wscore'] > 150) & (df['Lscore'] < 100)]

Out[163]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

12046 1988 40 1328 152 1147 84 H 0

12355 1988 52 1328 151 1173 99 N 0

25656 1991 84 1106 151 1212 97 H 0

28687 1992 54 1261 159 1319 86 H 0

35023 1993 112 1380 155 1341 91 A 0

52600 1998 33 1395 153 1410 87 H 0

Grouping
Another important function in Pandas is groupby(). This is a function that allows you to group entries by certain
attributes (e.g Grouping entries by Wteam number) and then perform operations on them. The following
function groups all the entries (games) with the same Wteam number and finds the mean for each group.

In [164]:

df.groupby('Wteam')['Wscore'].mean().head()

Out[164]:

Wteam

1101 78.111111

1102 69.893204

1103 75.839768

1104 75.825944

1105 74.960894

Name: Wscore, dtype: float64

This next command groups all the games with the same Wteam number and finds where how many times that
specific team won at home, on the road, or at a neutral site

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 15/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [165]:

df.groupby('Wteam')['Wloc'].value_counts().head(9)

Out[165]:

Wteam Wloc

1101 H 12

A 3

N 3

1102 H 204

A 73

N 32

1103 H 324

A 153

N 41

Name: Wloc, dtype: int64

Each dataframe has a values attribute which is useful because it basically displays your dataframe in a numpy
array style format

In [166]:

df.values

Out[166]:

array([[1985, 20, 1228, ..., 64, 'N', 0],

[1985, 25, 1106, ..., 70, 'H', 0],

[1985, 25, 1112, ..., 56, 'H', 0],

...,

[2016, 132, 1246, ..., 77, 'N', 1],

[2016, 132, 1277, ..., 62, 'N', 0],

[2016, 132, 1386, ..., 74, 'N', 0]], dtype=object)

Now, you can simply just access elements like you would in an array.

In [167]:

df.values[0][0]

Out[167]:

1985

Dataframe Iteration
In order to iterate through dataframes, we can use the iterrows() function. Below is an example of what the first
two rows look like. Each row in iterrows is a Series object

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 16/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [168]:

for index, row in df.iterrows():


print (row)
if index == 1:
break

Season 1985

Daynum 20

Wteam 1228

Wscore 81

Lteam 1328

Lscore 64

Wloc N

Numot 0

Name: 0, dtype: object

Season 1985

Daynum 25

Wteam 1106

Wscore 77

Lteam 1354

Lscore 70

Wloc H

Numot 0

Name: 1, dtype: object

Enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate
object can then be used directly in for loops or be converted into a list of tuples using list() method.

In [213]:

l1 = ["eat","sleep","repeat"]
s1 = "geek"

# creating enumerate objects


obj1 = enumerate(l1)
obj2 = enumerate(s1)

print ("Return type:",type(obj1))


print (list(enumerate(l1)))

# changing start index to 2 from 0


print (list(enumerate(s1,2)))

Return type: <class 'enumerate'>

[(0, 'eat'), (1, 'sleep'), (2, 'repeat')]

[(2, 'g'), (3, 'e'), (4, 'e'), (5, 'k')]

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 17/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [214]:

# enumerate function in loops


l1 = ["eat","sleep","repeat"]

# printing the tuples in object directly


for ele in enumerate(l1):
print (ele)
print
# changing index and printing separately
for count,ele in enumerate(l1,100):
print (count,ele)

(0, 'eat')

(1, 'sleep')

(2, 'repeat')

100 eat

101 sleep

102 repeat

Extracting Rows and Columns


The bracket indexing operator is one way to extract certain columns from a dataframe.

In [169]:

df[['Wscore', 'Lscore']].head()

Out[169]:

Wscore Lscore

0 81 64

1 77 70

2 63 56

3 70 54

4 86 74

Notice that you can acheive the same result by using the loc function. Loc is a veryyyy versatile function that
can help you in a lot of accessing and extracting tasks.

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 18/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [170]:

df.loc[1:, ['Wscore', 'Lscore']].head()

Out[170]:

Wscore Lscore

1 77 70

2 63 56

3 70 54

4 86 74

5 79 78

Note the difference is the return types when you use brackets and when you use double brackets.

In [171]:

type(df['Wscore'])

Out[171]:

pandas.core.series.Series

In [172]:

type(df[['Wscore']])

Out[172]:

pandas.core.frame.DataFrame

You've seen before that you can access columns through df['col name']. You can access rows by using slicing
operations.

In [173]:

df[0:3]

Out[173]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

1 1985 25 1106 77 1354 70 H 0

2 1985 25 1112 63 1223 56 H 0

Here's an equivalent using iloc

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 19/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [174]:

df.iloc[0:3,:]

Out[174]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

1 1985 25 1106 77 1354 70 H 0

2 1985 25 1112 63 1223 56 H 0

In [175]:

df.isnull().sum()

Out[175]:

Season 0

Daynum 0

Wteam 0

Wscore 0

Lteam 0

Lscore 0

Wloc 0

Numot 0

dtype: int64

If you do end up having missing values in your datasets, be sure to get familiar with these two functions.

dropna() - This function allows you to drop all(or some) of the rows that have missing values.
fillna() - This function allows you replace the rows that have missing values with the value that you pass in.

In [176]:

df.head()

Out[176]:

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

1 1985 25 1106 77 1354 70 H 0

2 1985 25 1112 63 1223 56 H 0

3 1985 25 1165 70 1432 54 H 0

4 1985 25 1192 86 1447 74 H 0

In [177]:

import numpy as np

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 20/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [178]:

df2 = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",


"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})

In [179]:

df2

Out[179]:

A B C D E

0 foo one small 1 2

1 foo one large 2 4

2 foo one large 2 5

3 foo two small 3 5

4 foo two small 3 6

5 bar one large 4 6

6 bar one small 5 8

7 bar two small 6 9

8 bar two large 7 9

In [180]:

table = pd.pivot_table(df2,values='D', index=['A'],columns=['C'], aggfunc=np.mean)

In [181]:

table

Out[181]:

C large small

bar 5.5 5.500000

foo 2.0 2.333333

Merge
Merge DataFrame or named Series objects with a database-style join

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 21/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [182]:

# import the pandas library


import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print(left)
print (right)

id Name subject_id

0 1 Alex sub1

1 2 Amy sub2

2 3 Allen sub4

3 4 Alice sub6

4 5 Ayoung sub5

id Name subject_id

0 1 Billy sub2

1 2 Brian sub4

2 3 Bran sub3

3 4 Bryce sub6

4 5 Betty sub5

In [183]:

pd.merge(left,right,on='id')

Out[183]:

id Name_x subject_id_x Name_y subject_id_y

0 1 Alex sub1 Billy sub2

1 2 Amy sub2 Brian sub4

2 3 Allen sub4 Bran sub3

3 4 Alice sub6 Bryce sub6

4 5 Ayoung sub5 Betty sub5

In [184]:

pd.merge(left,right,on=['id','subject_id'])

Out[184]:

id Name_x subject_id Name_y

0 4 Alice sub6 Bryce

1 5 Ayoung sub5 Betty

image.png

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 22/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

(INNER) JOIN: Returns records that have matching values in both tables

LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right
table

RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left
table

FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table

In [185]:

pd.merge(left, right, on='subject_id', how='left')

Out[185]:

id_x Name_x subject_id id_y Name_y

0 1 Alex sub1 NaN NaN

1 2 Amy sub2 1.0 Billy

2 3 Allen sub4 2.0 Brian

3 4 Alice sub6 4.0 Bryce

4 5 Ayoung sub5 5.0 Betty

In [186]:

pd.merge(left, right, on='subject_id', how='right')

Out[186]:

id_x Name_x subject_id id_y Name_y

0 2.0 Amy sub2 1 Billy

1 3.0 Allen sub4 2 Brian

2 NaN NaN sub3 3 Bran

3 4.0 Alice sub6 4 Bryce

4 5.0 Ayoung sub5 5 Betty

In [187]:

pd.merge(left, right, how='outer', on='subject_id')

Out[187]:

id_x Name_x subject_id id_y Name_y

0 1.0 Alex sub1 NaN NaN

1 2.0 Amy sub2 1.0 Billy

2 3.0 Allen sub4 2.0 Brian

3 4.0 Alice sub6 4.0 Bryce

4 5.0 Ayoung sub5 5.0 Betty

5 NaN NaN sub3 3.0 Bran

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 23/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [188]:

pd.merge(left, right, on='subject_id', how='inner')

Out[188]:

id_x Name_x subject_id id_y Name_y

0 2 Amy sub2 1 Billy

1 3 Allen sub4 2 Brian

2 4 Alice sub6 4 Bryce

3 5 Ayoung sub5 5 Betty

In [189]:

df2 = df.drop_duplicates()
df2.shape

Out[189]:

(145289, 8)

In [190]:

df2 = df.sample(10)

In [191]:

df2.shape

Out[191]:

(10, 8)

Concat
Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects

The concat() function (in the main pandas namespace) does all of the heavy lifting of performing concatenation
operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the
other axes

pd.concat(objs,axis=0,join='outer',join_axes=None,
ignore_index=False)

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 24/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [192]:

one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])

two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])

In [193]:

pd.concat([one,two])

Out[193]:

Name subject_id Marks_scored

1 Alex sub1 98

2 Amy sub2 90

3 Allen sub4 87

4 Alice sub6 69

5 Ayoung sub5 78

1 Billy sub2 89

2 Brian sub4 80

3 Bran sub3 79

4 Bryce sub6 97

5 Betty sub5 88

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 25/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [194]:

#dividing based on Keys


pd.concat([one,two],keys=['x','y'])

Out[194]:

Name subject_id Marks_scored

1 Alex sub1 98

2 Amy sub2 90

x 3 Allen sub4 87

4 Alice sub6 69

5 Ayoung sub5 78

1 Billy sub2 89

2 Brian sub4 80

y 3 Bran sub3 79

4 Bryce sub6 97

5 Betty sub5 88

In [195]:

#The index of the resultant is duplicated; each index is repeated.If the resultant object h
pd.concat([one,two],keys=['x','y'],ignore_index=True)

Out[195]:

Name subject_id Marks_scored

0 Alex sub1 98

1 Amy sub2 90

2 Allen sub4 87

3 Alice sub6 69

4 Ayoung sub5 78

5 Billy sub2 89

6 Brian sub4 80

7 Bran sub3 79

8 Bryce sub6 97

9 Betty sub5 88

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 26/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [196]:

#If two objects need to be added along axis=1(column), then the new columns will be appende
pd.concat([one,two],axis=1)

Out[196]:

Name subject_id Marks_scored Name subject_id Marks_scored

1 Alex sub1 98 Billy sub2 89

2 Amy sub2 90 Brian sub4 80

3 Allen sub4 87 Bran sub3 79

4 Alice sub6 69 Bryce sub6 97

5 Ayoung sub5 78 Betty sub5 88

In [197]:

#row wise
pd.concat([one,two],axis=0)

Out[197]:

Name subject_id Marks_scored

1 Alex sub1 98

2 Amy sub2 90

3 Allen sub4 87

4 Alice sub6 69

5 Ayoung sub5 78

1 Billy sub2 89

2 Brian sub4 80

3 Bran sub3 79

4 Bryce sub6 97

5 Betty sub5 88

Melt
Pandas Melt helps to transform rows to columns

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 27/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [201]:

import pandas as pd
d1 = {"Name": ["Pavan", "Balaji", "Ravi"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Auth
df = pd.DataFrame(d1)
print(df)

Name ID Role

0 Pavan 1 CEO

1 Balaji 2 Editor

2 Ravi 3 Author

In [202]:

df_melted = pd.melt(df, id_vars=["ID"], value_vars=["Name", "Role"])

print(df_melted)

ID variable value

0 1 Name Pavan

1 2 Name Balaji

2 3 Name Ravi

3 1 Role CEO

4 2 Role Editor

5 3 Role Author

Chunk
Pandas Chunk helps to read large volumns of data in bits

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 28/29


6/12/2021 Numpy_Pandas_Lab - Jupyter Notebook

In [215]:

import pandas as pd
for chunk in pd.read_csv("RegularSeasonCompactResults.csv", chunksize=10):
print(chunk)

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

0 1985 20 1228 81 1328 64 N 0

1 1985 25 1106 77 1354 70 H 0

2 1985 25 1112 63 1223 56 H 0

3 1985 25 1165 70 1432 54 H 0

4 1985 25 1192 86 1447 74 H 0

5 1985 25 1218 79 1337 78 H 0

6 1985 25 1228 64 1226 44 N 0

7 1985 25 1242 58 1268 56 N 0

8 1985 25 1260 98 1133 80 H 0

9 1985 25 1305 97 1424 89 H 0

Season Daynum Wteam Wscore Lteam Lscore Wloc Numot

10 1985 25 1307 103 1288 71 H 0

11 1985 25 1344 75 1438 71 N 0

12 1985 25 1374 91 1411 72 H 0

13 1985 25 1412 70 1397 65 N 0

14 1985 25 1417 87 1225 58 H 0

15 1985 26 1116 65 1368 62 H 0

16 1985 26 1120 92 1391 50 H 0

17 1985 26 1135 65 1306 60 A 0


In [ ]:

localhost:8888/notebooks/Downloads/Python pavan sir/Numpy_Pandas_Lab.ipynb# 29/29

You might also like