12 IP Unit 1 Python Pandas I (Part 3 Dataframes) Notes
12 IP Unit 1 Python Pandas I (Part 3 Dataframes) Notes
INTRODUCTION TO DATAFRAME
A
DataFrame is a 2D labeled heterogeneous, data-mutable and size-mutable array
which is widely used and is one of the most important data structures. The data in
DataFrame is aligned in a tabular fashion in rows and columns therefore has both a
row and column labels. Each column can have a different type of value such as
numeric, string, boolean, etc.
For example:
ID NAME DEPT SEX EXPERIENCE
101 JOHN ENT M 12
104 SMITH ORTHOPEDIC M 5
107 GEORGE CARDIOLOGY M 10
109 LARA SKIN F 3
113 GEORGE MEDICINE F 9
115 JOHNSON ORTHOPEDIC M 10
The above table describes data of doctors in the form of rows and columns. Here
vertical subset are columns and horizontal subsets are rows. The column labels are
Id, Name, Dept, Sex and Experience and row labels are 101, 104, 107, 109, 113 and
115.
CREATING DATAFRAME
SYNTAX:
import pandas as pd
pd.DataFrame( data, index, column)
where
data: takes various forms like series, list, constants/scalar values, dictionary, another
dataframe
index: specifies index/row labels to be used for resulting frame. They are unique and
hashable with same length as data. Default is np.arrange(n) if no index is passed.
column: specifies column labels to be used for resulting frame. They are unique and
hashable with same length as data. Default is np.arrange(n) if no index is passed.
EXAMPLE 1
import pandas as pd
dfempty = pd.DataFrame()
print (dfempty)
OUTPUT:
Empty DataFrame
Columns: []
Index: []
EXAMPLE 2
import pandas as pd
l1=[{101:"Amit",102:"Binay",103:"Chahal"}, {102:"Arjun",103:"Fazal"}]
df=pd.DataFrame(l1)
print(df)
OUTPUT:
Explanation:
Here, the dictionary keys are treated as column labels and row labels take default
values starting from zero. The values corresponding to each key are treated as rows.
The number of rows is equal to the number of dictionaries present in the list. There
are two rows in the above dataframe as there are two dictionaries in the list. In the
second row the value corresponding to key 101 is NaN because 101 key is missing in
the second dictionary.
Note: If the value for a particular key is missing NaN (Not a Number) is inserted at
that place.
EXAMPLE 3
import pandas as pd
a=[[1,"Amit"],[2,"Chetan"],[3,"Rajat"],[4,"Vimal"]]
b=pd.DataFrame(a)
print(b)
OUTPUT:
0 1
0 1 Amit
1 2 Chetan
2 3 Rajat
3 4 Vimal
Here, row and column labels take default values. Lists form the rows of the
dataframe. Column labels and row index can also be changed in the following way:
b=pd.DataFrame(a,index=[’I’,’II’,’III’,’IV’],columns=[‘Rollno’,’Name’])
Now the output will be:
OUTPUT:
Rollno Name
I 1 Amit
II 2 Chetan
III 3 Rajat
IV 4 Vimal
d1={"Rollno":[1,2,3], "Total":[350.5,400,420],
"Percentage":[70,80,84]}
df1=pd.DataFrame(d1)
print(df1)
OUTPUT
Rollno Total Percentage
0 1 350.5 70
1 2 400.0 80
2 3 420.0 84
Explanation:
Here, the dictionary keys are treated as column labels and row labels take default
values starting from zero.
d2={"Rollno":pd.Series([1,2,3,4]), "Total":pd.Series([350.5,400,420]),
"Percentage":pd.Series([70,80,84,80])}
df2=pd.DataFrame(d2)
print(df2)
OUTPUT
0 1 350.5 70
1 2 400.0 80
2 3 420.0 84
3 4 NaN 80
The index and columns parameters can be used to change the default row and
column labels. Consider a dataframe created in example 3 above:
0 1
0 1 Amit
1 2 Chetan
2 3 Rajat
3 4 Vimal
Suppose we want to have “roll No.” and “Name” as the column labels and numbers
from 1 to 4 as the row labels.
EXAMPLE 6:
import pandas as pd
a=[[1,"Amit"],[2,"Chetan"],[3,"Rajat"],[4,"Vimal"]]
print(b)
OUTPUT:
Roll No. Name
1 1 Amit
2 2 Chetan
3 3 Rajat
4 4 Vimal
EXPLANATION:
Here you can see that using index parameter the row labels are changed to 1,2,3 and
4. Similarly using columns parameter the column labels are changed to "Roll No." and
"Name".
One can easily change the default order of row labels to user defined row labels using
the index parameter. It can be used to select only desired rows instead of all rows.
Also, the columns parameter in the DataFrame() method can be used to change the
sequence of DataFrame columns or to display only selected columns.
EXAMPLE 7
import pandas as pd
d2={"Rollno":pd.Series([1,2,3,4],index=[1,2,3,4]),
"Total":pd.Series([350.5,400,420],index=[1,2,4]),
"Percentage":pd.Series([70,80,84,80],index=[1,2,3,4])
}
df3=pd.DataFrame(d2)
print(df3)
OUTPUT:
Rollno Total Percentage
1 1 350.5 70
2 2 400.0 80
3 3 NaN 84
4 4 420.0 80
EXPLANATION
Here, the dictionary keys are treated as column labels and the row labels are a union
of all the series indexes passed to create the DataFrame. Every DataFrame column is
a Series object.
OUTPUT
Rollno Total
1 1 350.5
2 2 400.0
3 3 NaN
4 4 420.0
EXPLANATION
Here, only two columns Rollno and Total arethere in df4. All rows are displayed.
EXAMPLE 9
>>>df5=pd.DataFrame(d2, index=[1,2,3], columns=["Rollno","Percentage"])
>>>df5
OUTPUT:
Rollno Percentage
1 1 70
2 2 80
3 3 84
EXPLANATION
Here, in df5 dataframe only two columns Rollno and Total are stored and displayed.
Only rows with index values 1,2 and 3 are displayed.
DATAFRAME ATTRIBUTES
1 1 350.5 70
2 2 400.0 80
3 3 420.0 84
4 4 356.0 80
5 5 434.0 87
6 6 398.0 79
Attributes
1. df1.size
OUTPUT
18
EXPLANATION:
OUPUT
(6, 3)
3. df1.axes
OUTPUT
OUTPUT
5. df1.columns
OUTPUT
6. df1.values
OUTPUT
[[ 1. 350.5 70. ]
[ 2. 400. 80. ]
[ 3. 420. 84. ]
[ 4. 356. 80. ]
[ 5. 434. 87. ]
[ 6. 398. 79. ]]
7. df1.empty
OUTPUT:
False
ROW/COLUMN OPERATIONS
SELECTING A PARTICULAR COLUMN
EXAMPLE
EXAMPLE 10
>>> df1["Grade"] = ["C","B","A"]
>>> df1
OUTPUT
Rollno Total Percentage Grade
0 1 350.5 70 C
1 2 400.0 80 B
2 3 420.0 84 A
Note: If the column already exists by the same label then the values of that column
are updated by the new values.
IMPORTANT:
A column can also be added using loc() method.
New row can be added to a dataframe using loc() method.
OUTPUT
0 1 350.5 70 C
1 2 400.0 80 B
2 3 420.0 84 A
EXAMPLE 12
>>> df1.loc[3]=[4,480.0,96,"A"]
>>> df1
OUTPUT
0 1 350.5 70 C
1 2 400.0 80 B
2 3 420.0 84 A
3 4 480.0 96 A
Note: If the row exists by the same index then the values of that row are updated
by the new values.
DELETING ROW/COLUMN
Important : For rows, axis is set to 0 and for columns, axis is set to 1.
Consider dataframe df1, to delete the column Grade the command would be:
EXAMPLE 13
>>> df1.drop("Grade", axis=1)
OUTPUT
0 1 350.5 70
1 2 400.0 80
2 3 420.0 84
3 4 480.0 96
EXAMPLE 14
>>> df1.drop(2, axis=0)
OUTPUT
0 1 350.5 70 C
1 2 400.0 80 B
3 4 480.0 96 A
0 1 350.5 70 C
3 4 480.0 96 A
RENAMING ROW/COLUMN
The DataFrame.rename() method is used to rename the labels of rows and columns
in a DataFrame.
Amit 1 350.5 70
Bunty 2 400.0 80
Chetan 3 420.0 84
Reena 4 356.0 80
OUTPUT
Roll Tot Per
Amit 1 350.5 70
Bunty 2 400.0 80
Chetan 3 420.0 84
Reena 4 356.0 80
To rename all the rows labels the command would be:
EXAMPLE 17
OUTPUT
Amit 1 350.5 70
Bunty 2 400.0 80
Chetan 3 420.0 84
Reena 4 356.0 80
Note: If new label is not given corresponding to the old label, the old label is left as it
is. Also if extra values are given in the rename() method they are simply ignored.
ACCESSING DATAFRAMES
The data present in the DataFrames can be accessed using indexing and slicing.
INDEXING
Indexing in DataFrames is of two types:
Label based Indexing
Boolean Indexing.
The loc() method is used for label based indexing. It accepts row/column labels as
parameters.
SYNTAX
dataframe_name.loc[startrow:endrow,startcolumn:end column]
To access a row: just give the row name/label : DF.loc[row label,:]
Make sure not to miss the COLON AFTER COMMA.
EXAMPLE: To display the row with row label 2, the command is:
df1.loc[2,:]
or
df1.loc[2]
EXAMPLE 18
>>> df1.loc[2]
OUTPUT
Rollno 3
Total 420
Percentage 84
Grade A
Here, you can see that single row label passed, returns that particular row as series.
Similarly, single column label passed will return that particular column as series. For
example, to display the column with column label Rollno the command is:
>>> df1.loc[:,"Rollno"]
OUTPUT
0 1
1 2
2 3
1 1 350.5 70
2 2 400.0 80
3 3 420.0 84
4 4 356.0 80
5 5 434.0 87
6 6 398.0 79
SLICING ROWS
df1[2:4]
2 2 400.0 80
3 3 420.0 84
df1[:4]
1 1 350.5 70
2 2 400.0 80
3 3 420.0 84
df1[::3]
1 1 350.5 70
4 4 356.0 80
df1[:: -3]
6 6 398.0 79
3 3 420.0 84
df1[3:]
Rollno Total Per
3 3 420.0 84
4 4 356.0 80
5 5 434.0 87
6 6 398.0 79
SLICING COLUMNS
df1['Total']
or
df1.Total
1 350.5
2 400.0
3 420.0
4 356.0
5 434.0
6 398.0
1 1 350.5
2 2 400.0
3 3 420.0
4 4 356.0
5 5 434.0
6 6 398.0
The loc[] method is used to retrieve the group of rows and columns by labels or a
boolean array present in the DataFrame. It takes only index labels, and if it exists in
the called DataFrame, it returns the rows, columns, or DataFrame.
SYNTAX:
dataframe_name.loc[start_row : end_row, start_column : end_column]
EXAMPLE 22
>>> df1.loc[1:2]
OUTPUT
1 2 400.0 80
2 3 420.0 84
EXAMPLE 23
>>> df1.loc[:,"Rollno":"Total"]
OUTPUT
Rollno Total
0 1 350.5
1 2 400.0
2 3 420.0
>>>df2.loc['Amit':'Chetan']
OUTPUT
Amit 1 350.5 70
Bunty 2 400.0 80
Chetan 3 420.0 84
EXAMPLE 25
>>> df2.loc['Amit':'Chetan','Rollno':'Total']
OUTPUT
Rollno Total
Amit 1 350.5
Bunty 2 400.0
Chetan 3 420.0
EXAMPLE 26
>>> df2.loc['Amit':'Chetan',['Rollno','Percentage']]
OUTPUT
Rollno Percentage
Amit 1 70
Bunty 2 80
Chetan 3 84
SYNTAX
dataframe_name.iloc[start_row : end_row, start_column : end_column]
Amit 1 350.5 70
Bimal 2 400.0 80
Chetan 3 420.0 84
Harshit 4 356.0 80
Seema 5 434.0 87
Ravi 6 398.0 79
Here, rows from Amit to Chetan and columns from Rollno to Percentage are
displayed.
Here, row 1 and columns 1 and 2 are displayed
Command
df1.loc['Amit':'Chetan']
OUTPUT
Amit 1 350.5 70
Bimal 2 400.0 80
Chetan 3 420.0 84
df1.loc[:,"Rollno":"Total"]
OUTPUT
Rollno Total
Amit 1 350.5
Bimal 2 400.0
Chetan 3 420.0
Harshit 4 356.0
Seema 5 434.0
Ravi 6 398.0
Here, all rows and columns with labels Rollno to Total are displayed.
df1.loc['Amit':'Chetan','Rollno':'Percentage']
OUTPUT
Amit 1 350.5 70
Bimal 2 400.0 80
Chetan 3 420.0 84
df1.loc['Amit':'Chetan',['Rollno','Percentage']]
OUTPUT
Rollno Percentage
Amit 1 70
Bimal 2 80
Chetan 3 84
Here, rows from Amit to Chetan and columns Rollno and Percentage are displayed.
df1.iloc[1:3]
OUTPUT
Bimal 2 400.0 80
Chetan 3 420.0 84
OUTPUT
Amit 1 350.5 70
Chetan 3 420.0 84
Seema 5 434.0 87
OUTPUT
Total Percentage
Amit 350.5 70
Bimal 400.0 80
Chetan 420.0 84
Harshit 356.0 80
Seema 434.0 87
Ravi 398.0 79
Here, all rows and 1,2 columns are displayed. Column 3 is not displayed.
df1.iloc[1:2,1:3]
OUTPUT
Total Percentage
Bimal 400.0 80