Block 1-Data Handling Using Pandas DataFrame
Block 1-Data Handling Using Pandas DataFrame
Informatics Practices
DataFrame
Sometimes we need to work on multiple columns at a time, i.e., we
need to process the tabular data. For example, the result of a class,
items in a restaurant’s menu, reservation chart of a train, etc.
Pandas store such tabular data using a DataFrame. A DataFrame is a
two-dimensional labelled data structure like a table of MySQL. It
contains rows and columns, and therefore has both a row and
column index. Each column can have a different type of value such
as numeric, string, boolean, etc., as in tables of a database.
Basic Features of DataFrame
1. It has two index – a row index and a column index
2. The row index is known as index and column index is known as
column name.
3. Index can be of number or letters or strings.
4. Columns may be of different types
5. Size can be changed(Mutable)
6. We can change column/row values.(value- mutable)
Creating a DataFrame
Empty DataFrame
Columns: []
Index: []
OUTPUT:
Dictionary is:
{'Student': ['a', 'b', 'c'], 'Marks': [12, 14, 15]}
DataFrame is:
Student Marks
0 a 12
1 b 14
2 c 15
OUTPUT:
Dictionary is:
{'Student': [1, 'b', 'abc'], 'Marks': [12, 'AB', 15]}
DataFrame is:
Student Marks
0 1 12
1 b AB
2 abc 15
# EG:3 Creating DataFrame having value as lists[putting
index]
import pandas as pd
D1={"Student":[1,"b","abc"],"Marks":[12,"AB",15]}
print("Dictionary is:\n",D1)
Df1=pd.DataFrame(D1,index=["A","B","C"])
print("DataFrame is:")
print(Df1)
OUTPUT:
Dictionary is:
{'Student': [1, 'b', 'abc'], 'Marks': [12, 'AB', 15]}
DataFrame is:
Student Marks
A 1 12
B b AB
C abc 15
import pandas as pd
D1={"Sales":{"Name":"a","Age":10},"Marketing":{"Name":"b
","Age":20}}
print("Dictionary is:\n",D1)
Df1=pd.DataFrame(D1)
print("DataFrame is:")
print(Df1)
OUTPUT:
Dictionary is:
{'Sales': {'Name': 'a', 'Age': 10}, 'Marketing': {'Name': 'b',
'Age': 20}}
DataFrame is:
Sales Marketing
Name a b
Age 10 20
In the above example there are two dictionaries in the list. So, the
DataFrame consists of two rows. Number of columns in a DataFrame
is equal to the maximum number of keys in any dictionary of the list.
Hence, there are three columns as the second dictionary has three
elements. Also, note that NaN (Not a Number) is inserted if a
corresponding value for a column is missing.
For example:
dictForUnion = { 'Series1' : pd.Series([1,2,3,4,5], index = ['a', 'b', 'c',
'd', 'e']) , 'Series2' : pd.Series([10,20,-10,-50,100], index = ['z', 'y', 'a',
'c', 'e']),
'Series3' : pd.Series([10,20,-10,-50,100], index = ['z', 'y', 'a', 'c', 'e']) }
>>> dFrameUnion = pd.DataFrame(dictForUnion)
>>> dFrameUnion
Series1 Series2 Series3
a 1.0 -10.0 -10.0
b 2.0 NaN NaN
c 3.0 -50.0 -50.0
d 4.0 NaN NaN
e 5.0 100.0 100.0
y NaN 20.0 20.0
z NaN 10.0 10.0
1. To access a row:
Dft.loc[<rowlabel>,:]
Eg: print(ResultDF.loc['Maths',:])
output:
Arnab 90
Ramit 92
Samridhi 89
Riya 81
Mallika 94
Name: Maths, dtype: int64
Output:
Arnab Ramit Samridhi
Maths 90 92 89
Science 91 81 91
Index:
print(ResultDF.Arnab['Maths'])
o/p : 90
>>> ResultDF['Preeti']=[89,78,76]
>>> ResultDF
Arnab Ramit Samridhi Riya Mallika Preeti
Maths 90 92 89 81 94 89
Science 91 81 91 71 95 78
Hindi 97 96 88 67 99 76
Assigning values to a new column label that does not exist will
create a new column at the end. If the column already exists in the
DataFrame then the assignment statement will update the values of
the already existing column, for example:
>>> ResultDF['Ramit']=[99, 98, 78]
>>> ResultDF
ResultDF.loc['Maths']=0
>>> ResultDF
If we try to add a row with lesser values than the number of
columns in the DataFrame, it results in a ValueError, with the error
message: ValueError: Cannot set a row with mismatched columns.
Similarly, if we try to add a column with lesser values than the
number of rows in the DataFrame, it results in a ValueError, with the
error message: ValueError: Length of values does not match length
of index.
If the DataFrame has more than one row with the same label, the
DataFrame.drop() method will delete all the matching rows from it.
For example, consider the following DataFrame:
ResultDF=ResultDF.rename({'Maths':'Sub1',
‘Science':'Sub2','English':'Sub3',
'Hindi':'Sub4'}, axis='index')
>>> print(ResultDF)
1] ResultDF['Arnab']>90
# applied to only one column of DF and return the result of
each column
o/p Maths False
Science True
Hindi True
Name: Arnab, dtype: bool
<dataframe obj>[condition]
Eg: print(ResultDF[ResultDF['Arnab']>90])
o/p :
Arnab Ramit Samridhi Riya Mallika
Science 91 81 91 71 95
Hindi 97 96 88 67 99
Or
<dataframe obj>.loc[condition]
Eg: print(ResultDF.loc[ResultDF['Arnab']>90])
o/p
Arnab Ramit Samridhi Riya Mallika
Science 91 81 91 71 95
Hindi 97 96 88 67 99
Import pandas as pd
Days=[“Mon”, “Tue”, “Wed”, “Thur”, “Fri”]
Classes = [6,0,3,0,8]
Dc= {“Days”:Days, ‘No. of Classes’:Classes}
Clasdf= pd.DataFrame(Dc,index=[True,False,True,False,True])
<df>.loc[True]
<df>.loc[False]
<df>.loc[1]
<df>.loc[0]