Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (2 votes)
204 views

Chapter-2 Python Pandas

This chapter discusses pandas, a popular Python library for data analysis. It covers pandas data structures like Series and DataFrame, as well as common operations like selecting/accessing data, descriptive statistics, pivoting, sorting, grouping, and aggregating. Functions like loc, iloc, describe(), hist(), groupby(), agg(), and transform() are explained. The chapter aims to teach readers how to efficiently work with data frames in pandas.

Uploaded by

Swarnim Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
204 views

Chapter-2 Python Pandas

This chapter discusses pandas, a popular Python library for data analysis. It covers pandas data structures like Series and DataFrame, as well as common operations like selecting/accessing data, descriptive statistics, pivoting, sorting, grouping, and aggregating. Functions like loc, iloc, describe(), hist(), groupby(), agg(), and transform() are explained. The chapter aims to teach readers how to efficiently work with data frames in pandas.

Uploaded by

Swarnim Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Chapter -2

Python Pandas

Introduction :

Pandas is one of the most preferred and widely used data


science libraries. It offer efficient data structure which are not
only powerful , but also very convenient and flexible.

In this chapter we will learn :

pivoting, sorting and aggregation.

Descriptive statistics – histogram and quantiles, various


functions, reindexing and altering labels.

Pandas provide two basic data structures – Series and


DataFrame.

Series – It represents a one-dimensional array of indexed


data.

DataFrame – It store data in two-dimensional way.

Some attributes of DataFrame are :

1. index – to display index (row labels) of the DataFrame.

2. columns – to display column labels of the DataFrame.


3. axes – Return a list representing both the axes – row and
columns.

4. size – Return number of elements in DataFame

5. shape – Return dimension of DataFrame

6. values – Return DataFrame in the form of Numpy Array.

Selecting/Accessing a Column :
Selecting/Accessing a Subset from a DataFrame using
Row/Column Names :

Syntax :

<DataFrameObject>.loc[<startrow>:<endrow>]

To access selective columns use :

<DF object>.loc[:,<start column>:<end column>]


To access range of columns from a range of rows , use :

<DataFrameObject>.loc[<startrow>:<endrow>,
<startcolumn>:<endcolumn>]

Obtaining a Subset/Slice from a DataFrame using


Row/Columns Numeric Index/Position

<DF object>.iloc[<start row index>: <end row index>,


<start col index> : <end column index>]

Selecting/Accessing Individual Values :


<DF object>.<column>[<row name or row numeric index>]

Adding and Deleting Columns in DataFrames

Deleting Columns : use del statement to delete a column

Del<DF object>[<column name>]

Descriptive Statistics with Pandas :

Pandas include many useful statistical functions.

Reference dataframe namely sal_df

Functions min() and max() : The min() and max() functions


find out the minimum or maximum values respectively form a
given set of data.

Parameters :

axis : (0 or 1) by default, minimum and maximum is


calculated along axis 0.
Functions mode() , mean(), median() :

mode() – Returns the mode value (i.e., the value that appears
most often) from a set of values.

Parameters :

axis : axis 0 or ‘index’ get mode of each column

axis 1 or ‘columns’ : get mode of each row

mean() – Returns the computed mean(average) form a set of


values.

median() – returns the middle number form a set of numbers.


(2000 10000 12000 13000)

(10000+12000) =22000, 22000/2 = 11000

(2000 6000 7000 7000)

(6000 + 7000) = 13000, 13000/2 = 6500


Functions count( ) and sum( )

count( ) – counts the non-NA entries for each row or column.

sum( ) – returns the sum of values for the requested axis.

Applying Functions on a Subset of Dataframe :

Sometimes , we need to apply a function on a selective column


or row or a subset of the data frame.

Applying Function on a Column of a DataFrame:


To apply a function on a column, write –

<dataframe>[<column name>]

Applying function on Multiple Columns of a DataFrame :

<dataframe>[[<column name>, <column name>,….]]

Applying Function on a row of a DataFrame :

<dateframe>.loc[<row index>, :]

Applying Functions on a range of rows of a DataFrame:

<dataframe>.loc[<start row> : <end row>, :]


Applying functions to a subset of the DataFrame :

<dataframe>.loc[<start row>:<end row>, :


<start column> : <end column>]

Advanced Operations on DataFrame :

Three advanced operations are :

1. pivoting
2. sorting
3. aggregation
1. Pivoting : Pivoting is actually a summary technique that
works on tabular data (i.e., data in rows and columns).
Pivoting technique rearranges the data from rows and columns,
by possibly aggregating data from multiple sources, in a report
form (with rows transferred to columns) so that data can be
viewed in a different perspective.

Real life example :

An online tutoring company maintains its data about tutors and


online classes in the following table.

Using pivot Function :


Cells in the pivoted table which do not have a matching entry
in the original one are set with NaN.

Now change the rows and columns , i.e. the index and columns
arguments

We can skip the values argument:

The above data is for one quarter only. The online tutoring
company has data for the entire year as shown below :
The index i.e. , the rows are specified as ‘Tutor’ and the
columns as ‘Country’. There are multiple entries of tutor which
are very much different for same country.

Consider Tahira’s entries

Tutor Classes Quarter Country


Tahira 28 1 USA
Tahira 36 2 USA
Tahira 24 3 Barzil
Tahira 36 4 Japan

Try to create a row for tutor Tahira from the above data with
columns as Country.

USA Brazil Japan


Tahira 24 36

Using pivot_table( ) function :

Example-1 Considering the tutoring company data, compute


total classes per tutor.
Example-2 considering the tutoring company data, compute
number of countries (count) per tutor.

Example -3 Considering the tutoring company data, compute


total classes by country.

Example-4 Considering the tutoring company data, compute


total classes on two fields, tutor and country wise.

Example 2.5 Considering the tutoring company data, compute


average classes on two fields
Sorting : Sorting refers to arranging values in a particular
order.

sort_values( ) – function arrange the values in ascending or


descending order.
Creating Histogram :

Histogram – A histogram is a plot that lets you discover, and


show the underlying frequency distribution(shape) of a set of
continuous data.

hist( ) function of pandas is used to create histogram.

Consider the following histogram that has been computed using


the following datasets containing age of 20 people.

37 28 38 44 53 69 74 53 35 38 66 46 24 45 92 48 51 62 58 57
Bin Frequency Age included in Bin

20-30 2 28,24

30-40 4 37,38,35,38

40-50 4 44,46,45,48

50-60 5 53,53,51,56,57

60-70 3 69,66,92

70-80 1 74

80-90 0 --

90-100 1 92
Function Application : It means that a function (a library
function or user defined function ) may be applied on a
dataframe in multiple ways:

(a) on the whole dataframe


(b) row-wise or column-wise
(c) on individual elements, i.e. element-wise

for the above mentioned three types of function application,


Pandas offers following three functions :

(a) pipe() – dataframe wise function application.


(b) apply() – row-wise/column-wise function application.
(c) applymap() – individual elements-wise function application.

(a) pipe( ) function : The piping of functions through pipe( )


basically means the chaining of functions in the order they are
executed.
pipe() Example 1 Function add( ) followed by multiply( )
applied on a dataframe.

The apply and applymap() functions :

apply () – apply is a series function, so it applies the given


function to one row or one column of the dataframe (as single
row/columns of a dataframe is equivalent to a series).

Syntax - <dataframe>.apply(<funcname>,axis = 0)

axis 0 or 1 default 0 , axis along with the function is applied.

If axis is 0 or ‘index’ : function is applied on each column

If axis is 1 or ‘columns’ : function is applied on each row.

applymap() – is an element function, so it applies the


given function to each individual elements, separately.

Syntax - <dataframe>.applymap(<funcname>)
To apply, apply() row-wise write :

<dataframe>.apply(<func>, axis = 1)

NOTE – The apply() will apply the function on individual


columns/rows, only if the passed function name is a Series
function. If you pass a single value function, then apply() will
behave like applymap()
Function groupby( ) :

The duplicate values in the same filed are grouped together to


form groups, e.g. for creating Tutor wise groups :

All the rows having Tutor as Tahira will be clubbed to form


Tahira group.

groupby() function - is used to create group for the


duplicate values in the same filed.
The groupby() creates the group internally and does not
display the grouped data by default.
Grouping on Multiple Columns : Create a group for Tutor
and for each tutor group, a Country-wise subgroup :

Example : to get a group having tutor name as ‘Anusha’ and


Country as ‘UK’ write :

Aggregation via groupby () :

agg( ) method – aggregates the data of the dataframe using


one or more operations over the specified axis.
Syntax - <dataframe>.agg(func, axis = 0)

mean :-

36,40,30,32

(36+40+30+32)/4 =
34.5

median :-

30 32 36 40

n/2 = 4/2 = 2

(n/2) + 1 = (4/2) + 1
=3

(32 + 36) / 2 = 34

Sum : (36+40+30+32)
= 138

We may combine the groupby( ) and agg( ) in single command


:
The transform( ) function : This function transforms the
aggregate data by repeating the summary result for each row
of the group and make the result have the same shape as
original data.

Q- What if we want to add this aggregate data to the


dataframe itself?
Reindexing and Altering Labels : The methods provided by
Pandas for reindexing and relabeling are :

1. rename( ) – simply rename the index and/or column labels


in a dataframe.
(ii) reindex( ) – specify the new order of existing indexes
and column labels, and/or also create new indexes/column
labels.

(a) Reordering the existing indexes using reindex( )

Adding indexes :
(iii) reindex_like() – for creating indexes/column-labels
based on other dataframe object.

<dataframe>.reindex_like(other)
Solved Problems :

1. Consider the following code to create two dataframes with


similar values. What will be printed by the code given below ?
Justify your answer.

import pandas as pd

df1 = pd.DataFrame([1,2,3])

df2 = pd.DataFrame([[1,2,3]])

print("df1")

print(df1)

print("df2")

print(df2)

Ans :

You might also like