0% found this document useful (0 votes)

439 views

Data Munging in Python Using Pandas PDF

1) The document discusses using Python and the Pandas library to analyze and "munge" or clean a Titanic passenger dataset. It contains missing and erroneous values that need cleaning. 2) The author extracts title/salutation values from names, identifies the most common titles (Mr, Mrs, Miss, Master), and groups rare titles into an "Others" category. Boxplots show age varies by title. 3) A pivot table is created with median age values for each combination of passenger class, gender, and title. A function fills missing age values using the appropriate median from the pivot table.

Uploaded by

Teodor von Burg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

439 views

Data Munging in Python Using Pandas PDF

Uploaded by

Teodor von Burg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

10/6/2016

DataMungingInPythonUsingPandas

Timefliesby!IseeJenika(mydaughter)runningaroundintheentirehouseandmyofficenow.She
stillslipsandtripsbutisnowindependenttoexploretheworldandfigureoutnewstuffonherown.
IhopeIwouldhavebeenabletoinspiresimilarconfidencewithuseofPythonfordataanalysisin
thefollowersofthisseries.
Forthose,whohavebeenfollowing,hereareapairofshoesforyoutostartrunning!

By end of this tutorial, you will also have all the tools necessary to perform any data analysis by
yourselfusingPython.

RecapGettingthebasicsright
In the previous posts in this series, we had downloaded and setup a Python installation, got
introduced to several useful libraries and data structures and finally started with an exploratory
analysisinPython(usingPandas).
In this tutorial, we will continue our journey from where we left it in our last tutorial we have a
reasonable idea about the characteristics of the dataset we are working on. If you have not gone
throughthepreviousarticleintheseries,kindlydosobeforeproceedingfurther.

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

1/7

10/6/2016

DataMungingInPythonUsingPandas

Datamungingrecapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedtobesolved
beforethedataisreadyforagoodmodel.ThisexerciseistypicallyreferredasDataMunging.Here
aretheproblems,wearealreadyawareof:
1.About31%(277outof891)ofvaluesinAgearemissing.Weexpectagetoplayanimportantroleand
hencewouldwanttoestimatethisinsomemanner.
2.Whilelookingatthedistributions,wesawthatFareseemedtocontainextremevaluesateitherenda
fewticketswereprobablyprovidedfreeorcontaineddataentryerror.Ontheotherhand$512sounds
likeaveryhighfareforbookingaticket

Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenonnumericalfields
i.e.Name,TicketandCabintosee,iftheycontainanyusefulinformation.

Checkmissingvaluesinthedataset
LetuslookatCabintostartwith.Firstglanceatthevariableleavesuswithanimpressionthatthere
aretoomanyNaNsinthedataset.So,letuscheckthenumberofnulls/NaNsinthedataset

sum(df['Cabin'].isnull())

Thiscommandshouldtellusthenumberofmissingvaluesasisnull()returns1,ifthevalueisnull.
Theoutputis687whichisalotofmissingvalues.So,wellneedtodropthisvariable.

Next,letuslookatvariableTicket.Ticketlookstohavemixofnumbersandtextanddoesntseemto
containanyinformation,sowilldropTicketaswell.

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

2/7

10/6/2016

DataMungingInPythonUsingPandas

df=df.drop(['Ticket','Cabin'],axis=1)

HowtofillmissingvaluesinAge?
There are numerous ways to fill the missing values of Age the simplest being replacement by
mean,whichcanbedonebyfollowingcode:

meanAge=np.mean(df.Age)
df.Age=df.Age.fillna(meanAge)

Theotherextremecouldbetobuildasupervisedlearningmodeltopredictageonthebasisofother
variablesandthenuseagealongwithothervariablestopredictsurvival.
Since, the purpose of this tutorial is to bring out the steps in data munging, Ill rather take an
approach, which lies some where in between these 2 extremes. The key hypothesis is that the
salutationsinName,GenderandPclasscombinedcanprovideuswithinformationrequiredtofillin
themissingvaluestoalargeextent.
Herearethestepsrequiredtoworkonthishypothesis:
Step1:ExtractingsalutationsfromName

Letusdefineafunction,whichextractsthesalutationfromaNamewritteninthisformat:
Family_Name,Salutation.FirstName

defname_extract(word):
returnword.split(',')[1].split('.')[0].strip()

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

3/7

10/6/2016

DataMungingInPythonUsingPandas

This function takes a Name, splits it by a comma (,), then splits it by a dot(.) and removes the
whitespaces.TheoutputofcallingfunctionwithJain,Mr.KunalwouldbeMrandJain,Miss.Jenika
wouldbeMiss
Next,weapplythisfunctiontotheentirecolumnusingapply()functionandconverttheoutcometoa
newDataFramedf2:

df2=pd.DataFrame({'Salutation':df['Name'].apply(name_extract)})

Once we have the Salutations, let us look at their distribution. We use the good old groupby after
mergingtheDataFramedf2withDataFramedf:

df=pd.merge(df,df2,left_index=True,right_index=True)#mergesonindex
temp1=df.groupby('Salutation').PassengerId.count()
printtemp1

Followingistheoutput:

Salutation
Capt1
Col2
Don1
Dr7
Jonkheer1
Lady1

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

4/7

10/6/2016

DataMungingInPythonUsingPandas

Major2
Master40
Miss182
Mlle2
Mme1
Mr517
Mrs125
Ms1
Rev6
Sir1
theCountess1
dtype:int64

As you can see, there are 4 main Salutations Mr, Mrs, Miss and Master all other are less in
number.Hence,wewillcombinealltheremainingsalutationsunderasinglesalutationOthers.In
ordertodoso,wetakethesameapproach,aswedidtoextractSalutationdefineafunction,apply
ittoanewcolumn,storetheoutcomeinaDataFrameandthenmergeitwitholdDataFrame:

defgroup_salutation(old_salutation):
ifold_salutation=='Mr':
return('Mr')
else:
ifold_salutation=='Mrs':
return('Mrs')
else:
ifold_salutation=='Master':
return('Master')
else:
ifold_salutation=='Miss':

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

5/7

10/6/2016

DataMungingInPythonUsingPandas

return('Miss')
else:
return('Others')
df3=pd.DataFrame({'New_Salutation':df['Salutation'].apply(group_salutation)})
df=pd.merge(df,df3,left_index=True,right_index=True)
temp1=df3.groupby('New_Salutation').count()
temp1
df.boxplot(column='Age',by='New_Salutation')

FollowingistheoutcomeforDistributionofNew_SalutationandvariationofAgeacrossthem:

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

6/7

10/6/2016

DataMungingInPythonUsingPandas

Step2:Creatingasimplegrid(ClassxGender)xSalutation

SimilarlyplottingthedistributionofagebySex&Classshowsasloping:

So,wecreateaPivottable,whichprovidesusmedianvaluesforallthecellsmentionedabove.Next,
wedefineafunction,whichreturnsthevaluesofthesecellsandapplyittofillthemissingvaluesof
age:

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

7/7

Final Case Study
No ratings yet
Final Case Study
18 pages
Docs Streamlit Io en 0.81.1
100% (1)
Docs Streamlit Io en 0.81.1
164 pages
Nest Cheatsheet
No ratings yet
Nest Cheatsheet
3 pages
TabJolt Installation Guide
No ratings yet
TabJolt Installation Guide
13 pages
Read & Download (PDF Kindle)
No ratings yet
Read & Download (PDF Kindle)
5 pages
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
No ratings yet
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
29 pages
Dong Ying PDF
No ratings yet
Dong Ying PDF
52 pages
A Complete Tutorial To Learn Data Science With Python From Scratch PDF
75% (4)
A Complete Tutorial To Learn Data Science With Python From Scratch PDF
29 pages
Analyticsvidhya Com
No ratings yet
Analyticsvidhya Com
38 pages
12 Useful Pandas Techniques in Python For Data Manipulation PDF
No ratings yet
12 Useful Pandas Techniques in Python For Data Manipulation PDF
13 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Pandas
100% (1)
Pandas
1,131 pages
9 Popular Ways To Perform Data Visualization in Python - Analytics Vidhya
100% (1)
9 Popular Ways To Perform Data Visualization in Python - Analytics Vidhya
7 pages
Udacity Deep Learning Notes
No ratings yet
Udacity Deep Learning Notes
46 pages
Data Analysis Using Python (Python For Beginners) - CloudxLab
No ratings yet
Data Analysis Using Python (Python For Beginners) - CloudxLab
152 pages
Implementing Data Science Projects PDF
No ratings yet
Implementing Data Science Projects PDF
2 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Flask Restplus
No ratings yet
Flask Restplus
86 pages
Statistics - Python PDF
100% (1)
Statistics - Python PDF
16 pages
Python - Programming
No ratings yet
Python - Programming
9 pages
Laravel 5.1 Beauty - Creating Beautiful Web Apps With Laravel 5.1 PDF
No ratings yet
Laravel 5.1 Beauty - Creating Beautiful Web Apps With Laravel 5.1 PDF
247 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
100% (1)
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
20 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
2 pages
Getting Started With Tableau Prep
No ratings yet
Getting Started With Tableau Prep
3 pages
Anaconda Cheat Sheet
No ratings yet
Anaconda Cheat Sheet
1 page
Data Engineer - Roadmap and FREE Resources - Paper 2021
No ratings yet
Data Engineer - Roadmap and FREE Resources - Paper 2021
7 pages
Variables: Web Browser Users
No ratings yet
Variables: Web Browser Users
8 pages
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
No ratings yet
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
15 pages
Pyomo Workshop December 2023
No ratings yet
Pyomo Workshop December 2023
261 pages
Test Driven Development Simplified in 5 Steps: Pete Heard
100% (1)
Test Driven Development Simplified in 5 Steps: Pete Heard
24 pages
Pandas Plotting Capabilities
No ratings yet
Pandas Plotting Capabilities
27 pages
Corejavabynageswararaopdffreedownload PDF
0% (2)
Corejavabynageswararaopdffreedownload PDF
3 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Ruby On Rails Step by Step
No ratings yet
Ruby On Rails Step by Step
6 pages
Acceleo User Guide
No ratings yet
Acceleo User Guide
56 pages
Introduction To Splunk
No ratings yet
Introduction To Splunk
7 pages
NLTK Documentation: Release 3.2.5
No ratings yet
NLTK Documentation: Release 3.2.5
87 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Shiny
No ratings yet
Shiny
21 pages
Python Machine Learning Projects
No ratings yet
Python Machine Learning Projects
135 pages
Python Programming
No ratings yet
Python Programming
10 pages
Data Science With Python
No ratings yet
Data Science With Python
4 pages
Jupyter Installation
100% (1)
Jupyter Installation
19 pages
Python
No ratings yet
Python
209 pages
Scientific Python Workshop
100% (1)
Scientific Python Workshop
2 pages
Great Collection of Data Science Resources
100% (1)
Great Collection of Data Science Resources
2 pages
Python Design Pattern
100% (1)
Python Design Pattern
82 pages
Data Career Guide Udacity 2017 06 13 PDF
No ratings yet
Data Career Guide Udacity 2017 06 13 PDF
47 pages
Python by Geeky Show
No ratings yet
Python by Geeky Show
9 pages
Graph Objects - Python - Plotly
No ratings yet
Graph Objects - Python - Plotly
1 page
Pig
No ratings yet
Pig
16 pages
Angular Auth Guard
No ratings yet
Angular Auth Guard
5 pages
Python3 Data Structures Cheat Sheet: by Via
No ratings yet
Python3 Data Structures Cheat Sheet: by Via
1 page
Hands-On With Git: Sid Anand (Linkedin) July 16, 2012
No ratings yet
Hands-On With Git: Sid Anand (Linkedin) July 16, 2012
43 pages
Data Visualisation Using Pyplot
No ratings yet
Data Visualisation Using Pyplot
20 pages
Playing With Filter Phase - BMC Communities
No ratings yet
Playing With Filter Phase - BMC Communities
9 pages
DevBlogs - Microsoft Developer Blogs
No ratings yet
DevBlogs - Microsoft Developer Blogs
3 pages
Git Cheet Sheet
No ratings yet
Git Cheet Sheet
2 pages
Django 1.1 Testing and Debugging
From Everand
Django 1.1 Testing and Debugging
Karen M. Tracey
4.5/5 (3)
From Zero to Market with Flutter
From Everand
From Zero to Market with Flutter
Viachaslau Lyskouski
No ratings yet
SAS Viya: The Python Perspective
From Everand
SAS Viya: The Python Perspective
Kevin D. Smith
No ratings yet
Hadoop MapReduce v2 Cookbook - Second Edition
From Everand
Hadoop MapReduce v2 Cookbook - Second Edition
Thilina Gunarathne
No ratings yet
Machine Learning To Predict San Francisco Crime - EFavDB PDF
No ratings yet
Machine Learning To Predict San Francisco Crime - EFavDB PDF
4 pages
Data Science - A Kaggle Walkthrough - Introduction - 1 PDF
No ratings yet
Data Science - A Kaggle Walkthrough - Introduction - 1 PDF
5 pages
Kaggle Competition PDF
No ratings yet
Kaggle Competition PDF
19 pages
Complete Guide To Parameter Tuning in Gradient Boosting (GBM) in Python PDF
No ratings yet
Complete Guide To Parameter Tuning in Gradient Boosting (GBM) in Python PDF
5 pages
Data Science - A Kaggle Walkthrough - Understanding The Data - 2 PDF
No ratings yet
Data Science - A Kaggle Walkthrough - Understanding The Data - 2 PDF
9 pages
Complete Guide To Parameter Tuning in XGBoost (With Codes in Python) PDF
No ratings yet
Complete Guide To Parameter Tuning in XGBoost (With Codes in Python) PDF
20 pages
Bayesian Statistics Explained in Simple English For Beginners PDF
100% (1)
Bayesian Statistics Explained in Simple English For Beginners PDF
19 pages
Complete Guide To Create A Time Series Forecast (With Codes in Python) PDF
100% (4)
Complete Guide To Create A Time Series Forecast (With Codes in Python) PDF
18 pages
A Complete Tutorial Which Teaches Data Exploration in Detail PDF
No ratings yet
A Complete Tutorial Which Teaches Data Exploration in Detail PDF
18 pages
A Complete Tutorial On Tree Based Modeling From Scratch (In R & Python) PDF
No ratings yet
A Complete Tutorial On Tree Based Modeling From Scratch (In R & Python) PDF
28 pages
Risk Management in SCM
No ratings yet
Risk Management in SCM
13 pages
Survey-Summative Test
No ratings yet
Survey-Summative Test
2 pages
Final Report ON 20 Years Perspective Tourism Plan FOR The State of Tamil Nadu
No ratings yet
Final Report ON 20 Years Perspective Tourism Plan FOR The State of Tamil Nadu
237 pages
Final-An Analysis of Recruitment & Selection Process at Jaypee Group (HR)
100% (1)
Final-An Analysis of Recruitment & Selection Process at Jaypee Group (HR)
85 pages
Hine 2021 Evaluating The Prospects For University Based Ethical Governance in Artificial Intelligence and Data Driven
No ratings yet
Hine 2021 Evaluating The Prospects For University Based Ethical Governance in Artificial Intelligence and Data Driven
16 pages
Philippines - Information and Communications Technology - Export - Gov
No ratings yet
Philippines - Information and Communications Technology - Export - Gov
11 pages
Articulos Sensorial-2020-2
No ratings yet
Articulos Sensorial-2020-2
1 page
EM-546 Statistical Quality Control Homework #1
No ratings yet
EM-546 Statistical Quality Control Homework #1
2 pages
IE Assignment (6-B) PDF
No ratings yet
IE Assignment (6-B) PDF
36 pages
Updated Xuanqi's International MArketing Notes
No ratings yet
Updated Xuanqi's International MArketing Notes
46 pages
COMPILATION OF RRLs
No ratings yet
COMPILATION OF RRLs
4 pages
Onfere e L
No ratings yet
Onfere e L
7 pages
Foundations of Selection: Fundamentals of Human Resource Management, 10/E, Decenzo/Robbins
No ratings yet
Foundations of Selection: Fundamentals of Human Resource Management, 10/E, Decenzo/Robbins
25 pages
(Kusumadhani, 2015) - MATHEMATICS LITERACY BASED ON ADVERSITY QUOTIENT ON THE DISCOVERY LEARNING AND GUILFORD APPROACH
No ratings yet
(Kusumadhani, 2015) - MATHEMATICS LITERACY BASED ON ADVERSITY QUOTIENT ON THE DISCOVERY LEARNING AND GUILFORD APPROACH
6 pages
A Research Paper On Artificial Intelligence
80% (5)
A Research Paper On Artificial Intelligence
18 pages
Parental Educational Expectations and Academic Achievement in Children and Adolescents-A Meta-Analysis
No ratings yet
Parental Educational Expectations and Academic Achievement in Children and Adolescents-A Meta-Analysis
18 pages
Force in A Statically Indeterminate Cantilever Beam Ummu
No ratings yet
Force in A Statically Indeterminate Cantilever Beam Ummu
17 pages
Efficacy and Safety of Zapnometinib in Hospitalised Adult Patient - 2023 - Eclin
No ratings yet
Efficacy and Safety of Zapnometinib in Hospitalised Adult Patient - 2023 - Eclin
12 pages
SBL PYQ Answer Scheme
No ratings yet
SBL PYQ Answer Scheme
7 pages
Transitional Justice in The World Insights From A New Dataset
No ratings yet
Transitional Justice in The World Insights From A New Dataset
8 pages
Binder 3
No ratings yet
Binder 3
127 pages
Aplication Letter
No ratings yet
Aplication Letter
3 pages
Green Supply Chain
No ratings yet
Green Supply Chain
91 pages
Future of The Finance Function
No ratings yet
Future of The Finance Function
26 pages
AgEcon 2022 Engl
No ratings yet
AgEcon 2022 Engl
21 pages
Failure Rate, Reliability & Probability
No ratings yet
Failure Rate, Reliability & Probability
4 pages
Asessment 1
No ratings yet
Asessment 1
14 pages
Assessment in Counseling Procedures and Practices 6th Edition
0% (1)
Assessment in Counseling Procedures and Practices 6th Edition
26 pages
Research 1 - FINALs
No ratings yet
Research 1 - FINALs
38 pages