Data Munging in Python Using Pandas PDF
Data Munging in Python Using Pandas PDF
DataMungingInPythonUsingPandas
Timefliesby!IseeJenika(mydaughter)runningaroundintheentirehouseandmyofficenow.She
stillslipsandtripsbutisnowindependenttoexploretheworldandfigureoutnewstuffonherown.
IhopeIwouldhavebeenabletoinspiresimilarconfidencewithuseofPythonfordataanalysisin
thefollowersofthisseries.
Forthose,whohavebeenfollowing,hereareapairofshoesforyoutostartrunning!
By end of this tutorial, you will also have all the tools necessary to perform any data analysis by
yourselfusingPython.
RecapGettingthebasicsright
In the previous posts in this series, we had downloaded and setup a Python installation, got
introduced to several useful libraries and data structures and finally started with an exploratory
analysisinPython(usingPandas).
In this tutorial, we will continue our journey from where we left it in our last tutorial we have a
reasonable idea about the characteristics of the dataset we are working on. If you have not gone
throughthepreviousarticleintheseries,kindlydosobeforeproceedingfurther.
http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/
1/7
10/6/2016
DataMungingInPythonUsingPandas
Datamungingrecapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedtobesolved
beforethedataisreadyforagoodmodel.ThisexerciseistypicallyreferredasDataMunging.Here
aretheproblems,wearealreadyawareof:
1.About31%(277outof891)ofvaluesinAgearemissing.Weexpectagetoplayanimportantroleand
hencewouldwanttoestimatethisinsomemanner.
2.Whilelookingatthedistributions,wesawthatFareseemedtocontainextremevaluesateitherenda
fewticketswereprobablyprovidedfreeorcontaineddataentryerror.Ontheotherhand$512sounds
likeaveryhighfareforbookingaticket
Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenonnumericalfields
i.e.Name,TicketandCabintosee,iftheycontainanyusefulinformation.
Checkmissingvaluesinthedataset
LetuslookatCabintostartwith.Firstglanceatthevariableleavesuswithanimpressionthatthere
aretoomanyNaNsinthedataset.So,letuscheckthenumberofnulls/NaNsinthedataset
sum(df['Cabin'].isnull())
Thiscommandshouldtellusthenumberofmissingvaluesasisnull()returns1,ifthevalueisnull.
Theoutputis687whichisalotofmissingvalues.So,wellneedtodropthisvariable.
Next,letuslookatvariableTicket.Ticketlookstohavemixofnumbersandtextanddoesntseemto
containanyinformation,sowilldropTicketaswell.
http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/
2/7
10/6/2016
DataMungingInPythonUsingPandas
df=df.drop(['Ticket','Cabin'],axis=1)
HowtofillmissingvaluesinAge?
There are numerous ways to fill the missing values of Age the simplest being replacement by
mean,whichcanbedonebyfollowingcode:
meanAge=np.mean(df.Age)
df.Age=df.Age.fillna(meanAge)
Theotherextremecouldbetobuildasupervisedlearningmodeltopredictageonthebasisofother
variablesandthenuseagealongwithothervariablestopredictsurvival.
Since, the purpose of this tutorial is to bring out the steps in data munging, Ill rather take an
approach, which lies some where in between these 2 extremes. The key hypothesis is that the
salutationsinName,GenderandPclasscombinedcanprovideuswithinformationrequiredtofillin
themissingvaluestoalargeextent.
Herearethestepsrequiredtoworkonthishypothesis:
Step1:ExtractingsalutationsfromName
Letusdefineafunction,whichextractsthesalutationfromaNamewritteninthisformat:
Family_Name,Salutation.FirstName
defname_extract(word):
returnword.split(',')[1].split('.')[0].strip()
http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/
3/7
10/6/2016
DataMungingInPythonUsingPandas
This function takes a Name, splits it by a comma (,), then splits it by a dot(.) and removes the
whitespaces.TheoutputofcallingfunctionwithJain,Mr.KunalwouldbeMrandJain,Miss.Jenika
wouldbeMiss
Next,weapplythisfunctiontotheentirecolumnusingapply()functionandconverttheoutcometoa
newDataFramedf2:
df2=pd.DataFrame({'Salutation':df['Name'].apply(name_extract)})
Once we have the Salutations, let us look at their distribution. We use the good old groupby after
mergingtheDataFramedf2withDataFramedf:
df=pd.merge(df,df2,left_index=True,right_index=True)#mergesonindex
temp1=df.groupby('Salutation').PassengerId.count()
printtemp1
Followingistheoutput:
Salutation
Capt1
Col2
Don1
Dr7
Jonkheer1
Lady1
http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/
4/7
10/6/2016
DataMungingInPythonUsingPandas
Major2
Master40
Miss182
Mlle2
Mme1
Mr517
Mrs125
Ms1
Rev6
Sir1
theCountess1
dtype:int64
As you can see, there are 4 main Salutations Mr, Mrs, Miss and Master all other are less in
number.Hence,wewillcombinealltheremainingsalutationsunderasinglesalutationOthers.In
ordertodoso,wetakethesameapproach,aswedidtoextractSalutationdefineafunction,apply
ittoanewcolumn,storetheoutcomeinaDataFrameandthenmergeitwitholdDataFrame:
defgroup_salutation(old_salutation):
ifold_salutation=='Mr':
return('Mr')
else:
ifold_salutation=='Mrs':
return('Mrs')
else:
ifold_salutation=='Master':
return('Master')
else:
ifold_salutation=='Miss':
http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/
5/7
10/6/2016
DataMungingInPythonUsingPandas
return('Miss')
else:
return('Others')
df3=pd.DataFrame({'New_Salutation':df['Salutation'].apply(group_salutation)})
df=pd.merge(df,df3,left_index=True,right_index=True)
temp1=df3.groupby('New_Salutation').count()
temp1
df.boxplot(column='Age',by='New_Salutation')
FollowingistheoutcomeforDistributionofNew_SalutationandvariationofAgeacrossthem:
http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/
6/7
10/6/2016
DataMungingInPythonUsingPandas
Step2:Creatingasimplegrid(ClassxGender)xSalutation
SimilarlyplottingthedistributionofagebySex&Classshowsasloping:
So,wecreateaPivottable,whichprovidesusmedianvaluesforallthecellsmentionedabove.Next,
wedefineafunction,whichreturnsthevaluesofthesecellsandapplyittofillthemissingvaluesof
age:
http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/
7/7