12 Useful Pandas Techniques in Python For Data Manipulation PDF
12 Useful Pandas Techniques in Python For Data Manipulation PDF
12UsefulPandasTechniquesinPythonforDataManipulation
Introduction
Python is fast becoming the preferred language for data scientists and for good reasons. It
provides the larger ecosystem of a programming language and the depth of good scientific
computationlibraries.IfyouarestartingtolearnPython,havealookatlearningpathonPython.
Among its scientific computation libraries, I found Pandas to be the most useful for data science
operations. Pandas, along with Scikitlearn provides almost the entire stack needed by a data
scientist. This article focuses on providing 12 ways for data manipulation in Python. Ive also
sharedsometips&trickswhichwillallowyoutoworkfaster.
Iwouldrecommendthatyoulookatthecodesfordataexplorationbeforegoingahead.Tohelpyou
understandbetter,Ivetakenadatasettoperformtheseoperationsandmanipulations.
DataSet:IveusedthedatasetofLoanPredictionproblem.Downloadthedatasetandgetstarted.
Letsgetstarted
IllstartbyimportingmodulesandloadingthedatasetintoPythonenvironment:
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
1/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
importpandasaspd
importnumpyasnp
data=pd.read_csv("train.csv",index_col="Loan_ID")
#1BooleanIndexing
What do you do, if you want to filter values of a column based on conditions from another set of
columns?Forinstance,wewantalistofallfemaleswhoarenotgraduateandgotaloan.Boolean
indexingcanhelphere.Youcanusethefollowingcode:
data.loc[(data["Gender"]=="Female")&(data["Education"]=="NotGraduate")&(data["Loan_Status"]=
="Y"),["Gender","Education","Loan_Status"]]
ReadMore:PandasSelectingandIndexing
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
2/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
#2ApplyFunction
It is one of the commonly used functions for playing with data and creating new
variables. Apply returns some value after passing each row/column of a data frame with some
function.Thefunctioncanbebothdefaultoruserdefined.Forinstance,hereitcanbeusedtofind
the#missingvaluesineachrowandcolumn.
#Createanewfunction:
defnum_missing(x):
returnsum(x.isnull())
#Applyingpercolumn:
print"Missingvaluespercolumn:"
printdata.apply(num_missing,axis=0)#axis=0definesthatfunctionistobeappliedoneachcolu
mn
#Applyingperrow:
print"\nMissingvaluesperrow:"
printdata.apply(num_missing,axis=1).head()#axis=1definesthatfunctionistobeappliedonea
chrow
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
3/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
Thuswegetthedesiredresult.
Note:head()functionisusedinsecondoutputbecauseitcontainsmanyrows.
ReadMore:PandasReference(apply)
#3Imputingmissingfiles
fillna()doesitinonego.Itisusedforupdatingmissingvalueswiththeoverallmean/mode/median
ofthecolumn.LetsimputetheGender,MarriedandSelf_Employedcolumnswiththeirrespective
modes.
#Firstweimportafunctiontodeterminethemode
fromscipy.statsimportmode
mode(data['Gender'])
Output:ModeResult(mode=array([Male],dtype=object),count=array([489]))
Thisreturnsbothmodeandcount.Rememberthatmodecanbeanarrayastherecanbemultiple
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
4/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
valueswithhighfrequency.Wewilltakethefirstonebydefaultalwaysusing:
mode(data['Gender']).mode[0]
Nowwecanfillthemissingvaluesandcheckusingtechnique#2.
#Imputethevalues:
data['Gender'].fillna(mode(data['Gender']).mode[0],inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0],inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0],inplace=True)
#Nowcheckthe#missingvaluesagaintoconfirm:
printdata.apply(num_missing,axis=0)
Hence, it is confirmed that missing values are imputed. Please note that this is the most primitive
form of imputation. Other sophisticated techniques include modeling the missing values, using
groupedaverages(mean/mode/median).Illcoverthatpartinmynextarticles.
ReadMore:PandasReference(fillna)
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
5/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
#4PivotTable
PandascanbeusedtocreateMSExcelstylepivottables.Forinstance,inthiscase,akeycolumnis
LoanAmount which has missing values. We can impute it using mean amount of each Gender,
MarriedandSelf_Employedgroup.ThemeanLoanAmountofeachgroupcanbedeterminedas:
#Determinepivottable
impute_grps=data.pivot_table(values=["LoanAmount"],index=["Gender","Married","Self_Employed"],
aggfunc=np.mean)
printimpute_grps
More:PandasReference(PivotTable)
#5MultiIndexing
Ifyounoticetheoutputofstep#3,ithasastrangeproperty.Eachindexismadeupofacombination
of3values.ThisiscalledMultiIndexing.Ithelpsinperformingoperationsreallyfast.
Continuingtheexamplefrom#3,wehavethevaluesforeachgroupbuttheyhavenotbeenimputed.
Thiscanbedoneusingthevarioustechniqueslearnedtillnow.
#iterateonlythroughrowswithmissingLoanAmount
fori,rowindata.loc[data['LoanAmount'].isnull(),:].iterrows():
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
6/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
ind=tuple([row['Gender'],row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount']=impute_grps.loc[ind].values[0]
#Nowcheckthe#missingvaluesagaintoconfirm:
printdata.apply(num_missing,axis=0)
Note:
1.Multiindexrequirestuplefordefininggroupsofindicesinlocstatement.Thisatupleusedinfunction.
2.The.values[0]suffixisrequiredbecause,bydefaultaserieselementisreturnedwhichhasanindexnot
matchingwiththatofthedataframe.Inthiscase,adirectassignmentgivesanerror.
#6.Crosstab
This function is used to get an initial feel (view) of the data. Here, we can validate some basic
hypothesis. For instance, in this case, Credit_History is expected to affect the loan status
significantly.Thiscanbetestedusingcrosstabulationasshownbelow:
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
7/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
Theseareabsolutenumbers.But,percentagescanbemoreintuitiveinmakingsomequickinsights.
Wecandothisusingtheapplyfunction:
defpercConvert(ser):
returnser/float(ser[1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert,axis=1)
Now, it is evident that people with a credit history have much higher chances of getting a loan as
80%peoplewithcredithistorygotaloanascomparedtoonly9%withoutcredithistory.
But thats not it. It tells an interesting story. Since I know that having a credit history is super
important, what if I predict loan status to be Y for ones with credit history and N otherwise.
Surprisingly,wellberight82+378=460timesoutof614whichisawhopping75%!
I wont blame you if youre wondering why the hell do we need statistical models. But trust me,
increasing the accuracy by even 0.001% beyond this mark is a challenging task. Would you take
thischallenge?
Note:75%isontrainset.Thetestsetwillbeslightlydifferentbutclose.Also,Ihopethisgivessome
intuitionintowhyevena0.05%increaseinaccuracycanresultinjumpof500ranksontheKaggle
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
8/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
leaderboard.
ReadMore:PandasReference(crosstab)
#7MergeDataFrames
Merging dataframes become essential when we have information coming from different sources to
becollated.Considerahypotheticalcasewheretheaveragepropertyrates(INRpersqmeters)is
availablefordifferentpropertytypes.Letsdefineadataframeas:
prop_rates=pd.DataFrame([1000,5000,12000],index=['Rural','Semiurban','Urban'],columns=['rate
s'])
prop_rates
Nowwecanmergethisinformationwiththeoriginaldataframeas:
data_merged=data.merge(right=prop_rates,how='inner',left_on='Property_Area',right_index=True,
sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'],aggfunc=len)
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
9/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
The pivot table validates successful merge operation. Note that the values argument is irrelevant
herebecausewearesimplycountingthevalues.
ReadMore:PandasReference(merge)
#8SortingDataFrames
Pandasalloweasysortingbasedonmultiplecolumns.Thiscanbedoneas:
data_sorted=data.sort_values(['ApplicantIncome','CoapplicantIncome'],ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)
Note:Pandassortfunctionisnowdeprecated.Weshouldusesort_valuesinstead.
More:PandasReference(sort_values)
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
10/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
#9Plotting(Boxplot&Histogram)
ManyofyoumightbeunawarethatboxplotsandhistogramscanbedirectlyplottedinPandasand
callingmatplotlibseparatelyisnotnecessary.Itsjusta1linecommand.Forinstance,ifwewantto
comparethedistributionofApplicantIncomebyLoan_Status:
importmatplotlib.pyplotasplt
%matplotlibinline
data.boxplot(column="ApplicantIncome",by="Loan_Status")
data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)
Thisshowsthatincomeisnotabigdecidingfactoronitsownasthereisnoappreciabledifference
betweenthepeoplewhoreceivedandweredeniedtheloan.
ReadMore:PandasReference(hist)|PandasReference(boxplot)
#10Cutfunctionforbinning
Sometimesnumericalvaluesmakemoresenseifclusteredtogether.Forexample,ifweretryingto
modeltraffic(#carsonroad)withtimeoftheday(minutes).Theexactminuteofanhourmightnotbe
thatrelevantforpredictingtrafficascomparedtoactualperiodofthedaylikeMorning,Afternoon,
Evening, Night, Late Night. Modeling traffic this way will be more intuitive and will avoid
overfitting.
Herewedefineasimplefunctionwhichcanbereusedforbinninganyvariablefairlyeasily.
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
11/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
#Binning:
defbinning(col,cut_points,labels=None):
#Defineminandmaxvalues:
minval=col.min()
maxval=col.max()
#createlistbyaddingminandmaxtocut_points
break_points=[minval]+cut_points+[maxval]
#ifnolabelsprovided,usedefaultlabels0...(n1)
ifnotlabels:
labels=range(len(cut_points)+1)
#Binningusingcutfunctionofpandas
colBin=pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
returncolBin
#Binningage:
cut_points=[90,140,190]
labels=["low","medium","high","veryhigh"]
data["LoanAmount_Bin"]=binning(data["LoanAmount"],cut_points,labels)
printpd.value_counts(data["LoanAmount_Bin"],sort=False)
ReadMore:PandasReference(cut)
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
12/13
9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
#11Codingnominaldata
Often,wefindacasewherewevetomodifythecategoriesofanominalvariable.Thiscanbedueto
variousreasons:
1.Some algorithms (like Logistic Regression) require all inputs to be numeric. So nominal variables are
mostlycodedas0,1.(n1)
2.Sometimes a category might be represented in 2 ways. For e.g. temperature might be recorded as
High,Medium,Low,H,low.Here,bothHighandHrefertosamecategory.Similarly,inLow
andlowthereisonlyadifferenceofcase.But,pythonwouldreadthemasdifferentlevels.
3.Somecategoriesmighthaveverylowfrequenciesanditsgenerallyagoodideatocombinethem.
HereIvedefinedagenericfunctionwhichtakesininputasadictionaryandcodesthevaluesusing
replacefunctioninPandas.
#DefineagenericfunctionusingPandasreplacefunction
defcoding(col,codeDict):
colCoded=pd.Series(col,copy=True)
forkey,valueincodeDict.items():
colCoded.replace(key,value,inplace=True)
returncolCoded
#CodingLoanStatusasY=1,N=0:
print'BeforeCoding:'
printpd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"]=coding(data["Loan_Status"],{'N':0,'Y':1})
print'\nAfterCoding:'
printpd.value_counts(data["Loan_Status_Coded"])
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
13/13