0% found this document useful (0 votes)

367 views

12 Useful Pandas Techniques in Python For Data Manipulation PDF

1) The document discusses 12 useful techniques for data manipulation in Python using the Pandas library. These techniques include boolean indexing, applying functions to rows and columns, imputing missing values, pivot tables, multi-indexing, cross-tabulation, merging dataframes, and sorting dataframes. 2) The techniques are demonstrated on a loan prediction dataset. Boolean indexing is used to filter loans by gender, education level, and loan status. Applying functions counts missing values. Missing values are imputed with modes. A pivot table calculates mean loan amounts by customer attributes. 3) Cross-tabulation shows credit history strongly predicts loan status, correctly classifying 75% of loans. Dataframes can be merged and sorted.

Uploaded by

Teodor von Burg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

367 views

12 Useful Pandas Techniques in Python For Data Manipulation PDF

Uploaded by

Teodor von Burg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

Introduction

Python is fast becoming the preferred language for data scientists and for good reasons. It
provides the larger ecosystem of a programming language and the depth of good scientific
computationlibraries.IfyouarestartingtolearnPython,havealookatlearningpathonPython.
Among its scientific computation libraries, I found Pandas to be the most useful for data science
operations. Pandas, along with Scikitlearn provides almost the entire stack needed by a data
scientist. This article focuses on providing 12 ways for data manipulation in Python. Ive also
sharedsometips&trickswhichwillallowyoutoworkfaster.
Iwouldrecommendthatyoulookatthecodesfordataexplorationbeforegoingahead.Tohelpyou
understandbetter,Ivetakenadatasettoperformtheseoperationsandmanipulations.
DataSet:IveusedthedatasetofLoanPredictionproblem.Downloadthedatasetandgetstarted.

Letsgetstarted
IllstartbyimportingmodulesandloadingthedatasetintoPythonenvironment:

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

1/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

importpandasaspd
importnumpyasnp
data=pd.read_csv("train.csv",index_col="Loan_ID")

#1BooleanIndexing
What do you do, if you want to filter values of a column based on conditions from another set of
columns?Forinstance,wewantalistofallfemaleswhoarenotgraduateandgotaloan.Boolean
indexingcanhelphere.Youcanusethefollowingcode:

data.loc[(data["Gender"]=="Female")&(data["Education"]=="NotGraduate")&(data["Loan_Status"]=
="Y"),["Gender","Education","Loan_Status"]]

ReadMore:PandasSelectingandIndexing

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

2/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#2ApplyFunction
It is one of the commonly used functions for playing with data and creating new
variables. Apply returns some value after passing each row/column of a data frame with some
function.Thefunctioncanbebothdefaultoruserdefined.Forinstance,hereitcanbeusedtofind
the#missingvaluesineachrowandcolumn.

#Createanewfunction:
defnum_missing(x):
returnsum(x.isnull())

#Applyingpercolumn:
print"Missingvaluespercolumn:"
printdata.apply(num_missing,axis=0)#axis=0definesthatfunctionistobeappliedoneachcolu
mn

#Applyingperrow:
print"\nMissingvaluesperrow:"
printdata.apply(num_missing,axis=1).head()#axis=1definesthatfunctionistobeappliedonea
chrow

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

3/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

Thuswegetthedesiredresult.
Note:head()functionisusedinsecondoutputbecauseitcontainsmanyrows.
ReadMore:PandasReference(apply)

#3Imputingmissingfiles
fillna()doesitinonego.Itisusedforupdatingmissingvalueswiththeoverallmean/mode/median
ofthecolumn.LetsimputetheGender,MarriedandSelf_Employedcolumnswiththeirrespective
modes.

#Firstweimportafunctiontodeterminethemode
fromscipy.statsimportmode
mode(data['Gender'])

Output:ModeResult(mode=array([Male],dtype=object),count=array([489]))
Thisreturnsbothmodeandcount.Rememberthatmodecanbeanarrayastherecanbemultiple

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

4/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

valueswithhighfrequency.Wewilltakethefirstonebydefaultalwaysusing:

mode(data['Gender']).mode[0]

Nowwecanfillthemissingvaluesandcheckusingtechnique#2.

#Imputethevalues:
data['Gender'].fillna(mode(data['Gender']).mode[0],inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0],inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0],inplace=True)

#Nowcheckthe#missingvaluesagaintoconfirm:
printdata.apply(num_missing,axis=0)

Hence, it is confirmed that missing values are imputed. Please note that this is the most primitive
form of imputation. Other sophisticated techniques include modeling the missing values, using
groupedaverages(mean/mode/median).Illcoverthatpartinmynextarticles.
ReadMore:PandasReference(fillna)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

5/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#4PivotTable
PandascanbeusedtocreateMSExcelstylepivottables.Forinstance,inthiscase,akeycolumnis
LoanAmount which has missing values. We can impute it using mean amount of each Gender,
MarriedandSelf_Employedgroup.ThemeanLoanAmountofeachgroupcanbedeterminedas:

#Determinepivottable
impute_grps=data.pivot_table(values=["LoanAmount"],index=["Gender","Married","Self_Employed"],
aggfunc=np.mean)
printimpute_grps

More:PandasReference(PivotTable)

#5MultiIndexing
Ifyounoticetheoutputofstep#3,ithasastrangeproperty.Eachindexismadeupofacombination
of3values.ThisiscalledMultiIndexing.Ithelpsinperformingoperationsreallyfast.
Continuingtheexamplefrom#3,wehavethevaluesforeachgroupbuttheyhavenotbeenimputed.
Thiscanbedoneusingthevarioustechniqueslearnedtillnow.

#iterateonlythroughrowswithmissingLoanAmount
fori,rowindata.loc[data['LoanAmount'].isnull(),:].iterrows():

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

6/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

ind=tuple([row['Gender'],row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount']=impute_grps.loc[ind].values[0]

#Nowcheckthe#missingvaluesagaintoconfirm:
printdata.apply(num_missing,axis=0)

Note:
1.Multiindexrequirestuplefordefininggroupsofindicesinlocstatement.Thisatupleusedinfunction.
2.The.values[0]suffixisrequiredbecause,bydefaultaserieselementisreturnedwhichhasanindexnot
matchingwiththatofthedataframe.Inthiscase,adirectassignmentgivesanerror.

#6.Crosstab
This function is used to get an initial feel (view) of the data. Here, we can validate some basic
hypothesis. For instance, in this case, Credit_History is expected to affect the loan status
significantly.Thiscanbetestedusingcrosstabulationasshownbelow:

pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

7/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

Theseareabsolutenumbers.But,percentagescanbemoreintuitiveinmakingsomequickinsights.
Wecandothisusingtheapplyfunction:

defpercConvert(ser):
returnser/float(ser[1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert,axis=1)

Now, it is evident that people with a credit history have much higher chances of getting a loan as
80%peoplewithcredithistorygotaloanascomparedtoonly9%withoutcredithistory.
But thats not it. It tells an interesting story. Since I know that having a credit history is super
important, what if I predict loan status to be Y for ones with credit history and N otherwise.
Surprisingly,wellberight82+378=460timesoutof614whichisawhopping75%!
I wont blame you if youre wondering why the hell do we need statistical models. But trust me,
increasing the accuracy by even 0.001% beyond this mark is a challenging task. Would you take
thischallenge?
Note:75%isontrainset.Thetestsetwillbeslightlydifferentbutclose.Also,Ihopethisgivessome
intuitionintowhyevena0.05%increaseinaccuracycanresultinjumpof500ranksontheKaggle

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

8/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

leaderboard.
ReadMore:PandasReference(crosstab)

#7MergeDataFrames
Merging dataframes become essential when we have information coming from different sources to
becollated.Considerahypotheticalcasewheretheaveragepropertyrates(INRpersqmeters)is
availablefordifferentpropertytypes.Letsdefineadataframeas:

prop_rates=pd.DataFrame([1000,5000,12000],index=['Rural','Semiurban','Urban'],columns=['rate
s'])
prop_rates

Nowwecanmergethisinformationwiththeoriginaldataframeas:

data_merged=data.merge(right=prop_rates,how='inner',left_on='Property_Area',right_index=True,
sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'],aggfunc=len)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

9/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

The pivot table validates successful merge operation. Note that the values argument is irrelevant
herebecausewearesimplycountingthevalues.
ReadMore:PandasReference(merge)

#8SortingDataFrames
Pandasalloweasysortingbasedonmultiplecolumns.Thiscanbedoneas:

data_sorted=data.sort_values(['ApplicantIncome','CoapplicantIncome'],ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)

Note:Pandassortfunctionisnowdeprecated.Weshouldusesort_valuesinstead.
More:PandasReference(sort_values)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

10/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#9Plotting(Boxplot&Histogram)
ManyofyoumightbeunawarethatboxplotsandhistogramscanbedirectlyplottedinPandasand
callingmatplotlibseparatelyisnotnecessary.Itsjusta1linecommand.Forinstance,ifwewantto
comparethedistributionofApplicantIncomebyLoan_Status:

importmatplotlib.pyplotasplt
%matplotlibinline
data.boxplot(column="ApplicantIncome",by="Loan_Status")

data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

Thisshowsthatincomeisnotabigdecidingfactoronitsownasthereisnoappreciabledifference
betweenthepeoplewhoreceivedandweredeniedtheloan.
ReadMore:PandasReference(hist)|PandasReference(boxplot)

#10Cutfunctionforbinning
Sometimesnumericalvaluesmakemoresenseifclusteredtogether.Forexample,ifweretryingto
modeltraffic(#carsonroad)withtimeoftheday(minutes).Theexactminuteofanhourmightnotbe
thatrelevantforpredictingtrafficascomparedtoactualperiodofthedaylikeMorning,Afternoon,
Evening, Night, Late Night. Modeling traffic this way will be more intuitive and will avoid
overfitting.
Herewedefineasimplefunctionwhichcanbereusedforbinninganyvariablefairlyeasily.

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

11/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#Binning:
defbinning(col,cut_points,labels=None):
#Defineminandmaxvalues:
minval=col.min()
maxval=col.max()

#createlistbyaddingminandmaxtocut_points
break_points=[minval]+cut_points+[maxval]

#ifnolabelsprovided,usedefaultlabels0...(n1)
ifnotlabels:
labels=range(len(cut_points)+1)

#Binningusingcutfunctionofpandas
colBin=pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
returncolBin

#Binningage:
cut_points=[90,140,190]
labels=["low","medium","high","veryhigh"]
data["LoanAmount_Bin"]=binning(data["LoanAmount"],cut_points,labels)
printpd.value_counts(data["LoanAmount_Bin"],sort=False)

ReadMore:PandasReference(cut)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

12/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#11Codingnominaldata
Often,wefindacasewherewevetomodifythecategoriesofanominalvariable.Thiscanbedueto
variousreasons:
1.Some algorithms (like Logistic Regression) require all inputs to be numeric. So nominal variables are
mostlycodedas0,1.(n1)
2.Sometimes a category might be represented in 2 ways. For e.g. temperature might be recorded as
High,Medium,Low,H,low.Here,bothHighandHrefertosamecategory.Similarly,inLow
andlowthereisonlyadifferenceofcase.But,pythonwouldreadthemasdifferentlevels.
3.Somecategoriesmighthaveverylowfrequenciesanditsgenerallyagoodideatocombinethem.

HereIvedefinedagenericfunctionwhichtakesininputasadictionaryandcodesthevaluesusing
replacefunctioninPandas.

#DefineagenericfunctionusingPandasreplacefunction
defcoding(col,codeDict):
colCoded=pd.Series(col,copy=True)
forkey,valueincodeDict.items():
colCoded.replace(key,value,inplace=True)
returncolCoded

#CodingLoanStatusasY=1,N=0:
print'BeforeCoding:'
printpd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"]=coding(data["Loan_Status"],{'N':0,'Y':1})
print'\nAfterCoding:'
printpd.value_counts(data["Loan_Status_Coded"])

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

13/13