Complete Guide To Create A Time Series Forecast (With Codes in Python) PDF
Complete Guide To Create A Time Series Forecast (With Codes in Python) PDF
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Introduction
Time Series (referred as TS from now) is considered to be one of the less known skills in the
analytics space (Even I had little clue about it a couple of days back). But as you know our
inaugural Mini Hackathon is based on it, I set myself on a journey to learn the basic steps for
solvingaTimeSeriesproblemandhereIamsharingthesamewithyou.Thesewilldefinitelyhelp
yougetadecentmodelinourhackathontoday.
Before going through this article, I highly recommend reading A Complete Tutorial on Time Series
ModelinginR,whichislikeaprequeltothisarticle.Itfocusesonfundamentalconceptsandisbased
onRandIwillfocusonusingtheseconceptsinsolvingaproblemendtoendalongwithcodesin
Python.ManyresourcesexistforTSinRbutveryfewarethereforPythonsoIllbeusingPythonin
thisarticle.
Outjourneywouldgothroughthefollowingsteps:
1.WhatmakesTimeSeriesSpecial?
2.LoadingandHandlingTimeSeriesinPandas
3.HowtoCheckStationarityofaTimeSeries?
4.HowtomakeaTimeSeriesStationary?
5.ForecastingaTimeSeries
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
1/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
1.WhatmakesTimeSeriesSpecial?
Asthenamesuggests,TSisacollectionofdatapointscollectedatconstanttimeintervals.These
are analyzed to determine the long term trend so as to forecast the future or perform some other
formofanalysis.ButwhatmakesaTSdifferentfromsayaregularregressionproblem?Thereare2
things:
1.Itistimedependent.Sothebasicassumptionofalinearregressionmodelthattheobservationsare
independentdoesntholdinthiscase.
2.Along with an increasing or decreasing trend, most TS have some form of seasonality trends, i.e.
variationsspecifictoaparticulartimeframe.Forexample,ifyouseethesalesofawoolenjacketover
time,youwillinvariablyfindhighersalesinwinterseasons.
BecauseoftheinherentpropertiesofaTS,therearevariousstepsinvolvedinanalyzingit.These
arediscussedindetailbelow.LetsstartbyloadingaTSobjectinPython.Wellbeusingthepopular
AirPassengersdatasetwhichcanbedownloadedhere.
PleasenotethattheaimofthisarticleistofamiliarizeyouwiththevarioustechniquesusedforTSin
general.TheexampleconsideredhereisjustforillustrationandIwillfocusoncoverageabreadthof
topicsandnotmakingaveryaccurateforecast.
2.LoadingandHandlingTimeSeriesin
Pandas
PandashasdedicatedlibrariesforhandlingTSobjects,particularlythedatatime64[ns]classwhich
storestimeinformationandallowsustoperformsomeoperationsreallyfast.Letsstartbyfiringup
therequiredlibraries:
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
2/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
importpandasaspd
importnumpyasnp
importmatplotlib.pylabasplt
%matplotlibinline
frommatplotlib.pylabimportrcParams
rcParams['figure.figsize']=15,6
Now,wecanloadthedatasetandlookatsomeinitialrowsanddatatypesofthecolumns:
data=pd.read_csv('AirPassengers.csv')
printdata.head()
print'\nDataTypes:'
printdata.dtypes
Thedatacontainsaparticularmonthandnumberofpassengerstravellinginthatmonth.Butthisis
stillnotreadasaTSobjectasthedatatypesareobjectandint.Inordertoreadthedataasatime
series,wehavetopassspecialargumentstotheread_csvcommand:
dateparse=lambdadates:pd.datetime.strptime(dates,'%Y%m')
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
3/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
data=pd.read_csv('AirPassengers.csv',parse_dates='Month',index_col='Month',date_parser=datepa
rse)
printdata.head()
Letsunderstandtheargumentsonebyone:
1.parse_dates:Thisspecifiesthecolumnwhichcontainsthedatetimeinformation.Aswesayabove,the
columnnameisMonth.
2.index_col:AkeyideabehindusingPandasforTSdataisthattheindexhastobethevariabledepicting
datetimeinformation.SothisargumenttellspandastousetheMonthcolumnasindex.
3.date_parser:Thisspecifiesafunctionwhichconvertsaninputstringintodatetimevariable.Bedefault
PandasreadsdatainformatYYYYMMDDHH:MM:SS.Ifthedataisnotinthisformat,theformathas
to be manually defined. Something similar to the dataparse function defined here can be used for this
purpose.
Nowwecanseethatthedatahastimeobjectasindexand#Passengersasthecolumn.Wecan
crosscheckthedatatypeoftheindexwiththefollowingcommand:
data.index
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
4/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Noticethedtype=datetime[ns]whichconfirmsthatitisadatetimeobject.Asapersonal
preference,IwouldconvertthecolumnintoaSeriesobjecttopreventreferringtocolumnsnames
everytimeIusetheTS.Pleasefeelfreetouseasadataframeisthatworksbetterforyou.
ts=data[#Passengers]ts.head(10)
Before going further, Ill discuss some indexing techniques for TS data. Lets start by selecting a
particularvalueintheSeriesobject.Thiscanbedoneinfollowing2ways:
#1.Specifictheindexasastringconstant:
ts['19490101']
#2.Importthedatetimelibraryanduse'datetime'function:
fromdatetimeimportdatetime
ts[datetime(1949,1,1)]
Both would return the value 112 which can also be confirmed from previous output. Suppose we
wantallthedatauptoMay1949.Thiscanbedonein2ways:
#1.Specifytheentirerange:
ts['19490101':'19490501']
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
5/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
#2.Use':'ifoneoftheindicesisatends:
ts[:'19490501']
Bothwouldyieldfollowingoutput:
Thereare2thingstonotehere:
1.Unlikenumericindexing,theendindexisincludedhere.Forinstance,ifweindexalistasa[:5]thenit
would return the values at indices [0,1,2,3,4]. But here the index 19490501 was included in the
output.
2.Theindiceshavetobesortedforrangestowork.Ifyourandomlyshuffletheindex,thiswontwork.
Consideranotherinstancewhereyouneedallthevaluesoftheyear1949.Thiscanbedoneas:
ts['1949']
Themonthpartwasomitted.Similarlyifyoualldaysofaparticularmonth,thedaypartcanbe
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
6/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
omitted.
Now,letsmoveontotheanalyzingtheTS.
3.HowtoCheckStationarityofaTime
Series?
ATSissaidtobestationaryifitsstatisticalpropertiessuch as mean, variance remain constant
overtime. But why is it important? Most of theTS models work on the assumption that theTS is
stationary.Intuitively,wecansatthatifaTShasaparticularbehaviourovertime,thereisaveryhigh
probabilitythatitwillfollowthesameinthefuture.Also,thetheoriesrelatedtostationaryseriesare
morematureandeasiertoimplementascomparedtononstationaryseries.
Stationarityisdefinedusingverystrictcriterion.However,forpracticalpurposeswecanassumethe
seriestobestationaryifithasconstantstatisticalpropertiesovertime,ie.thefollowing:
1.constantmean
2.constantvariance
3.anautocovariancethatdoesnotdependontime.
Ill skip the details as it is very clearly defined in this article. Lets move onto the ways of testing
stationarity.Firstandforemostistosimpleplotthedataandanalyzevisually.Thedatacanbeplotted
usingfollowingcommand:
plt.plot(ts)
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
7/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Itisclearlyevidentthatthereisanoverallincreasingtrendinthedataalongwithsomeseasonal
variations.However,itmightnotalwaysbepossibletomakesuchvisualinferences(wellseesuch
caseslater).So,moreformally,wecancheckstationarityusingthefollowing:
1.PlottingRollingStatistics:We can plot the moving average or moving variance and see if it varies
withtime.Bymovingaverage/varianceImeanthatatanyinstantt,welltaketheaverage/varianceof
thelastyear,i.e.last12months.Butagainthisismoreofavisualtechnique.
2.DickeyFullerTest:Thisisoneofthestatisticaltestsforcheckingstationarity.Herethenullhypothesis
is that the TS is nonstationary. The test results comprise of a Test Statistic and some Critical
Values for difference confidence levels. If the Test Statistic is less than the Critical Value, we can
rejectthenullhypothesisandsaythattheseriesisstationary.Referthisarticlefordetails.
Theseconceptsmightnotsoundveryintuitiveatthispoint.Irecommendgoingthroughtheprequel
article.Ifyoureinterestedinsometheoreticalstatistics,youcanreferIntroductiontoTimeSeries
andForecastingbyBrockwellandDavis.Thebookisabitstatsheavy,butifyouhavetheskillto
readbetweenlines,youcanunderstandtheconceptsandtangentiallytouchthestatistics.
Back to checking stationarity, well be using the rolling statistics plots along with DickeyFuller test
results a lot so I have defined a function which takes a TS as input and generated them for us.
PleasenotethatIveplottedstandarddeviationinsteadofvariancetokeeptheunitsimilartomean.
fromstatsmodels.tsa.stattoolsimportadfuller
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
8/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
deftest_stationarity(timeseries):
#Determingrollingstatistics
rolmean=pd.rolling_mean(timeseries,window=12)
rolstd=pd.rolling_std(timeseries,window=12)
#Plotrollingstatistics:
orig=plt.plot(timeseries,color='blue',label='Original')
mean=plt.plot(rolmean,color='red',label='RollingMean')
std=plt.plot(rolstd,color='black',label='RollingStd')
plt.legend(loc='best')
plt.title('RollingMean&StandardDeviation')
plt.show(block=False)
#PerformDickeyFullertest:
print'ResultsofDickeyFullerTest:'
dftest=adfuller(timeseries,autolag='AIC')
dfoutput=pd.Series(dftest[0:4],index=['TestStatistic','pvalue','#LagsUsed','NumberofO
bservationsUsed'])
forkey,valueindftest[4].items():
dfoutput['CriticalValue(%s)'%key]=value
printdfoutput
Thecodeisprettystraightforward.Pleasefeelfreetodiscussthecodeincommentsifyouface
challengesingraspingit.
Letsrunitforourinputseries:
test_stationarity(ts)
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
9/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Thoughthevariationinstandarddeviationissmall,meanisclearlyincreasingwithtimeandthisis
not a stationary series. Also, the test statistic is way more than the critical values. Note that
thesignedvaluesshouldbecomparedandnottheabsolutevalues.
Next,welldiscussthetechniquesthatcanbeusedtotakethisTStowardsstationarity.
4.HowtomakeaTimeSeriesStationary?
ThoughstationarityassumptionistakeninmanyTSmodels,almostnoneofpracticaltimeseriesare
stationary.Sostatisticianshavefiguredoutwaystomakeseriesstationary,whichwelldiscussnow.
Actually,itsalmostimpossibletomakeaseriesperfectlystationary,butwetrytotakeitascloseas
possible.
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
10/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Lets understand what is making a TS nonstationary. There are 2 major reasons behind non
stationarutyofaTS:
1. Trend varying mean over time. For eg, in this case we saw that on average, the number of
passengerswasgrowingovertime.
2.Seasonalityvariationsatspecifictimeframes.egpeoplemighthaveatendencytobuycarsina
particularmonthbecauseofpayincrementorfestivals.
Theunderlyingprincipleistomodelorestimatethetrendandseasonalityintheseriesandremove
those from the series to get a stationary series. Then statistical forecasting techniques can be
implementedonthisseries.Thefinalstepwouldbetoconverttheforecastedvaluesintotheoriginal
scalebyapplyingtrendandseasonalityconstraintsback.
Note:Illbediscussinganumberofmethods.Somemightworkwellinthiscaseandothersmight
not.Buttheideaistogetahangofallthemethodsandnotfocusonjusttheproblemathand.
Letsstartbyworkingonthetrendpart.
Estimating&EliminatingTrend
One of the first tricks to reduce trend can be transformation. For example, in this case we can
clearly see that the there is a significant positive trend. So we can apply transformation which
penalizehighervaluesmorethansmallervalues.Thesecanbetakingalog,squareroot,cuberoot,
etc.Letstakealogtransformhereforsimplicity:
ts_log=np.log(ts)
plt.plot(ts_log)
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
11/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Inthissimplercase,itiseasytoseeaforwardtrendinthedata.Butitsnotveryintuitiveinpresence
ofnoise.Sowecanusesometechniquestoestimateormodelthistrendandthenremoveitfrom
theseries.Therecanbemanywaysofdoingitandsomeofmostcommonlyusedare:
1.Aggregationtakingaverageforatimeperiodlikemonthly/weeklyaverages
2.Smoothingtakingrollingaverages
3.PolynomialFittingfitaregressionmodel
Iwilldiscusssmoothinghereandyoushouldtryothertechniquesaswellwhichmightworkoutfor
otherproblems.Smoothingreferstotakingrollingestimates,i.e.consideringthepastfewinstances.
TherearecanbevariouswaysbutIwilldiscusstwoofthosehere.
Movingaverage
In this approach, we take average of k consecutive values depending on the frequency of time
series.Herewecantaketheaverageoverthepast1year,i.e.last12values.Pandashasspecific
functionsdefinedfordeterminingrollingstatistics.
moving_avg=pd.rolling_mean(ts_log,12)
plt.plot(ts_log)
plt.plot(moving_avg,color='red')
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
12/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Theredlineshowstherollingmean.Letssubtractthisfromtheoriginalseries.Notethatsincewe
aretakingaverageoflast12values,rollingmeanisnotdefinedforfirst11values.Thiscanbe
observedas:
ts_log_moving_avg_diff=ts_logmoving_avg
ts_log_moving_avg_diff.head(12)
Noticethefirst11beingNan.LetsdroptheseNaNvaluesandchecktheplotstoteststationarity.
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
13/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
ts_log_moving_avg_diff.dropna(inplace=True)
test_stationarity(ts_log_moving_avg_diff)
Thislookslikeamuchbetterseries.Therollingvaluesappeartobevaryingslightlybutthereisno
specifictrend.Also,theteststatisticissmallerthanthe5%criticalvaluessowecansaywith95%
confidencethatthisisastationaryseries.
However,adrawbackinthisparticularapproachisthatthetimeperiodhastobestrictlydefined.In
this case we can take yearly averages but in complex situations like forecasting a stock price, its
difficult to come up with a number. So we take a weighted moving average where more recent
valuesaregivenahigherweight.Therecanbemanytechniqueforassigningweights.Apopularone
isexponentiallyweightedmovingaveragewhereweightsareassignedtoallthepreviousvalues
withadecayfactor.Finddetailshere.ThiscanbeimplementedinPandasas:
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
14/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
expwighted_avg=pd.ewma(ts_log,halflife=12)
plt.plot(ts_log)
plt.plot(expwighted_avg,color='red')
Notethatheretheparameterhalflifeisusedtodefinetheamountofexponentialdecay.Thisisjust
anassumptionhereandwoulddependlargelyonthebusinessdomain.Otherparameterslikespan
andcenterofmasscanalsobeusedtodefinedecaywhicharediscussedinthelinksharedabove.
Now,letsremovethisfromseriesandcheckstationarity:
ts_log_ewma_diff=ts_logexpwighted_avg
test_stationarity(ts_log_ewma_diff)
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
15/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
This TS has even lesser variations in mean and standard deviation in magnitude. Also, the test
statisticissmallerthanthe1%criticalvalue,whichisbetterthanthepreviouscase.Notethatin
thiscasetherewillbenomissingvaluesasallvaluesfromstartingaregivenweights.Soitllwork
evenwithnopreviousvalues.
EliminatingTrendandSeasonality
Thesimpletrendreductiontechniquesdiscussedbeforedontworkinallcases,particularlytheones
withhighseasonality.Letsdiscusstwowaysofremovingtrendandseasonality:
1.Differencingtakingthedifferecewithaparticulartimelag
2.Decompositionmodelingbothtrendandseasonalityandremovingthemfromthemodel.
Differencing
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
16/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Oneofthemostcommonmethodsofdealingwithbothtrendandseasonalityisdifferencing.Inthis
technique,wetakethedifferenceoftheobservationataparticularinstantwiththatattheprevious
instant. This mostly works well in improving stationarity. First order differencing can be done in
Pandasas:
ts_log_diff=ts_logts_log.shift()
plt.plot(ts_log_diff)
Thisappearstohavereducedtrendconsiderably.Letsverifyusingourplots:
ts_log_diff.dropna(inplace=True)
test_stationarity(ts_log_diff)
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
17/18
21/6/2016
CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)
Wecanseethatthemeanandstdvariationshavesmallvariationswithtime.Also,theDickeyFuller
teststatisticislessthanthe10%criticalvalue,thustheTSisstationarywith90%confidence.We
can also take second or third order differences which might get even better results in certain
applications.Ileaveittoyoutotrythemout.
Decomposing
Inthisapproach,bothtrendandseasonalityaremodeledseparatelyandtheremainingpartofthe
seriesisreturned.Illskipthestatisticsandcometotheresults:
fromstatsmodels.tsa.seasonalimportseasonal_decompose
decomposition=seasonal_decompose(ts_log)
trend=decomposition.trend
seasonal=decomposition.seasonal
http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/
18/18