Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (4 votes)
3K views

Complete Guide To Create A Time Series Forecast (With Codes in Python) PDF

The document provides a complete guide to creating a time series forecast using Python code. It discusses key steps like loading time series data into Pandas, checking for stationarity, making the series stationary if needed, and forecasting. The guide loads airline passenger data, checks it visually and through statistical tests for trends and non-stationarity, discusses options for making it stationary like differencing, and provides code examples for forecasting. The overall goal is to familiarize readers with techniques for time series analysis and forecasting in Python.

Uploaded by

Teodor von Burg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
3K views

Complete Guide To Create A Time Series Forecast (With Codes in Python) PDF

The document provides a complete guide to creating a time series forecast using Python code. It discusses key steps like loading time series data into Pandas, checking for stationarity, making the series stationary if needed, and forecasting. The guide loads airline passenger data, checks it visually and through statistical tests for trends and non-stationarity, discusses options for making it stationary like differencing, and provides code examples for forecasting. The overall goal is to familiarize readers with techniques for time series analysis and forecasting in Python.

Uploaded by

Teodor von Burg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Introduction
Time Series (referred as TS from now) is considered to be one of the less known skills in the
analytics space (Even I had little clue about it a couple of days back). But as you know our
inaugural Mini Hackathon is based on it, I set myself on a journey to learn the basic steps for
solvingaTimeSeriesproblemandhereIamsharingthesamewithyou.Thesewilldefinitelyhelp
yougetadecentmodelinourhackathontoday.

Before going through this article, I highly recommend reading A Complete Tutorial on Time Series
ModelinginR,whichislikeaprequeltothisarticle.Itfocusesonfundamentalconceptsandisbased
onRandIwillfocusonusingtheseconceptsinsolvingaproblemendtoendalongwithcodesin
Python.ManyresourcesexistforTSinRbutveryfewarethereforPythonsoIllbeusingPythonin
thisarticle.
Outjourneywouldgothroughthefollowingsteps:
1.WhatmakesTimeSeriesSpecial?
2.LoadingandHandlingTimeSeriesinPandas
3.HowtoCheckStationarityofaTimeSeries?
4.HowtomakeaTimeSeriesStationary?
5.ForecastingaTimeSeries

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

1/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

1.WhatmakesTimeSeriesSpecial?
Asthenamesuggests,TSisacollectionofdatapointscollectedatconstanttimeintervals.These
are analyzed to determine the long term trend so as to forecast the future or perform some other
formofanalysis.ButwhatmakesaTSdifferentfromsayaregularregressionproblem?Thereare2
things:
1.Itistimedependent.Sothebasicassumptionofalinearregressionmodelthattheobservationsare
independentdoesntholdinthiscase.
2.Along with an increasing or decreasing trend, most TS have some form of seasonality trends, i.e.
variationsspecifictoaparticulartimeframe.Forexample,ifyouseethesalesofawoolenjacketover
time,youwillinvariablyfindhighersalesinwinterseasons.

BecauseoftheinherentpropertiesofaTS,therearevariousstepsinvolvedinanalyzingit.These
arediscussedindetailbelow.LetsstartbyloadingaTSobjectinPython.Wellbeusingthepopular
AirPassengersdatasetwhichcanbedownloadedhere.
PleasenotethattheaimofthisarticleistofamiliarizeyouwiththevarioustechniquesusedforTSin
general.TheexampleconsideredhereisjustforillustrationandIwillfocusoncoverageabreadthof
topicsandnotmakingaveryaccurateforecast.

2.LoadingandHandlingTimeSeriesin
Pandas
PandashasdedicatedlibrariesforhandlingTSobjects,particularlythedatatime64[ns]classwhich
storestimeinformationandallowsustoperformsomeoperationsreallyfast.Letsstartbyfiringup
therequiredlibraries:

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

2/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

importpandasaspd
importnumpyasnp
importmatplotlib.pylabasplt
%matplotlibinline
frommatplotlib.pylabimportrcParams
rcParams['figure.figsize']=15,6

Now,wecanloadthedatasetandlookatsomeinitialrowsanddatatypesofthecolumns:

data=pd.read_csv('AirPassengers.csv')
printdata.head()
print'\nDataTypes:'
printdata.dtypes

Thedatacontainsaparticularmonthandnumberofpassengerstravellinginthatmonth.Butthisis
stillnotreadasaTSobjectasthedatatypesareobjectandint.Inordertoreadthedataasatime
series,wehavetopassspecialargumentstotheread_csvcommand:

dateparse=lambdadates:pd.datetime.strptime(dates,'%Y%m')

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

3/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

data=pd.read_csv('AirPassengers.csv',parse_dates='Month',index_col='Month',date_parser=datepa
rse)
printdata.head()

Letsunderstandtheargumentsonebyone:
1.parse_dates:Thisspecifiesthecolumnwhichcontainsthedatetimeinformation.Aswesayabove,the
columnnameisMonth.
2.index_col:AkeyideabehindusingPandasforTSdataisthattheindexhastobethevariabledepicting
datetimeinformation.SothisargumenttellspandastousetheMonthcolumnasindex.
3.date_parser:Thisspecifiesafunctionwhichconvertsaninputstringintodatetimevariable.Bedefault
PandasreadsdatainformatYYYYMMDDHH:MM:SS.Ifthedataisnotinthisformat,theformathas
to be manually defined. Something similar to the dataparse function defined here can be used for this
purpose.

Nowwecanseethatthedatahastimeobjectasindexand#Passengersasthecolumn.Wecan
crosscheckthedatatypeoftheindexwiththefollowingcommand:

data.index

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

4/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Noticethedtype=datetime[ns]whichconfirmsthatitisadatetimeobject.Asapersonal
preference,IwouldconvertthecolumnintoaSeriesobjecttopreventreferringtocolumnsnames
everytimeIusetheTS.Pleasefeelfreetouseasadataframeisthatworksbetterforyou.
ts=data[#Passengers]ts.head(10)

Before going further, Ill discuss some indexing techniques for TS data. Lets start by selecting a
particularvalueintheSeriesobject.Thiscanbedoneinfollowing2ways:

#1.Specifictheindexasastringconstant:
ts['19490101']

#2.Importthedatetimelibraryanduse'datetime'function:
fromdatetimeimportdatetime
ts[datetime(1949,1,1)]

Both would return the value 112 which can also be confirmed from previous output. Suppose we
wantallthedatauptoMay1949.Thiscanbedonein2ways:

#1.Specifytheentirerange:
ts['19490101':'19490501']

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

5/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

#2.Use':'ifoneoftheindicesisatends:
ts[:'19490501']

Bothwouldyieldfollowingoutput:

Thereare2thingstonotehere:
1.Unlikenumericindexing,theendindexisincludedhere.Forinstance,ifweindexalistasa[:5]thenit
would return the values at indices [0,1,2,3,4]. But here the index 19490501 was included in the
output.
2.Theindiceshavetobesortedforrangestowork.Ifyourandomlyshuffletheindex,thiswontwork.

Consideranotherinstancewhereyouneedallthevaluesoftheyear1949.Thiscanbedoneas:

ts['1949']

Themonthpartwasomitted.Similarlyifyoualldaysofaparticularmonth,thedaypartcanbe

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

6/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

omitted.
Now,letsmoveontotheanalyzingtheTS.

3.HowtoCheckStationarityofaTime
Series?
ATSissaidtobestationaryifitsstatisticalpropertiessuch as mean, variance remain constant
overtime. But why is it important? Most of theTS models work on the assumption that theTS is
stationary.Intuitively,wecansatthatifaTShasaparticularbehaviourovertime,thereisaveryhigh
probabilitythatitwillfollowthesameinthefuture.Also,thetheoriesrelatedtostationaryseriesare
morematureandeasiertoimplementascomparedtononstationaryseries.
Stationarityisdefinedusingverystrictcriterion.However,forpracticalpurposeswecanassumethe
seriestobestationaryifithasconstantstatisticalpropertiesovertime,ie.thefollowing:
1.constantmean
2.constantvariance
3.anautocovariancethatdoesnotdependontime.

Ill skip the details as it is very clearly defined in this article. Lets move onto the ways of testing
stationarity.Firstandforemostistosimpleplotthedataandanalyzevisually.Thedatacanbeplotted
usingfollowingcommand:

plt.plot(ts)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

7/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Itisclearlyevidentthatthereisanoverallincreasingtrendinthedataalongwithsomeseasonal
variations.However,itmightnotalwaysbepossibletomakesuchvisualinferences(wellseesuch
caseslater).So,moreformally,wecancheckstationarityusingthefollowing:
1.PlottingRollingStatistics:We can plot the moving average or moving variance and see if it varies
withtime.Bymovingaverage/varianceImeanthatatanyinstantt,welltaketheaverage/varianceof
thelastyear,i.e.last12months.Butagainthisismoreofavisualtechnique.
2.DickeyFullerTest:Thisisoneofthestatisticaltestsforcheckingstationarity.Herethenullhypothesis
is that the TS is nonstationary. The test results comprise of a Test Statistic and some Critical
Values for difference confidence levels. If the Test Statistic is less than the Critical Value, we can
rejectthenullhypothesisandsaythattheseriesisstationary.Referthisarticlefordetails.

Theseconceptsmightnotsoundveryintuitiveatthispoint.Irecommendgoingthroughtheprequel
article.Ifyoureinterestedinsometheoreticalstatistics,youcanreferIntroductiontoTimeSeries
andForecastingbyBrockwellandDavis.Thebookisabitstatsheavy,butifyouhavetheskillto
readbetweenlines,youcanunderstandtheconceptsandtangentiallytouchthestatistics.
Back to checking stationarity, well be using the rolling statistics plots along with DickeyFuller test
results a lot so I have defined a function which takes a TS as input and generated them for us.
PleasenotethatIveplottedstandarddeviationinsteadofvariancetokeeptheunitsimilartomean.

fromstatsmodels.tsa.stattoolsimportadfuller

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

8/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

deftest_stationarity(timeseries):

#Determingrollingstatistics
rolmean=pd.rolling_mean(timeseries,window=12)
rolstd=pd.rolling_std(timeseries,window=12)

#Plotrollingstatistics:
orig=plt.plot(timeseries,color='blue',label='Original')
mean=plt.plot(rolmean,color='red',label='RollingMean')
std=plt.plot(rolstd,color='black',label='RollingStd')
plt.legend(loc='best')
plt.title('RollingMean&StandardDeviation')
plt.show(block=False)

#PerformDickeyFullertest:
print'ResultsofDickeyFullerTest:'
dftest=adfuller(timeseries,autolag='AIC')
dfoutput=pd.Series(dftest[0:4],index=['TestStatistic','pvalue','#LagsUsed','NumberofO
bservationsUsed'])
forkey,valueindftest[4].items():
dfoutput['CriticalValue(%s)'%key]=value
printdfoutput

Thecodeisprettystraightforward.Pleasefeelfreetodiscussthecodeincommentsifyouface
challengesingraspingit.
Letsrunitforourinputseries:

test_stationarity(ts)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

9/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Thoughthevariationinstandarddeviationissmall,meanisclearlyincreasingwithtimeandthisis
not a stationary series. Also, the test statistic is way more than the critical values. Note that
thesignedvaluesshouldbecomparedandnottheabsolutevalues.
Next,welldiscussthetechniquesthatcanbeusedtotakethisTStowardsstationarity.

4.HowtomakeaTimeSeriesStationary?
ThoughstationarityassumptionistakeninmanyTSmodels,almostnoneofpracticaltimeseriesare
stationary.Sostatisticianshavefiguredoutwaystomakeseriesstationary,whichwelldiscussnow.
Actually,itsalmostimpossibletomakeaseriesperfectlystationary,butwetrytotakeitascloseas
possible.

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

10/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Lets understand what is making a TS nonstationary. There are 2 major reasons behind non
stationarutyofaTS:
1. Trend varying mean over time. For eg, in this case we saw that on average, the number of
passengerswasgrowingovertime.
2.Seasonalityvariationsatspecifictimeframes.egpeoplemighthaveatendencytobuycarsina
particularmonthbecauseofpayincrementorfestivals.
Theunderlyingprincipleistomodelorestimatethetrendandseasonalityintheseriesandremove
those from the series to get a stationary series. Then statistical forecasting techniques can be
implementedonthisseries.Thefinalstepwouldbetoconverttheforecastedvaluesintotheoriginal
scalebyapplyingtrendandseasonalityconstraintsback.
Note:Illbediscussinganumberofmethods.Somemightworkwellinthiscaseandothersmight
not.Buttheideaistogetahangofallthemethodsandnotfocusonjusttheproblemathand.
Letsstartbyworkingonthetrendpart.

Estimating&EliminatingTrend
One of the first tricks to reduce trend can be transformation. For example, in this case we can
clearly see that the there is a significant positive trend. So we can apply transformation which
penalizehighervaluesmorethansmallervalues.Thesecanbetakingalog,squareroot,cuberoot,
etc.Letstakealogtransformhereforsimplicity:

ts_log=np.log(ts)
plt.plot(ts_log)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

11/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Inthissimplercase,itiseasytoseeaforwardtrendinthedata.Butitsnotveryintuitiveinpresence
ofnoise.Sowecanusesometechniquestoestimateormodelthistrendandthenremoveitfrom
theseries.Therecanbemanywaysofdoingitandsomeofmostcommonlyusedare:
1.Aggregationtakingaverageforatimeperiodlikemonthly/weeklyaverages
2.Smoothingtakingrollingaverages
3.PolynomialFittingfitaregressionmodel

Iwilldiscusssmoothinghereandyoushouldtryothertechniquesaswellwhichmightworkoutfor
otherproblems.Smoothingreferstotakingrollingestimates,i.e.consideringthepastfewinstances.
TherearecanbevariouswaysbutIwilldiscusstwoofthosehere.

Movingaverage
In this approach, we take average of k consecutive values depending on the frequency of time
series.Herewecantaketheaverageoverthepast1year,i.e.last12values.Pandashasspecific
functionsdefinedfordeterminingrollingstatistics.

moving_avg=pd.rolling_mean(ts_log,12)
plt.plot(ts_log)
plt.plot(moving_avg,color='red')

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

12/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Theredlineshowstherollingmean.Letssubtractthisfromtheoriginalseries.Notethatsincewe
aretakingaverageoflast12values,rollingmeanisnotdefinedforfirst11values.Thiscanbe
observedas:

ts_log_moving_avg_diff=ts_logmoving_avg
ts_log_moving_avg_diff.head(12)

Noticethefirst11beingNan.LetsdroptheseNaNvaluesandchecktheplotstoteststationarity.

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

13/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

ts_log_moving_avg_diff.dropna(inplace=True)
test_stationarity(ts_log_moving_avg_diff)

Thislookslikeamuchbetterseries.Therollingvaluesappeartobevaryingslightlybutthereisno
specifictrend.Also,theteststatisticissmallerthanthe5%criticalvaluessowecansaywith95%
confidencethatthisisastationaryseries.
However,adrawbackinthisparticularapproachisthatthetimeperiodhastobestrictlydefined.In
this case we can take yearly averages but in complex situations like forecasting a stock price, its
difficult to come up with a number. So we take a weighted moving average where more recent
valuesaregivenahigherweight.Therecanbemanytechniqueforassigningweights.Apopularone
isexponentiallyweightedmovingaveragewhereweightsareassignedtoallthepreviousvalues
withadecayfactor.Finddetailshere.ThiscanbeimplementedinPandasas:

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

14/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

expwighted_avg=pd.ewma(ts_log,halflife=12)
plt.plot(ts_log)
plt.plot(expwighted_avg,color='red')

Notethatheretheparameterhalflifeisusedtodefinetheamountofexponentialdecay.Thisisjust
anassumptionhereandwoulddependlargelyonthebusinessdomain.Otherparameterslikespan
andcenterofmasscanalsobeusedtodefinedecaywhicharediscussedinthelinksharedabove.
Now,letsremovethisfromseriesandcheckstationarity:

ts_log_ewma_diff=ts_logexpwighted_avg
test_stationarity(ts_log_ewma_diff)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

15/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

This TS has even lesser variations in mean and standard deviation in magnitude. Also, the test
statisticissmallerthanthe1%criticalvalue,whichisbetterthanthepreviouscase.Notethatin
thiscasetherewillbenomissingvaluesasallvaluesfromstartingaregivenweights.Soitllwork
evenwithnopreviousvalues.

EliminatingTrendandSeasonality
Thesimpletrendreductiontechniquesdiscussedbeforedontworkinallcases,particularlytheones
withhighseasonality.Letsdiscusstwowaysofremovingtrendandseasonality:
1.Differencingtakingthedifferecewithaparticulartimelag
2.Decompositionmodelingbothtrendandseasonalityandremovingthemfromthemodel.

Differencing

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

16/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Oneofthemostcommonmethodsofdealingwithbothtrendandseasonalityisdifferencing.Inthis
technique,wetakethedifferenceoftheobservationataparticularinstantwiththatattheprevious
instant. This mostly works well in improving stationarity. First order differencing can be done in
Pandasas:

ts_log_diff=ts_logts_log.shift()
plt.plot(ts_log_diff)

Thisappearstohavereducedtrendconsiderably.Letsverifyusingourplots:

ts_log_diff.dropna(inplace=True)
test_stationarity(ts_log_diff)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

17/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Wecanseethatthemeanandstdvariationshavesmallvariationswithtime.Also,theDickeyFuller
teststatisticislessthanthe10%criticalvalue,thustheTSisstationarywith90%confidence.We
can also take second or third order differences which might get even better results in certain
applications.Ileaveittoyoutotrythemout.

Decomposing
Inthisapproach,bothtrendandseasonalityaremodeledseparatelyandtheremainingpartofthe
seriesisreturned.Illskipthestatisticsandcometotheresults:

fromstatsmodels.tsa.seasonalimportseasonal_decompose
decomposition=seasonal_decompose(ts_log)

trend=decomposition.trend
seasonal=decomposition.seasonal

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

18/18

You might also like