Time Series Using Python
Time Series Using Python
import pandas as pd
# Import the data
df = pd.read_csv("Blog_Orders.csv")
df['Date'] = pd.to_datetime(df['Date’])
# Set the date as index
df = df.set_index('Date’)
# Select the proper time period for weekly aggregation
df = df['2017-01-02':'2019-12-29'].resample('W').sum()
df.head()
Examine and Prepare Your Dataset for Modeling
Check the Data for Common Time Series Patterns :
• It’s important to check any time series data for patterns that can affect the results, and can
inform which forecasting model to use. Some common time series data patterns are:
• Most time-series data will contain one or more, but probably not all of these patterns. It’s
still a good idea to check for them since they can affect the performance of the model and
may even require different modeling approaches.
• Two great methods for finding these data patterns are visualization and decomposition.
Visualize the Data
• The first step is simply to plot the dataset. In the example, matplotlib package is used. Since it’s
easier to see a general trend using the mean, I use both the original data (blue line) as well as the
monthly average resample data (orange line).
• By changing the 'M’ (or ‘Month’) within y.resample('M'), you can plot the mean for different
aggregate dates. For example, if you have a very long history of data, you might plot the yearly
average by changing ‘M’ to ‘Y’.
import warnings
import matplotlib.pyplot as plt
y = df['Orders’]
fig, ax = plt.subplots(figsize=(20, 6))
ax.plot(y,marker='.', linestyle='-', linewidth=0.5, label='Weekly’)
ax.plot(y.resample('M').mean(),marker='o', markersize=8, linestyle='-', label='Monthly Mean Resample’)
ax.set_ylabel('Orders’)
ax.legend();
Visualize the Data
Decompose the Data
• By looking at the graph of sales data above, we can see a general
increasing trend with no clear pattern of seasonal or cyclical changes.
• The next step is to decompose the data to view more of the complexity
behind the linear visualization.
• A useful Python function called seasonal_decompose within the '
statsmodels' package can help us to decompose the data into four
different components:
• Observed
• Trended
• Seasonal
• Residual
Decompose the Data
import statsmodels.api as sm
# graphs to show seasonal_decompose
def seasonal_decompose (y):
decomposition = sm.tsa.seasonal_decompose(y,
model='additive',extrapolate_trend='freq’)
fig = decomposition.plot(
fig.set_size_inches(14,7)
plt.show()
Decompose the Data
seasonal_decompose(y)
After looking at the four pieces of decomposed graphs, we can tell that our
sales dataset has an overall increasing trend as well as a yearly seasonality.
Depending on the components of your dataset like trend, seasonality, or
cycles, your choice of model will be different.
Check for Stationarity
• Next, we need to check whether the dataset is stationary or not. A dataset
is stationary if its statistical properties like mean, variance, and
autocorrelation do not change over time.
• Most time series datasets related to business activity are not stationary
since there are usually all sorts of non-stationary elements like trends and
economic cycles.
• But, since most time series forecasting models use stationarity—and
mathematical transformations related to it—to make predictions, we need
to ‘stationarize’ the time series as part of the process of fitting a model.
• Two common methods to check for stationarity are Visualization and the
Augmented Dickey-Fuller (ADF) Test. Python makes both approaches
easy:
Visualization - Check for Stationarity
This method graphs the rolling statistics (mean and variance) to show at a glance whether the standard
deviation changes substantially over time:
Both the mean and standard deviation for stationary data does not change much
over time. But in this case, since the y-axis has such a large scale, we can not
confidently conclude that our data is stationary by simply viewing the above
graph. Therefore, we should do another test of stationarity.
Augmented Dickey-Fuller Test
• The ADF approach is essentially a statistical significance test that compares the
p-value with the critical values and does hypothesis testing.
• Using this test, we can determine whether the processed data is stationary or not
with different levels of confidence.
# Augmented Dickey-Fuller Test
from statsmodels.tsa.stattools import adfuller
Looking at both the visualization and ADF test, we can tell that our
sample sales data is non-stationary.
Make the Data Stationary - Detrending
• To proceed with our time series analysis, we need to stationarize the dataset. There
are many approaches to stationarize data, but we’ll use de-trending, differencing,
and then a combination of the two.
•This detrending method removes the underlying trend in the time series:
# Detrending
y_detrend = (y -
y.rolling(window=12).mean())/y.rolling(window=12).std()
test_stationarity(y_detrend,'de-trended data')
ADF_test(y_detrend,'de-trended data')
Make the Data Stationary - Detrending
The results show that the data is now stationary, indicated by the relative smoothness of
the rolling mean and rolling standard deviation after running the ADF test again
Differencing
This method removes the underlying seasonal or cyclical patterns in the time
series. Since the sample dataset has a 12-month seasonality, a 12-lag
difference is used:
# Differencing
y_12lag = y - y.shift(12)
test_stationarity(y_12lag,'12 lag differenced data')
ADF_test(y_12lag,'12 lag differenced data')