Time Series Analysis With Python
Time Series Analysis With Python
Daily time series of Open Power System Data (OPSD) for Germany, which has been rapidly expanding its
renewable energy production in recent years. The data set includes country-wide totals of electricity
consumption, wind power production, and solar power production for 2006-2017. You can download the
data here.
Electricity production and consumption are reported as daily totals in gigawatt-hours (GWh). The columns of
the data file are:
How do wind and solar power production vary with seasons of the year?
What are the long-term trends in electricity consumption, solar power, and wind power?
How do wind and solar power production compare with electricity consumption, and how has this ratio
changed over time?
Time series data structures
In pandas, a single point in time is represented as a Timestamp. We can use the to_datetime() function to
create Timestamps from strings in a wide variety of date/time formats.
import pandas as pd
to_datetime() automatically infers a date/time format based on the input. In the example above, the
ambiguous date '7/8/1952' is assumed to be month/day/year and is interpreted as July 8, 1952
We can use the dayfirst parameter to tell pandas to interpret the date as August 7, 1952
If we supply a list or array of strings as input to to_datetime(), it returns a sequence of date/time values in a
DatetimeIndex object, which is the core data structure that powers much of pandas time series functionality.
In the DatetimeIndex above, the data type datetime64[ns] indicates that the underlying data is stored as 64-
bit integers, in units of nanoseconds (ns). This data structure allows pandas to compactly store large
sequences of date/time values and efficiently perform vectorized operations using NumPy datetime64
arrays.If we're dealing with a sequence of strings all in the same date/time format, we can explicitly specify it
with the format parameter.
To work with time series data in pandas, we use a DatetimeIndex as the index for our DataFrame (or Series).
Let's see how to do this with our OPSD data set. First, we use the read_csv() function to read the data into a
DataFrame, and then display its shape.
opsd_daily = pd.read_csv('opsd_germany_daily.csv')
opsd_daily.shape
Let's check data using head and tail to see how it looks and check types
opsd_daily.dtypes
Now that the Date column is the correct data type, let's set it as the DataFrame's index.
opsd_daily = opsd_daily.set_index('Date')
opsd_daily.index
Alternatively, we can consolidate the above steps into a single line, using the index_col and parse_dates
parameters of the read_csv() function. This is often a useful shortcut.
Another useful aspect of the DatetimeIndex is that the individual date/time components are all available as
attributes such as year, month, day, and so on. Let's add a few more columns to opsd_daily, containing the
year, month, and weekday name.
opsd_daily['Year'] = opsd_daily.index.year
opsd_daily['Month'] = opsd_daily.index.month
opsd_daily.sample(5, random_state=0)
Time-based indexing
One of the most powerful and convenient features of pandas time series is time-based indexing — using
dates and times to intuitively organize and access our data. With time-based indexing, we can use date/time
formatted strings to select data in our DataFrame with the loc accessor. The indexing works similar to
standard label-based indexing with loc, but with a few additional features.
For example, we can select data for a single day using a string such as '2017-08-10'.
opsd_daily.loc['2017-08-10']
We can also select a slice of days, such as '2014-01-20':'2014-01-22'. As with regular label-based indexing
with loc, the slice is inclusive of both endpoints.
opsd_daily.loc['2014-01-20':'2014-01-22']
Another very handy feature of pandas time series is partial-string indexing, where we can select all
date/times which partially match a given string. For example, we can select the entire year 2006 with
opsd_daily.loc['2006'], or the entire month of February 2012 with opsd_daily.loc['2012-02'].
opsd_daily.loc['2006']
opsd_daily.loc['2012-02']
Visualizing time series data
With pandas and matplotlib, we can easily visualize our time series data. In this section, we'll cover a few examples
and some useful customizations for our time series plots. First, let's import matplotlib.
# Use seaborn style defaults and set the default figure size 11 inch width and 4 inch height
sns.set(rc={'figure.figsize':(11, 4)})
Let's create a line plot of the full time series of Germany's daily electricity consumption, using the DataFrame's plot()
method.
opsd_daily['Consumption'].plot(linewidth=0.5);
We can see that the plot() method has chosen pretty good tick locations (every two years) and labels (the
years) for the x-axis, which is helpful. However, with so many data points, the line plot is crowded and hard to
read. Let's plot the data as dots instead, and also look at the Solar and Wind time series.
for ax in axes:
Seasonality can also occur on other time scales. The plot above suggests there may be some weekly
seasonality in Germany's electricity consumption, corresponding with weekdays and weekends. Let's plot the
time series in a single year to investigate further.
ax = opsd_daily.loc['2017', 'Consumption'].plot()
To better visualize the weekly seasonality in electricity consumption in the plot above, it would be nice to
have vertical gridlines on a weekly time scale (instead of on the first day of each month). We can customize
our plot with matplotlib.dates, so let's import that module.
Because date/time ticks are handled a bit differently in matplotlib.dates compared with the DataFrame's
plot() method, let's create the plot directly in matplotlib. Then we use mdates.WeekdayLocator() and
mdates.MONDAY to set the x-axis ticks to the first Monday of each week. We also use
mdates.DateFormatter() to improve the formatting of the tick labels, using the format codes we saw earlier.
fig, ax = plt.subplots()
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));
Now we have vertical gridlines and nicely formatted tick labels on each Monday, so we can easily tell which
days are weekdays and weekends.
Date Formatter :
We can use with the DateFormatter class to customize the formatting of the tick labels on the x-axis. Here are
a few examples:
%m displays the zero-padded month number (e.g., 01, 02, ..., 12)
%d displays the zero-padded day number (e.g., 01, 02, ..., 31)
%H displays the hour as a zero-padded decimal number (e.g., 00, 01, ..., 23)
%M displays the minute as a zero-padded decimal number (e.g., 00, 01, ..., 59)
%S displays the second as a zero-padded decimal number (e.g., 00, 01, ..., 59)
Seasonality
Next, let's further explore the seasonality of our data with box plots, using seaborn's boxplot() function to
group the data by different time periods and display the distributions for each group. We'll first group the
data by month, to visualize yearly seasonality.
As expected, electricity consumption is significantly higher on weekdays than on weekends. The low outliers
on weekdays are presumably during holidays.
This section has provided a brief introduction to time series seasonality. As we will see later, applying a rolling
window to the data can also help to visualize seasonality on different time scales.
Frequencies
When the data points of a time series are uniformly spaced in time (e.g., hourly, daily, monthly, etc.), the time
series can be associated with a frequency in pandas. For example, let's use the date_range() function to create
a sequence of uniformly spaced dates from 1998-03-10 through 1998-03-15 at daily frequency.
OUTPUT :
'1998-03-14', '1998-03-15'],
dtype='datetime64[ns]', freq='D')
The resulting DatetimeIndex has an attribute freq with a value of 'D', indicating daily frequency. Available
frequencies in pandas include hourly ('H'), calendar daily ('D'), business daily ('B'), weekly ('W'), monthly
('M'), quarterly ('Q'), annual ('A'), and many others. Frequencies can also be specified as multiples of any of
the base frequencies, for example '5D' for every five days.
opsd_daily.index
We can see that it has no frequency (freq=None). This makes sense, since the index was created from a
sequence of dates in our CSV file, without explicitly specifying any frequency for the time series.
If we know that our data should be at a specific frequency, we can use the DataFrame's asfreq() method to
assign a frequency. If any date/times are missing in the data, new rows will be added for those date/times,
which are either empty (NaN), or filled according to a specified data filling method such as forward filling or
interpolation.
To see how this works, let's create a new DataFrame which contains only the Consumption data for Feb 3, 6,
and 8, 2013.
times_sample = pd.to_datetime(['2013-02-03', '2013-02-06', '2013-02-08'])
consum_sample
Now we use the asfreq() method to convert the DataFrame to daily frequency, with a column for unfilled
data, and a column for forward filled data.
consum_freq = consum_sample.asfreq('D')
consum_freq