Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
63 views

Time Series Analysis With Python

This document discusses analyzing time series data from Germany's electricity consumption, wind power, and solar power production from 2006-2017 using pandas. It provides an overview of pandas time series data structures like DatetimeIndex and how to create a time series DataFrame from the data. It also demonstrates various time-based indexing and visualization techniques for exploring patterns in the data like seasonality and trends over time.

Uploaded by

Chit Surela
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Time Series Analysis With Python

This document discusses analyzing time series data from Germany's electricity consumption, wind power, and solar power production from 2006-2017 using pandas. It provides an overview of pandas time series data structures like DatetimeIndex and how to create a time series DataFrame from the data. It also demonstrates various time-based indexing and visualization techniques for exploring patterns in the data like seasonality and trends over time.

Uploaded by

Chit Surela
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Time Series

Daily time series of Open Power System Data (OPSD) for Germany, which has been rapidly expanding its
renewable energy production in recent years. The data set includes country-wide totals of electricity
consumption, wind power production, and solar power production for 2006-2017. You can download the
data here.

Electricity production and consumption are reported as daily totals in gigawatt-hours (GWh). The columns of
the data file are:

Date — The date (yyyy-mm-dd format)

Consumption — Electricity consumption in GWh

Wind — Wind power production in GWh

Solar — Solar power production in GWh

Wind+Solar — Sum of wind and solar power production in GWh


We will explore how electricity consumption and production in Germany have varied over time, using pandas
time series tools to answer questions such as:

When is electricity consumption typically highest and lowest?

How do wind and solar power production vary with seasons of the year?

What are the long-term trends in electricity consumption, solar power, and wind power?

How do wind and solar power production compare with electricity consumption, and how has this ratio
changed over time?
Time series data structures

In pandas, a single point in time is represented as a Timestamp. We can use the to_datetime() function to
create Timestamps from strings in a wide variety of date/time formats.

import pandas as pd

pd.to_datetime('2018-01-15 3:45pm') #Timestamp('2018-01-15 15:45:00')

pd.to_datetime('7/8/1952') #Timestamp('1952-07-08 00:00:00')

to_datetime() automatically infers a date/time format based on the input. In the example above, the
ambiguous date '7/8/1952' is assumed to be month/day/year and is interpreted as July 8, 1952
We can use the dayfirst parameter to tell pandas to interpret the date as August 7, 1952

pd.to_datetime('7/8/1952, dayfirst=True) #Timestamp('1952-08-07 00:00:00')

If we supply a list or array of strings as input to to_datetime(), it returns a sequence of date/time values in a
DatetimeIndex object, which is the core data structure that powers much of pandas time series functionality.

pd.to_datetime(['2018-01-05', '7/8/1952', 'Oct 10, 1995'])

OUTPUT : DatetimeIndex(['2018-01-05', '1952-07-08', '1995-10-10'], dtype='datetime64[ns]', freq=None)

In the DatetimeIndex above, the data type datetime64[ns] indicates that the underlying data is stored as 64-
bit integers, in units of nanoseconds (ns). This data structure allows pandas to compactly store large
sequences of date/time values and efficiently perform vectorized operations using NumPy datetime64
arrays.If we're dealing with a sequence of strings all in the same date/time format, we can explicitly specify it
with the format parameter.

pd.to_datetime(['2/25/10', '8/6/17', '12/15/12'], format='%m/%d/%y')

OUTPUT : DatetimeIndex(['2010-02-25', '2017-08-06', '2012-12-15'], dtype='datetime64[ns]', freq=None)


Creating a time series DataFrame

To work with time series data in pandas, we use a DatetimeIndex as the index for our DataFrame (or Series).
Let's see how to do this with our OPSD data set. First, we use the read_csv() function to read the data into a
DataFrame, and then display its shape.

opsd_daily = pd.read_csv('opsd_germany_daily.csv')

opsd_daily.shape

Let's check data using head and tail to see how it looks and check types

opsd_daily.dtypes
Now that the Date column is the correct data type, let's set it as the DataFrame's index.

opsd_daily = opsd_daily.set_index('Date')

opsd_daily.index

Alternatively, we can consolidate the above steps into a single line, using the index_col and parse_dates
parameters of the read_csv() function. This is often a useful shortcut.

opsd_daily = pd.read_csv('opsd_germany_daily.csv', index_col=0, parse_dates=True)


Now that our DataFrame's index is a DatetimeIndex, we can use all of pandas' powerful time-based indexing
to wrangle and analyze our data, as we shall see in the following sections.

Another useful aspect of the DatetimeIndex is that the individual date/time components are all available as
attributes such as year, month, day, and so on. Let's add a few more columns to opsd_daily, containing the
year, month, and weekday name.

# Add columns with year, month, and weekday name

opsd_daily['Year'] = opsd_daily.index.year

opsd_daily['Month'] = opsd_daily.index.month

opsd_daily['Weekday Name'] = opsd_daily.index.weekday_name

# Display a random sampling of 5 rows

opsd_daily.sample(5, random_state=0)
Time-based indexing

One of the most powerful and convenient features of pandas time series is time-based indexing — using
dates and times to intuitively organize and access our data. With time-based indexing, we can use date/time
formatted strings to select data in our DataFrame with the loc accessor. The indexing works similar to
standard label-based indexing with loc, but with a few additional features.

For example, we can select data for a single day using a string such as '2017-08-10'.

opsd_daily.loc['2017-08-10']
We can also select a slice of days, such as '2014-01-20':'2014-01-22'. As with regular label-based indexing
with loc, the slice is inclusive of both endpoints.

opsd_daily.loc['2014-01-20':'2014-01-22']

Another very handy feature of pandas time series is partial-string indexing, where we can select all
date/times which partially match a given string. For example, we can select the entire year 2006 with
opsd_daily.loc['2006'], or the entire month of February 2012 with opsd_daily.loc['2012-02'].

opsd_daily.loc['2006']

opsd_daily.loc['2012-02']
Visualizing time series data

With pandas and matplotlib, we can easily visualize our time series data. In this section, we'll cover a few examples
and some useful customizations for our time series plots. First, let's import matplotlib.

import matplotlib.pyplot as plt

import seaborn as sns

# Use seaborn style defaults and set the default figure size 11 inch width and 4 inch height

sns.set(rc={'figure.figsize':(11, 4)})

Let's create a line plot of the full time series of Germany's daily electricity consumption, using the DataFrame's plot()
method.

opsd_daily['Consumption'].plot(linewidth=0.5);
We can see that the plot() method has chosen pretty good tick locations (every two years) and labels (the
years) for the x-axis, which is helpful. However, with so many data points, the line plot is crowded and hard to
read. Let's plot the data as dots instead, and also look at the Solar and Wind time series.

cols_plot = ['Consumption', 'Solar', 'Wind']

axes = opsd_daily[cols_plot].plot(marker='.', alpha=0.5, linestyle='None', figsize=(11, 9), subplots=True)

for ax in axes:

ax.set_ylabel('Daily Totals (GWh)')

Seasonality can also occur on other time scales. The plot above suggests there may be some weekly
seasonality in Germany's electricity consumption, corresponding with weekdays and weekends. Let's plot the
time series in a single year to investigate further.

ax = opsd_daily.loc['2017', 'Consumption'].plot()

ax.set_ylabel('Daily Consumption (GWh)');


Now we can clearly see the weekly oscillations. Another interesting feature that becomes apparent at this
level of granularity is the drastic decrease in electricity consumption in early January and late December,
during the holidays.

Let's zoom in further and look at just January and February.

ax = opsd_daily.loc['2017-01':'2017-02', 'Consumption'].plot(marker='o', linestyle='-')

ax.set_ylabel('Daily Consumption (GWh)');


Customizing time series plots

To better visualize the weekly seasonality in electricity consumption in the plot above, it would be nice to
have vertical gridlines on a weekly time scale (instead of on the first day of each month). We can customize
our plot with matplotlib.dates, so let's import that module.

import matplotlib.dates as mdates

Because date/time ticks are handled a bit differently in matplotlib.dates compared with the DataFrame's
plot() method, let's create the plot directly in matplotlib. Then we use mdates.WeekdayLocator() and
mdates.MONDAY to set the x-axis ticks to the first Monday of each week. We also use
mdates.DateFormatter() to improve the formatting of the tick labels, using the format codes we saw earlier.
fig, ax = plt.subplots()

ax.plot(opsd_daily.loc['2017-01':'2017-02', 'Consumption'], marker='o', linestyle='-')

ax.set_ylabel('Daily Consumption (GWh)')

ax.set_title('Jan-Feb 2017 Electricity Consumption')

# Set x-axis major ticks to weekly interval, on Mondays

ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))

# Format x-tick labels as 3-letter month name and day number

ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

Now we have vertical gridlines and nicely formatted tick labels on each Monday, so we can easily tell which
days are weekdays and weekends.
Date Formatter :

We can use with the DateFormatter class to customize the formatting of the tick labels on the x-axis. Here are
a few examples:

%b displays the abbreviated month name (e.g., Jan, Feb)

%Y displays the full year (e.g., 2021)

%m displays the zero-padded month number (e.g., 01, 02, ..., 12)

%d displays the zero-padded day number (e.g., 01, 02, ..., 31)

%H displays the hour as a zero-padded decimal number (e.g., 00, 01, ..., 23)

%M displays the minute as a zero-padded decimal number (e.g., 00, 01, ..., 59)

%S displays the second as a zero-padded decimal number (e.g., 00, 01, ..., 59)
Seasonality

Next, let's further explore the seasonality of our data with box plots, using seaborn's boxplot() function to
group the data by different time periods and display the distributions for each group. We'll first group the
data by month, to visualize yearly seasonality.

fig, axes = plt.subplots(3, 1, figsize=(11, 10))


for name, ax in zip(['Consumption', 'Solar', 'Wind'], axes):
sns.boxplot(data=opsd_daily, x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name)
# Remove the automatic x-axis label from all but the bottom subplot
if ax != axes[-1]:
ax.set_xlabel('')
Next, let's group the electricity consumption time series by day of the week, to explore weekly seasonality.

sns.boxplot(data=opsd_daily, x='Weekday Name', y='Consumption')

As expected, electricity consumption is significantly higher on weekdays than on weekends. The low outliers
on weekdays are presumably during holidays.

This section has provided a brief introduction to time series seasonality. As we will see later, applying a rolling
window to the data can also help to visualize seasonality on different time scales.
Frequencies

When the data points of a time series are uniformly spaced in time (e.g., hourly, daily, monthly, etc.), the time
series can be associated with a frequency in pandas. For example, let's use the date_range() function to create
a sequence of uniformly spaced dates from 1998-03-10 through 1998-03-15 at daily frequency.

pd.date_range('1998-03-10', '1998-03-15', freq='D')

OUTPUT :

DatetimeIndex(['1998-03-10', '1998-03-11', '1998-03-12', '1998-03-13',

'1998-03-14', '1998-03-15'],

dtype='datetime64[ns]', freq='D')
The resulting DatetimeIndex has an attribute freq with a value of 'D', indicating daily frequency. Available
frequencies in pandas include hourly ('H'), calendar daily ('D'), business daily ('B'), weekly ('W'), monthly
('M'), quarterly ('Q'), annual ('A'), and many others. Frequencies can also be specified as multiples of any of
the base frequencies, for example '5D' for every five days.

pd.date_range('2004-09-20', periods=8, freq='H')

opsd_daily.index

We can see that it has no frequency (freq=None). This makes sense, since the index was created from a
sequence of dates in our CSV file, without explicitly specifying any frequency for the time series.

If we know that our data should be at a specific frequency, we can use the DataFrame's asfreq() method to
assign a frequency. If any date/times are missing in the data, new rows will be added for those date/times,
which are either empty (NaN), or filled according to a specified data filling method such as forward filling or
interpolation.

To see how this works, let's create a new DataFrame which contains only the Consumption data for Feb 3, 6,
and 8, 2013.
times_sample = pd.to_datetime(['2013-02-03', '2013-02-06', '2013-02-08'])

# Select the specified dates and just the Consumption column

consum_sample = opsd_daily.loc[times_sample, ['Consumption']].copy()

consum_sample

Now we use the asfreq() method to convert the DataFrame to daily frequency, with a column for unfilled
data, and a column for forward filled data.

# Convert the data to daily frequency, without filling any missings

consum_freq = consum_sample.asfreq('D')

# Create a column with missings forward filled

consum_freq['Consumption - Forward Fill'] = consum_sample.asfreq('D', method='ffill')

consum_freq

You might also like