3 Steps To Time Series Forecasting LSTM With TensorFlow KerasA Practical Example in Python With Usefu
3 Steps To Time Series Forecasting LSTM With TensorFlow KerasA Practical Example in Python With Usefu
com /forecast-time-series-lstm-with-tensorflow-keras/
In this tutorial, we present a deep learning time series analysis example with Python. You’ll see:
If you want to analyze large time series dataset with machine learning techniques, you’ll love this guide with
practical tips.
The dataset we are using is the Household Electric Power Consumption from Kaggle. It provides
measurements of electric power consumption in one household with a one-minute sampling rate.
There are 2,075,259 measurements gathered within 4 years. Different electrical quantities and some sub-
metering values are available. But we’ll only focus on three features:
# import packages
1/15
import math
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.utils import Sequence
from datetime import timedelta
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import time
import os
# read the dataset into python
df = pd.read_csv('household_power_consumption.txt', delimiter=';')
df.head()
view raw reading_data.py hosted with ❤ by GitHub
%%time
# This code is copied from https://towardsdatascience.com/time-series-analysis-visualization-
forecasting-with-lstm-77a905180eba
# with a few minor changes.
#
df['date_time'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'], errors='coerce')
df = df.dropna(subset=['Global_active_power'])
2/15
df['date_time'] = pd.to_datetime(df['date_time'])
df = df.loc[:, ['date_time', 'Global_active_power']]
df.sort_values('date_time', inplace=True, ascending=True)
df = df.reset_index(drop=True)
print('Number of rows and columns after removing missing values:', df.shape)
print('The time series starts from: ', df['date_time'].min())
print('The time series ends on: ', df['date_time'].max())
view raw preprocessing_data.py hosted with ❤ by GitHub
df.info()
df.head(10)
view raw time_series_data.py hosted with ❤ by GitHub
3/15
Next, we split the dataset into training, validation, and test datasets.
df_test holds the data within the last 7 days in the original dataset. df_val has data 14 days before the test
dataset. df_train has the rest of the data.
Related article: Time Series Analysis, Visualization & Forecasting with LSTM
This article forecasted the Global_active_power only 1 minute ahead of historical data.
But practically, we want to forecast over a more extended period, which we’ll do in this article.
4/15
Before we can fit the TensorFlow Keras LSTM, there are still other processes that need to be done.
As mentioned earlier, we want to forecast the Global_active_power that’s 10 minutes in the future.
The graph below visualizes the problem: using the lagged data (from t-n to t-1) to predict the target (t+10).
Unit: minutes
It is not efficient to loop through the dataset while training the model. So we want to transform the dataset
with each row representing the historical data and the target.
Unit: minutes
In this way, we only need to train the model using each row of the above matrix.
start_index: the earliest time to be included in all the historical data for forecasting.
In this practice, we want to include history from the very beginning, so we set the default of it to be 0.
5/15
end_index: the latest time to be included in all the historical data for forecasting.
In this practice, we want to include all the history, so we set the default of it to be None.
history_length: this is n mentioned earlier, which is the number of timesteps to look back for each
forecasting.
step_size: the stride of the history window.
Global_active_power doesn’t change fast throughout time. So to be more efficient, we can let
step_size = 10. In this way, we downsample to use every 10 minutes of data in the past to predict the
future amount. We are only looking at t-1, t-11, t-21 until t-n to predict t+10.
target_step: the number of periods in the future to predict.
As mentioned earlier, we are trying to predict the global_active_power 10 minutes ahead. So this
feature = 10.
num_rows_per_file: the number of records to put in each file.
This is necessary to divide the large new dataset into smaller files.
data_folder: the one single folder that will contain all the files.
In the end, just know that this function creates a folder with files.
And each file contains a pandas dataframe that looks like the new dataset in the chart above.
Each of these dataframes has columns:
y, which is the target to predict. This will be the value at t + target_step (t + 10).
x_lag{i}, the value at time t + target_step – i (t + 10 – 11, t + 10 – 21, and so on), i.e., the lagged value
compared to y.
At the same time, the function also returns the number of lags (len(col_names)-1) in the dataframes. This
number will be required when defining the shape for TensorFlow models later.
6/15
os.makedirs(data_folder)
time_lags = sorted(range(target_step+1, target_step+history_length+1, step_size), reverse=True)
col_names = [f'x_lag{i}' for i in time_lags] + ['y']
start_index = start_index + history_length
if end_index is None:
end_index = len(dataset) - target_step
rng = range(start_index, end_index)
num_rows = len(rng)
num_files = math.ceil(num_rows/num_rows_per_file)
# for each file.
print(f'Creating {num_files} files.')
for i in range(num_files):
filename = f'{data_folder}/ts_file{i}.pkl'
if i % 10 == 0:
print(f'{filename}')
# get the start and end indices.
ind0 = i*num_rows_per_file + start_index
ind1 = min(ind0 + num_rows_per_file, end_index)
data_list = []
# j in the current timestep. Will need j-n to j-1 for the history. And j + target_step for the target.
for j in range(ind0, ind1):
indices = range(j-1, j-history_length-1, -step_size)
data = dataset[sorted(indices) + [j+target_step]]
# append data to the list.
data_list.append(data)
df_ts = pd.DataFrame(data=data_list, columns=col_names)
df_ts.to_pickle(filename)
return len(col_names)-1
view raw define_ts_function_time_series.py hosted with ❤ by GitHub
create 158 files (each including a pandas dataframe) within the folder ts_data.
return num_timesteps as the number of lags.
7/15
%%time
global_active_power = df_train['Global_active_power'].values
# Scaled to work with Neural networks.
scaler = MinMaxScaler(feature_range=(0, 1))
global_active_power_scaled = scaler.fit_transform(global_active_power.reshape(-1, 1)).reshape(-1,
)
history_length = 7*24*60 # The history length in minutes.
step_size = 10 # The sampling rate of the history. Eg. If step_size = 1, then values from every
minute will be in the history.
# If step size = 10 then values every 10 minutes will be in the history.
target_step = 10 # The time step in the future to predict. Eg. If target_step = 0, then predict the next
timestep after the end of the history period.
# If target_step = 10 then predict 10 timesteps the next timestep (11 minutes after the end of
history).
# The csv creation returns the number of rows and number of features. We need these values
below.
num_timesteps = create_ts_files(global_active_power_scaled,
start_index=0,
end_index=None,
history_length=history_length,
step_size=step_size,
target_step=target_step,
num_rows_per_file=128*100,
data_folder='ts_data')
# I found that the easiest way to do time series with tensorflow is by creating pandas files with the
lagged time steps (eg. x{t-1}, x{t-2}...) and
# the value to predict y = x{t+n}. We tried doing it using TFRecords, but that API is not very intuitive
and lacks working examples for time series.
# The resulting file using these parameters is over 17GB. If history_length is increased, or step_size
is decreased, it could get much bigger.
# Hard to fit into laptop memory, so need to use other means to load the data from the hard drive.
view raw transform_data_neural_networks_time_series.py hosted with ❤ by GitHub
8/15
The folder ts_data is around 16 GB, and we were only using the past 7 days of data to predict. Now you can
see why it’s necessary to divide the dataset into smaller dataframes!
In this procedure, we create a class TimeSeriesLoader to transform and feed the dataframes into the model.
There are built-in functions from Keras such as Keras Sequence, tf.data API. But they are not very efficient
for this purpose.
The definitions might seem a little confusing. But keep reading, you’ll see this object in action within the next
step.
#
# So we can handle loading the data in chunks from the hard drive instead of having to load
everything into memory.
9/15
#
# The reason we want to do this is so we can do custom processing on the data that we are feeding
into the LSTM.
# LSTM requires a certain shape and it is tricky to get it right.
#
class TimeSeriesLoader:
def __init__(self, ts_folder, filename_format):
self.ts_folder = ts_folder
# find the number of files.
i=0
file_found = True
while file_found:
filename = self.ts_folder + '/' + filename_format.format(i)
file_found = os.path.exists(filename)
if file_found:
i += 1
self.num_files = i
self.files_indices = np.arange(self.num_files)
self.shuffle_chunks()
def num_chunks(self):
return self.num_files
def get_chunk(self, idx):
assert (idx >= 0) and (idx < self.num_files)
ind = self.files_indices[idx]
filename = self.ts_folder + '/' + filename_format.format(ind)
df_ts = pd.read_pickle(filename)
num_records = len(df_ts.index)
features = df_ts.drop('y', axis=1).values
target = df_ts['y'].values
# reshape for input into LSTM. Batch major format.
features_batchmajor = np.array(features).reshape(num_records, -1, 1)
return features_batchmajor, target
# this shuffles the order the chunks will be outputted from get_chunk.
def shuffle_chunks(self):
np.random.shuffle(self.files_indices)
view raw time_series_loader.py hosted with ❤ by GitHub
ts_folder = 'ts_data'
filename_format = 'ts_file{}.pkl'
tss = TimeSeriesLoader(ts_folder, filename_format)
10/15
view raw time_series_split.py hosted with ❤ by GitHub
Now with the object tss points to our dataset, we are finally ready for LSTM!
LSTM networks are well-suited to classifying, processing and making predictions based on time
series data, since there can be lags of unknown duration between important events in a time
series.
Wikipedia
As mentioned before, we are going to build an LSTM model based on the TensorFlow Keras library.
We all know the importance of hyperparameter tuning based on our guide. But in this article, we are simply
demonstrating the model fitting without tuning.
Then we also define the optimization function and the loss function. Again, tuning these hyperparameters to
find the best option would be a better practice.
11/15
# Specify the training configuration.
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
loss=tf.keras.losses.MeanSquaredError(),
metrics=['mse'])
view raw training_configuration.py hosted with ❤ by GitHub
To take a look at the model we just defined before running, we can print out the summary.
model.summary()
view raw model_summary.py hosted with ❤ by GitHub
You can see that the output shape looks good, which is n / step_size (7*24*60 / 10 = 1008). The number of
parameters that need to be trained looks right as well (4*units*(units+2) = 480).
We train each chunk in batches, and only run for one epoch. Ideally, you would train for multiple epochs for
neural networks.
%%time
# train in batch sizes of 128.
BATCH_SIZE = 128
NUM_EPOCHS = 1
NUM_CHUNKS = tss.num_chunks()
for epoch in range(NUM_EPOCHS):
print('epoch #{}'.format(epoch))
for i in range(NUM_CHUNKS):
X, y = tss.get_chunk(i)
# model.fit does train the model incrementally. ie. Can call multiple times in batches.
# https://github.com/keras-team/keras/issues/4446
12/15
model.fit(x=X, y=y, batch_size=BATCH_SIZE)
# shuffle the chunks so they're not in the same order next time around.
tss.shuffle_chunks()
view raw model_fit.py hosted with ❤ by GitHub
After fitting the model, we may also evaluate the model performance using the validation dataset.
Same as the training dataset, we also create a folder of the validation data, which prepares the validation
dataset for model fitting.
13/15
num_timesteps = create_ts_files(global_active_power_val_scaled,
start_index=0,
end_index=None,
history_length=history_length,
step_size=step_size,
target_step=target_step,
num_rows_per_file=128*100,
data_folder='ts_val_data')
view raw evaluate_model_validation.py hosted with ❤ by GitHub
Besides testing using the validation dataset, we also test against a baseline model using only the most
recent history point (t + 10 – 11).
# If we assume that the validation dataset can fit into memory we can do this.
df_val_ts = pd.read_pickle('ts_val_data/ts_file0.pkl')
features = df_val_ts.drop('y', axis=1).values
features_arr = np.array(features)
# reshape for input into LSTM. Batch major format.
num_records = len(df_val_ts.index)
features_batchmajor = features_arr.reshape(num_records, -1, 1)
y_pred = model.predict(features_batchmajor).reshape(-1, )
y_pred = scaler.inverse_transform(y_pred.reshape(-1, 1)).reshape(-1 ,)
y_act = df_val_ts['y'].values
y_act = scaler.inverse_transform(y_act.reshape(-1, 1)).reshape(-1 ,)
print('validation mean squared error: {}'.format(mean_squared_error(y_act, y_pred)))
#baseline
y_pred_baseline = df_val_ts['x_lag11'].values
y_pred_baseline = scaler.inverse_transform(y_pred_baseline.reshape(-1, 1)).reshape(-1 ,)
print('validation baseline mean squared error: {}'.format(mean_squared_error(y_act,
y_pred_baseline)))
view raw validation.py hosted with ❤ by GitHub
The validation dataset using LSTM gives Mean Squared Error (MSE) of 0.418. While the baseline model has
MSE of 0.428. The LSTM does slightly better than the baseline.
We could do better with hyperparameter tuning and more epochs. Plus, some other essential time series
analysis tips such as seasonality would help too.
14/15
Related article: Hyperparameter Tuning with Python: Complete Step-by-Step Guide
Hope you found something useful in this guide. Leave a comment if you have any questions.
Before you leave, don’t forget to sign up for the Just into Data newsletter! Or connect with us on Twitter,
Facebook.
So you won’t miss any new data science articles from us!
15/15