Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

3 Steps To Time Series Forecasting LSTM With TensorFlow KerasA Practical Example in Python With Usefu

The document describes preprocessing time series data for forecasting with an LSTM model in TensorFlow Keras. It discusses splitting the dataset into training, validation and test sets based on dates. It also describes transforming the original dataset into smaller dataframes with each row containing historical values to predict a target value in the future. This is done to efficiently train the model on smaller chunks of data due to memory limitations. The function create_ts_files is defined to perform this transformation and divide the data into multiple files.

Uploaded by

Juanito Alimaña
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

3 Steps To Time Series Forecasting LSTM With TensorFlow KerasA Practical Example in Python With Usefu

The document describes preprocessing time series data for forecasting with an LSTM model in TensorFlow Keras. It discusses splitting the dataset into training, validation and test sets based on dates. It also describes transforming the original dataset into smaller dataframes with each row containing historical values to predict a target value in the future. This is done to efficiently train the model on smaller chunks of data due to memory limitations. The function create_ts_files is defined to perform this transformation and divide the data into multiple files.

Uploaded by

Juanito Alimaña
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

www.justintodata.

com /forecast-time-series-lstm-with-tensorflow-keras/

3 Steps to Time Series Forecasting: LSTM with TensorFlow


Keras A Practical Example in Python with useful Tips
⋮ 22/3/2020

In this tutorial, we present a deep learning time series analysis example with Python. You’ll see:

How to preprocess/transform the dataset for time series forecasting.


How to handle large time series datasets when we have limited computer memory.
How to fit Long Short-Term Memory (LSTM) with TensorFlow Keras neural networks model.
And More.

If you want to analyze large time series dataset with machine learning techniques, you’ll love this guide with
practical tips.

Let’s begin now!

The dataset we are using is the Household Electric Power Consumption from Kaggle. It provides
measurements of electric power consumption in one household with a one-minute sampling rate.

There are 2,075,259 measurements gathered within 4 years. Different electrical quantities and some sub-
metering values are available. But we’ll only focus on three features:

Date: date in format dd/mm/yyyy


Time: time in format hh:mm:ss
Global_active_power: household global minute-averaged active power (in kilowatt)

In this project, we will predict the amount of Global_active_power 10 minutes ahead.

# import packages

1/15
import math
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.utils import Sequence
from datetime import timedelta
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import time
import os
# read the dataset into python
df = pd.read_csv('household_power_consumption.txt', delimiter=';')
df.head()
view raw reading_data.py hosted with ❤ by GitHub

Step #1: Preprocessing the Dataset for Time Series Analysis


To begin, let’s process the dataset to get ready for time series analysis.

We transform the dataset df by:

creating feature date_time in DateTime format by combining Date and Time.


converting Global_active_power to numeric and remove missing values (1.25%).
ordering the features by time in the new dataset.

%%time
# This code is copied from https://towardsdatascience.com/time-series-analysis-visualization-
forecasting-with-lstm-77a905180eba
# with a few minor changes.
#
df['date_time'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'], errors='coerce')
df = df.dropna(subset=['Global_active_power'])

2/15
df['date_time'] = pd.to_datetime(df['date_time'])
df = df.loc[:, ['date_time', 'Global_active_power']]
df.sort_values('date_time', inplace=True, ascending=True)
df = df.reset_index(drop=True)
print('Number of rows and columns after removing missing values:', df.shape)
print('The time series starts from: ', df['date_time'].min())
print('The time series ends on: ', df['date_time'].max())
view raw preprocessing_data.py hosted with ❤ by GitHub

Now we have a dataset df as below.

df.info()
df.head(10)
view raw time_series_data.py hosted with ❤ by GitHub

3/15
Next, we split the dataset into training, validation, and test datasets.

df_test holds the data within the last 7 days in the original dataset. df_val has data 14 days before the test
dataset. df_train has the rest of the data.

# Split into training, validation and test datasets.


# Since it's timeseries we should do it by date.
test_cutoff_date = df['date_time'].max() - timedelta(days=7)
val_cutoff_date = test_cutoff_date - timedelta(days=14)
df_test = df[df['date_time'] > test_cutoff_date]
df_val = df[(df['date_time'] > val_cutoff_date) & (df['date_time'] <= test_cutoff_date)]
df_train = df[df['date_time'] <= val_cutoff_date]
#check out the datasets
print('Test dates: {} to {}'.format(df_test['date_time'].min(), df_test['date_time'].max()))
print('Validation dates: {} to {}'.format(df_val['date_time'].min(), df_val['date_time'].max()))
print('Train dates: {} to {}'.format(df_train['date_time'].min(), df_train['date_time'].max()))
view raw splitting_data_training_validation_test.py hosted with ❤ by GitHub

Related article: Time Series Analysis, Visualization & Forecasting with LSTM
This article forecasted the Global_active_power only 1 minute ahead of historical data.
But practically, we want to forecast over a more extended period, which we’ll do in this article.

Step #2: Transforming the Dataset for TensorFlow Keras

4/15
Before we can fit the TensorFlow Keras LSTM, there are still other processes that need to be done.

Let’s deal with them little by little!

Dividing the Dataset into Smaller Dataframes

As mentioned earlier, we want to forecast the Global_active_power that’s 10 minutes in the future.

The graph below visualizes the problem: using the lagged data (from t-n to t-1) to predict the target (t+10).

Unit: minutes

It is not efficient to loop through the dataset while training the model. So we want to transform the dataset
with each row representing the historical data and the target.

Unit: minutes

In this way, we only need to train the model using each row of the above matrix.

Now here comes the challenges:

How do we convert the dataset to the new structure?


How do we handle this larger new data structure when our computer memory is limited?

As a result, the function create_ts_files is defined:

to convert the original dataset to the new dataset above.


at the same time, to divide the new dataset into smaller files, which is easier to process.

Within this function, we define the following parameters:

start_index: the earliest time to be included in all the historical data for forecasting.
In this practice, we want to include history from the very beginning, so we set the default of it to be 0.

5/15
end_index: the latest time to be included in all the historical data for forecasting.
In this practice, we want to include all the history, so we set the default of it to be None.
history_length: this is n mentioned earlier, which is the number of timesteps to look back for each
forecasting.
step_size: the stride of the history window.
Global_active_power doesn’t change fast throughout time. So to be more efficient, we can let
step_size = 10. In this way, we downsample to use every 10 minutes of data in the past to predict the
future amount. We are only looking at t-1, t-11, t-21 until t-n to predict t+10.
target_step: the number of periods in the future to predict.
As mentioned earlier, we are trying to predict the global_active_power 10 minutes ahead. So this
feature = 10.
num_rows_per_file: the number of records to put in each file.
This is necessary to divide the large new dataset into smaller files.
data_folder: the one single folder that will contain all the files.

That’s a lot of complicated parameters!

In the end, just know that this function creates a folder with files.
And each file contains a pandas dataframe that looks like the new dataset in the chart above.
Each of these dataframes has columns:

y, which is the target to predict. This will be the value at t + target_step (t + 10).
x_lag{i}, the value at time t + target_step – i (t + 10 – 11, t + 10 – 21, and so on), i.e., the lagged value
compared to y.

At the same time, the function also returns the number of lags (len(col_names)-1) in the dataframes. This
number will be required when defining the shape for TensorFlow models later.

# Goal of the model:


# Predict Global_active_power at a specified time in the future.
# Eg. We want to predict how much Global_active_power will be ten minutes from now.
# We can use all the values from t-1, t-2, t-3, .... t-history_length to predict t+10
def create_ts_files(dataset,
start_index,
end_index,
history_length,
step_size,
target_step,
num_rows_per_file,
data_folder):
assert step_size > 0
assert start_index >= 0
if not os.path.exists(data_folder):

6/15
os.makedirs(data_folder)
time_lags = sorted(range(target_step+1, target_step+history_length+1, step_size), reverse=True)
col_names = [f'x_lag{i}' for i in time_lags] + ['y']
start_index = start_index + history_length
if end_index is None:
end_index = len(dataset) - target_step
rng = range(start_index, end_index)
num_rows = len(rng)
num_files = math.ceil(num_rows/num_rows_per_file)
# for each file.
print(f'Creating {num_files} files.')
for i in range(num_files):
filename = f'{data_folder}/ts_file{i}.pkl'
if i % 10 == 0:
print(f'{filename}')
# get the start and end indices.
ind0 = i*num_rows_per_file + start_index
ind1 = min(ind0 + num_rows_per_file, end_index)
data_list = []
# j in the current timestep. Will need j-n to j-1 for the history. And j + target_step for the target.
for j in range(ind0, ind1):
indices = range(j-1, j-history_length-1, -step_size)
data = dataset[sorted(indices) + [j+target_step]]
# append data to the list.
data_list.append(data)
df_ts = pd.DataFrame(data=data_list, columns=col_names)
df_ts.to_pickle(filename)
return len(col_names)-1
view raw define_ts_function_time_series.py hosted with ❤ by GitHub

Before applying the function create_ts_files, we also need to:

scale the global_active_power to work with Neural Networks.


define n, the history_length, as 7 days (7*24*60 minutes).
define step_size within historical data to be 10 minutes.
set the target_step to be 10, so that we are forecasting the global_active_power 10 minutes after the
historical data.

After these, we apply the create_ts_files to:

create 158 files (each including a pandas dataframe) within the folder ts_data.
return num_timesteps as the number of lags.

7/15
%%time
global_active_power = df_train['Global_active_power'].values
# Scaled to work with Neural networks.
scaler = MinMaxScaler(feature_range=(0, 1))
global_active_power_scaled = scaler.fit_transform(global_active_power.reshape(-1, 1)).reshape(-1,
)
history_length = 7*24*60 # The history length in minutes.
step_size = 10 # The sampling rate of the history. Eg. If step_size = 1, then values from every
minute will be in the history.
# If step size = 10 then values every 10 minutes will be in the history.
target_step = 10 # The time step in the future to predict. Eg. If target_step = 0, then predict the next
timestep after the end of the history period.
# If target_step = 10 then predict 10 timesteps the next timestep (11 minutes after the end of
history).
# The csv creation returns the number of rows and number of features. We need these values
below.
num_timesteps = create_ts_files(global_active_power_scaled,
start_index=0,
end_index=None,
history_length=history_length,
step_size=step_size,
target_step=target_step,
num_rows_per_file=128*100,
data_folder='ts_data')
# I found that the easiest way to do time series with tensorflow is by creating pandas files with the
lagged time steps (eg. x{t-1}, x{t-2}...) and
# the value to predict y = x{t+n}. We tried doing it using TFRecords, but that API is not very intuitive
and lacks working examples for time series.
# The resulting file using these parameters is over 17GB. If history_length is increased, or step_size
is decreased, it could get much bigger.
# Hard to fit into laptop memory, so need to use other means to load the data from the hard drive.
view raw transform_data_neural_networks_time_series.py hosted with ❤ by GitHub

As the function runs, it prints the name of every 10 files.

8/15
The folder ts_data is around 16 GB, and we were only using the past 7 days of data to predict. Now you can
see why it’s necessary to divide the dataset into smaller dataframes!

Defining the Time Series Object Class

In this procedure, we create a class TimeSeriesLoader to transform and feed the dataframes into the model.

There are built-in functions from Keras such as Keras Sequence, tf.data API. But they are not very efficient
for this purpose.

Within this class, we define:

__init__: the initial settings of the object, including:


– ts_folder, which will be ts_data that we just created.
– filename_format, which is the string format of the file names in the ts_folder.
For example, when the files are ts_file0.pkl, ts_file1.pkl, …, ts_file100.pkl, the format would be
‘ts_file{}.pkl’.
num_chunks: the total number of files (chunks).
get_chunk: this method takes the dataframe from one of the files, processes it to be ready for training.
shuffle_chunks: this method shuffles the order of the chunks that are returned in get_chunk. This is a
good practice for modeling.

The definitions might seem a little confusing. But keep reading, you’ll see this object in action within the next
step.

#
# So we can handle loading the data in chunks from the hard drive instead of having to load
everything into memory.

9/15
#
# The reason we want to do this is so we can do custom processing on the data that we are feeding
into the LSTM.
# LSTM requires a certain shape and it is tricky to get it right.
#
class TimeSeriesLoader:
def __init__(self, ts_folder, filename_format):
self.ts_folder = ts_folder
# find the number of files.
i=0
file_found = True
while file_found:
filename = self.ts_folder + '/' + filename_format.format(i)
file_found = os.path.exists(filename)
if file_found:
i += 1
self.num_files = i
self.files_indices = np.arange(self.num_files)
self.shuffle_chunks()
def num_chunks(self):
return self.num_files
def get_chunk(self, idx):
assert (idx >= 0) and (idx < self.num_files)
ind = self.files_indices[idx]
filename = self.ts_folder + '/' + filename_format.format(ind)
df_ts = pd.read_pickle(filename)
num_records = len(df_ts.index)
features = df_ts.drop('y', axis=1).values
target = df_ts['y'].values
# reshape for input into LSTM. Batch major format.
features_batchmajor = np.array(features).reshape(num_records, -1, 1)
return features_batchmajor, target
# this shuffles the order the chunks will be outputted from get_chunk.
def shuffle_chunks(self):
np.random.shuffle(self.files_indices)
view raw time_series_loader.py hosted with ❤ by GitHub

After defining, we apply this TimeSeriesLoader to the ts_data folder.

ts_folder = 'ts_data'
filename_format = 'ts_file{}.pkl'
tss = TimeSeriesLoader(ts_folder, filename_format)

10/15
view raw time_series_split.py hosted with ❤ by GitHub

Now with the object tss points to our dataset, we are finally ready for LSTM!

Step #3: Creating the LSTM Model


Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)
architecture used in the field of deep learning.

LSTM networks are well-suited to classifying, processing and making predictions based on time
series data, since there can be lags of unknown duration between important events in a time
series.

Wikipedia

As mentioned before, we are going to build an LSTM model based on the TensorFlow Keras library.

We all know the importance of hyperparameter tuning based on our guide. But in this article, we are simply
demonstrating the model fitting without tuning.

The procedures are below:

define the shape of the input dataset:


– num_timesteps, the number of lags in the dataframes we set in Step #2.
– the number of time series as 1. Since we are only using one feature of global_active_power.
define the number of units, 4*units*(units+2) is the number of parameters of the LSTM.
The higher the number, the more parameters in the model.
define the dropout rate, which is used to prevent overfitting.
specify the output layer to have a linear activation function.
define the model.

# Create the Keras model.


# Use hyperparameter optimization if you have the time.
ts_inputs = tf.keras.Input(shape=(num_timesteps, 1))
# units=10 -> The cell and hidden states will be of dimension 10.
# The number of parameters that need to be trained = 4*units*(units+2)
x = layers.LSTM(units=10)(ts_inputs)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(1, activation='linear')(x)
model = tf.keras.Model(inputs=ts_inputs, outputs=outputs)
view raw keras_model_lstm.py hosted with ❤ by GitHub

Then we also define the optimization function and the loss function. Again, tuning these hyperparameters to
find the best option would be a better practice.

11/15
# Specify the training configuration.
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
loss=tf.keras.losses.MeanSquaredError(),
metrics=['mse'])
view raw training_configuration.py hosted with ❤ by GitHub

To take a look at the model we just defined before running, we can print out the summary.

model.summary()
view raw model_summary.py hosted with ❤ by GitHub

You can see that the output shape looks good, which is n / step_size (7*24*60 / 10 = 1008). The number of
parameters that need to be trained looks right as well (4*units*(units+2) = 480).

Let’s start modeling!

We train each chunk in batches, and only run for one epoch. Ideally, you would train for multiple epochs for
neural networks.

%%time
# train in batch sizes of 128.
BATCH_SIZE = 128
NUM_EPOCHS = 1
NUM_CHUNKS = tss.num_chunks()
for epoch in range(NUM_EPOCHS):
print('epoch #{}'.format(epoch))
for i in range(NUM_CHUNKS):
X, y = tss.get_chunk(i)
# model.fit does train the model incrementally. ie. Can call multiple times in batches.
# https://github.com/keras-team/keras/issues/4446

12/15
model.fit(x=X, y=y, batch_size=BATCH_SIZE)
# shuffle the chunks so they're not in the same order next time around.
tss.shuffle_chunks()
view raw model_fit.py hosted with ❤ by GitHub

After fitting the model, we may also evaluate the model performance using the validation dataset.

Same as the training dataset, we also create a folder of the validation data, which prepares the validation
dataset for model fitting.

# evaluate the model on the validation set.


#
# Create the validation CSV like we did before with the training.
global_active_power_val = df_val['Global_active_power'].values
global_active_power_val_scaled = scaler.transform(global_active_power_val.reshape(-1,
1)).reshape(-1, )
history_length = 7*24*60 # The history length in minutes.
step_size = 10 # The sampling rate of the history. Eg. If step_size = 1, then values from every
minute will be in the history.
# If step size = 10 then values every 10 minutes will be in the history.
target_step = 10 # The time step in the future to predict. Eg. If target_step = 0, then predict the next
timestep after the end of the history period.
# If target_step = 10 then predict 10 timesteps the next timestep (11 minutes after the end of
history).
# The csv creation returns the number of rows and number of features. We need these values
below.

13/15
num_timesteps = create_ts_files(global_active_power_val_scaled,
start_index=0,
end_index=None,
history_length=history_length,
step_size=step_size,
target_step=target_step,
num_rows_per_file=128*100,
data_folder='ts_val_data')
view raw evaluate_model_validation.py hosted with ❤ by GitHub

Besides testing using the validation dataset, we also test against a baseline model using only the most
recent history point (t + 10 – 11).

The detailed Python code is below.

# If we assume that the validation dataset can fit into memory we can do this.
df_val_ts = pd.read_pickle('ts_val_data/ts_file0.pkl')
features = df_val_ts.drop('y', axis=1).values
features_arr = np.array(features)
# reshape for input into LSTM. Batch major format.
num_records = len(df_val_ts.index)
features_batchmajor = features_arr.reshape(num_records, -1, 1)
y_pred = model.predict(features_batchmajor).reshape(-1, )
y_pred = scaler.inverse_transform(y_pred.reshape(-1, 1)).reshape(-1 ,)
y_act = df_val_ts['y'].values
y_act = scaler.inverse_transform(y_act.reshape(-1, 1)).reshape(-1 ,)
print('validation mean squared error: {}'.format(mean_squared_error(y_act, y_pred)))
#baseline
y_pred_baseline = df_val_ts['x_lag11'].values
y_pred_baseline = scaler.inverse_transform(y_pred_baseline.reshape(-1, 1)).reshape(-1 ,)
print('validation baseline mean squared error: {}'.format(mean_squared_error(y_act,
y_pred_baseline)))
view raw validation.py hosted with ❤ by GitHub

The validation dataset using LSTM gives Mean Squared Error (MSE) of 0.418. While the baseline model has
MSE of 0.428. The LSTM does slightly better than the baseline.

We could do better with hyperparameter tuning and more epochs. Plus, some other essential time series
analysis tips such as seasonality would help too.

14/15
Related article: Hyperparameter Tuning with Python: Complete Step-by-Step Guide

Thank you for reading!

Hope you found something useful in this guide. Leave a comment if you have any questions.

Before you leave, don’t forget to sign up for the Just into Data newsletter! Or connect with us on Twitter,
Facebook.
So you won’t miss any new data science articles from us!

15/15

You might also like