Hands-On Tutorials

How to use Transformer Networks to build a Forecasting model

Train a Forecasting model using Transformers and PyTorch

Youness Mansar

Published in

Towards Data Science

5 min readMay 15, 2021

I recently read a really interesting paper called Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case. I thought it might be an interesting project to implement something similar from scratch to learn more about time series forecasting.

The Forecasting Task:

In time series forecasting, the objective is to predict future values of a time series given its historical values. Some examples of time series forecasting tasks are:

Predicting influenza prevalence case: Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case
Energy production prediction: Energy consumption forecasting using a stacked non-parametric Bayesian approach
Weather forecasting: MetNet: A Neural Weather Model for Precipitation Forecasting

We can, for example, store the energy consumption measures of a city for a few months and then train a model that will be able to predict the energy consumption of the city in the future. This can be used to estimate energy demand and thus energy companies can use this model to estimate the optimal value of the energy that needs to be produced at any given time.

The Model:

The model we will use is an encoder-decoder Transformer where the encoder part takes as input the history of the time series while the decoder part predicts the future values in an auto-regressive fashion.

The decoder is linked with the encoder using an attention mechanism. This way, the decoder can learn to “attend” to the most useful part of the time series historical values before making a prediction.

The decoder uses masked self-attention so that the network can’t cheat during train by looking ahead and using future values to predict past values.

The encoder Sub-Network:

The decoder Sub-Network:

The full model:

Auto-regressive Encoder-Decoder Transformer / Image by author

This architecture can be constructed using PyTorch using the following:

encoder_layer = nn.TransformerEncoderLayer(
    d_model=channels,
    nhead=8,
    dropout=self.dropout,
    dim_feedforward=4 * channels,
)
decoder_layer = nn.TransformerDecoderLayer(
    d_model=channels,
    nhead=8,
    dropout=self.dropout,
    dim_feedforward=4 * channels,
)

self.encoder = torch.nn.TransformerEncoder(encoder_layer, num_layers=8)
self.decoder = torch.nn.TransformerDecoder(decoder_layer, num_layers=8)

The data:

Every time I implement a new approach I like to try it first on synthetic data so that it's easier to understand and debug. This reduces the complexity of the data and focuses more on the implementation/Algorithm.

I wrote a small script that can generate non-trivial time series with different periodicity, offsets, and patterns.

def generate_time_series(dataframe):

    clip_val = random.uniform(0.3, 1)

    period = random.choice(periods)

    phase = random.randint(-1000, 1000)

    dataframe["views"] = dataframe.apply(
        lambda x: np.clip(
            np.cos(x["index"] * 2 * np.pi / period + phase), -clip_val, clip_val
        )
        * x["amplitude"]
        + x["offset"],
        axis=1,
    ) + np.random.normal(
        0, dataframe["amplitude"].abs().max() / 10, size=(dataframe.shape[0],)
    )

    return dataframe

Examples of generated time series / Image by author

The model is then trained on all those time series at once:

Results:

We now use the model to make predictions on future values of those time series. The results are somewhat mixed:

The Bad:

Examples of bad predictions / Image by author

The Good:

Examples of good prediction / Image by author

The results are not as good as I expected, especially given it can generally be easy to make good predictions on synthetic data, but they are still encouraging.

The model’s predictions are somewhat out of phase with a slight over-estimation of the amplitude on some of the bad examples. On the good examples, the prediction fit the ground truth really well, excluding the noise.

I probably need to debug my code a little more and work on optimizing the hyper-parameters before I can expect to achieve better results.

Conclusion:

Transformers are currently very popular models in multitudes of Machine Learning applications so it is only natural that they will be used for time series forecasting.

Transformers should probably not be your first go-to approach when dealing with time series since they can be heavy and data-hungry but they are nice to have in your Machine Learning toolkit given their versatility and wide range of applications, starting from their first introduction in NLP to audio processing, computer vision and time series.

Feel free to comment if you have any questions or suggestions.

https://github.com/CVxTz/time_series_forecasting