How to Apply Transformers to Time Series Models

Intel

Published in

Intel Tech

11 min readAug 2, 2023

Use artificial intelligence to improve data forecasting results.

A collection of office clocks of different sizes on a wall — Photo by Jon Tyson on Unsplash

Author: Ezequiel Lanza ( AI open source evangelist)

In the sprawling landscape of machine learning, transformers stand tall as architectural marvels, reshaping the way we process and comprehend vast amounts of data with their intricate designs and ability to capture complex relationships.

Since the creation of the first transformer in 2017, there’s been an explosion of transformer types, including powerful generative AI models such as ChatGPT* and DALL-E*. While transformers are effective in text-to-text or text-to-image models, there are several challenges when applying transformers to time series. At the Open Source Summit North America* 2023, Ezequiel Lanza shared the problems with current transformer models and introduced new transformers that are beginning to show promising results for time series.

This post will not go deeper into the technical aspects, but we’ll include links to important papers throughout if you’d like to read more. Watch the full talk here.

Many new transformers have been created since the creation of vanilla transformer, the first transformer.

Overview of Transformer Functionality

Let’s look at a transformer’s role in Stable Diffusion*, a deep learning model that can turn a phrase, such as “A dog wearing glasses,” into an image. The transformer receives the text inputted by the user and generates text embeddings. Text embeddings are a representation of the text that can be read by a convolutional neural network (CNN) — in this case, a U-NET. While Stable Diffusion models use embeddings to generate images, embeddings can be used to generate additional outputs that are useful for time series models.

How Transformers Work

To understand how to apply a transformer to a time series model, we need to focus on three key parts of the transformer architecture:

Embedding and positional encoding
Encoder: Calculating multi-head self-attention
Decoder: Calculating multi-head self-attention

As an example, we’ll explain how vanilla transformers work, a transformer that translates simple phrases from one language into another.

Embedding and positional encoding: How you represent your input data

When you input the phrase “I love dogs” into a vanilla transformer, an algorithm called Word2Vec converts each word into a list of numbers, called a vector. Each vector contains information about what the word means and how it’s related to other words, such as synonyms and antonyms.

A model must also understand the position of each word in a phrase. For instance, “dogs I love” does not have the same meaning as “I love dogs.” A second algorithm called positional vector uses a complex mathematical equation to help your model understand sentence order. Packaged together, information provided by Word2Vec and positional vector algorithms is what’s known as a text embedding, or your original phrase represented in a way a machine can read.

Multi-head Self-attention at the Encoder Level

Next, encoders receive the text embeddings and convert them into new vectors, adding information to help the model discern the relationship between words in a phrase. For example, in the phrase “Children playing in the park,” an encoder would assign the most weight to “children,” “playing,” and “park.” We call this process self-attention because it dictates which words the model should pay most attention to.

To calculate self-attention, encoders create three vectors — a query vector, a key vector, and a value vector — for each word. The vectors are created by multiplying the phrase against three matrices. It’s a complex algorithm, but the important part to understand is that each word in a phrase gets multiplied by every other word in a phrase, and it can take a lot of time to calculate the attention of long phrases.

To develop an even better understanding of the relationship between words, the self-attention layer can run multiple heads at once. This process is called multi-head attention, and it allows the model to focus on different parts of a phrase simultaneously, such as when there are short- and long-term dependencies. For example, in the phrase “The animal didn’t cross the road because it was too tired,” multi-head attention tells the model that “animal” and “it” refer to the same idea.

Read this article for an in-depth explanation of self-attention and multi-headed attention algorithms
Read “Attention Is All You Need,” the paper that originally introduced multi-head attention

The time it takes to calculate attention quadratically grows with each new data point you add to a series.

Multi-head self-attention at the decoder level

The decoder works the same way as the encoder, except it has been trained using a different data set. For example, in a vanilla transformer, if the encoder has been trained on English language data and the decoder on French data, the decoder will run the same multi-head self-attention algorithms to translate the original phrase into French.

Using Transformers for Time Series

Why doesn’t this transformer architecture work for time series? Time series acts like a language in some ways, but it’s different than traditional languages. In language, you can express the same idea using vastly different words or sentence orders. Once a language-based transformer such as vanilla has been trained on a language, it can understand the relationship between words, so when you represent an idea in two different inputs, the transformer will still arrive at roughly the same meaning. Time series, however, requires a strict sequence — the order of the data points matter much more. This presents a challenge for using transformers for time series.

Let’s look at how we currently solve this problem and why these models fall short.

Current Approaches

The autoregressive integrated moving average (ARIMA) model works for some time series but requires a deep understanding of associated trends, seasonal changes, and residual values — and even then it only works for linear dependencies. In many time series that feature multivariate problems, the relationship between dependencies is not linear and ARIMA will not work.

There are also several approaches that use neural networks.

Feedforward neural network (FNN) models use any previous six data points in a series to predict the next six. Though FNNs enable nonlinear dependencies, they require you to handcraft a model that focuses on very specific problems or subsets of data, making this model too time consuming to construct for large data sets.
In a recurrent neural network (RNN) model, you can feed the model a small subset of data points that are relevant to your time series, and cells in the RNN will memorize which data points are important and what their weight is. However, when you’re dealing with data sets that have long dependencies, the weight becomes less relevant, and the model’s accuracy diminishes over time.
Long short-term memory (LSTM) models are similar to RNNs, except that each cell has a memory that allows you to update the weight more frequently during long sequences. This makes LSTM a good solution for some use cases.
Seq2seq is a way to improve LSTM performance. Instead of feeding your network directly, you can feed your data into an encoder, which generates features of your input that get fed into the decoder.

How Transformers Can Improve Time Series?

Using multi-head attention enabled by transformers could help improve the way time series models handle long-term dependencies, offering benefits over current approaches. To give you an idea of how well transformers work for long dependencies, think of the long and detailed responses that ChatGPT can generate in language-based models. Applying multi-head attention to time series could produce similar benefits by allowing one head to focus on long-term dependencies while another head focuses on short-term dependencies. We believe transformers could make it possible for time series models to predict as many as 1,000 data points into the future, if not more.

The Quadratic Complexity Issue

The way transformers calculate multi-head self-attention is problematic for time series. Because data points in a series must be multiplied by every other data point in the series, each data point you add to your input exponentially increases the time it takes to calculate attention. This is called quadratic complexity, and it creates a computational bottleneck when dealing with long sequences.

Improving the Transformer Model for Time Series

A survey published early this year identified two essential network modifications to address before applying transformers to time series:

Positional encoding: How we represent the input data
Attention module: Ways to reduce time complexity

The next section will cover the high-level takeaways, but you can read the survey for more details about the modifications and their results.

Network Modification №1: Positional Encoding

In 2019, we tried applying the Word2Vec encoding process in vanilla transformers, but the model couldn’t fully exploit the important features of a time series. Vanilla transformers are good at discerning the relationship between words but not at following a strict order in a data sequence. Read more.

In 2021, we created learnable text embeddings, enabling us to include additional positional encoding information in an input. Compared to fixed encoding in vanilla transformers, learnable positional encoding allows transformers to be more flexible and better exploit sequential ordering information. This helps transformers learn more important context about a time series, such as seasonal information.

Network Modification №2: Attention Module

To reduce the quadratic complexity in the attention layer, new transformers introduce the concept of ProbSparse Attention. By enabling the attention layer to calculate weight and probability using the most important data points only, instead of all data points, ProbSparse helps greatly reduce the time it takes to calculate attention.

ProbSparse attention used in new models like Informer* reduce time by calculating probability based only on the most important data points in a series.

Putting New Transformers to the Test

While many new transformers such as LogTrans*, Pyraformer*, and FEDformer* incorporate these network modifications, here we’re focusing on Informer and Spacetimeformer* because they’re open source. The GitHub* repos offer reference documentation and examples to make it easy to fine-tune the models to your data without having to understand every detail of the attention layer.
Let’s look at how Informer and Spacetimeformer leverage these network modifications and see what kind of results they generate.

The Informer Architecture

Informer transformers enable you to feed them important information about seasonal, monthly, or holiday trends to help the model understand subtle differences in the way your data behaves over the course of a year. For example, your data set may behave differently in the summer than in the winter. Through position encodings, you can tell the model to use different weights during different times of the year, allowing you to take greater control over the quality of your input.

By combining a ProbSparse attention model and positional encoding, Informer offers performance advantages over traditional deep learning architectures like LSTM. When predicting 24 data points into the future, Informer produces a slightly better mean squared error (MSE) of 0.577 than LSTM’s MSE of 0.650. When predicting 720 data points, the difference in performance grows, with Informer earning an MSE of 1.215 compared to LSTM’s 1.960. What we can conclude here is that Informer delivers slightly better results in a long series, but LSTM may still produce good results for certain short-term use cases.

Visit the Informer GitHub repo to see more results

Informer produces slightly better results than LSTM models, especially for long data series.

The Spacetimeformer Architecture

Spacetimeformer proposes a new way to represent inputs. Temporal attention models like Informer represent the value of multiple variables per time step in a single input token, which fails to consider spatial relationships between features. Graph attention models allow you to manually represent relationships between features but rely on hardcoded graphs that cannot change over time. Spacetimeformer combines both temporal and spatial attention methods, creating an input token to represent the value of a single feature at a given time. This helps the model understand more about the relationship between space, time, and value information.

Spacetimeformer calculates weight using space and time features in parallel, represented by the blue lines in the bottom right.

Like Informer, Spacetimeformer offers marginally better results than LSTMs. When predicting 40 hours into the future, Spacetimeformer’s MSE of 12.49 is slightly better than LSTM’s MSE of 14.29. While this margin widens for longer sequences, Spacetimeformer does not yet deliver significantly better results than LSTMs for every use case.

Similar to Informer, Spacetimeformer produces slightly better results than an LSTM, especially for longer time series.

Use case: Latency on a Microservice Architecture

Let’s apply a time series model to an online boutique. The store has 11 microservices, including a cart service that allows users to add and remove items and a catalog service that allows users to search for individual products.

Applying an Informer time series forecasting to an online boutique store with 11 microservices.

To demonstrate the impact on end users, we’ll predict how long users must wait for each microservice to process a request. Basing our model on the previous 360 data points of each service, we ran a short prediction of 36 data points into the future and a long prediction of 120 data points into the future.

When predicting the next 36 data points, Informer produced an MSE of 0.6, which is slightly better than LSTM. However, Informer took more time to process. The same is true for the results of the long model: Informer’s predictions are more accurate but take longer to process.

Informer generated better results in both short and long data series, but took slightly more time to process.

Get involved and Start Testing

Time series vary in complexity, so it’s important to test models to find the best fit for your use case. While traditional models such as LSTM are strong options for certain short-term time series, Informer and Spacetimeformer may offer more accurate forecasting for long-term series. We expect performance to improve as we continue to make optimizations to the attention layer and how the input data is represented. Additionally, as open source frameworks, Informer and Spacetimeformer make it much easier to install a model and start testing it with your data.

Please contribute to the GitHub repos to help advance these projects. We also offer a library of deep learning tools and frameworks to make the most of our open source models.

Go to the Informer GitHub repo
Visit the Spacetimeformer GitHub repo
Check out the Time Series GitHub repo for state-of-the-art deep learning for time series and sequences

For more open source content from Intel, check out open.intel

About the author

Ezequiel Lanza, Open Source Evangelist.Passionate about helping people discover the exciting world of artificial intelligence, Ezequiel is a frequent AI conference presenter and the creator of use cases, tutorials, and guides that help developers adopt open source AI tools like TensorFlow* and Hugging Face*. Find him on Twitter at @eze_lanza