Attention Augmented Convolutional Transformer For Tabular Time-Series
Attention Augmented Convolutional Transformer For Tabular Time-Series
Attention Augmented Convolutional Transformer For Tabular Time-Series
Time-series
Sharath M Shankaranarayana1 and Davor Runje1,2
Abstract— Time-series classification is one of the most fre- learning, since it requires domain knowledge and time-
quently performed tasks in industrial data science, and one consuming experimentation for every different use case.
of the most widely used data representation in the industrial Additionally, it has limited scalability because models cannot
setting is tabular representation. In this work, we propose a
novel scalable architecture for learning representations from be automatically created and maintained. Another important
tabular time-series data and subsequently performing down- issue is that most machine learning models require data
arXiv:2110.01825v1 [cs.LG] 5 Oct 2021
stream tasks such as time-series classification. The represen- to be in the form of fixed-length vectors or sequences of
tation learning framework is end-to-end, akin to bidirectional vectors. This is not straightforward and is difficult to obtain
encoder representations from transformers (BERT) in language for such large time-series datasets. To this end, we propose
modeling, however, we introduce novel masking technique
suitable for pretraining of time-series data. Additionally, we also a framework for learning vector representations from large
use one-dimensional convolutions augmented with transformers time-series data. Our framework can later employ these
and explore their effectiveness, since the time-series datasets learned representations on downstream tasks such as time-
lend themselves naturally for one-dimensional convolutions. series classification.
We also propose a novel timestamp embedding technique, Learning numerical vector representations (embeddings),
which helps in handling both periodic cycles at different time
granularity levels, and aperiodic trends present in the time- or representation learning, is one the major areas of re-
series data. Our proposed model is end-to-end and can handle search, specifically in the natural language processing (NLP)
both categorical and continuous valued inputs, and does not domain. In the seminal work called word2vec [2], vector
require any quantization or encoding of continuous features. representations of words are learnt from huge quantities of
I. I NTRODUCTION text. In word2vec, each word is mapped to a d-dimensional
vector such that semantically similar words have geometri-
Industrial entities in domains such as finance, telecom- cally closer vectors. This is achieved by predicting either the
munication, and healthcare usually log a large amount of context words appearing in a window around a given target
data of their customers or patients. The data is typically in word (skip-gram model), or the target word given the context
the form of events data, capturing interactions their users (called continuous bag of words or CBOW model). In this
have with different entities. The events could be specific model, it is assumed that the words appearing frequently
to the respective domains, for example, financial institu- in similar contexts share statistical properties and thus this
tions log all financial transactions made by their customers, model leverages word co-occurrence statistics.
telecommunication companies log all the interactions of the NLP has since then seen massive improvements in terms
individual customers, national healthcare systems keep logs of results by building upon embeddings-related works. For
of all visits and diagnoses patients had in different healthcare example, the work [3] proposed a transfer learning scheme
institutions, etc. These kinds of data mostly employ tabular for text classification and thus heralded a new era by making
representation. A multivariate time-series data represented in the transfer learning in NLP on par with the transfer learning
tabular form is often referred to as dynamic tabular data [1] tasks in computer vision. Another recent work [4] proposed
or tabular time-series. These datasets are rich for performing contextualized word representations, as opposed to static
knowledge discovery, doing analytics, and also building word representations.
predictive models using machine learning (ML) techniques. The current state-of-the-art techniques in NLP learn vector
Even though there exists a large amount of data, the data representations of words using transformers and attention
often remains unexplored due to various inherent difficulties mechanism [5] on large datasets with the task of reconstruct-
(such as unstructuredness, noise, sparse and missing values). ing an input text with some of the words in it randomly
Because of the complexities, usually only a subset of the masked [6].
data is employed for building ML models. Thus, a large Upon seeing the immense gains of employing models
amount of potentially rich and relevant data is unused for with the attention mechanism and transformers in NLP,
ML modeling. Moreover, machine learning using such data there have been a few works employing them for tabu-
is frequently performed by first extracting hand-crafted or lar data. In the work [7], the authors propose a network
engineered features and later building task-specific machine called TabNet, which is a transformer based model for self-
learning models. This engineering of the features is known supervised learning for tabular data. They report significant
to be one of the biggest challenges in industrial machine gains in prediction accuracy when the labelled data is sparse,
1 Airt Research, Zagreb, Croatia however, the work does not learn any sort of embeddings
2 Algebra University College, Zagreb, Croatia inherently and translates the use of transformer architecture
Fig. 1. Attention-augmented convolution: For each temporal location, Nh attention maps over the image are computed from queries and keys. These
attention maps are used to compute Nh weighted averages of the values V . The results are then concatenated, reshaped to match the original input’s
dimensions and mixed with a pointwise convolution. Multi-head attention is applied in parallel to a standard convolution operation and the outputs are
concatenated.
B. Architecture
TABLE I
C OMPARISON OF VARIOUS METHODS