Attention Augmented Convolutional Transformer For Tabular Time-Series

Attention Augmented Convolutional Transformer for Tabular
Time-series
Sharath M Shankaranarayana1 and Davor Runje1,2
Abstract— Time-series classification is one of the most fre- learning, since it requires domain knowledge and time-
quently performed tasks in industrial data science, and one consuming experimentation for every different use case.
of the most widely used data representation in the industrial Additionally, it has limited scalability because models cannot
setting is tabular representation. In this work, we propose a
novel scalable architecture for learning representations from be automatically created and maintained. Another important
tabular time-series data and subsequently performing down- issue is that most machine learning models require data
arXiv:2110.01825v1 [cs.LG] 5 Oct 2021
stream tasks such as time-series classification. The represen- to be in the form of fixed-length vectors or sequences of
tation learning framework is end-to-end, akin to bidirectional vectors. This is not straightforward and is difficult to obtain
encoder representations from transformers (BERT) in language for such large time-series datasets. To this end, we propose
modeling, however, we introduce novel masking technique
suitable for pretraining of time-series data. Additionally, we also a framework for learning vector representations from large
use one-dimensional convolutions augmented with transformers time-series data. Our framework can later employ these
and explore their effectiveness, since the time-series datasets learned representations on downstream tasks such as time-
lend themselves naturally for one-dimensional convolutions. series classification.
We also propose a novel timestamp embedding technique, Learning numerical vector representations (embeddings),
which helps in handling both periodic cycles at different time
granularity levels, and aperiodic trends present in the time- or representation learning, is one the major areas of re-
series data. Our proposed model is end-to-end and can handle search, specifically in the natural language processing (NLP)
both categorical and continuous valued inputs, and does not domain. In the seminal work called word2vec [2], vector
require any quantization or encoding of continuous features. representations of words are learnt from huge quantities of
I. I NTRODUCTION text. In word2vec, each word is mapped to a d-dimensional
vector such that semantically similar words have geometri-
Industrial entities in domains such as finance, telecom- cally closer vectors. This is achieved by predicting either the
munication, and healthcare usually log a large amount of context words appearing in a window around a given target
data of their customers or patients. The data is typically in word (skip-gram model), or the target word given the context
the form of events data, capturing interactions their users (called continuous bag of words or CBOW model). In this
have with different entities. The events could be specific model, it is assumed that the words appearing frequently
to the respective domains, for example, financial institu- in similar contexts share statistical properties and thus this
tions log all financial transactions made by their customers, model leverages word co-occurrence statistics.
telecommunication companies log all the interactions of the NLP has since then seen massive improvements in terms
individual customers, national healthcare systems keep logs of results by building upon embeddings-related works. For
of all visits and diagnoses patients had in different healthcare example, the work [3] proposed a transfer learning scheme
institutions, etc. These kinds of data mostly employ tabular for text classification and thus heralded a new era by making
representation. A multivariate time-series data represented in the transfer learning in NLP on par with the transfer learning
tabular form is often referred to as dynamic tabular data [1] tasks in computer vision. Another recent work [4] proposed
or tabular time-series. These datasets are rich for performing contextualized word representations, as opposed to static
knowledge discovery, doing analytics, and also building word representations.
predictive models using machine learning (ML) techniques. The current state-of-the-art techniques in NLP learn vector
Even though there exists a large amount of data, the data representations of words using transformers and attention
often remains unexplored due to various inherent difficulties mechanism [5] on large datasets with the task of reconstruct-
(such as unstructuredness, noise, sparse and missing values). ing an input text with some of the words in it randomly
Because of the complexities, usually only a subset of the masked [6].
data is employed for building ML models. Thus, a large Upon seeing the immense gains of employing models
amount of potentially rich and relevant data is unused for with the attention mechanism and transformers in NLP,
ML modeling. Moreover, machine learning using such data there have been a few works employing them for tabu-
is frequently performed by first extracting hand-crafted or lar data. In the work [7], the authors propose a network
engineered features and later building task-specific machine called TabNet, which is a transformer based model for self-
learning models. This engineering of the features is known supervised learning for tabular data. They report significant
to be one of the biggest challenges in industrial machine gains in prediction accuracy when the labelled data is sparse,
1 Airt Research, Zagreb, Croatia however, the work does not learn any sort of embeddings
2 Algebra University College, Zagreb, Croatia inherently and translates the use of transformer architecture
Fig. 1. Attention-augmented convolution: For each temporal location, Nh attention maps over the image are computed from queries and keys. These
attention maps are used to compute Nh weighted averages of the values V . The results are then concatenated, reshaped to match the original input’s
dimensions and mixed with a pointwise convolution. Multi-head attention is applied in parallel to a standard convolution operation and the outputs are
concatenated.
from NLP domain to tabular domain. Another similar very follows:

recent work [8] proposes similar use of transformers for the • We propose a novel BERT framework employing atten-
tabular data. In this work, although the authors do intend tion augmented convolutions for time-series tabular data
to learn embeddings, it is only for encoding categorical TabAConvBERT.
features present in the tabular data. Whereas in our work, • We propose a masking scheme that is better suitable for
we learn the embeddings for specific agents by contextual- time-series tabular data.
izing the interactions between the various agents. Thus, the • We propose a novel timestamp embedding block along
embeddings we obtain are very informative, unlike broad with positional encoding. This timestamp embedding
encoding of categorical values. Most recently, the authors block helps in handling of both periodic cycles at
in [9] proposed the use of BERT architecture modified for different time granularity levels and aperiodic trends.
time-series anomaly detection task. • The proposed framework can handle both discrete and
To the best of our knowledge, the closest work to ours is continuous input types.
TabBERT proposed in the work [1], in which the authors em-
ploy hierarchical BERT [10], [11] to first perform pretraining II. M ETHODOLOGY
on time-series data and then later employ the embeddings
on downstream tasks. Although similar, our proposed work A. Attention Augmented Convolution
differs in several aspects: One-dimensional (1-D) convolutions have long been em-
• We do not employ hierarchical BERT but single en-
ployed in time-series tasks and have shown reasonably
coder layer consisting of transformers, and make use of good results [14]. In computer vision, employing attention
embeddings to transform raw inputs in order to be fed mechanisms along with convolutional neural networks have
to BERT, making its computational cost significantly given significant improvements in various vision tasks [15],
smaller. [16], [17], [18]. Recently, the authors of the work [13]
• We propose a masking technique specifically suited for
proposed attention augmented convolutional networks that
time-series data resulting in higher performance of the can jointly attend to both spatial and feature subspaces,
model. in contrast to previous convolutional attention mechanisms,
• TabBERT requires input data to be already encoded as
which either perform only channel-wise reweighing [15],
categorical values, whereas our proposed framework can [16] or perform reweighing of both channels and spatial
handle both discrete and continuous inputs. positions independently[17], [18]. In this work, we propose a
1-D convolution version of attention augmented convolutions
Another relevant work to ours is [12], where the authors
[13]. Similar to [13], the proposed attention augmented
propose transformers for time-series representation learning,
convolution
but the pretraining loss function employed is mean squared
1) is equivariant to translation, and
error loss for both continuous and discrete features, whereas
2) can readily operate on inputs of different temporal
in our method we apply classification loss for discrete type
dimensions.
features and regression loss for floating features. To the
best of our knowledge, ours is the first work that employs As shown in [13], if we consider an original convolution
attention augmented convolutions [13] (as a part of BERT operator with kernel size k, Fin input filters, and Fout output
layer) for time-series data and also the first work that filters, the corresponding attention augmented convolution
proposes timestamp embedding block. can simply be written as:
In summary, the main contributions of our paper are as AAConv(X) = Concat[Conv(X), MHA(X)] (1)
Fig. 2. Our proposed architecture of TabAConvBERT
where Conv denotes convolution and MHA denotes multi-

head attention.
B. Architecture
The proposed architecture is shown in Fig. 2. The archi-

tecture consists of embedding network, which can encode
both categorical inputs (by using simple embedding neural
network) and continuous inputs (by using simple shallow
neural network). The architecture also consists of a special
timestamp embedding block, which is described in more
detail in Fig. 3. We first break the original raw timestamp
into multiple components such as year, month, day, weekday,
week, hour, minute and seconds. These broken up compo-
nents are discrete and have a finite set of values. For example,
month feature has 12 values, week feature has 52 values,
weekday feature has 7 values, and so on. Each of these
broken up discrete features are passed through an embedding
layer. The outputs of each of the embedding layers are then
summed-up. Additionally, we also create normalized times-
tamp features based on dates and time. These float values are
then passed through a shallow neural network with “activity
regularization” to obtain the vector representations having the
same output dimensions of embedding representation. The
resulting outputs from these shallow neural networks are also
added to the summed-up embedding vector to obtain the final Fig. 3. Our proposed Timestamp Embedding Block
time embedding. The obtained time embedding is then added
with input features’ embedding and also positional encoding.
The resulting summed value is then passed to attention aug- C. Masking
mented convolution layer and feed forward neural network For learning representations from the time-series tabular
layer. This combination of attention augmented convolution data, we propose a procedure called masked data modeling
layer and feed forward neural network layer can be stacked (MDM), akin to masked language modeling (MLM) em-
N times to obtain the encoder layer of BERT. Finally, the ployed in NLP. One of the foremost steps in pretraining,
dense layers in the output part of the network are added to as the name suggests, is masking. In MLM, masking is
aid pretraining or downstream tasks. straightforward, since the languages contain only sequence of
Method Fraud F1 Score
Baseline MLP [1] 0.74
Baseline LSTM [1] 0.73
TabBERT MLP [1] 0.83
TabBERT LSTM [1] 0.86
Proposed TabAConvBERT (Only Encoded Data) 0.888
Proposed TabAConvBERT (With custom masking) 0.892
Proposed TabAConvBERT 0.896
TABLE I
C OMPARISON OF VARIOUS METHODS
Fig. 4. Our proposed Masking methodology

For the downstream task of fraud classification, we again
perform similar procedures by creating samples by combin-
words. However, in multivariate time-series data, we propose ing 10 contiguous rows (with a stride of 10) in a time-
two kinds of masking. For the first kind of masking, we mask dependent manner for each user, and thus obtain 2.4M sam-
out certain percentage of the features at random from the ples, with 29,342 labeled as fraudulent. For evaluation, we
tabular time-series data. This masking is done independently use F1 binary score, on a test set consisting of 480K samples
for all the features and is similar to the one performed in for better comparison. For the downstream task, we re-
[7]. For the second kind of masking, we randomly mask out move the final dense layers employed during the pretraining
certain percentage of entire rows of features for the time- stage and a single dense layer for binary classification. For
series inputs. Almost all the previous works perform only downstream classification task, after pretraining, we freeze
first kind of masking for the tabular data. But only masking all embedding layers and finetune the layers from attention
out individual features may not be very effective since often augmented convolution layer. We perform three different
the features are slightly correlated and hence it becomes kinds of experiments. First, we employ the same quantization
easier for MDM to predict missing features. The masking and same masking employed in [1], with the main difference
of entire row is similar to masking a word in MLM since, being the BERT architecture having attention augmented
in tabular data, the analogue of a ”word” is an entire row of convolutional layers. Next, we employ our proposed masking
features. The figure Fig. 4 gives a clear picture of two types scheme but keep the same quantized data. This helps in
of masking employed in our work. gauging the improvements offered by our masking scheme.
Additionally, different from other previous works, our Lastly, we do not perform any encoding and keep all the
framework has the ability to handle continuous type of inputs continuous valued features as is, and also employ timestamp
as is, without resorting to binning. Although for masking the encoding as described in the previous section. Since we
categorical data we can simply have a specific integer token employ the same data partitions as described in [1], we
similar to MASK token employed in MLM, for continuous directly compare our results from their paper.
type of inputs, we mask the data by replacing the original As seen from Table I, the proposed methods give signif-
values with the mean value of the particular continuous icantly improved results on the downstream fraud classifi-
feature. cation task. The improvements can be seen by directly em-
ploying our TabAConvBERT architecture on the quantized or
III. E XPERIMENTS AND R ESULTS
encoded data. The masking procedure also helps in providing
For our experiments, we employ the dataset provided a slight improvement in the F1 score. The best results can be
in the work [1]. The dataset has 24 million transactions seen from our TabAConvBERT with no data preprocessing,
from 20,000 users. Each transaction (row) has 12 fields i.e., keeping the continuous and categorical values as is and
(columns) consisting of both continuous and discrete nom- employing timestamp embedding.
inal attributes, such as merchant name, merchant address,
transaction amount, etc. For easier comparison, we employ IV. C ONCLUSION
the similar procedures for sampling as performed in [1] In this work, we proposed a novel end-to-end BERT based
by creating samples as sliding windows of 10 transactions, architecture for time-series tasks. For the first time, we pro-
with a stride of 5. The strategies employed in masking are posed the use of attention augmented convolutions for tabular
different since we propose custom masking methodology, time-series data and also proposed major modifications to
however, we perform an experiment by quantizing and creat- the masking methodology for tabular time-series data. From
ing a vocabulary same as [1]. For masking, we perform 30% our experiments, we showed that each of the individual
masking of sample’s fields and 15% masking of sample’s modifications lead to improved results. Our method has a
entire rows. While performing pretraining, we omit the label major advantage that it can be directly used with raw data,
column to prevent biasing. In our proposed TabAConvBERT without resorting to techniques such as feature quantization
architecture, we only employ a single AAConvBERT layer or encoding. In future, we would like to rigorously evaluate
to keep the number of parameters low. on even larger scale industrial datasets.
R EFERENCES
[1] I. Padhi, Y. Schiff, I. Melnyk, M. Rigotti, Y. Mroueh, P. Dognin,
J. Ross, R. Nair, and E. Altman, “Tabular transformers for modeling
multivariate time series,” in ICASSP 2021-2021 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2021, pp. 3565–3569.
[2] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781,
2013.
[3] J. Howard and S. Ruder, “Universal language model fine-tuning for
text classification,” arXiv preprint arXiv:1801.06146, 2018.
[4] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
and L. Zettlemoyer, “Deep contextualized word representations,” arXiv
preprint arXiv:1802.05365, 2018.
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
[7] S. O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular
learning,” arXiv preprint arXiv:1908.07442, 2019.
[8] X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “Tabtransformer:
Tabular data modeling using contextual embeddings,” arXiv preprint
arXiv:2012.06678, 2020.
[9] W. Dang, B. Zhou, L. Wei, W. Zhang, Z. Yang, and S. Hu, “Ts-
bert: Time series anomaly detection via pre-training model bert,” in
International Conference on Computational Science. Springer, 2021,
pp. 209–223.
[10] X. Zhang, F. Wei, and M. Zhou, “Hibert: Document level pre-training
of hierarchical bidirectional transformers for document summariza-
tion,” arXiv preprint arXiv:1905.06566, 2019.
[11] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, and N. Dehak,
“Hierarchical transformers for long document classification,” in 2019
IEEE Automatic Speech Recognition and Understanding Workshop
(ASRU). IEEE, 2019, pp. 838–844.
[12] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff,
“A transformer-based framework for multivariate time series represen-
tation learning,” in Proceedings of the 27th ACM SIGKDD Conference
on Knowledge Discovery & Data Mining, 2021, pp. 2114–2124.
[13] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention
augmented convolutional networks,” in Proceedings of the IEEE/CVF
international conference on computer vision, 2019, pp. 3286–3295.
[14] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
convolutional and recurrent networks for sequence modeling,” arXiv
[15] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proceedings of the IEEE conference on computer vision and pattern
recognition, 2018, pp. 7132–7141.
[16] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi, “Gather-excite:
Exploiting feature context in convolutional neural networks,” arXiv
[17] J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, “Bam: Bottleneck attention
module,” arXiv preprint arXiv:1807.06514, 2018.
[18] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional
block attention module,” in Proceedings of the European conference
on computer vision (ECCV), 2018, pp. 3–19.

Attention Augmented Convolutional Transformer For Tabular Time-Series

Uploaded by

Copyright:

Available Formats

Attention Augmented Convolutional Transformer For Tabular Time-Series

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attention Augmented Convolutional Transformer For Tabular Time-Series

Uploaded by

Copyright:

Available Formats

Attention Augmented Convolutional Transformer for Tabular

from NLP domain to tabular domain. Another similar very follows:

where Conv denotes convolution and MHA denotes multi-

The proposed architecture is shown in Fig. 2. The archi-

Fig. 4. Our proposed Masking methodology

You might also like