This is the html version of the file https://arxiv.org/abs/2401.10396.
Google automatically generates html versions of documents as we crawl the web.
These search terms have been highlighted: lossless compression industrial time series direct access
Deep Dict: Deep Learning-based Lossy Time Series Compressor for IoT Data
  Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page 1
Deep Dict: Deep Learning-based Lossy Time Series
Compressor for IoT Data
Jinxin Liu1, Petar Djukic2, Michel Kulhandjian1, Burak Kantarci1
1School of Electrical Engineering and Computer Science
University of Ottawa, Ottawa, ON, Canada
2 Nokia Bell Labs
1{jliu367, mkulhand, burak.kantarci}@uottawa.ca, 2petar.djukic@nokia-bell-labs.com
Abstract—We propose Deep Dict, a deep learning-based lossy
time series compressor designed to achieve a high compression
ratio while maintaining decompression error within a predefined
range. Deep Dict incorporates two essential components: the
Bernoulli transformer autoencoder (BTAE) and a distortion
constraint. BTAE extracts Bernoulli representations from time
series data, reducing the size of the representations compared
to conventional autoencoders. The distortion constraint limits
the prediction error of BTAE to the desired range. More-
over, in order to address the limitations of common regression
losses such as L1/L2, we introduce a novel loss function called
quantized entropy loss (QEL). QEL takes into account the
specific characteristics of the problem, enhancing robustness to
outliers and alleviating optimization challenges. Our evaluation
of Deep Dict across ten diverse time series datasets from various
domains reveals that Deep Dict outperforms state-of-the-art lossy
compressors in terms of compression ratio by a significant margin
by up to 53.66%.
Index Terms—Internet of Things, Machine Learning, Deep
Learning, IoT Data Compression, Lossy Time Series Compressor
I. INTRODUCTION
Internet of Things (IoT) is a paradigm that connects real-
world objects and collects real-time information/data using
various sensors, such as accelerometers, gyroscopes, and tem-
perature sensors [1]. In recent decades, IoT has been widely
used in a variety of applications, including smart healthcare,
smart homes, connected vehicles, and wearable devices [2].
In these applications, massive amounts of time series data
are created, stored, and communicated as a result of the
widespread use of smart devices, industrial processes, IoT
networks, and scientific research [3]. However, transmitting
such a large amount of time series can be costly in terms
of network bandwidth and storage space [4]; consequently,
many studies focus on compressing time series data with a
high compression ratio [5]. Data compression can be roughly
classified into two categories: lossless and lossy compression
[6]. As lossy compression introduces errors to decompressed
time series, it is important for lossy compressors to contain
error-bound or distortion-constraint mechanisms to make a
compromise between compression ratio and decompression
errors [7].
Autoencoder (AE) [8] is one of the important lossy time
series compression techniques. An AE encodes time series data
as latent states with real values and decodes them as prediction.
Compressed latent states are one of the major overheads
of compressed data. This work addresses the potential of
encoding time series into Bernoulli distributed latent states
rather than real-value latent states in order to drastically reduce
the size of latent states and improve compression rate. In
addition to AE, prediction-based compressors typically employ
regression losses such as L1 and L2 [9]. Given that outliers
and noise can significantly impact traditional regression loss
functions like L1 and L2, which may not always align effec-
tively with the underlying objective, this research introduces
a novel loss function inspired by the principles of entropy
coding. This approach aims to provide a more accurate and
precise definition of the problem at hand.
This work proposes a new lossy time series data compres-
sion technique, namely Deep Dict in order to improve the
compression ratio. The contribution of this paper is threefold:
A new compression framework, called Deep Dict, is
proposed for lossy compression of time series data gen-
erated from IoT devices such as gyroscopes, and the
results demonstrate that Deep Dict achieves a higher
compression ratio than state-of-the-art compressors.
This work proposes a novel Bernoulli transformer-based
autoencoder (BTAE) that can effectively reduce the size
of latent states and reconstruct IoT time series from the
Bernoulli latent states.
We introduce a novel loss function called quantized
entropy loss (QEL), tailored to the specific characteristics
of the problem. QEL surpasses conventional regression
loss functions like L1 and L2 in terms of compression
ratio performance. This loss function is adaptable for
use with any prediction-based compression method that
employs uniform quantization and an entropy coder.
The rest of the paper is organized as follows. Section II
introduces the state-of-the-art lossy data compressors. Sec-
tion III presents the problem statement, the proposed Deep
Dict framework, and the details of network architecture. The
datasets and comprehensive experiments are described in Sec-
tion IV. Finally, Section V presents conclusions along with
future directions.
II. RELATED WORK
In the majority of IoT applications, due to resource con-
straints on the devices, substantial amounts of data are typ-
arXiv:2401.10396v1 [eess.SP] 18 Jan 2024

Page 2
Fig. 1: Overview of Deep Dict.
ically offloaded either to edge nodes or to the cloud for the
purposes of analytics or decision support [10].
Liang et al. [11] propose SZ2 a framework for adaptive
prediction-based compression with Lorenzo or linear regres-
sion as the predictor. In addition to compressing data with a
specified error constraint, SZ2 maximizes peak single-to-noise
ratio (PSNR) to ensure the integrity of recovered data with a
high compression ratio.
LFZip is a lossy compressor, proposed by Chandak et al.
[9], that utilizes machine learning for prediction, quantiza-
tion, and entropy coding. The auto-regressive predictor of
LFZip has two types: normalized least mean square predictor
(NLMS) and bidirectional GRU for learning nonlinear patterns
in time series. In LFZip, L2 serves as the loss function, and
the maximum absolute error (MAE) is used for distortion
measurement.
Based on SZ2, Zhao et al. [12] offer SZ3, which replaces
linear regression with a dynamic spline interpolation approach.
SZ3 can identify nonlinear patterns, resulting in a high com-
pression ratio and superior data quality (e.g., PSNR, etc.).
Based on the related work review the state-of-the-art calls
for new lossy compression methods that can remarkably
reduce the size of latent states in comparison to conventional
autoencoder-based compressors. With this objective in mind,
the following section introduces a prediction-quantization-
entropy coder paradigm and presents Deep Dict as an in-
novative lossy compressor designed to improve prediction
capabilities.
III. METHODOLOGY
This section lays out the technical details of the proposed
Deep Dict framework for lossy time series compression.
A. Problem Definition
Time series are described as a collection of time-dependent
data that can be categorized broadly into univariate time
series (UTS) and multivariate time series (MTS).AE-based
compressors encode time series into a latent representation
consisting of floating-point numbers; thus, the size of the latent
representation has a direct effect on the compression ratio.
B. Deep Dict
1) Overview: The two major components of Deep Dict
are the BTAE and distortion constraint. Figure 1 illustrates
an overview of the proposed Deep Dict compressor. Initially,
Deep Dict chunks the original long time series into smaller
time series x by a time window. BTAE encodes x into
Bernoulli latent states c and decodes c to predict time series
x. In order to limit the error of the reconstructed time series to
a desired range, the residual r = x− x is quantized uniformly
to rq and an entropy coder is used to compress rencoded in a
lossless manner. For transmission or storage after compression,
c, the decoder, and rencoded are used. During decompression,
c is fed into the decoder to recover x, and the entropy coder
decodes rencoded to rq. Since c contains limited information,
a feed-forward network (FFN) is used as the encoder in this
study. Fig. 2 demonstrates the architecture of the decoder
in detail. Due to the fact that each input time series x is
truncated from a long sequence, the relative position can be
more meaningful than the absolute position. Therefore, we
include relative positional encoding in the proposed decoder.
Figure 3 illustrates the details of multihead attention (MHA)
with RPE [13].
2) Distortion Constraint: The predicted time series x typ-
ically has more than 10% mean absolute percentage error
(MAPE) loss. By utilizing BTAE, distortion can be constrained
to a small range at the expense of more parameters. Thus,
with a limited number of parameters in BTAE, the distortion
constraint is utilized to reduce the distortion to a desired range.
As depicted in Fig. 4, r is quantized uniformly to rq as follows:
rq = 2ϵ×round(r/2ϵ), where ϵ is the desired MaAE. To avoid
float-point overflow, rq is stored in 64-bit format. In this work,
we utilize adaptive quantized local frequency coding, powered
by libbsc, a renowned library for lossless compression [14],
as the entropy coder to encode rq into rencoded.
3) Quanized Entropy Loss (QEL): We introduce a new loss
function, QEL, so as to minimize the size of rencoded. Good-
performing entropy coders with high compression ratios tend
to approach the limits,
|rencoded|≥−
|S|
j=0
n(sj) log p(sj),
(1)
where sj are is the unique values of rq, n(sj) is a counter to
keep track of the number of times sj appears in rq, and p(sj)
specifies the probability of sj appearing in rq, respectively. In
light of these, we formulate the minimization problem of the
objective function as
min
r
H(r) = −
|S|
j=0
p(sj) log p(sj).
(2)
Therefore, the minimization process consist of the forward
process the QEL calculates the entropy of time series. In
backpropagation process we can show that the gradient of H
can be writeen as
∂H
∂ri
= lim
b→∞
|S|
j=0
[1 + ln p(sj)] × R(ri − sj),
(3)
where
R(ri − sj) =
b
|r|ϵb
(ri − sj)b−1
[
(ri−sj )b
ϵb
+ 1]2
.
(4)

Page 3
Fig. 2: Detailed Architecture of the Decoder of Deep Dict.
Fig. 3: Detailed Architecture of Multihead Attention with RPE.
Fig. 4: Intuitive Example of Uniformed Quantization.
IV. EXPERIMENTAL RESULTS
BTAE’s encoder has 3 layers with 64 hidden states. The
FFN that is used for augmenting c has 1 layer with 64 hidden
states. The decoder of BTAE has two layers and all feed-
forward layers inside the decoder have 64 hidden states. The
hyperparameter dmodel of the transformer encoder is set at
32, and the number of heads of MHA is 8. The FFN used
for projecting output time series has 1 layer with 64 hidden
states. We apply the Gaussian Error Linear Unit (GeLU) as
the activation function. The following three loss functions are
used: L1, L2, and QEL. The other hyperparameter b (for QEL)
is set to 10 as default. The batch size is set to 64. Adam
optimizer is used with a learning rate of 0.0001, weight decay
of 0.01, β1 of 0.9, and β2 of 0.999. The model is trained by
using PyTorch 1.11 on NVIDIA GeForece RTX 3070 and Intel
Xeon W-2295.
Fig. 5: Comparison of L1, L2/MSE, and QEL under bar crawl
univariate dataset.
A. Numerical Results
This section discusses Deep Dict’s performance under a
variety of time series datasets.
1) Results on Univariate Datasets: To evaluate the per-
formance of Deep Dict, five lossy time series compressors
are used as baselines. These compressors include Critical
Aperture (CA)1, SZ22, LFZip3, and SZ34. CA is an industrially
well-received compressor that is computationally simple and
efficient.
Table I compares the proposed method (RPE is not used by
default) to the baselines under the datasets that are ordered
with respect to the length of their time series. Under seven
out of ten datasets, the proposed method outperforms the
state-of-the-art algorithms. Due to the overhead of BTAE and
codes, Deep Dict performs similarly to the baseline on small
datasets; however, under large datasets, Deep Dict outperforms
the baselines by at most 53.66%.
As depicted in Fig. 5, when the bar crawl dataset is
considered as a representative example, because L1 and L2
are not particularly designed to reduce the size of rencoded,
L1 and L2 losses can result in an increase of |rencoded| during
the training process; however, QEL can handle such situations
1https://github.com/shubhamchandak94/LFZip/blob/master/src/ca
compress.py
2https://github.com/szcompressor/SZ
3https://github.com/shubhamchandak94/LFZip
4https://github.com/szcompressor/SZ3

Page 4
TABLE I: Compression ratio compared to the best baseline methods.
Dataset
length
CA
SZ
LFZip
SZ3
DeepDict(L1)
Imp.
DeepDict(QEL)
Imp.
DeepDict(RPE)
Imp.
dna
1167877
4.86
8.62
8.40
7.78
8.58
-0.46%
8.09
-6.15%
8.36
-3.02%
pow
2049280
12.47
23.99
17.98
24.21
23.92
-1.20%
23.98
-0.95%
24.00
-0.87%
watch gyr
3205431
10.75
24.79
28.77
26.85
27.10
-5.80%
24.68
-14.22%
25.63
-10.91%
watch acc
3540962
5.19
11.00
12.71
10.78
13.24
4.17%
12.74
0.24%
12.87
1.26%
phones acc
13062475
7.13
12.63
14.12
12.64
15.95
12.96%
15.84
12.18%
15.89
12.54%
phones gyr
13932632
27.32
52.00
46.28
55.11
56.44
2.41%
55.46
0.64%
52.66
-4.45%
bar crawl
14057567
4.99
18.09
19.14
18.39
27.57
44.04%
29.41
53.66%
29.78
55.59%
soybeans
22824499
3.43
14.35
15.61
13.50
17.04
9.16%
18.32
17.36%
18.45
18.19%
synthetic
43000000
113.07
119.00
58.35
127.23
125.16
-1.63%
154.65
21.55%
149.54
17.54%
ppg ecg
90642300
46.01
69.20
45.47
71.42
90.31
26.45%
95.31
33.45%
97.28
36.21%
Fig. 6: Comparison among different architectures of the decoder on univariate datasets.
TABLE II: Compression ratio comparison between univariate and multivariate mode.
L1
QEL
RPE
Dataset
Shape
Uni.
Mul.
Imp.
Uni.
Mul.
Imp.
Uni.
Mul.
Imp.
watch gyr
3205431 × 3
31.14
28.83
-7.42%
31.64
28.52
-9.86%
32.93
28.72
-12.78%
watch acc
3,540,962 × 3
13.06
11.28
-13.63%
13.53
11.48
-15.15%
13.57
10.79
-20.49%
phone acc
13062475 × 3
11.34
13.15
15.96%
11.77
13.55
15.12%
11.8
13.89
17.71%
phone gyr
13,932,632 × 3
37.28
47.09
26.31%
42.85
47.81
11.58%
42.1
46.9
11.40%
bar crawl
14,057,567 × 3
28.08
29.1
3.63%
22.56 (b=3)
28.57 (b=3)
26.64% (b=3)
23.65 (b=3)
28.85 (b=3)
21.99% (b=3)
synthetic
43,000,000×5
24.42
42.68
74.77%
15.9
43.5
173.58%
28.17
44.31
57.29%
and increase the compression ratio. We further enable the RPL
(using QEL loss) in Deep Dict. The results indicate that RPL
can improve the performance of Deep Dict on eight out of ten
datasets.
As demonstrated in Fig. 1, the decoder of BTAE can be
other network architecture intended for time series data, such
as FFN, LSTM, or GRU. In order to illustrate the effectiveness
of the proposed decoder, we compare the various decoder
designs in Fig. 6. Similar to the typical decoder of an RNN-
based autoencoder, the auto-regressive approach is employed
for LSTM and GRU, with c serving as the initial input
timestamp. Under six out of ten datasets, the results indicate
that LSTM performs better than FFN and GRU. Our proposed
decoder outperforms the other network architectures under
nine out of ten datasets.
2) Results on Multivariate Datasets: Table II compares the
performance of the univariate mode (i.e., flattening the MTS
prior to feeding into Deep Dict) and multivariate mode. Com-
pared with the performance of L1 and QEL, QEL outperforms
L1 on all datasets except for the bar crawl dataset. When RPL
is leveraged (QEL is used as a loss function), it is able to
improve the performance of the univariate and multivariate
modes.
3) Transferability: As shown in Table IIIa, the compression
ratio of Deep Dict with transfer learning (Deep Dict + TL)
reduces by less than 5% under 7 out of 10 univariate datasets
when compared to training a model from scratch. Under five
out of seven univariate datasets (where Deep Dict outperforms
the best baseline), Deep Dict + TL continues to outperform
the best baseline. Table IIIb shows the comparative results
between NTL and TL for multivariate datasets. Under five out
of six multivariate datasets, Deep Dict + TL decreases the
compression ratio by up to 9.22%.

Page 5
TABLE III: Transferability; between TL and NTL in terms of
compression ratio.
(a) Univariate datasets.
Dataset
Length
NTL
TL
Imp.
TL+RPE
Imp.
dna
1,167,877
8.09
8.01
-0.99%
8.02
-0.87%
pow
2,049,280
23.98
21.84
-8.92%
22.01
-8.22%
watch gyr
3,205,431
24.68
24.01
-2.71%
24.19
-1.99%
watch acc
3,540,962
12.74
12.78
0.31%
12.73
-0.08%
phones acc
13,062,475
15.84
14.87
-6.12%
14.83
-6.38%
phones gyr
13,932,632
55.46
53.43
-3.66%
53.76
-3.07%
bar crawl
14,057,567
29.41
28.54
-2.96%
28.71
-2.38%
soybeans
22,824,499
18.32
17.59
-3.98%
17.85
-2.57%
synthetic
43,000,000
154.65
119.97
-22.42%
120
-22.41%
ppg ecg
90,642,300
95.31
94.15
-1.22%
96.67
1.43%
(b) Multivariate datasets.
Dataset
Shape
NTL
TL
Imp.
TL+RPE
Imp.
watch gyr
3,205,431 × 3
28.52
21.79
-23.60%
27.19
-4.66%
watch acc
3,540,962 × 3
11.48
10.92
-4.88%
10.71
-6.71%
phone acc
13,062,475 × 3
13.55
12.31
-9.15%
12.52
-7.60%
phone gyr
13,932,632 × 3
47.81
43.4
-9.22%
46.31
-3.14%
bar crawl
14,057,567 × 3
28.57
28.36
-0.74%
30.58
7.04%
synthetic
43,000,000 × 5
43.5
44.84
3.08%
44.13
1.45%
Fig. 7: The effect of b on compression ratio.
4) Empirical Study: As shown in Fig. 7, the compression
ratio increases with b. When b > 6, QEL performs better than
L1 loss.
Since QEL can only minimize the entropy of each batch
for each backpropagation, batch size is one of the critical
hyperparameters for QEL. Figure 8 indicates that Deep Dict
achieves the highest compression ratio when the batch size is
64, and that compression ratio is steady for batch sizes greater
than 64. As seen in Fig. 9, Deep Dict performs effectively
with a small window, however, its compression ratio decreases
Fig. 8: The effect of batch size on compression ratio.
Fig. 9: The effect of window size on compression ratio.
Fig. 10: The change of compression ratio with the number of
dimension of data.
rapidly under a large window.
Previous results indicate that Deep Dict outperforms the
baselines under large time series datasets. Figure 10 illustrates
the effect of the dimensionality on compression ratio (with the
same hyperparameters). As network size cannot be increased,
Deep Dict’s compression ratio is limited by the number of
parameters. There are two ways to increase the number of
parameters: stacking more layers and expanding the network.
As shown in Fig. 11, stacking more transformer encoders does
not result in a significant improvement; rather, as the number
of layers increases, the compression ratio decreases because
of the increase in the decoder size. On the other hand, Fig.
12 demonstrates that compression ratio can be improved with
a large dmodel (one of the Transformer hyperparameters). It
Fig. 11: The effect of the number of layers on compression
ratio.

Page 6
Fig. 12: The effect of dmodel on compression ratio.
Fig. 13: The effect of |c| on compression ratio.
is worth noting that large dmodel is not suitable for small
datasets since large dmodel will notably increase the number
of parameters in BTAE. Figure 13 depicts the variation of
compression ratio under varying Bernoulli latent states (|c|).
Increasing |c| is possible to considerably enhance Deep Dict
performance for big datasets, although, similar to dmodel,
a large |c| can also increase the number of parameters. In
summary, increasing b, dmodel, and |c| can further enhance
the performance of long-time series.
V. CONCLUSION
We propose a Bernoulli transformer autoencoder-based
lossy time series compressor, namely Deep Dict to improve
the compression ratio by learning the Bernoulli representation
of time series. We have substituted the conventional regression
loss with a novel loss function, quantized entropy loss (QEL),
which further improves the compression ratio and reduces the
difficulty of optimization. Deep Dict outperforms the state-of-
the-art time series compression under 7 out of 10 datasets,
particularly under the lengthy time series datasets. We have
shown that Deep Dict can outperform the best baseline by
a maximum of 53.66%. The proposed loss function, QEL
can boost compression ratios more than the traditional re-
gression losses such as L1 and L2. The experiment on the
transferability demonstrates that Deep Dict can be accelerated
by transfer learning without significantly sacrificing much
compression ratio (less than 5%). Moreover, RPE can improve
Deep Dict’s transferability on multivariate datasets. When
multivariate and univariate modes are compared, the results
indicate that Deep Dict’s multivariate mode performs better
under larger multivariate time series. In our future work,
we are focusing on utilizing neural network quantization to
reduce the size of the model further. Furthermore, as various
data sizes and types have different hyperparameters, selecting
hyperparameters automatically based on the data is also on
our agenda.
ACKNOWLEDGMENT
This work was supported in part by the Natural Sciences
and Engineering Research Council of Canada (NSERC) un-
der Grant RGPIN/2017-04032. Petar Djukic was with Ciena
(Kanata, ON, Canada) when this work was done.
REFERENCES
[1] M. Stoyanova, Y. Nikoloudakis, S. Panagiotakis, E. Pallis, and E. K.
Markakis, “A Survey on the Internet of Things (IoT) Forensics: Chal-
lenges, Approaches, and Open Issues,” IEEE Communications Surveys
& Tutorials, vol. 22, no. 2, pp. 1191–1221, 2020, conference Name:
IEEE Communications Surveys & Tutorials.
[2] A. Nauman, Y. A. Qadri, M. Amjad, Y. B. Zikria, M. K. Afzal, and S. W.
Kim, “Multimedia Internet of Things: A Comprehensive Survey,” IEEE
Access, vol. 8, pp. 8202–8250, 2020, conference Name: IEEE Access.
[3] T. Wong and Z. Luo, “Recurrent Auto-Encoder Model for Large-
Scale Industrial Sensor Signal Analysis,” in Engineering Applications
of Neural Networks, ser. Communications in Computer and Information
Science, E. Pimenidis and C. Jayne, Eds. Cham: Springer International
Publishing, 2018, pp. 203–216.
[4] T. Buddhika, M. Malensek, S. Pallickara, and S. L. Pallickara, “Living
on the edge: Data transmission, storage, and analytics in continuous
sensing environments,” ACM Trans. Internet Things, vol. 2/3, jul 2021.
[5] S. K. Jensen, T. B. Pedersen, and C. Thomsen, “Time Series Manage-
ment Systems: A Survey,” IEEE Transactions on Knowledge and Data
Engineering, vol. 29, no. 11, pp. 2581–2600, Nov. 2017, conference
Name: IEEE Transactions on Knowledge and Data Engineering.
[6] G. Chiarot and C. Silvestri, “Time series compression: a survey,”
Jan. 2021, number: arXiv:2101.08784 arXiv:2101.08784 [cs]. [Online].
Available: http://arxiv.org/abs/2101.08784
[7] S. Jin, S. Di, X. Liang, J. Tian, D. Tao, and F. Cappello, “DeepSZ:
A Novel Framework to Compress Deep Neural Networks by Using
Error-Bounded Lossy Compression,” in 28th International Symposium
on High-Performance Parallel and Distributed Computing, ser. HPDC
’19. New York, NY, USA: ACM, 2019, pp. 159–170.
[8] D. Bank, N. Koenigstein, and R. Giryes, “Autoencoders,” Apr. 2021,
arXiv:2003.05991 [cs, stat]. [Online]. Available: http://arxiv.org/abs/
2003.05991
[9] S. Chandak, K. Tatwawadi, C. Wen, L. Wang, J. Aparicio Ojea, and
T. Weissman, “LFZip: Lossy Compression of Multivariate Floating-Point
Time Series Data via Improved Prediction,” in 2020 Data Compression
Conference (DCC), Mar. 2020, pp. 342–351, iSSN: 2375-0359.
[10] A. Kumar, Z. Wang, and A. Srivastava, “A novel approach for clas-
sification in resource-constrained environments,” ACM Trans. Internet
Things, vol. 3/4, sep 2022.
[11] X. Liang, S. Di, D. Tao, S. Li, S. Li, H. Guo, Z. Chen, and F. Cappello,
“Error-Controlled Lossy Compression Optimized for High Compression
Ratios of Scientific Datasets,” in 2018 IEEE International Conference
on Big Data (Big Data), 2018, pp. 438–447.
[12] K. Zhao, S. Di, M. Dmitriev, T.-L. D. Tonellot, Z. Chen, and F. Cappello,
“Optimizing Error-Bounded Lossy Compression for Scientific Data
by Dynamic Spline Interpolation,” in 2021 IEEE 37th International
Conference on Data Engineering (ICDE), Apr. 2021, pp. 1643–1654,
iSSN: 2375-026X.
[13] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon,
C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck,
“Music Transformer,” Dec. 2018, arXiv:1809.04281 [cs, eess, stat].
[Online]. Available: http://arxiv.org/abs/1809.04281
[14] I. Grebnov, “IlyaGrebnov/libbsc,” May 2022, original-date: 2011-
05-11T06:39:49Z. [Online]. Available: https://github.com/IlyaGrebnov/
libbsc