2020_EUSIPCO - Adaptive Algorithms for Tracking Tensor-Train Decomposition of Streaming Tensors
2020_EUSIPCO - Adaptive Algorithms for Tracking Tensor-Train Decomposition of Streaming Tensors
Abstract—Tensor-train (TT) decomposition has been an efficient SVD decompositions to unfolding matrices of the tensor [5],
tool to find low order approximation of large-scale, high-order [8]. Although TT decomposition offers an efficient approach to
tensors. Existing TT decomposition algorithms are either of high represent high-order tensors, most existing TT algorithms are
computational complexity or operating in batch-mode, hence
quite inefficient for (near) real-time processing. In this paper, either of high computational complexity or operating in batch-
we propose a novel adaptive algorithm for TT decomposition of mode, hence become inefficient for (near) real-time process-
streaming tensors whose slices are serially acquired over time. By ing. In many applications, data acquisition is a time-varying
leveraging the alternating minimization framework, our estimator process where data are serially observed or slowly changing
minimizes an exponentially weighted least-squares cost function with time. This leads to two essential issues: (i) problem of
in an efficient way. The proposed method can yield an estimation
accuracy very close to the error bound. Numerical experiments growing in size of tensors over time and (ii) problem of time-
show that the proposed algorithm is capable of adaptive TT dependent low order approximation for dynamic sensing of
decomposition with a competitive performance evaluation on both a tensor. Therefore, there is a great interest in developing
synthetic and real data. adaptive (online) TT decomposition algorithms to deal with
Index Terms—Tensor-train decomposition, adaptive algorithms, this scenario.
streaming tensors. In the literature, there are a several algorithms related to
I. I NTRODUCTION adaptive TT decomposition. Lubich et al. have proposed a
dynamical tensor approximation-based TT format to account
Tensor decomposition has been emerging as a new data for time-dependent tensors acquired by dynamical systems [9],
analytic tool to discover new valuable information hidden in [10]. Inspired by the Dirac–Frenkel time-dependent variational
datasets [1]. Tensor is a multidimensional array, and it provides principle, a projector-splitting integrator was applied to track
a natural representation for multivariate, high-dimensional the variation of such tensors over time. Note that, the time-
data. Hence, tensor decomposition has rapidly found many dependent tensors are of fixed size, i.e., Lubich’s method works
applications in signal processing and machine learning prob- in a batch setting. Liu et al. have introduced an incremental
lems [2]. TT decomposition algorithm, namely iTTD, for decomposing
Most well-known and widely-used tensor decompositions are tensors whose slices increase over time [11]. iTTD consists
CANDECOMP/PARAFAC (CP) decomposition [3] and Tucker of two phases: (i) factorizing the newcoming tensor derived
decomposition [4] which are two extensions of singular value from new observations into TT-cores and (ii) appending the
decomposition (SVD) for tensors. Under these two models, a resulting TT-cores to past estimated TT-cores to form the new
tensor is factorized into a sequence of factor matrices acting bigger TT-cores. As a result, iTTD can avoid recalculation
on a reduced size core tensor [1]. As a result, they offer for past observations, but its storage complexity grows with
several advantages for data refinement, e.g. CP decomposi- time. Further, iTTD does not exploit past estimations and
tion only costs a linear storage complexity w.r.t the order considers the newcoming tensor as an individual separate
of tensors, while Tucker decomposition provides a practical tensor. Hence, it is essentially sensitive to time variation. iTTD
low multilinear rank approximation for tensors. However, CP is only suitable for static and small/moderate size tensors.
decomposition easily becomes ill-conditioned unless additional These disadvantages motivate us to look for a scalable TT
structural constraints apply. Tucker decomposition is stable, algorithm for decomposing streaming time-varying tensors.
but its storage complexity grows exponentially with the order
of tensors; this is the “curse of dimensionality”. To alleviate II. A DAPTIVE T ENSOR -T RAIN D ECOMPOSITION
their drawbacks, tensor-train (TT) decomposition, which is a Before introducing the problem statement, we first define some
special case of Hierarchical Tucker decomposition, has been useful tensor operators used in this paper.
introduced as a good alternative [5], [6]. The mode-(n, 1) contracted product of two tensors A ∈
TT decomposition is a factorization of a tensor into a multi- RI1 ×I2 ×···×In and B ∈ RJ1 ×J2 ×···×Jm with In = J1 , written
linear product of 3-way tensors. Note that, TT decomposition as A ×1n B, yields a new tensor C ∈ RI1 ×···×In−1 ×J2 ×···×Jm
of 3-way tensors is a constrained Tucker model called Tucker- such that
In
2 [7]. TT decomposition provides a memory-saving representa- C(i1 , . . . , in−1 , j2 , . . . , jm ) =
X
A(i1 , . . . , in )B(in , j2 . . . , jm ).
tion for high-order tensors which costs only O(nIr2 ) memory in =1
storage to represent a n-way tensor of size I × I × · · · × I and
The concatenation of A ∈ RI1 ×I2 ×···×In and B ∈
rank r. Similar to Tucker decomposition, TT decomposition
RI1 ×I2 ×···×In−1 ×1 , written as A ⊞ B, yields a new tensor
allows to decompose any tensor under a predefined error [5].
C ∈ RI1 ×I2 ×···×(In +1) such that
Moreover, on the TT model, basic mathematical operations can (
be efficiently performed to determine TT decomposition in a A(i1 , . . . , in−1 , in ) if in ≤ In
C(i1 , . . . , in−1 , in ) = .
stable way, e.g. TT-SVD algorithm performs a sequence of B(i1 , . . . , in−1 , 1) if in = In + 1
I1 I2 I3 I4
R2 R3
R1 R3
R1 R2
I1 I2 I3 I4
Consider a streaming n-way tensor X [t] ∈ RI1 ×I2 ×···×In [t] The vector gn [t] can be updated by minimizing the t-th
fixing all but the last “time” dimension In [t]. At time t, summand in (4)
X [t] is particularly obtained by appending a new slice X t ∈ 2
gn [t] = argmin X t − H[t − 1] ×1n gn . (6)
RI1 ×I2 ···×In−1 to the previous observation X [t − 1] along gn ∈Rrn−1 ×1
F
996
Algorithm 1: TT-FOA: First-Order Adaptive From that, we can obtainPGk [t] in the recursive way as follows:
t
Tensor-Train Decomposition Let us denote Sk [t] = i=1 λt−i Wk [i]Wk [i]T and Rk [t] =
Pt t−i (k) T
i=1 λ Xi Wk [i] . The two matrices Rk [t] and Sk [t] can
Input:
+ Observations {X t }∞ I1 ×I2 ×···×In−1 ,
t=1 , X t ∈ R be updated recursively:
+ TT-rank rTT = [r1 , r2 , . . . , rn−1 ],
+ Forgetting factor λ. Sk [t] = λSk [t − 1] + Wk [t]Wk [t]T , (14)
Output: TT-cores {Gi [t]}n i=1 . (k)
Initialization Rk [t] = λRk [t − 1] + Xt Wk [t]T . (15)
{Gi [0]}n−1
i=1 are initialized randomly,
{Si [0]}n−1 Therefore, (13) can be rewritten as
i=1 = I.
for t = 1, 2 . . . do (k)
Step 1: Estimate gn [t] Gk Sk [t] = λRk [t − 1] + Xt Wk [t]T
H[t − 1] = G1 [t − 1] ×12 · · · ×1n−1 Gn−1 [t − 1] (k)
= λGk [t − 1]Sk [t − 1] + Xt Wk [t]T
H[t − 1] = unfolding(H[t − 1], [I1 I2 . . . In−1 , rn−1 ]) (k)
= Gk [t − 1]Sk [t] + (Xt − Gk [t − 1]Wk [t])Wk [t]T .
Ω[t] = randsample([1, I1 I2 . . . In−1 ])
xΩ [t] = vec(X t ) Let the residual matrix ∆k [t] and coefficient matrix Vk [t] be
gn [t] = HΩ[t] [t − 1]# xΩ[t] [t] (k)
∆k [t] = Xt − Gk [t − 1]Wk [t], (16)
∆[t] = X t − H[t − 1] ×1n gn [t] −1
Vk [t] = Wk [t]T Sk [t] . (17)
Step 2: Update TT-cores Gk in parallel
Ak [t − 1] = G1 [t − 1] ×12 ··· ×1k−1 Gk−1 [t − 1] We obtain a simple rule for updating Gk [t] as follows
Ak [t − 1] = unfolding(Ak [t], [rk−1 , I1 I2 . . . Ik−1 ]) Gk [t] = Gk [t − 1] + ∆k [t]Vk [t]. (18)
Bk [t] = Gk+1 [t − 1] ×1k+2 . . . Gn−1 [t − 1] ×1n gn [t]
Bk [t] = unfolding(Bk [t], [rk , Ik+1 Ik+2 . . . In−1 ])
After that, the TT-core Gk [t] will be derived from reshaping
Wk [t] = Bk [t] ⊗ Ak [t − 1]
Gk [t] into a 3-way tensor of size rk−1 × Ik × rk .
Sk [t] = λSk [t − 1] + Wk [t]Wk [t]T We also note that when dealing with large-scale and high-
−1 rank tensors (i.e. ri ≈ Ii ), TT-FOA can be sped up by using
Vk [t] = Sk [t] Wk [t]T
its stochastic approximation. We refer to this method as the
∆k [t] = unfolding(∆[t], [Ik , rk−1 rk ])
stochastic TT-FOA. In particular, the gradient ∇f (Gk ) can
Gk [t] = Gk [t − 1] + ∆k [t]Vk [t]T
be approximated by the instantaneous gradient of the last
Gk [t] = reshape(Gk [t], [rk−1 , Ik , rk ])
summand of f (Gk ). Thus, Sk [t] can be computed by
end
Sk [t] ≃ Wk [t]WkT [t]. (19)
Accordingly, the matrix Vk [t] in (17) can be derived directly
where L(.) is a sketching map. Thanks to the Kronecker
from the right inverse of Wk . As a result, the stochastic
structure of H[t − 1], uniform random sampling can provide a
TT-FOA not only skips several operations, but also saves
good sketch for H[t − 1]. Accordingly, we can select rows 2
a memory storage of O(rk−1 rk2 ) for storing Sk [t] at time
of H[t − 1] as well as x[t] at random to form the sketch
t. However, the stochastic approximation achieves a lower
HΩ [t − 1] ∈ R|Ω|×rn−1 and a sampled vector xΩ [t] ∈ R|Ω|×1 ,
convergence rate than the original TT-FOA, see Fig. 6 for an
where Ω denotes the set of sampling rows. Therefore, gn [t] can
illustration.
be efficiently updated by applying the ridge regression method
to (9), whose closed-form is given by C. Computational Complexity and Memory Storage Analysis
T
gn [t] = (HΩ [t − 1] HΩ [t − 1] + ρIrn−1 ) −1 T
HΩ [t − 1] xΩ [t]. For convenience of the analysis, we assume that the fixed
(10) dimensions of the tensor are equal to I while its TT-rank is
rTT = [r, r, . . . , r]. In terms of computational complexity, TT-
As a result, the last TT-core Gn [t] is updated as follows FOA first requires O(|Ω|r2 ) flops for computing gn [t] by using
Gn [t] = Gn [t − 1], gn [t] . (11) the randomized LS method at time t. The cost for updating
B. Estimation of TT-cores the k-th TT-core, Gk [t], comes from matrix-matrix products
except an inverse operation for Sk [t], hence it costs O(I n−1 r2 )
Given the new slice X t and past estimations, the k-th flops in general. It is due to that the matrix Sk [t] is of size
TT-core Gk [t] can be estimated by minimizing the matrix- r2 × r2 , thus the computation of (Sk [t])−1 is not expensive
representation of the objective function (7), as follows
and independent of the tensor dimension. Therefore, the overall
t
Gk [t] = argmin f (Gk ) =
X
λt−i Xi
(k)
− Gk Wk [i]
2
, computational complexity is O(I n−1 r2 ). In term of memory
Gk ∈R
Ik ×rk rk−1
i=1
F storage, TT-FOA does not require to save the observation
data
(12) at each time, it totally costs O (n − 1)(Ir2 + r4 ) for storing
(k) n − 1 TT-cores and n − 1 matrices Sk [t]. When the stochastic
where Gk [t] is the mode-2 matricization of Gk [t], is Xi
TT-FOA is applied, the memory storage is only O (n−1)Ir2 .
the mode-k matricization of X i ; Wk [i] = Bk [i] ⊗ Ak [t − 1]
where ⊗ denotes the Kronecker product, Ak [t − 1] and Bk [i] IV. E VALUATIONS
are the unfolding matrices of Ak [t − 1] and Bk [i] respectively;
The local optimal Gk [t] can be obtained by setting the first In this section, we conduct several experiments on both syn-
derivative of f (Gk ) to zero: thetic and real data to evaluate the performance of TT-FOA
for adaptive TT decomposition. Experiments are implemented
t
X t
X (k) in MATLAB platform and are available online1 .
Gk λt−i Wk [i]Wk [i]T = λt−i Xi Wk [i]T . (13)
i=1 i=1 1 https://sites.google.com/view/thanhtbt/.
997
100
100
10-5
10-2
10-10 10-4
10-6
10-15
0 100 200 300 400 500 0 100 200 300 400 500
Fig. 3: Effect of the forgetting factor λ on the performance of Fig. 5: Effect of the time-varying factor σ on the performance
TT-FOA. of TT-FOA in the case of noise-free.
100 101
100
10-3
10-1
10-6
10-2
10-9
10-3
0 100 200 300 400 500 0 100 200 300 400 500
Fig. 4: Effect of the noise level ǫ on the performance of TT- Fig. 6: Performance of three TT decomposition algorithms in a
FOA. time-varying scenario: The noise level ǫ = 10−1 and the time
variance factor σ = 10−4 .
A. Synthetic Data
We generate streaming 4-way tensors X [t] ∈ RI1 ×I2 ×I3 ×I4 [t] its estimation on the same tensor above. The result is shown
of a TT-rank vector rTT = [r1 , r2 , r3 ] as follows: in Fig. 4. When we reduce the noise, relative error (RE)
between the ground truth and estimation degrades gradually
X t = G1 [t] ×12 G2 [t] ×13 G3 [t] ×14 g4 [t] + ǫN [t], and converges towards a steady state error bound. Note that
where the 3-way tensor X t ∈ RI1 ×I2 ×I3 is the t-th slice of the convergence rate of the algorithm is not affected by the
X [t]; N [t] is a Gaussian noise tensor of the same size with noise level but only its estimation error.
X t and ǫ controls the noise level; the last column g4 [t] of We next consider a scenario where TT-cores change slowly
TT-core G4 [t] is a random vector living on Rr3 space; TT- with time and abruptly at instant t = 300. Fig. 5 shows the
cores G1 [t], G2 [t] and G3 [t] are, respectively, of size I1 × r1 , performance of TT-FOA applying to the same free-noise tensor
r1 × I2 × r2 and r2 × I3 × r3 given by versus the time-varying factor σ. In the same manner to the
effect of the noise level, TT-FOA’s estimation accuracy goes
Gi [t] = (1 − σ)Gi [t − 1] + σNi [t], down when σ increases, but converges towards a steady state
where σ controls the variation of the TT-cores between error. Fig. 6 shows a performance comparison among three
two consecutive instances, N1 [t] ∈ RI1 ×r1 and Ni [t] ∈ TT decomposition algorithms when the value of the noise
Rri−1 ×Ii ×ri are noise tensors whose entries are i.i.d from the level ǫ and the time-varying factor σ are 10−1 and 10−4
Gaussian distribution with zero-mean and unit-variance. respectively. The batch algorithm TT-SVD fails in this time-
varying scenario, while TT-FOA and its stochastic version
To measure the estimation accuracy, we use the relative error
can track successfully the variation of the tensor along the
(RE) metric given by
time, which yields to an estimation accuracy very close to the
RE(Xtr , Xes ) = kXtr − Xes kF / kXtr kF , error bound (i.e. steady state error). The result also indicates
that the convergence rate of TT-FOA is faster than that of its
where Xtr (resp. Xes ) refers to the true tensor (resp. estimated
stochastic version. This is probably because the convergence
tensor).
rate of the stochastic TT-FOA is limited by its noisy/stochastic
The choice of forgetting factor λ plays a central role in approximation of the true gradient.
how fast TT-FOA converges. Fig. 3 shows the experimental
results of applying the algorithm to a static and free-noise B. Real Data
tensor whose size is 10 × 12 × 15 × 500 and its TT-rank is In order to provide empirical evidences of applying TT-FOA
rTT = [2, 3, 5]. We can see that the relative error is minimized to real data, we use a surveillance video sequence2 , and a
when λ is round 0.7. TT-FOA fails when λ is close to its functional MRI data3 . The video data contains 1546 frames of
infimum or supremum. We then fix λ = 0.7 in the next size 128 × 160, while the fMRI data includes 20 abdominal
experiments. scans of size 256 × 256 × 14.
To study the effect of noise on the performance of our 2 http://www.changedetection.net/
algorithm, we vary the value of the noise level ǫ and access 3 https://github.com/colehawkins/
998
101
100
R EFERENCES
[1] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,”
SIAM Rev., vol. 51, no. 3, pp. 455–500, 2009.
(c) OLCP (d) TT-FOA
[2] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalex-
Fig. 8: Reconstructed 1345-th frame. akis, and C. Faloutsos, “Tensor decomposition for signal processing and
machine learning,” IEEE Trans. Signal Process., vol. 65, no. 13, pp.
The first task is to track surveillance video. We compare TT- 3551–3582, 2017.
[3] R. A. Harshman, “Foundations of the PARAFAC procedure: Models and
FOA against the two state-of-the-art adaptive CP tensor de- conditions for an explanatory multimodal factor analysis,” UCLA Work.
compositions, including PARAFAC-SDT [14] and OLCP [15]. Papers Phonet., vol. 16, no. 1-84, 1970.
In order to apply these algorithms effectively, color video [4] L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multilinear singular
value decomposition,” SIAM J. Matrix Anal. Appl., vol. 21, no. 4, pp.
frames are converted into grayscale. The CP-rank and TT-rank 1253–1278, 2000.
are set at 15 and [15, 15] respectively. Moreover, the 100 first [5] I. V. Oseledets, “Tensor-train decomposition,” SIAM J. Sci. Comput.,
video frames are trained to obtain the good initialization for vol. 33, no. 5, pp. 2295–2317, Sep. 2011.
[6] A. Cichocki, N. Lee, I. Oseledets, A. H. Phan, Q. Zhao, D. P. Mandic
PARAFAC-SDT and OLCP. The results indicate that TT-FOA et al., “Tensor networks for dimensionality reduction and large-scale
outperforms these adaptive CP decompositions, as shown in optimization: Part 1 low-rank tensor decompositions,” Found. Trends
Fig. 7 and Fig. 8. In particular, PARAFAC-SDT fails to track Mach. Learn., vol. 9, no. 4-5, pp. 249–429, 2016.
[7] L. R. Tucker, “Some mathematical notes on three-mode factor analysis,”
video frame while OLCP achieves a worse estimation accuracy Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
than our algorithm. [8] Y. Zniyed, R. Boyer, A. de Almeida, and G. Favier, “A TT-based
hierarchical framework for decomposing high-order tensors,” SIAM J.
The second task is to demonstrate the effect of TT-rank rTT on Sci. Comput., 2020.
the low-rank approximation of the fMRI tensor. The abdominal [9] C. Lubich, T. Rohwedder, R. Schneider, and B. Vandereycken, “Dy-
scans are seen as tensor slices in the online setting. Results of namical approximation by hierarchical Tucker and tensor-train tensors,”
SIAM J. Matrix Anal. Appl., vol. 34, no. 2, pp. 470–494, 2013.
tracking the low-rank component of the last scan are shown [10] C. Lubich, B. Vandereycken, and H. Walach, “Time integration of rank-
in Fig. 9. The estimated low-rank fMRI scan deviates from constrained Tucker tensors,” SIAM J. Numer. Anal., vol. 56, no. 3, pp.
its ground truth when the TT-rank decreases, and hence the 1273–1290, 2018.
[11] H. Liu, L. T. Yang, Y. Guo, X. Xie, and J. Ma, “An incremental tensor-
relative error increases. train decomposition for cyber-physical-social big data,” IEEE Trans. Big
Data, pp. 1–1, 2018.
V. C ONCLUSIONS [12] T. Minh-Chinh, N. Viet-Dung, N. Linh-Trung, and K. Abed-Meraim,
“Adaptive PARAFAC decomposition for third-order tensor completion,”
In this paper, we proposed an efficient first-order method in Int. Conf. Com. Elect. IEEE, 2016, pp. 297–301.
for tensor-train decomposition in the adaptive scheme. TT- [13] M. W. Mahoney et al., “Randomized algorithms for matrices and data,”
FOA and its stochastic version are able to track the tensor- Found. Trends Mach. Learn., vol. 3, no. 2, pp. 123–224, 2011.
[14] D. Nion and N. D. Sidiropoulos, “Adaptive algorithms to track the
train representation of streaming tensors from noisy and high- parafac decomposition of a third-order tensor,” IEEE Trans. Signal
dimensional data with high accuracy, even when they come Process., vol. 57, no. 6, pp. 2299–2310, 2009.
from time-dependent observations. The effectiveness of TT- [15] S. Zhou, N. X. Vinh, J. Bailey, Y. Jia, and I. Davidson, “Accelerating
online CP decompositions for higher order tensors,” in Int. Conf. Know.
FOA is numerically validated by several experiments on both Disc. Data Min., 2016, pp. 1375–1384.
synthetic and real data.
999