Fast Nonnegative Tensor Factorizations with Tensor Train Model
Fast Nonnegative Tensor Factorizations with Tensor Train Model
c Pleiades Publishing, Ltd., 2022.
(Submitted by V. V. Voevodin)
1
Lomonosov Moscow State University, Moscow, 119991 Russia
2
Marchuk Institute of Numerical Mathematics of Russian Academy of Sciences, Moscow, 119333 Russia
Received February 24, 2022; revised March 13, 2022; accepted March 16, 2022
Abstract—Tensor train model is a low-rank approximation for multidimensional data. In this article
we demonstrate how it can be succesfully used for fast computation of nonnegative tensor train,
nonnegative canonical and nonnegative Tucker factorizations. The proposed approaches can be
incorporated in wide range of methods to solve big data problems.
DOI: 10.1134/S1995080222070228
Keywords and phrases: nonnegative tensor factorization, tensor train, Tucker decomposition,
canonical decomposition.
1. INTRODUCTION
In time of exponential data growth higher-order generalisations of matrices, referred as tensors,
are successfully applied to many different fields. However as the number of dimensions increases and
reaches several tens, it becomes close to impossible to handle the computations due to the “curse of
dimensionality”. To solve this problem tensor decompositions are used.
Canonical Polyadic Decomposition (CPD/CANDECOMP) is one of the most famous methods for
tensor factorization. This decomposition was first studied by Hitchcock in 1927 [1], and was later known
as parallel factor analysis (PARAFAC), a tool for chemometric analysis popularized by Harshman [2],
Carroll and Chang [3], and Kruskal [4]. The CP representation isn’t only interpretable, dimensionality
reduction is better than that of other factorizations. However despite these attractive properties it is still
susceptible to the “curse of dimensionality”.
Not long ago, it was considered that the cheapest in computations algorithm for CPD using
alternating least squares updates for a tensor of size n1 × . . . × nd has O(dR di=1 ni ) complexity, where
R is a rank of the decomposition. But recently method based on prior tensor train decomposition was
introduced which allows to significantly reduce time for calculations [6].
However in practice we often have to deal with nonnegative multidimensional data. Then it is natural
to wish for tensor decomposition with the same property. And thus nonnegative CPD (NCPD) model
was introduced [5]. Just like the model without constraints standard algorithms suffer from the “curse of
dimensionality”. The logical idea is to try using tensor train decomposition, as it was done in [6]. To the
best of our knowledge, there was only one attempt, and it was unsuccesful because of the high relative
error [9].
In this article we suggest extremely useful approach for speeding up existing NCPD algorithms,
reducing initial complexity of O(dRN d ), N = maxi (ni ) to just O(dN R3 ).
Moreover the same ideas can be applied for constructing Nonnegative Tucker Decomposition (NTD).
The Tucker decomposition model assumes that an input multiway array is decomposed into a core
tensor and a set of low-rank matrices, which allows to capture the multilinear features associated
*
E-mail: lena19592@mail.ru
**
E-mail: eugene.tyrtyshnikov@gmail.com
882
FAST NONNEGATIVE TENSOR FACTORIZATIONS 883
with all the modes of the input tensor. The baseline model was proposed by L. R. Tucker in 1966
[22] as a multilinear extension to principal component analysis (PCA). For dealing with nonnegative
data NTD was proposed, where nonnegativity constraints are imposed on the core tensor and all the
factor matrices. This model has been used in multiple applications, including image classification [23],
audio pattern extraction [24], clustering [25], etc. In Section 4 we discuss how to construct NTD
approximation faster using tensor train.
As most proposed methods are based on the nonnegative tensor train (TT) factorization, it is
important to develop effective methods for it’s construction. By definition nonnegative TT (NTT)
requires elements of each core to be nonnegative. It is done for interpretability of the results and so that
all elements of the approximation are guaranteed to be nonnegative. But often in practice this constraint
may be excessive. In Section 2 we suggest novel method for building NTT.
In Section 5 some numerical experiments are discussed.
with tensor carriages G1 , ..., Gd , where any two neighbors have a common summation index. The
summation indices αk run from 1 to rk and are called auxiliary indices. The quantities rk are referred to
r ×n ×r
as tensor train (TT) ranks. The tensor carriages G2 , . . . , Gd−1 ∈ R+k−1 k k . It is sometimes easier
to assume that G1 , Gd are not two-dimensional but three-dimensional with additional auxiliary indices
α0 = αd = 1 and tensor train ranks r0 = rd = 1. This assumption helps to simplify some algorithms.
For nonnegative tensor decompositions it is standard to require core matrices or tensors to have only
nonnegative elements. There are two main reasons. Based on intuition, the parts of a dataset represented
by the factors are thought to in general case combine additively. And in many applications the quantities
involved should be nonnegative otherwise it will be difficult to interpret the results of a decomposition.
Another reason is much more practical: after imposing such constraint we are guaranteed to get
nonnegative approximation.
However there are cases when nonnegativity of factors isn’t necessary as we are only interested in
the elements of the resulting approximation. Tensor decomposition is an instrument allowing to operate
with large amounts of data efficiently. Many algorithms can use such representations in order to speed
up calculations. And there are examples when nonnegativity of the tensor decomposition is crucial for
the method convergence. So when the main agenda is to get all the pros of tensor factorization such as
lower memory cost and faster computations and simultaneously keep the property of the original tensor
we can consider the following problem.
Given nonnegative tensor build the nonnegative TT decomposition
y(i1 , . . . , id ) = G1 (i1 , α1 )G2 (α1 , i2 , α2 ) . . . Gd−1 (αd−2 , id−1 , αd−1 )Gd (αd−1 , id ) ≥ 0,
α1 ,...,αd−1
where tensor carriages G1 , . . . , Gd ∈ Rrk−1 ×nk ×rk . Because we discard constraints for the factor
nonnegativity we can expect there to be an algorithm with lesser complexity or higher accuracy.
In this article we propose new approach to nonnegative tensor train factorization. The idea follows
from the simple rationale: when we construct tensor approximation with prescribed accuracy, we expect
it’s elements to be close to the original, and if initial data is nonnegative it is natural to assume that the
majority of the tensor train model elements would also keep this property. Then when the number of
elements turned negative is small, we can find and correct them.
So first of all we need a method to find negative elements in tensor train approximation. We will
start with looking for the element with maximum absolute value. Of course item-by-item examination
is unacceptable as it has O(N d ) complexity, where N = maxi (ni ). In case of a diagonal matrix A it’s
maximum by module value is simultaneously it’s largest (in absolute value) eigenvalue. One of the most
well known and simple methods to solve this task is power method. In the algorithm we start with
If tensor has only one dominant eigenvalue then in exact arithmetics method should converge to the
tensor with ranks 1. One can use the Rayleigh quotient in order to get the associated eigenvalue.
It has to be noted that besides eigenvalue algorithms there are other approaches to search for
minimum element in tensor train. For example, in [12] there is a routine which computes statistics such
as largest/smallest in magnitude entries or entries having largest/smallest real parts of a TT tensor via
maxvol cross. It has proved to be very competing with power method during numerical experiments.
In the future we plan to test other eigenvalue algorithms (for example inverse power method, Rayleigh
quotient iteration, etc.) and compare their performance. Of course, each method has it’s own pros ad
cons. For instance, inverse iteration demonstrates fast convergence, but only if the optimal shifts are
chosen. And for each iteration we have to solve linear system.
But finding the minimum entry y(k1 , · · · , kd ) of TT is only part of the solution. Our main goal
is to get tensor train approximation with only nonnegative values. Of course, if minimum is bigger
than zero then we don’t have to act any further as we already have our answer. But usually it isn’t
the case. When we know which element is negative we can forcefully amount it to the desired value
by adding rank one tensor. This tensor C will have all zero elements except one: c(k1 , · · · , kd ) =
y(k1 , · · · , kd ) − y(k1 , · · · , kd ). After element correction we can repeat the process: find minimum value
of the fixed TT and correct it if it is negative. And in the end we will construct TT with only nonnegative
values. However this approach has very serious flaw: for sum of two TT their ranks are also summed.
So for each iteration the ranks will grow by one. And unlike for power method, in this case we can’t
employ TT rounding, because it doesn’t keep nonnegativity of the tensor. To avoid the uncontrolled
growth of ranks we suggest first to add some small constant to all elements of the TT approximation.
It’s value should be chosen depending on the minimum value. It is logical to expect negative elements of
the approximation for nonnegative tensor to be close to zero. Then adding constant tensor will not only
allow us to turn most elements positive but also won’t strongly affect approximation accuracy. After this
step we can resume individual element correction.
Let us summarise proposed steps for building tensor train approximation with nonnegative entries.
• Build tensor train approximation Y for the original tensor Y . It can be found using, for example,
TT-SVD or TT-CROSS [13]. Fortunately, the approximation should be computed only once.
• Find minimum entry of Y . There are different approaches for this step. For example, we can find
maximum in magnitude value with power method, subtract it from TT and repeat the search for
this new tensor train with only negative elements.
• To the Y we add small constant, which value depends on the found minimum. With this we
correct most of the negative values in approximation. Ranks are increased by one.
• We continue to search for remaining negative elements, determine their indices and then build TT
with ranks equal 1 to add to the approximation.
Of course there remains much scope for future research. We are working on analizing negative
elements distribution in order to define more strict rule for determining the constant from third step.
Also it is interesting to compare different methods for minimum value search in TT. But the results are
looking very promising for experiments as we demonstrate in Section 5.
R
y(i1 , . . . , id ) = u1r (i1 )u2r (i2 ) . . . udr (id ),
r=1
algorithms for CPD increase exponentially with the tensor order. But there is a way to significantly
decrease complexity from O(dRN d ) to just O(dN R3 ) in case n1 = . . . = nd = N .
In [6] authors suggested to compress initial tensor to TT prior to CPD. For noiseless tensors, an
exact mapping from the core tensors of a TT-representation of a given data tensor to the factor matrices
of its CPD was proposed and for the noisy case, iterative algorithm was developed for estimation of factor
matrices, with a cost of O(dN R3 ).
Now let us consider CPD with nonnegative constraints. A Nonnegative Canonical Polyadic De-
composition (NCPD) for tensor Y ∈ Rn+1 ×n2 ×···×nd is composed of factor matrices U (i) = [u1 . . . uR ]
(i) (i)
ni ×R
∈ R+ . And by analogy with CPD well known algorithms have the same flaw—O(dRN d ) complexity.
In this case it seems appropriate to use the same technique to speed up methods.
And indeed in 2020 the group of researchers tried to implement this idea and represented results
as report for the conference [9]. Their algorithm included three steps: dimensionality reduction by
nonnegative tensor train, factorization of the NTT core tensors by low rank nonnegative CPD and
reconstruction of the final NCPD of the initial tensor. But this method was unsuccessful because of
the high approximation error even for noiseless artificial data.
We suggest different from [9] approach for NCPD representation construction with a cost of only
O(dN R3 ) and desired accuracy. Most algorithms for CPD and NCPD use initial tensor unfoldings, and
operations with them (usually matrix multiplication) give us final complexity of O(dRN d ).
For example we will consider FAST-HALS NTF method from [10]. In the pseudocode for
Algorithm 2 the elementwise division, Kronecker, Khatri-Rao (columnwise Kronecker), Hadamard and
outer products are denoted, respectively, by , ⊗, , , ◦. Operation [· ]+ changes the negative elements
of its argument to some very small nonnegative constant ρ.
For the faster version of the algorithm we will need the tensor-by-matrix multiplication referred to
as the mode-k multiplication. Given a tensor A = [A (i1 , i2 , . . . , id )] and a matrix U = [U (α, ik )], we
define the mode-k multiplication result as a tensor B = [B (i1 , . . . , α, . . . , id )] (α is on the kth place)
obtained by the contraction over the kth axis:
nk
B (i1 , . . . , α, . . . , id ) = A (i1 , i2 , . . . , id ) U (α, ik ) .
ik =1
with s = 1, nk , j = 1, R.
(m) (m)
Introduce matrices Γj = im Gm (im )uj (im ). So for each j we need to compute d − 1 tensor by
(1) (d)
vector multiplication over a given mode. Then calculation of Γj . . . Gk (s) . . . Γj for all s and j requires
O(dN r 2 R) operations. So final complexity in case when tensor ranks are less or equal NCPD ranks is
O(dN R3 ).
Then the Algorithm 2 can be rewritten. Implementations of the necessary operations with TT are
available, for example, in [12].
Naive approach of calculating T2 from scratch on each iteration will result in O(d2 N R3 ) complexity.
The smarter way is to evaluate common parts of computations beforehand and then only update them.
Then we can keep overall complexity of iteration at O(dN R3 ) with additional memory of only O(dR3 ).
So if we have tensor train representation of the initial tensor we are able to dramatically speed up
NCPD algorithm. Let us notice additional useful property of the Algorithm 3: even if TT approximation
has negative elements, the method will construct nonnegative CPD. And in this case we don’t need
nonnegative tensor train factorization. TT approximation Y for the original tensor can be found using,
for example, TT-SVD or TT-CROSS [13].
Matlab package for Tensor decompositions “TensorBox” [14] includes different methods for NCPD:
ANLS with Active Set Method and Column Grouping, Multiplicative Updating Method, ANLS with
Block Principal Pivoting Method. All of them have the same time consuming operation Y(n) {U −n }.
Using approach described above it is possible to significantly accelerate all these NCPD algorithms.
But there are methods for which we need to build nonnegative tensor train prior. As a way to do so we
can use technique described in Section 2, because in this case we only need nonnnegative entries for
the approximation, not for all the factors. Methods for constructing NTT with nonnegative cores can be
found in [11, 16, 18].
loadings:
y(i1 , . . . , id ) = g(j1 , j2 , . . . , jd )a1j1 (i1 )a2j2 (i2 ) . . . adjd (id ).
j1 ,j2 ,...,jd
1 ×R2 ···×Rd
Classical nonnegative Tucker factorization (NTD) is supposed to have G ∈ RR
+ and A(i) =
a1 , a2 , . . . , aRi ∈ Rn+i ×Ri (i = 1, 2, . . . , d).
(i) (i) (i)
To construct nonnegative Tucker approximation Y we need to minimize the following cost function:
1
DNTD = ||Y − Y ||2F .
2
To optimize A(n) we consider the mode-n matricization of Y and Y and rewrite functional:
1 2
DNTD = Y(n) − A(n) G(n) (A⊗−n ) , A(n) ≥ 0.
2 F
In [15] an alternating proximal gradient method (APG) for solving NTD is proposed. Unlike
alternating nonnegative least squares method (ANLS) that exactly solves each subproblem, APG
updates every factor matrix by solving a relaxed subproblem with a separable quadratic objective.
Each relaxed subproblem has a closed form solution, which makes low per-iteration cost. Using an
extrapolation technique, APG also converges quite fast. But as the authors commented, computing
partial gradients is very time consuming process for large tensors. Adding tensor train decomposition
as it was discussed earlier helps to considerably accelerate this algorithm. Alternating proximal gradient
for NTD corrects values for A(n) and G turning negative in each iteration, this means that we don’t need
nonnegative tensor train approximation and can use any TT factorization method as it doesn’t affect the
result.
5. NUMERICAL EXPERIMENTS
Relative accuracy is computed as A−B
AF × 100%, where B is a constructed nonnegative approxi-
F
mation of the original nonnegative tensor A. Experiments were performed on a desktop computer with
an Intel CPU at 3.10 GHz using 8.0 GB of RAM running macOS Catalina 10.15.7 and Matlab R2018a.
This equation describes evolution in time of the concentration function n(v, t) of particles of size v per
unit volume of the system at the moment t. When particles are composed of l different components, their
sizes form a vector v = (v1 , . . . , vl ). Then coagulation process is described by multicomponent version
of Smoluchowski equation.
We have considered coagulation kernel for Smoluchowski equation K(u, v) = (u1 + u2 )μ (v1 +
v2 )ν + (u1 + u2 )ν (v1 + v2 )μ , ui ≥ 0, i = 1, 2, vi ≥ 0, i = 1, 2, where different parameters μ, ν were
chosen so that μ + ν ≤ 1, |μ − ν| ≤ 2. Variables ui , vi , i = 1, 2 range from 0 to 10 with step 0.1. So
we have fourth order tensor with 108 elements. For each pair μ, ν the experiments were repeated 10
times. We considered μ = 0.2 and ν = 0.1. Power method for this tensor converges in 300 iterations,
which takes 0.2 s, and gives the right answer. The problem here is that two maximum values are very
close to each other. Despite this it works quite well because approximation TT ranks turn out to be
relatively small. But TT-crossbased minimization procedure from [12] takes fewer time and in our
experiments we decided to use this method for faster performance, although in general case it may need
some tuning to find absolute extrema. For experiment different TT decompositions from [12] were tested.
For comparison NTTF method for nonnegative TT factorization from [18] is considered.
This example has good TT approximations, because appearing negative values are very small and by
adding a constant all elements become positive. At the same time relative error changes insignificantly
and ranks increase only by one.
The results from Table 2 prove proposed approach very promising for nonnegative tensor train
factorization in considered application.
Method Relative error (%) Time (s, ± std) Average ranks Min ranks Max ranks
NTTF 0.12 63.2 ± 2.8 3, 25, 41 3, 1, 3 3, 85, 100
AMEN-CROSS + min correction 0.0028 1.3 ± 0.4 7, 3, 7 7, 3, 7 7, 3, 7
DMRG-CROSS + min correction 0.0446 9.2 ± 5.2 7, 5, 7 7, 5, 7 7, 5, 7
TT-SVD + min correction 0.0338 4.3 ± 0.1 5, 3, 5 5, 3, 5 5, 3, 5
Table 4. One iteration of algorithms to positive tensors with nonnegative NTD ranks equal 5
Table 6. Comparison of algorithms applied to positive tensors with nonnegative NCP ranks equal 5
We can check the quality of nonnegative TT approximation with ranks 4, 10, 6, which is built by
NTT-MU (300 sweeps) from [16] for this data. Values in Table 3 are the averaged results for 100
repetitions of the experiment.
For this example in terms of time and relative error we also achieve better results for building
nonnegative TT with method proposed in Section 2.
Next we use algorithms to construct NCP for coagulation kernel for Smoluchowski equation from
subsection 5.5.1. For the first example variables ui , vi , i = 1, 2 range from 0 to 5 with step 0.1, whereas
for the second they range from 0 to 10 with step 0.1. Ranks for NCP were chosen equal 5. As prior
TT decomposition method we used TT-SVD with accuracy 1e − 6. Values in Table 7 are the averaged
results for 10 repetitions of the experiment.
From Table 7 it follows that high accuracy of TT approximation allows to keep the same relative error
as for the initial tensor. And we still have great acceleration for the computations, and with TT-CROSS
methods it is possible to decrease work time even more.
6. CONCLUSION
In this article we propose new approaches to nonnegative tensor factorization based on tensor train
model. They allow to construct decompositions much faster.
First we consider nonnegative tensor train factorization problem. For nonnegative tensor decompo-
sitions by definition all factors are required to be nonnegative. Partly it is done for the interpretability
of the results. The other reason is more practical: with this condition we don’t have to worry about
nonnegativity of the resulting approximation. But there are many cases when TT factorization is
used to simply save memory and speed up computations. Of course, if the data is nonnegative, we
want decomposition to have the same property. Moreover, there exist algorithms for which tensor
nonnegativity is crucial for convergence. In these cases it makes sense to consider the problem where
we have to build TT approximation to nonnegative data so that all resulting entries are nonnegative. We
suggest the method for solving the aforementioned task based on correction of negative elements in TT
decomposition. We study examples where this approach gives the best results in time and accuracy.
Secondly we suggest a technique to use tensor train decompositions for constructing nonnegative
canonical and Tucker models. It allows to reduce the complexity of some operations from O(N d ) to
linear in N , thus bringing impressive acceleration to algorithms. What is more, some methods for
NCP and NTD such as NCP-HALS, NTD with APG, etc. don’t require tensor train approximation
to be nonnegative, and it is well known that additional constraints usually slow down computations and
worsen the accuracy of decomposition. And for TT factorization there are effective algorithms based on
cross methods.
Proposed techniques for constructing NTT, NCP and NTD decompositions based on TT are very
useful for big data problems as they allow to greatly reduce complexity of the methods. In the future we
plan to further develop the proposed approaches and test them for more applications.
FUNDING
The work was supported by the Moscow Center of Fundamental and Applied Mathematics (agree-
ment 075-15-2019-1624 with Ministery of education and science of Russian Federation).
REFERENCES
1. F. L. Hitchcock, “Multiple invariants and generalized rank of a p-way matrix or tensor,” J. Math. Phys. 7,
39–79 (1927).
2. R. A. Harshman, “Determination and proof of minimum uniqueness conditions for PARAFAC1,” in UCLA
Working Papers in Phonetics (1972), Vol. 22.
3. J. D. Carroll and J. J. Chang, “Analysis of individual differences in multidimensional scaling via an n-way
generalization of Eckart–Young decomposition,” Psychometrika 35, 283–319 (1970).
4. J. B. Kruskal, “Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to
arithmetic complexity and statistics,” Linear Algebra Appl. 18, 95–138 (1977).
5. J. D. Carroll, G. de Soete, and S. Pruzansky, “Fitting of the latent class model via iteratively reweighted
least squares CANDECOMP with nonnegativity constraints,” in Multiway Data Analysis (Elsevier,
Amsterdam, The Netherlands, 1989), pp. 463–472.
6. A.-H. Phan, A. Cichocki, I. Oseledets, G. G. Calvi, S. Ahmadi-Asl, and D. P. Mandic, “Tensor networks
for latent variable analysis: Higher order canonical polyadic decomposition,” IEEE Trans. Neural Networks
Learn. Syst. 31, 2174–2188 (2020). https://doi.org/10.1109/TNNLS.2019.2929063
7. O. Lebedeva, “Tensor conjugate-gradient-type method for Rayleigh quotient minimization in block qtt-
format,” Russ. J. Numer. Anal. Math. Model. 26, 465–489 (2011).