Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: tkz-tab

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
arXiv:2303.17992v2 [math.OC] 19 Mar 2024

A fast Multiplicative Updates algorithm for Non-negative Matrix Factorization

Mai-Quyen Pham IMT Atlantique; UMR CNRS 6285 Lab-STICC (email: firstname.lastname@imt-atlantique.fr)CROSSING IRL CNRS (email: mai-quyen.pham@cnrs.fr)    Jérémy Cohen Univ Lyon, INSA-Lyon, UCBL, UJM-Saint Etienne, CNRS, Inserm, CREATIS UMR 5220, U1206, F-69100 Villeurbanne, France ( email: jeremy.cohen@cnrs.fr).    Thierry Chonavel 11footnotemark: 1
Abstract

Nonnegative Matrix Factorization is an important tool in unsupervised machine learning to decompose a data matrix into a sum of parts that are often interpretable. Many dedicated algorithms have been proposed during the last three decades. A well-known method is the Multiplicative Updates algorithm proposed by Lee and Seung in 2002. Multiplicative updates have many interesting features: they are simple to implement, can be adapted to popular variants such as sparse Nonnegative Matrix Factorization, and, according to recent benchmarks, are state-of-the-art for many problems where the loss function is not the squared Frobenius norm. In this manuscript, we propose to improve the Multiplicative Updates algorithm seen as an alternating majorization minimization algorithm by crafting a tighter upper bound of the Hessian matrix for each alternate subproblem. Convergence is still ensured and we observe in practice on both synthetic and real world dataset that the proposed fastMU algorithm is often significantly faster than the regular Multiplicative Updates algorithm, and can even be competitive with state-of-the-art methods for the Frobenius loss.

Keywords— Nonnegative Matrix Factorization, Quadratic majorization, Multiplicative Updates, Alternating Optimization

1 Introduction

Nonnegative Matrix Factorization (NMF) is the (approximate) decomposition of a matrix into a product of nonnegative matrix factors. In many applications, the factors of the decomposition give an interesting interpretation of its content as a sum of parts. While Paatero and Tapper could be attributed with the modern formulation of NMF [23] as known in the data sciences, its roots are much older and stem from many fields, primarily chemometrics, and the Beer Lambert law, see for instance the literature survey in [11]. NMF can be formulated as follows: given a matrix 𝐕M×N𝐕superscript𝑀𝑁{\bf V}\in{\mathbb{R}}^{M\times N}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, find two non-negative (entry-wise) matrices 𝐖+R×M𝐖superscriptsubscript𝑅𝑀{\bf W}\in{\mathbb{R}}_{+}^{R\times M}bold_W ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R × italic_M end_POSTSUPERSCRIPT and 𝐇+R×N𝐇superscriptsubscript𝑅𝑁{\bf H}\in{\mathbb{R}}_{+}^{R\times N}bold_H ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R × italic_N end_POSTSUPERSCRIPT with fixed Rmin(M,N)𝑅𝑀𝑁R\leq\min(M,N)italic_R ≤ roman_min ( italic_M , italic_N ), typically, one selects Rmin(M,N)much-less-than𝑅𝑀𝑁R\ll\min(M,N)italic_R ≪ roman_min ( italic_M , italic_N ), that satisfy

𝐕𝐖𝐇.𝐕superscript𝐖top𝐇{\bf V}\approx{\bf W}^{\top}{\bf H}.bold_V ≈ bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H . (1)

Formally, we solve a low-rank approximation problem by minimizing a loss function Ψ(𝐖,𝐇)Ψ𝐖𝐇\Psi({\bf W},{\bf H})roman_Ψ ( bold_W , bold_H ):

Find (𝐖^,𝐇^)argmin(𝐖,𝐇)+R×M×+R×NΨ(𝐖,𝐇)Find ^𝐖^𝐇𝐖𝐇superscriptsubscript𝑅𝑀superscriptsubscript𝑅𝑁argminΨ𝐖𝐇\text{Find }\left(\widehat{{\bf W}},\widehat{{\bf H}}\right)\in\underset{{({% \bf W},{\bf H})\in{\mathbb{R}}_{+}^{R\times M}\times{\mathbb{R}}_{+}^{R\times N% }}}{\text{argmin}}\Psi({\bf W},{\bf H})Find ( over^ start_ARG bold_W end_ARG , over^ start_ARG bold_H end_ARG ) ∈ start_UNDERACCENT ( bold_W , bold_H ) ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R × italic_M end_POSTSUPERSCRIPT × blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R × italic_N end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG argmin end_ARG roman_Ψ ( bold_W , bold_H ) (2)

Classical instances of ΨΨ\Psiroman_Ψ are the squared Frobenius norm or the Kullback-Leibler divergence (KL) defined in Equations (5) and (6). The integer R𝑅Ritalic_R is often called the nonnegative rank of the approximation matrix 𝐖𝐇superscript𝐖top𝐇{\bf W}^{\top}{\bf H}bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H.

There is a large body of literature on how to compute solutions to the NMF problem. Existing algorithms could be classified based on two criteria: the loss they minimize, and whether they alternatively solve for 𝐖𝐖{\bf W}bold_W and 𝐇𝐇{\bf H}bold_H, that is, for each iteration k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N

𝐇k+1=argmin𝐇+R×NΨ(𝐖k,𝐇)superscript𝐇𝑘1𝐇superscriptsubscript𝑅𝑁argminΨsuperscript𝐖𝑘𝐇\displaystyle{\bf H}^{k+1}=\underset{{{\bf H}\in{\mathbb{R}}_{+}^{R\times N}}}% {\text{argmin}}\Psi({{\bf W}^{k}},{\bf H})bold_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = start_UNDERACCENT bold_H ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R × italic_N end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG argmin end_ARG roman_Ψ ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H ) (3)
𝐖k+1=argmin𝐖+R×MΨ(𝐖,𝐇k+1),superscript𝐖𝑘1𝐖superscriptsubscript𝑅𝑀argminΨ𝐖superscript𝐇𝑘1\displaystyle{\bf W}^{k+1}=\underset{{{\bf W}\in{\mathbb{R}}_{+}^{R\times M}}}% {\text{argmin}}\Psi({\bf W},{\bf H}^{k+1}),bold_W start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = start_UNDERACCENT bold_W ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R × italic_M end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG argmin end_ARG roman_Ψ ( bold_W , bold_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) , (4)

or update 𝐖𝐖{\bf W}bold_W and 𝐇𝐇{\bf H}bold_H simultaneoustly. In many usual loss functions, the subproblems of the alternating algorithms are (strongly) convex. Therefore convergence guarantees can be ensured using the theory of block-coordinate descent. Moreover, these methods generally perform well in practice. For the Frobenius loss, state-of-the-art alternating algorithms include the Hierarchical Alternating Least Squares [7, 12]. For the more general beta-divergence loss which includes the popular KL divergence, the Multiplicative Updates algorithm proposed by Lee and Seung [18, 19], that we will denote by MU in the rest of this manuscript, is still state-of-the-art for many dataset [14].

The MU algorithm is in fact a very popular algorithm for computing NMF even when minimizing the Frobenius loss. It is simple to implement a vanilla MU and easy to modify it to account for regularizations, a popular choice being the sparsity-inducing 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalizations [15, 30]. There has been a number of works regarding the convergence and the implementation of MU [25, 20, 33, 29, 26], but these works do not modify the core of the algorithm: the design of an approximate diagonal Hessian matrix used to define matrix-like step sizes in gradient descent. A downside of MU, in particular for the Frobenius loss, is that its convergence can be significantly slower than other methods.

In this work, we aim at speeding up the convergence of the MU algorithm. To this end, we propose to compute different approximate diagonal Hessian matrices which are tighter approximations of the true Hessian matrices. The iterates of the proposed algorithm, which is coined fastMU, are guaranteed to converge to a stationary point of the objective function, but the convergence speed is unknown. The updates are not multiplicative anymore but instead follow the usual structure of forward-backward algorithms. We observe on both synthetic and realistic dataset that fastMU is competitive with state-of-the-art methods for the Frobenius loss, and can speed-up MU by several orders of magnitudes in Frobenius loss and KL divergence. It can however struggle when the data is very sparse or when the residuals are large.

1.1 Structure of the manuscript

Section 1 introduces the motivation and also specifies the position of our work in the topic. Section 2 covers the relevant background material on quadratic majorization technique. Section 3 provides the technical details about the Majorize Minimize framework which underpins the fastMU algorithm. Section 4 details the fastMU algorithm, the main contribution of our work, which allows for improved empirical convergence speed over MU. Section 5 shows several experiments in both simulation and real data cases with many comparisons with different well-known and classical methods in the literature. Finally, Section 6 gives the conclusion of our work and some possible perspectives for future work.

1.2 Notations

To end this section, let us introduce the following notations: direct-product\odot and \oslash denote the Hadamard (entry-wise) product and division, respectively. Diag(𝐯)Diag𝐯\operatorname{Diag}({\bf v})roman_Diag ( bold_v ) denotes the diagonal matrix with entries defined from vector 𝐯𝐯{\bf v}bold_v. Uppercase bold letters denote matrices, and lowercase bold letters denote column vectors. Vector with index n𝑛nitalic_n denotes the n𝑛nitalic_nth column of the corresponding matrix. For example, 𝐯nsubscript𝐯𝑛{\bf v}_{n}bold_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT will be the n𝑛nitalic_n-th column of 𝐕M×N𝐕superscript𝑀𝑁{\bf V}\in{\mathbb{R}}^{M\times N}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT with 1nN1𝑛𝑁1\leq n\leq N1 ≤ italic_n ≤ italic_N and 𝐯nMsubscript𝐯𝑛superscript𝑀{\bf v}_{n}\in{\mathbb{R}}^{M}bold_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. In the same way, va,bsubscript𝑣𝑎𝑏v_{a,b}italic_v start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT denotes entry (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) of matrix 𝐕𝐕{\bf V}bold_V. For 𝐀M×N𝐀superscript𝑀𝑁{\bf A}\in{\mathbb{R}}^{M\times N}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, we note 𝐀ϵ𝐀italic-ϵ{\bf A}\geq\epsilonbold_A ≥ italic_ϵ (resp. 𝐀>ϵ𝐀italic-ϵ{\bf A}>\epsilonbold_A > italic_ϵ) if 𝐀[ϵ,+)M×N𝐀superscriptitalic-ϵ𝑀𝑁{\bf A}\in[\epsilon,+\infty)^{M\times N}bold_A ∈ [ italic_ϵ , + ∞ ) start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT (resp. 𝐀(ϵ,+)M×N𝐀superscriptitalic-ϵ𝑀𝑁{\bf A}\in(\epsilon,+\infty)^{M\times N}bold_A ∈ ( italic_ϵ , + ∞ ) start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT). 𝐀0succeeds-or-equals𝐀0{\bf A}\succeq 0bold_A ⪰ 0 (resp. 𝐀0succeeds𝐀0{\bf A}\succ 0bold_A ≻ 0) means that 𝐀N×N𝐀superscript𝑁𝑁{\bf A}\in{\mathbb{R}}^{N\times N}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT is positive semi-definite (resp. positive definite), that is, for all 𝐱N𝐱superscript𝑁{\bf x}\in{\mathbb{R}}^{N}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, 𝐱𝐀𝐱0superscript𝐱top𝐀𝐱0{\bf x}^{\top}{\bf A}{\bf x}\geq 0bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ax ≥ 0 (resp. 𝐱𝐀𝐱>0superscript𝐱top𝐀𝐱0{\bf x}^{\top}{\bf A}{\bf x}>0bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ax > 0). 𝟙Nsubscript1𝑁{\mathbbm{1}}_{N}blackboard_1 start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT denotes vector of ones with length N𝑁Nitalic_N and 𝟙M,N=𝟙M𝟙Nsubscript1𝑀𝑁subscript1𝑀superscriptsubscript1𝑁top{\mathbbm{1}}_{M,N}={\mathbbm{1}}_{M}{\mathbbm{1}}_{N}^{\top}blackboard_1 start_POSTSUBSCRIPT italic_M , italic_N end_POSTSUBSCRIPT = blackboard_1 start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

2 Background

2.1 Loss function

In this paper, we consider two widely used loss functions:
1) the squared Frobenius norm:

Ψ(𝐖,𝐇)=1nN1mM12(vm,n𝐰m𝐡n)2Ψ𝐖𝐇subscriptℓist1𝑛𝑁1𝑚𝑀12superscriptsubscript𝑣𝑚𝑛superscriptsubscript𝐰𝑚topsubscript𝐡𝑛2\Psi({\bf W},{\bf H})=\sum_{\begin{subarray}{c}1\leq n\leq N\\ 1\leq m\leq M\end{subarray}}\frac{1}{2}(v_{m,n}-{\bf w}_{m}^{\top}{\bf h}_{n})% ^{2}roman_Ψ ( bold_W , bold_H ) = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL 1 ≤ italic_n ≤ italic_N end_CELL end_ROW start_ROW start_CELL 1 ≤ italic_m ≤ italic_M end_CELL end_ROW end_ARG end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5)

2) the KL divergence:

Ψ(𝐖,𝐇)=1nN1mMvm,nlog(vm,n𝐰m𝐡n)vm,n+𝐰m𝐡nΨ𝐖𝐇subscriptℓist1𝑛𝑁1𝑚𝑀subscript𝑣𝑚𝑛subscript𝑣𝑚𝑛superscriptsubscript𝐰𝑚topsubscript𝐡𝑛subscript𝑣𝑚𝑛superscriptsubscript𝐰𝑚topsubscript𝐡𝑛\Psi({\bf W},{\bf H})=\sum_{\begin{subarray}{c}1\leq n\leq N\\ 1\leq m\leq M\end{subarray}}v_{m,n}\log(\frac{v_{m,n}}{{\bf w}_{m}^{\top}{\bf h% }_{n}})-v_{m,n}+{\bf w}_{m}^{\top}{\bf h}_{n}roman_Ψ ( bold_W , bold_H ) = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL 1 ≤ italic_n ≤ italic_N end_CELL end_ROW start_ROW start_CELL 1 ≤ italic_m ≤ italic_M end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT end_ARG start_ARG bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) - italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT + bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (6)

These two loss functions can be split into a sum of separable functions with respect to each column vector of the matrix 𝐇𝐇{\bf H}bold_H. Therefore, as is commonly done in the literature, we propose to solve the subproblem of estimating 𝐇𝐇{\bf H}bold_H in parallel for each column 𝐡nsubscript𝐡𝑛{\bf h}_{n}bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. More precisely,

Ψ(𝐖,𝐇)=n=1Nψ(𝐖,𝐡n)Ψ𝐖𝐇superscriptsubscript𝑛1𝑁𝜓𝐖subscript𝐡𝑛\Psi({\bf W},{\bf H})=\sum_{n=1}^{N}\psi({\bf W},{\bf h}_{n})roman_Ψ ( bold_W , bold_H ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ψ ( bold_W , bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (7)

where
1) when ΨΨ\Psiroman_Ψ is the squared Frobenius norm

ψ(𝐖,𝐡n)=12𝐯n𝐖𝐡n22.𝜓𝐖subscript𝐡𝑛12superscriptsubscriptnormsubscript𝐯𝑛superscript𝐖topsubscript𝐡𝑛22\psi({\bf W},{\bf h}_{n})=\frac{1}{2}\|{\bf v}_{n}-{{\bf W}}^{\top}{\bf h}_{n}% \|_{2}^{2}.italic_ψ ( bold_W , bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (8)

2) when ΨΨ\Psiroman_Ψ is the KL divergence

ψ(𝐖,𝐡n)=𝜓𝐖subscript𝐡𝑛absent\displaystyle\psi({\bf W},{\bf h}_{n})=italic_ψ ( bold_W , bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = m=1Mvm,nlog(𝐰m𝐡n)superscriptsubscript𝑚1𝑀subscript𝑣𝑚𝑛superscriptsubscript𝐰𝑚topsubscript𝐡𝑛\displaystyle\sum_{m=1}^{M}-v_{m,n}\log\left({{\bf w}_{m}}^{\top}{\bf h}_{n}\right)∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT roman_log ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
+𝐰m𝐡n+vm,n(log(vm,n)1).superscriptsubscript𝐰𝑚topsubscript𝐡𝑛subscript𝑣𝑚𝑛subscript𝑣𝑚𝑛1\displaystyle+{{\bf w}_{m}}^{\top}{\bf h}_{n}+v_{m,n}\left(\log(v_{m,n})-1% \right).+ bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( roman_log ( italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) - 1 ) . (9)

For simplicity and because of (i) the symmetry between the 𝐖𝐖{\bf W}bold_W and 𝐇𝐇{\bf H}bold_H updates and (ii) the separability of loss functions, only strategies proposed for updating a column 𝐡nsubscript𝐡𝑛{\bf h}_{n}bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the matrix 𝐇𝐇{\bf H}bold_H are discussed. Thus, we only consider problems of the form

𝐱*=argmin𝐱+R(θ(𝐱):=ψ(𝐖,𝐱)).superscript𝐱𝐱superscriptsubscript𝑅argminassign𝜃𝐱𝜓𝐖𝐱{\bf x}^{*}=\underset{{{\bf x}\in{\mathbb{R}}_{+}^{R}}}{\text{argmin}}\;\left(% \theta({\bf x}):=\psi({\bf W},{\bf x})\right).bold_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_UNDERACCENT bold_x ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG argmin end_ARG ( italic_θ ( bold_x ) := italic_ψ ( bold_W , bold_x ) ) . (10)

The aim of this work is to develop an efficient algorithm to minimize θ(𝐱)𝜃𝐱\theta({\bf x})italic_θ ( bold_x ). A well-known technique to tackle this purpose is the Majorize-Minimize (MM) principle [16]. In the following, we will present how to build quadratic majorization functions.

2.2 Quadratic majorization functions

MM algorithms are based on the idea of iteratively constructing convex quadratic majorizing approximation functions (also called an auxiliary function [19], [17] ) of the cost function:

Definition 1

Let φ:Rnormal-:𝜑normal-→superscript𝑅\varphi:{\mathbb{R}}^{R}\to{\mathbb{R}}italic_φ : blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT → blackboard_R be a differentiable function and 𝐱R𝐱superscript𝑅{\bf x}\in{\mathbb{R}}^{R}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT. Let us define, for every 𝐱Rsuperscript𝐱normal-′superscript𝑅{\bf x}^{\prime}\in{\mathbb{R}}^{R}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT,

ξ(𝐱,𝐱)=φ(𝐱)+(𝐱𝐱)φ(𝐱)+12𝐱𝐱𝐀(𝐱)2𝜉𝐱superscript𝐱𝜑𝐱superscriptsuperscript𝐱𝐱top𝜑𝐱12superscriptsubscriptnormsuperscript𝐱𝐱𝐀𝐱2\xi({\bf x},{\bf x}^{\prime})=\varphi({\bf x})+({\bf x}^{\prime}-{\bf x})^{% \top}\nabla\varphi({\bf x})+\frac{1}{2}\|{\bf x}^{\prime}-{\bf x}\|_{{\bf A}({% \bf x})}^{2}italic_ξ ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_φ ( bold_x ) + ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_φ ( bold_x ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x ∥ start_POSTSUBSCRIPT bold_A ( bold_x ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (11)

where 𝐀(𝐱)R×R𝐀𝐱superscript𝑅𝑅{\bf A}({\bf x})\in{\mathbb{R}}^{R\times R}bold_A ( bold_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_R × italic_R end_POSTSUPERSCRIPT is a positive semi-definite matrix and 𝐀(𝐱)2\|\cdot\|_{{\bf A}({\bf x})}^{2}∥ ⋅ ∥ start_POSTSUBSCRIPT bold_A ( bold_x ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the weighted Euclidean norm induced by matrix 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ), that is, 𝐳R,𝐳𝐀(𝐱)2=𝐳𝐀(𝐱)𝐳formulae-sequencefor-all𝐳superscript𝑅superscriptsubscriptnorm𝐳𝐀𝐱2superscript𝐳top𝐀𝐱𝐳\forall{\bf z}\in{\mathbb{R}}^{R},\;\|{\bf z}\|_{{\bf A}({\bf x})}^{2}={\bf z}% ^{\top}{\bf A}({\bf x}){\bf z}∀ bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , ∥ bold_z ∥ start_POSTSUBSCRIPT bold_A ( bold_x ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_A ( bold_x ) bold_z. Then, 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) satisfies the majoration condition for φ𝜑\varphiitalic_φ at 𝐱𝐱{\bf x}bold_x if ξ(𝐱,)𝜉𝐱normal-⋅\xi({\bf x},\cdot)italic_ξ ( bold_x , ⋅ ) is a quadratic majorization function of φ𝜑\varphiitalic_φ at 𝐱𝐱{\bf x}bold_x, that is, for every 𝐱R,φ(𝐱)ξ(𝐱,𝐱)formulae-sequence𝐱superscript𝑅𝜑𝐱𝜉𝐱superscript𝐱normal-′{\bf x}\in{\mathbb{R}}^{R},\;\varphi({\bf x})\leq\xi({\bf x},{\bf x}^{\prime})bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_φ ( bold_x ) ≤ italic_ξ ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

In this work, with the purpose of achieving fast convergence, we propose a new approach to design matrix 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) in a way that permits as large moves as possible among successive approximations of the decomposition, while still allowing an easy inversion. To build a family of majorizing functions, we resort to the following result inspired by the convergence proof of MU from Lee and Seung [19]:

Proposition 1

Let 𝐁+R×R𝐁superscriptsubscript𝑅𝑅{\bf B}\in{\mathbb{R}}_{+}^{R\times R}bold_B ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R × italic_R end_POSTSUPERSCRIPT be a symmetric matrix, and 𝐮++R𝐮subscriptsuperscript𝑅absent{\bf u}\in{\mathbb{R}}^{R}_{++}bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, a vector with positive entries. Then, (Diag((𝐁𝐮)𝐮)𝐁)normal-Diagnormal-⊘𝐁𝐮𝐮𝐁\left(\operatorname{Diag}\left(({\bf B}{\bf u})\oslash{\bf u}\right)-{\bf B}\right)( roman_Diag ( ( bold_Bu ) ⊘ bold_u ) - bold_B ) is positive semi-definite.

Proof 1

See Appendix Proof of Proposition 1.

This proposition leads to the following practical corollary.

Corollary 1

Let φ:Rnormal-:𝜑normal-→superscript𝑅\varphi:{\mathbb{R}}^{R}\to{\mathbb{R}}italic_φ : blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT → blackboard_R a convex, twice-differentiable function with continuous derivatives in an open ball {\cal B}caligraphic_B around the point 𝐱R𝐱superscript𝑅{\bf x}\in{\mathbb{R}}^{R}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT. Denote 2φ(x)superscriptnormal-∇2𝜑𝑥\nabla^{2}\varphi(x)∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_φ ( italic_x ) the Hessian matrix evaluated at x𝑥xitalic_x. Then for any 𝐮++R𝐮subscriptsuperscript𝑅absent{\bf u}\in{\mathbb{R}}^{R}_{++}bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT, the function

ξ(𝐱,𝐱)=𝜉𝐱superscript𝐱absent\displaystyle\xi({\bf x},{\bf x}^{\prime})=italic_ξ ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = φ(𝐱)+(𝐱𝐱)φ(𝐱)𝜑𝐱superscriptsuperscript𝐱𝐱top𝜑𝐱\displaystyle\varphi({\bf x})+({\bf x}^{\prime}-{\bf x})^{\top}\nabla\varphi({% \bf x})italic_φ ( bold_x ) + ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_φ ( bold_x )
+12(𝐱𝐱)Diag((|2φ(x)|𝐮)𝐮)(𝐱𝐱)12superscriptsuperscript𝐱𝐱topDiagsuperscript2𝜑𝑥𝐮𝐮superscript𝐱𝐱\displaystyle+\frac{1}{2}({\bf x}^{\prime}-{\bf x})^{\top}\operatorname{Diag}% \left(({|\nabla^{2}\varphi(x)|}{\bf u})\oslash{\bf u}\right)({\bf x}^{\prime}-% {\bf x})+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Diag ( ( | ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_φ ( italic_x ) | bold_u ) ⊘ bold_u ) ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x )

is a quadratic majorization function of φ𝜑\varphiitalic_φ at 𝐱𝐱{\bf x}bold_x.

We now turn our attention to the construction of the MM function when the objective function θ(x)𝜃𝑥\theta(x)italic_θ ( italic_x ) is obtained from (8) or (2.1), i.e. we show how to build a tight auxiliary function by careful selection of the approximate Hessian matrix 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ).

3 Construction of quadratic majorant functions for NMF

The choice of matrix 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) is important since a good choice can improve the speed of convergence of the MM algorithm. Indeed, in [6, 24] the authors show that the use of judicious preconditioning matrices can significantly accelerate the convergence of the algorithm. In fact, we may see 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) as an approximation of the Hessian of the cost function, and the choice of 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) constitutes a trade-off between the number of iterations necessary to achieve convergence and update complexity. If 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) is exactly the Hessian matrix, minimizing the majorant amounts to performing a Newton step method which has quadratic convergence. On the other hand, if 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) is diagonal, it can be inverted with negligible time complexity but convergence speed may be sublinear as is the case for gradient descent. Here, we leverage Corollary 1 to build the quadratic majorant function ξ𝜉\xiitalic_ξ of θ𝜃\thetaitalic_θ. We choose

ξ(𝐱,𝐱)=θ(𝐱)+(𝐱𝐱)θ(𝐱)+12(𝐱𝐱)𝐀(𝐱)(𝐱𝐱)𝜉𝐱superscript𝐱𝜃𝐱superscriptsuperscript𝐱𝐱top𝜃𝐱12superscriptsuperscript𝐱𝐱top𝐀𝐱superscript𝐱𝐱\xi({\bf x},{\bf x}^{\prime})=\theta({\bf x})+({\bf x}^{\prime}-{\bf x})^{\top% }\nabla\theta({\bf x})+\frac{1}{2}({\bf x}^{\prime}-{\bf x})^{\top}{\bf A}({% \bf x})({\bf x}^{\prime}-{\bf x})italic_ξ ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_θ ( bold_x ) + ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_θ ( bold_x ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_A ( bold_x ) ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_x ) (12)

with 𝐀(𝐱)=Diag([2θ(𝐱)𝐮]𝐮)𝐀𝐱Diagdelimited-[]superscript2𝜃𝐱𝐮𝐮{\bf A}({\bf x})=\operatorname{Diag}\left([\nabla^{2}\theta({\bf x}){\bf u}]% \oslash{\bf u}\right)bold_A ( bold_x ) = roman_Diag ( [ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( bold_x ) bold_u ] ⊘ bold_u ) for a well chosen vector 𝐮𝐮{\bf u}bold_u. Here, the Hessian matrix is easily derived:
1) when θ𝜃\thetaitalic_θ is the squared Frobenius norm

2θ(𝐱)=𝐖𝐖superscript2𝜃𝐱superscript𝐖𝐖top\nabla^{2}\theta({\bf x})={\bf W}{\bf W}^{\top}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( bold_x ) = bold_WW start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (13)

2) when θ𝜃\thetaitalic_θ is the KL divergence

2θ(𝐱)=m=1Mvn,m𝐰m𝐰m(𝐰m𝐱)2.superscript2𝜃𝐱superscriptsubscript𝑚1𝑀subscript𝑣𝑛𝑚subscript𝐰𝑚superscriptsubscript𝐰𝑚topsuperscriptsuperscriptsubscript𝐰𝑚top𝐱2\nabla^{2}\theta({\bf x})=\sum_{m=1}^{M}\frac{v_{n,m}{\bf w}_{m}{\bf w}_{m}^{% \top}}{({\bf w}_{m}^{\top}{\bf x})^{2}}.∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_v start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (14)

An approximation of the Hessian is sometimes used instead for KL. In particular, within the MU framework, it is obtained by setting vn,m=𝐰m𝐱subscript𝑣𝑛𝑚superscriptsubscript𝐰𝑚top𝐱v_{n,m}={\bf w}_{m}^{\top}{\bf x}italic_v start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x, leading to

2θ(𝐱)m=1M𝐰m𝐰mvn,m.superscript2𝜃𝐱superscriptsubscript𝑚1𝑀subscript𝐰𝑚superscriptsubscript𝐰𝑚topsubscript𝑣𝑛𝑚\nabla^{2}\theta({\bf x})\approx\sum_{m=1}^{M}\frac{{\bf w}_{m}{\bf w}_{m}^{% \top}}{v_{n,m}}.∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( bold_x ) ≈ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG . (15)

In their seminal work [18, 19], Lee and Seung proposed to choose 𝐮=𝐱=𝐡n𝐮𝐱subscript𝐡𝑛{\bf u}={\bf x}={\bf h}_{n}bold_u = bold_x = bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to build the majorant of the Hessian matrix. We shell denote it by 𝐀MU(𝐡n)subscript𝐀MUsubscript𝐡𝑛{\bf A}_{\operatorname{MU}}({\bf h}_{n})bold_A start_POSTSUBSCRIPT roman_MU end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). This yields
1) when θ𝜃\thetaitalic_θ is the squared Frobenius norm:

𝐀MU(𝐡n)=Diag([𝐖𝐖𝐡n]𝐡n)subscript𝐀MUsubscript𝐡𝑛Diagdelimited-[]superscript𝐖𝐖topsubscript𝐡𝑛subscript𝐡𝑛{\bf A}_{\operatorname{MU}}({\bf h}_{n})=\operatorname{Diag}\left([{\bf W}{{% \bf W}}^{\top}{\bf h}_{n}]\oslash{\bf h}_{n}\right)bold_A start_POSTSUBSCRIPT roman_MU end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_Diag ( [ bold_WW start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ⊘ bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (16)

2) when θ𝜃\thetaitalic_θ is the KL divergence, under the approximation vn,m=𝐰m𝐡nsubscript𝑣𝑛𝑚superscriptsubscript𝐰𝑚topsubscript𝐡𝑛v_{n,m}={\bf w}_{m}^{\top}{\bf h}_{n}italic_v start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

𝐀MU(𝐡n)subscript𝐀MUsubscript𝐡𝑛\displaystyle{\bf A}_{\operatorname{MU}}({\bf h}_{n})bold_A start_POSTSUBSCRIPT roman_MU end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =Diag([m=1M𝐰m𝐰m𝐡nvn,m]𝐡n)absentDiagdelimited-[]superscriptsubscript𝑚1𝑀subscript𝐰𝑚superscriptsubscript𝐰𝑚topsubscript𝐡𝑛subscript𝑣𝑛𝑚subscript𝐡𝑛\displaystyle=\operatorname{Diag}\left(\left[\sum_{m=1}^{M}\frac{{\bf w}_{m}{% \bf w}_{m}^{\top}{\bf h}_{n}}{v_{n,m}}\right]\oslash{\bf h}_{n}\right)= roman_Diag ( [ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG ] ⊘ bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
=Diag([m=1M𝐰m]𝐡n)absentDiagdelimited-[]superscriptsubscript𝑚1𝑀subscript𝐰𝑚subscript𝐡𝑛\displaystyle=\operatorname{Diag}\left(\left[\sum_{m=1}^{M}{\bf w}_{m}\right]% \oslash{\bf h}_{n}\right)= roman_Diag ( [ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] ⊘ bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (17)

It is shown in [19] that these choices not only yield majorizing functions ξ𝜉\xiitalic_ξ for θ𝜃\thetaitalic_θ but also give a minimizer vector of θ𝜃\thetaitalic_θ with positive entries.

Although MU is an efficient and widely used algorithm in the literature, its convergence still attracts the attention of many researchers. In particular [10] produced a sequence of iterates (𝐱(k))k*subscriptsuperscript𝐱𝑘𝑘superscript({\bf x}^{(k)})_{k\in{\mathbb{N}}^{*}}( bold_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k ∈ blackboard_N start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT with decreasing cost function values θ(𝐱(k))θ(𝐱(k+1))𝜃superscript𝐱𝑘𝜃superscript𝐱𝑘1\theta({\bf x}^{(k)})\geq\theta({\bf x}^{(k+1)})italic_θ ( bold_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ≥ italic_θ ( bold_x start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ). However, the convergence of the cost function to a first-order stationary point is not guaranteed. Zhao and Tan proposed in [33] to add regularizers for 𝐖𝐖{\bf W}bold_W and 𝐇𝐇{\bf H}bold_H in the objective function to guarantee the convergence of the algorithm. Another family of approaches proposed to enforce the nonnegativity of the entries of 𝐖𝐖{\bf W}bold_W and 𝐇𝐇{\bf H}bold_H in (2) using 𝐖>ϵ𝐖italic-ϵ{\bf W}>\epsilonbold_W > italic_ϵ and 𝐇>ϵ𝐇italic-ϵ{\bf H}>\epsilonbold_H > italic_ϵ with ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 [28, 27]. These constraints allow the MU algorithm to converge to the stationary points by avoiding various problems occurring at zero. In the same way, in the following, we assume that the factor matrices 𝐖𝐖{\bf W}bold_W and 𝐇𝐇{\bf H}bold_H have positive entries.

Minimization of the quadratic majorant obtained with the 𝐀MUsubscript𝐀MU{\bf A}_{\operatorname{MU}}bold_A start_POSTSUBSCRIPT roman_MU end_POSTSUBSCRIPT approximate Hessian recovers the well-known MU updates:
1) for the squared Frobenius norm:

𝐇𝐇(𝐖𝐕𝐖𝐖𝐇)𝐇direct-product𝐇𝐖𝐕superscript𝐖𝐖top𝐇{\bf H}\leftarrow{\bf H}\odot\left({\bf W}{\bf V}\oslash{\bf W}{\bf W}^{\top}{% \bf H}\right)bold_H ← bold_H ⊙ ( bold_WV ⊘ bold_WW start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H ) (18)

2) for the KL divergence111For the KL divergence, note that the MU algorithm uses an approximation of the approximate Hessian. This double approximation is justified by the simplicity of the updates obtained this way.

𝐇𝐇[(𝐖(𝐕𝐖𝐇))(𝐖𝟙M,N)]𝐇direct-product𝐇delimited-[]𝐖𝐕superscript𝐖top𝐇𝐖subscript1𝑀𝑁{\bf H}\leftarrow{\bf H}\odot\left[\left({\bf W}({\bf V}\oslash{\bf W}^{\top}{% \bf H})\right)\oslash\left({\bf W}{\mathbbm{1}}_{M,N}\right)\right]bold_H ← bold_H ⊙ [ ( bold_W ( bold_V ⊘ bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H ) ) ⊘ ( bold_W blackboard_1 start_POSTSUBSCRIPT italic_M , italic_N end_POSTSUBSCRIPT ) ] (19)

At this stage, it is important to discuss the complexity of these updates. A quick analysis easily shows that in a low-rank context, costly operations are products of the form 𝐖𝐕𝐖𝐕{\bf W}{\bf V}bold_WV and 𝐖𝐇superscript𝐖top𝐇{\bf W}^{\top}{\bf H}bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H, both involving 𝒪(MNR)𝒪𝑀𝑁𝑅\mathcal{O}(MNR)caligraphic_O ( italic_M italic_N italic_R ) multiplications.

The purpose of our work is to find a matrix 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) (which boils down to finding a vector 𝐮𝐮{\bf u}bold_u) that improves the majorant proposed by Lee and Seung without increasing the computational cost of the updates. To this end, we notice from Proposition 1 that the approximate Hessians proposed by Lee and Seung are actually larger than the true Hessian matrices (or their approximation in the KL case). In order to achieve reduced curvature of the function ξ(𝐱,)𝜉𝐱\xi({\bf x},\cdot)italic_ξ ( bold_x , ⋅ ) and thus enables more significant moves when minimizing it, we look for tighter upper bounds of the Hessian matrices by choosing another value for 𝐮𝐮{\bf u}bold_u. To achieve this goal, we will search for 𝐮𝐮{\bf u}bold_u so that the diagonal values of 𝐀(𝐱)𝐀𝐱{\bf A}({\bf x})bold_A ( bold_x ) are as small as possible. Finding such a vector 𝐮𝐮{\bf u}bold_u can be addressed by minimizing the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of these diagonal values:

𝐮*argmin𝟎R𝐮0j=1R(2θ(𝐱)𝐮)jujsuperscript𝐮subscript0𝑅𝐮0argminsuperscriptsubscript𝑗1𝑅subscriptsuperscript2𝜃𝐱𝐮𝑗subscript𝑢𝑗{\bf u}^{*}{\in}\underset{{{\bf 0}_{R}\neq{\bf u}\geq 0}}{\text{argmin}}\sum_{% j=1}^{R}\frac{\left(\nabla^{2}\theta({\bf x}){\bf u}\right)_{j}}{u_{j}}bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ start_UNDERACCENT bold_0 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ≠ bold_u ≥ 0 end_UNDERACCENT start_ARG argmin end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT divide start_ARG ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( bold_x ) bold_u ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG (20)

The following proposition shows that solutions to these problems are known in closed form.

Proposition 2

Let 𝐁𝐁{\bf B}bold_B be a a symmetric matrix with strictly positive entries of size R×R𝑅𝑅R\times Ritalic_R × italic_R. The optimization problem

𝐮*=𝑎𝑟𝑔𝑚𝑖𝑛𝟎R𝐮0j=1R(𝐁𝐮)jujsuperscript𝐮subscript0𝑅𝐮0𝑎𝑟𝑔𝑚𝑖𝑛superscriptsubscript𝑗1𝑅subscript𝐁𝐮𝑗subscript𝑢𝑗{\bf u}^{*}=\underset{{{\bf 0}_{R}\neq{\bf u}\geq 0}}{\text{argmin}}\;\sum_{j=% 1}^{R}\frac{\left({\bf B}{\bf u}\right)_{j}}{u_{j}}bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_UNDERACCENT bold_0 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ≠ bold_u ≥ 0 end_UNDERACCENT start_ARG argmin end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT divide start_ARG ( bold_Bu ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG (21)

has solutions in following form:

𝐮*=α𝟙R, with α>0.formulae-sequencesuperscript𝐮𝛼subscript1𝑅 with 𝛼0{\bf u}^{*}=\alpha{\mathbbm{1}}_{R},\;\text{ with }\alpha>0.bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_α blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , with italic_α > 0 . (22)
Proof 2

We denote the cost function in (21) as φ(𝐮)𝜑𝐮\varphi({\bf u})italic_φ ( bold_u ) and rewrite it as follows

φ(𝐮)=j=1R(bj,j+i=1,ijRbj,iuiuj)𝜑𝐮superscriptsubscript𝑗1𝑅subscript𝑏𝑗𝑗superscriptsubscriptformulae-sequence𝑖1𝑖𝑗𝑅subscript𝑏𝑗𝑖subscript𝑢𝑖subscript𝑢𝑗\varphi({\bf u})=\sum_{j=1}^{R}\left(b_{j,j}+\sum_{i=1,i\neq j}^{R}\frac{b_{j,% i}u_{i}}{u_{j}}\right)italic_φ ( bold_u ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT italic_j , italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT divide start_ARG italic_b start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG )

First we can note that for any 𝐮*R𝐮superscriptsubscript𝑅{\bf u}\in{\mathbb{R}}_{*}^{R}bold_u ∈ blackboard_R start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT and α0𝛼0\alpha\neq 0italic_α ≠ 0 we have φ(α𝐮)=φ(𝐮)𝜑𝛼𝐮𝜑𝐮\varphi(\alpha{\bf u})=\varphi({\bf u})italic_φ ( italic_α bold_u ) = italic_φ ( bold_u ). The Problem 21 can be reformulated as follows

𝐮*=𝑎𝑟𝑔𝑚𝑖𝑛𝐮0j=1R(𝐁𝐮)juj s.t. j=1Ruj=1superscript𝐮𝐮0𝑎𝑟𝑔𝑚𝑖𝑛superscriptsubscript𝑗1𝑅subscript𝐁𝐮𝑗subscript𝑢𝑗 s.t. superscriptsubscript𝑗1𝑅subscript𝑢𝑗1{\bf u}^{*}=\underset{{{\bf u}\geq 0}}{\text{argmin}}\;\sum_{j=1}^{R}\frac{% \left({\bf B}{\bf u}\right)_{j}}{u_{j}}\quad\text{ s.t. }\quad\sum_{j=1}^{R}u_% {j}=1bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_UNDERACCENT bold_u ≥ 0 end_UNDERACCENT start_ARG argmin end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT divide start_ARG ( bold_Bu ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG s.t. ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 (23)

The Lagrangian of this problem is given by

L(𝐮,𝝁,λ)=φ(𝐮)+λ(𝟙R𝐮1)𝝁𝐮𝐿𝐮𝝁𝜆𝜑𝐮𝜆superscriptsubscript1𝑅top𝐮1superscript𝝁top𝐮L({\bf u},\hbox{\boldmath$\mu$},\lambda)=\varphi({\bf u})+\lambda\left({% \mathbbm{1}}_{R}^{\top}{\bf u}-1\right)-\hbox{\boldmath$\mu$}^{\top}{\bf u}italic_L ( bold_u , bold_italic_μ , italic_λ ) = italic_φ ( bold_u ) + italic_λ ( blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u - 1 ) - bold_italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u (24)

and its partial gradient with respect to 𝐮𝐮{\bf u}bold_u is

Lur(𝐮,μ,λ)=𝐿subscript𝑢𝑟𝐮𝜇𝜆absent\displaystyle\frac{\partial L}{\partial u_{r}}({\bf u},\mu,\lambda)=divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ( bold_u , italic_μ , italic_λ ) = i=1,irRbr,iuiur2+j=1,jrRbj,ruj+λμrsuperscriptsubscriptformulae-sequence𝑖1𝑖𝑟𝑅subscript𝑏𝑟𝑖subscript𝑢𝑖superscriptsubscript𝑢𝑟2superscriptsubscriptformulae-sequence𝑗1𝑗𝑟𝑅subscript𝑏𝑗𝑟subscript𝑢𝑗𝜆subscript𝜇𝑟\displaystyle\sum_{i=1,i\neq r}^{R}\frac{-b_{r,i}u_{i}}{u_{r}^{2}}+\sum_{j=1,j% \neq r}^{R}\frac{b_{j,r}}{u_{j}}+\lambda-\mu_{r}∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT divide start_ARG - italic_b start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT divide start_ARG italic_b start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + italic_λ - italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
=\displaystyle== i=1R(br,iuiur2+br,iui+λμr)superscriptsubscript𝑖1𝑅subscript𝑏𝑟𝑖subscript𝑢𝑖superscriptsubscript𝑢𝑟2subscript𝑏𝑟𝑖subscript𝑢𝑖𝜆subscript𝜇𝑟\displaystyle\sum_{i=1}^{R}\left(\frac{-b_{r,i}u_{i}}{u_{r}^{2}}+\frac{b_{r,i}% }{u_{i}}+\lambda-\mu_{r}\right)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( divide start_ARG - italic_b start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_b start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + italic_λ - italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT )
=\displaystyle== 𝐛r(𝟙R𝐮1ur2𝐮)+λμrsuperscriptsubscript𝐛𝑟topsubscript1𝑅𝐮1superscriptsubscript𝑢𝑟2𝐮𝜆subscript𝜇𝑟\displaystyle{\bf b}_{r}^{\top}\left({\mathbbm{1}}_{R}\oslash{\bf u}-\frac{1}{% u_{r}^{2}}{\bf u}\right)+\lambda-\mu_{r}bold_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ⊘ bold_u - divide start_ARG 1 end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_u ) + italic_λ - italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT

We can see that for every 1rR1𝑟𝑅1\leq r\leq R1 ≤ italic_r ≤ italic_R, ur0subscript𝑢𝑟0u_{r}\neq 0italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≠ 0 otherwise φ(𝐮)=+𝜑𝐮\varphi({\bf u})=+\inftyitalic_φ ( bold_u ) = + ∞, which implies that μr=0subscript𝜇𝑟0\mu_{r}=0italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0 for all 1rR1𝑟𝑅1\leq r\leq R1 ≤ italic_r ≤ italic_R. Now let rmax=𝑎𝑟𝑔𝑚𝑎𝑥1rRursubscript𝑟1𝑟𝑅𝑎𝑟𝑔𝑚𝑎𝑥subscript𝑢𝑟r_{\max}=\underset{{1\leq r\leq R}}{\text{argmax}}\;u_{r}italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = start_UNDERACCENT 1 ≤ italic_r ≤ italic_R end_UNDERACCENT start_ARG argmax end_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and let rmin=𝑎𝑟𝑔𝑚𝑖𝑛1rRursubscript𝑟1𝑟𝑅𝑎𝑟𝑔𝑚𝑖𝑛subscript𝑢𝑟r_{\min}=\underset{{1\leq r\leq R}}{\text{argmin}}\;u_{r}italic_r start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = start_UNDERACCENT 1 ≤ italic_r ≤ italic_R end_UNDERACCENT start_ARG argmin end_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT by using the KKT condition, we have

λ=𝐛rmin(𝟙R𝐮1urmin2𝐮)0𝜆superscriptsubscript𝐛subscript𝑟topsubscript1𝑅𝐮1superscriptsubscript𝑢subscript𝑟2𝐮0\lambda=-{\bf b}_{r_{\min}}^{\top}\left({\mathbbm{1}}_{R}\oslash{\bf u}-\frac{% 1}{u_{r_{\min}}^{2}}{\bf u}\right)\geq 0italic_λ = - bold_b start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ⊘ bold_u - divide start_ARG 1 end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_u ) ≥ 0

and

λ=𝐛rmax(𝟙R𝐮1urmax2𝐮)0𝜆superscriptsubscript𝐛subscript𝑟topsubscript1𝑅𝐮1superscriptsubscript𝑢subscript𝑟2𝐮0\lambda=-{\bf b}_{r_{\max}}^{\top}\left({\mathbbm{1}}_{R}\oslash{\bf u}-\frac{% 1}{u_{r_{\max}}^{2}}{\bf u}\right)\leq 0italic_λ = - bold_b start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ⊘ bold_u - divide start_ARG 1 end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_u ) ≤ 0

Hence λ=0,μ=𝟎Rformulae-sequence𝜆0𝜇subscript0𝑅\lambda=0,\mu={\bf 0}_{R}italic_λ = 0 , italic_μ = bold_0 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, 𝐮=α𝟙R𝐮𝛼subscript1𝑅{\bf u}=\alpha{\mathbbm{1}}_{R}bold_u = italic_α blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is the only solutuion of KKT condition.

To show that these solutions are local minima, we compute the Hessian matrix:

2L(𝐮,μ,λ)=(2φ(𝐮)urui)r=1,,R;i=1,,Rsuperscript2𝐿𝐮𝜇𝜆subscriptsuperscript2𝜑𝐮subscript𝑢𝑟subscript𝑢𝑖formulae-sequence𝑟1𝑅𝑖1𝑅\displaystyle\nabla^{2}L({\bf u},\mu,\lambda)=\left(\frac{\partial^{2}\varphi(% {\bf u})}{\partial u_{r}\partial u_{i}}\right)_{r=1,\ldots,R;i=1,\ldots,R}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( bold_u , italic_μ , italic_λ ) = ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_φ ( bold_u ) end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∂ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUBSCRIPT italic_r = 1 , … , italic_R ; italic_i = 1 , … , italic_R end_POSTSUBSCRIPT

where for each r=1,,R𝑟1normal-…𝑅r=1,\ldots,Ritalic_r = 1 , … , italic_R and i=1,,R𝑖1normal-…𝑅i=1,\ldots,Ritalic_i = 1 , … , italic_R we have

2φ(𝐮)urui={rj=1R2br,jujur3 if r=i(1ui2+1ur2)br,i otherwise.superscript2𝜑𝐮subscript𝑢𝑟subscript𝑢𝑖casessuperscriptsubscript𝑟𝑗1𝑅2subscript𝑏𝑟𝑗subscript𝑢𝑗superscriptsubscript𝑢𝑟3 if 𝑟𝑖1superscriptsubscript𝑢𝑖21superscriptsubscript𝑢𝑟2subscript𝑏𝑟𝑖 otherwise\frac{\partial^{2}\varphi({\bf u})}{\partial u_{r}\partial u_{i}}=\begin{cases% }\sum_{r\neq j=1}^{R}\frac{2b_{r,j}u_{j}}{u_{r}^{3}}&\text{ if }r=i\\ -\left(\frac{1}{u_{i}^{2}}+\frac{1}{u_{r}^{2}}\right)b_{r,i}&\text{ otherwise}% .\\ \end{cases}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_φ ( bold_u ) end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∂ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = { start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_r ≠ italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT divide start_ARG 2 italic_b start_POSTSUBSCRIPT italic_r , italic_j end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL if italic_r = italic_i end_CELL end_ROW start_ROW start_CELL - ( divide start_ARG 1 end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_b start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT end_CELL start_CELL otherwise . end_CELL end_ROW

For any non-zero vector 𝐯R𝐯superscript𝑅{\bf v}\in{\mathbb{R}}^{R}bold_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT, we have

𝐯2φ(α𝟙R)𝐯superscript𝐯topsuperscript2𝜑𝛼subscript1𝑅𝐯\displaystyle{\bf v}^{\top}\nabla^{2}\varphi(\alpha{\mathbbm{1}}_{R}){\bf v}bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_φ ( italic_α blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) bold_v =r,i=1R(vrvi)2br,i0,𝐯Rformulae-sequenceabsentsuperscriptsubscript𝑟𝑖1𝑅superscriptsubscript𝑣𝑟subscript𝑣𝑖2subscript𝑏𝑟𝑖0for-all𝐯superscript𝑅\displaystyle=\sum_{r,i=1}^{R}\left(v_{r}-v_{i}\right)^{2}b_{r,i}\geq 0,\;% \forall\;{\bf v}\in{\mathbb{R}}^{R}= ∑ start_POSTSUBSCRIPT italic_r , italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT ≥ 0 , ∀ bold_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT

where the equality holds if and only if 𝐯𝐯{\bf v}bold_v is a constant vector which is not belong to the tangent space of the problem. Therefore the hessian matrix 2φsuperscriptnormal-∇2𝜑\nabla^{2}\varphi∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_φ at the set of vectors {α𝟙R}α>0subscript𝛼subscript1𝑅𝛼0\{\alpha{\mathbbm{1}}_{R}\}_{\alpha>0}{ italic_α blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_α > 0 end_POSTSUBSCRIPT is a positive definite over the tangent space as long as 𝐁𝐁{\bf B}bold_B is a symmetric and entry-wise positive matrix. Thus the proof of Proposition 2.

Finally, plugging the 𝐮𝐮{\bf u}bold_u crafted with Proposition 2 in an approximate Hessian Diag([2θ(𝐱)𝐮]𝐮)Diagdelimited-[]superscript2𝜃𝐱𝐮𝐮\operatorname{Diag}\left([\nabla^{2}\theta({\bf x}){\bf u}]\oslash{\bf u}\right)roman_Diag ( [ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( bold_x ) bold_u ] ⊘ bold_u ), we are able to provide new simple majorants of the true Hessians, denoted 𝐀fastMU(𝐱)subscript𝐀fastMU𝐱{\bf A}_{\operatorname{fastMU}}({\bf x})bold_A start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT ( bold_x ), for each loss function:
1) When θ𝜃\thetaitalic_θ is the squared Frobenius norm:

𝐀fastMU(𝐱)=Diag[𝐖𝐖𝟙R]subscript𝐀fastMU𝐱Diagsuperscript𝐖𝐖topsubscript1𝑅{\bf A}_{\operatorname{fastMU}}({\bf x})=\operatorname{Diag}\left[{\bf W}{\bf W% }^{\top}{\mathbbm{1}}_{R}\right]bold_A start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT ( bold_x ) = roman_Diag [ bold_WW start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ] (25)

2) When θ𝜃\thetaitalic_θ is the KL divergence:

𝐀fastMU(𝐱)=Diag[m=1Mvn,m𝐰m𝐰m𝟙R(𝐰m𝐱)2]subscript𝐀fastMU𝐱Diagsuperscriptsubscript𝑚1𝑀subscript𝑣𝑛𝑚subscript𝐰𝑚superscriptsubscript𝐰𝑚topsubscript1𝑅superscriptsuperscriptsubscript𝐰𝑚top𝐱2{\bf A}_{\operatorname{fastMU}}({\bf x})=\operatorname{Diag}\left[\sum_{m=1}^{% M}\frac{v_{n,m}{\bf w}_{m}{\bf w}_{m}^{\top}{\mathbbm{1}}_{R}}{({\bf w}_{m}^{% \top}{\bf x})^{2}}\right]bold_A start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT ( bold_x ) = roman_Diag [ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_v start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] (26)

Importantly, computing either Hessian matrices 𝐀fastMUsubscript𝐀fastMU{\bf A}_{\operatorname{fastMU}}bold_A start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT in (25) or (26) involves similar quantities as in the MU updates, which are also required to compute the gradients. Therefore they are computed with little additional cost compared to classical MU Hessian matrices 𝐀MUsubscript𝐀MU{\bf A}_{\operatorname{MU}}bold_A start_POSTSUBSCRIPT roman_MU end_POSTSUBSCRIPT.

Remark 1

The upper approximation of the Hessian is of the form Diag((𝐁𝐮)𝐮)normal-Diagnormal-⊘𝐁𝐮𝐮\operatorname{Diag}(({\bf B}{\bf u})\oslash{\bf u})roman_Diag ( ( bold_Bu ) ⊘ bold_u ), with 𝐁=2θ(𝐱)𝐁superscriptnormal-∇2𝜃𝐱{\bf B}=\nabla^{2}\theta({\bf x})bold_B = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( bold_x ). When we minimize (𝐁𝐮)𝐮1subscriptnormnormal-⊘𝐁𝐮𝐮1\|({\bf B}{\bf u})\oslash{\bf u}\|_{1}∥ ( bold_Bu ) ⊘ bold_u ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we choose the element in that class of majorants with minimum mean value. We can think of other choices. For instance, picking 𝐮𝐮{\bf u}bold_u which leads to the largest possible step in all directions,

𝑎𝑟𝑔𝑚𝑖𝑛𝐮>0max1rR(𝐛r𝐮ur)=𝑎𝑟𝑔𝑚𝑖𝑛𝐮>0(𝐁𝐮)𝐮𝐮0𝑎𝑟𝑔𝑚𝑖𝑛subscript1𝑟𝑅superscriptsubscript𝐛𝑟top𝐮subscript𝑢𝑟𝐮0𝑎𝑟𝑔𝑚𝑖𝑛subscriptnorm𝐁𝐮𝐮\underset{{{\bf u}>0}}{\text{argmin}}\;\max_{1\leq r\leq R}\left(\frac{{\bf b}% _{r}^{\top}{\bf u}}{u_{r}}\right)=\underset{{{\bf u}>0}}{\text{argmin}}\;\|({% \bf B}{\bf u})\oslash{\bf u}\|_{\infty}start_UNDERACCENT bold_u > 0 end_UNDERACCENT start_ARG argmin end_ARG roman_max start_POSTSUBSCRIPT 1 ≤ italic_r ≤ italic_R end_POSTSUBSCRIPT ( divide start_ARG bold_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ) = start_UNDERACCENT bold_u > 0 end_UNDERACCENT start_ARG argmin end_ARG ∥ ( bold_Bu ) ⊘ bold_u ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT (27)

We can show that the solution to the this minmax problem is any eigenvector associated with the largest singular value of 𝐁𝐁{\bf B}bold_B (see in Appendix Minmax problem for more details), this means that under this criterion, fastMU boils down to gradient descent with 1/μ1𝜇1/\mu1 / italic_μ stepsize with μ𝜇\muitalic_μ the local Lipschitz constant.

4 Final proposed algorithm

In the previous section, we showed how to compute diagonal approximate Hessian matrices for updating iterates of NMF subproblems. Let us now detail how to make use of these approximate Hessian matrices. We propose to use an inexact Variable Metric Forward-Backward Algorithm [6] which is a combination between an MM strategy and a Weighted proximity operator. This framework allows to include nonnegativity constraints easily through projection, but other penalizations with tractable proximity operators could easily be considered instead. Within the scope of this work, we only consider nonnegativity constraints.

4.1 The fastMU Algorithm

We represent the following forward-backward algorithm [6, 4, 21, 3, 32] to minimize the function ΨΨ\Psiroman_Ψ in Algorithm 1. At each iteration, k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N, 𝐇ksuperscript𝐇𝑘{\bf H}^{k}bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (resp. 𝐖ksuperscript𝐖𝑘{\bf W}^{k}bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT) is updated with one step of gradient descent (forward step) on Ψ(𝐖k,)Ψsuperscript𝐖𝑘\Psi({\bf W}^{k},\cdot)roman_Ψ ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋅ ) (resp. Ψ(,𝐇k)Ψsuperscript𝐇𝑘\Psi(\cdot,{\bf H}^{k})roman_Ψ ( ⋅ , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )) followed by a projection on the ϵitalic-ϵ\epsilonitalic_ϵ-nonnegative orthant222The set of matrices with entries larger than ϵitalic-ϵ\epsilonitalic_ϵ with ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. (backward step). For (𝐖k,𝐇k)R×M×R×Nsuperscript𝐖𝑘superscript𝐇𝑘superscript𝑅𝑀superscript𝑅𝑁\left({\bf W}^{k},{\bf H}^{k}\right)\in{\mathbb{R}}^{R\times M}\times{\mathbb{% R}}^{R\times N}( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_R × italic_M end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_R × italic_N end_POSTSUPERSCRIPT, 𝐙fastMU𝐇(𝐖k,𝐇k)R×Nsuperscriptsubscript𝐙fastMU𝐇superscript𝐖𝑘superscript𝐇𝑘superscript𝑅𝑁{\bf Z}_{\operatorname{fastMU}}^{{\bf H}}({\bf W}^{k},{\bf H}^{k})\in{\mathbb{% R}}^{R\times N}bold_Z start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_H end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_R × italic_N end_POSTSUPERSCRIPT denotes a “step-size”-like matrix with its n𝑛nitalic_n-th column given by:

𝐳fastMU𝐇(𝐖k,𝐡nk)=diag[𝐀fastMU(𝐡n)].subscriptsuperscript𝐳𝐇fastMUsuperscript𝐖𝑘superscriptsubscript𝐡𝑛𝑘diagdelimited-[]subscript𝐀fastMUsubscript𝐡𝑛{\bf z}^{{\bf H}}_{\operatorname{fastMU}}({\bf W}^{k},{\bf h}_{n}^{k})={\hbox{% diag}}\left[{\bf A}_{\operatorname{fastMU}}({\bf h}_{n})\right].bold_z start_POSTSUPERSCRIPT bold_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = diag [ bold_A start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] . (28)

With the same principle, 𝐙fastMU𝐖(𝐖k,𝐇k)R×Msuperscriptsubscript𝐙fastMU𝐖superscript𝐖𝑘superscript𝐇𝑘superscript𝑅𝑀{\bf Z}_{\operatorname{fastMU}}^{{\bf W}}({\bf W}^{k},{\bf H}^{k})\in{\mathbb{% R}}^{R\times M}bold_Z start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_W end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_R × italic_M end_POSTSUPERSCRIPT denotes the “step-size”-like matrix for Ψ(𝐖k,𝐇k)Ψsuperscript𝐖𝑘superscript𝐇𝑘\Psi({\bf W}^{k},{\bf H}^{k})roman_Ψ ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) at 𝐖ksuperscript𝐖𝑘{\bf W}^{k}bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Moreover, 𝐇Ψ(𝐖k,𝐇k)R×Nsubscript𝐇Ψsuperscript𝐖𝑘superscript𝐇𝑘superscript𝑅𝑁\nabla_{\bf H}\Psi({\bf W}^{k},{\bf H}^{k})\in{\mathbb{R}}^{R\times N}∇ start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT roman_Ψ ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_R × italic_N end_POSTSUPERSCRIPT and 𝐖Ψ(𝐖k,𝐇k)R×Msubscript𝐖Ψsuperscript𝐖𝑘superscript𝐇𝑘superscript𝑅𝑀\nabla_{\bf W}\Psi({\bf W}^{k},{\bf H}^{k})\in{\mathbb{R}}^{R\times M}∇ start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT roman_Ψ ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_R × italic_M end_POSTSUPERSCRIPT denote the partial gradients of ΨΨ\Psiroman_Ψ at (𝐖k,𝐇k)superscript𝐖𝑘superscript𝐇𝑘({\bf W}^{k},{\bf H}^{k})( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) with respect to the variables 𝐇𝐇{\bf H}bold_H and 𝐖𝐖{\bf W}bold_W, respectively. We observed in preliminary experiments that fastMU with KL loss was sensitive to initialization, and therefore we recommend initializing with good estimates, e.g. refine random initialization with one iteration of the classical MU algorithm.

Algorithm 1 fastMU
0:  𝐕+M×N𝐕superscriptsubscript𝑀𝑁{\bf V}\in{\mathbb{R}}_{+}^{M\times N}bold_V ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT and R*𝑅superscriptR\in{\mathbb{N}}^{*}italic_R ∈ blackboard_N start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, Rmin(M,N)𝑅𝑀𝑁R\leq\min(M,N)italic_R ≤ roman_min ( italic_M , italic_N ), ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0.
0:  𝐖0ϵsuperscript𝐖0italic-ϵ{\bf W}^{0}\geq\epsilonbold_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≥ italic_ϵ and 𝐇0ϵsuperscript𝐇0italic-ϵ{\bf H}^{0}\geq\epsilonbold_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≥ italic_ϵ. For every k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N, let Jksubscript𝐽𝑘J_{k}\in{\mathbb{N}}italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_N and Iksubscript𝐼𝑘I_{k}\in{\mathbb{N}}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_N and let (γ𝐇k,j)0jJk1subscriptsuperscriptsubscript𝛾𝐇𝑘𝑗0𝑗subscript𝐽𝑘1(\gamma_{\bf H}^{k,j})_{0\leq j\leq J_{k}-1}( italic_γ start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 0 ≤ italic_j ≤ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT and (γ𝐖k,i)0iIk1subscriptsuperscriptsubscript𝛾𝐖𝑘𝑖0𝑖subscript𝐼𝑘1(\gamma_{\bf W}^{k,i})_{0\leq i\leq I_{k}-1}( italic_γ start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 0 ≤ italic_i ≤ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT be positive sequences.. Refine 𝐖0,𝐇0superscript𝐖0superscript𝐇0{\bf W}^{0},{\bf H}^{0}bold_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT with one iteration of MU for KL loss.
1:  for k=0,1,𝑘01k=0,1,\ldotsitalic_k = 0 , 1 , … do
2:     𝐇k,0=𝐇k,𝐖k,0=𝐖kformulae-sequencesuperscript𝐇𝑘0superscript𝐇𝑘superscript𝐖𝑘0superscript𝐖𝑘{\bf H}^{k,0}={\bf H}^{k},{\bf W}^{k,0}={\bf W}^{k}bold_H start_POSTSUPERSCRIPT italic_k , 0 end_POSTSUPERSCRIPT = bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT italic_k , 0 end_POSTSUPERSCRIPT = bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
3:     for j=0,,Jk1𝑗0subscript𝐽𝑘1j=0,\ldots,J_{k}-1italic_j = 0 , … , italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 do
4:        𝐇k,j+1=max(𝐇k,jγ𝐇k,j𝐇Ψ(𝐖k,𝐇k,j)𝐙fastMU𝐇(𝐖k,𝐇k,j),ϵ)superscript𝐇𝑘𝑗1superscript𝐇𝑘𝑗superscriptsubscript𝛾𝐇𝑘𝑗subscript𝐇Ψsuperscript𝐖𝑘superscript𝐇𝑘𝑗superscriptsubscript𝐙fastMU𝐇superscript𝐖𝑘superscript𝐇𝑘𝑗italic-ϵ{\bf H}^{k,j+1}=\max({\bf H}^{k,j}-\gamma_{\bf H}^{k,j}\nabla_{\bf H}\Psi({\bf W% }^{k},{\bf H}^{k,j})\oslash{\bf Z}_{\operatorname{fastMU}}^{\bf H}({\bf W}^{k}% ,{\bf H}^{k,j}),\,\epsilon)bold_H start_POSTSUPERSCRIPT italic_k , italic_j + 1 end_POSTSUPERSCRIPT = roman_max ( bold_H start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT roman_Ψ ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT ) ⊘ bold_Z start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_H end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT ) , italic_ϵ )
5:     end for
6:     𝐇k+1=𝐇k,Jksuperscript𝐇𝑘1superscript𝐇𝑘subscript𝐽𝑘{\bf H}^{k+1}={\bf H}^{k,J_{k}}bold_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = bold_H start_POSTSUPERSCRIPT italic_k , italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
7:     for i=0,,Ik1𝑖0subscript𝐼𝑘1i=0,\ldots,I_{k}-1italic_i = 0 , … , italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 do
8:        𝐖k,i+1=max(𝐖k,iγ𝐖k,i𝐖Ψ(𝐖k,i,𝐇k+1)𝐙fastMU𝐖(𝐖k,i,𝐇k+1),ϵ)superscript𝐖𝑘𝑖1superscript𝐖𝑘𝑖superscriptsubscript𝛾𝐖𝑘𝑖subscript𝐖Ψsuperscript𝐖𝑘𝑖superscript𝐇𝑘1superscriptsubscript𝐙fastMU𝐖superscript𝐖𝑘𝑖superscript𝐇𝑘1italic-ϵ{\bf W}^{k,i+1}=\max({\bf W}^{k,i}-\gamma_{\bf W}^{k,i}\nabla_{\bf W}\Psi({\bf W% }^{k,i},{\bf H}^{k+1})\oslash{\bf Z}_{\operatorname{fastMU}}^{\bf W}({\bf W}^{% k,i},{\bf H}^{k+1}),\,\epsilon)bold_W start_POSTSUPERSCRIPT italic_k , italic_i + 1 end_POSTSUPERSCRIPT = roman_max ( bold_W start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT roman_Ψ ( bold_W start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ⊘ bold_Z start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_W end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) , italic_ϵ )
9:     end for
10:     𝐖k+1=𝐖k,Iksuperscript𝐖𝑘1superscript𝐖𝑘subscript𝐼𝑘{\bf W}^{k+1}={\bf W}^{k,I_{k}}bold_W start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = bold_W start_POSTSUPERSCRIPT italic_k , italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
11:  end for

4.2 Convergence guarantee

It may be surprising that Algorithm 1 imposes 𝐖ϵ𝐖italic-ϵ{\bf W}\geq\epsilonbold_W ≥ italic_ϵ and 𝐇ϵ𝐇italic-ϵ{\bf H}\geq\epsilonbold_H ≥ italic_ϵ for a small given ϵitalic-ϵ\epsilonitalic_ϵ, typically 1016superscript101610^{-16}10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT in our implementation. In fact, this means that Algorithm 1 solves a slightly modified NMF problem where the nonnegativity constraint is replaced by the constraint “greater than ϵitalic-ϵ\epsilonitalic_ϵ”. However, this is both a standard operation in recent versions of MU algorithms that ensures the convergence of MU iterates [29, 11] and a necessary operation for the proposed approximate Hessians to always be invertible. The clipping operator is in fact the proximity operator of the modified constraint so the proposed algorithm is still an instance of a forward-backward algorithm for plain NMF. Convergence guarantees with other regularizations are not provided in this work but should be easily obtained for a few classical regularizations such as the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm.

The convergence of Algorithm 1 when the cost is penalized with the characteristic function of the ϵitalic-ϵ\epsilonitalic_ϵ-nonnegative orthant can be derived from a general result established in [6].

Proposition 3 ([6])

. Let (𝐖k)ksubscriptsuperscript𝐖𝑘𝑘({\bf W}^{k})_{k\in{\mathbb{N}}}( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT and (𝐇k)ksubscriptsuperscript𝐇𝑘𝑘({\bf H}^{k})_{k\in{\mathbb{N}}}( bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT be sequences generated by Algorithm 1 with 𝐙fastMU𝐇(𝐖k,𝐇k,j)subscriptsuperscript𝐙𝐇normal-fastMUsuperscript𝐖𝑘superscript𝐇𝑘𝑗{\bf Z}^{\bf H}_{\operatorname{fastMU}}({\bf W}^{k},{\bf H}^{k,j})bold_Z start_POSTSUPERSCRIPT bold_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT ) and 𝐙fastMU𝐖(𝐖k,i,𝐇k+1)superscriptsubscript𝐙normal-fastMU𝐖superscript𝐖𝑘𝑖superscript𝐇𝑘1{\bf Z}_{\operatorname{fastMU}}^{\bf W}({\bf W}^{k,i},{\bf H}^{k+1})bold_Z start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_W end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) given by (28). Assume that:
1) There exists (ν¯,ν¯)]0,+[2(\underline{\nu},\overline{\nu})\in]0,+\infty[^{2}( under¯ start_ARG italic_ν end_ARG , over¯ start_ARG italic_ν end_ARG ) ∈ ] 0 , + ∞ [ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT such that, for all k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N,

( 0jJk1)ν¯𝐙fastMU𝐇(𝐖k,𝐇k,j)ν¯,for-all 0𝑗subscript𝐽𝑘1¯𝜈superscriptsubscript𝐙fastMU𝐇superscript𝐖𝑘superscript𝐇𝑘𝑗¯𝜈\displaystyle(\forall\;0\leq j\leq J_{k}-1)\quad\underline{\nu}\leq{\bf Z}_{% \operatorname{fastMU}}^{\bf H}({\bf W}^{k},{\bf H}^{k,j})\leq\overline{\nu},( ∀ 0 ≤ italic_j ≤ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) under¯ start_ARG italic_ν end_ARG ≤ bold_Z start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_H end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT ) ≤ over¯ start_ARG italic_ν end_ARG ,
( 0iIk1)ν¯𝐙fastMU𝐖(𝐖k,i,𝐇k+1)ν¯.for-all 0𝑖subscript𝐼𝑘1¯𝜈superscriptsubscript𝐙fastMU𝐖superscript𝐖𝑘𝑖superscript𝐇𝑘1¯𝜈\displaystyle(\forall\;0\leq i\leq I_{k}-1)\quad\underline{\nu}\leq{\bf Z}_{% \operatorname{fastMU}}^{\bf W}({\bf W}^{k,i},{\bf H}^{k+1})\leq\overline{\nu}.( ∀ 0 ≤ italic_i ≤ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) under¯ start_ARG italic_ν end_ARG ≤ bold_Z start_POSTSUBSCRIPT roman_fastMU end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_W end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ≤ over¯ start_ARG italic_ν end_ARG .

2) Step-sizes (γ𝐇k,j)k,0jJk1subscriptsuperscriptsubscript𝛾𝐇𝑘𝑗formulae-sequence𝑘0𝑗subscript𝐽𝑘1(\gamma_{\bf H}^{k,j})_{k\in{\mathbb{N}},0\leq j\leq J_{k}-1}( italic_γ start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k ∈ blackboard_N , 0 ≤ italic_j ≤ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT and (γ𝐖k,i)k,0iIk1subscriptsuperscriptsubscript𝛾𝐖𝑘𝑖formulae-sequence𝑘0𝑖subscript𝐼𝑘1(\gamma_{\bf W}^{k,i})_{k\in{\mathbb{N}},0\leq i\leq I_{k}-1}( italic_γ start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k ∈ blackboard_N , 0 ≤ italic_i ≤ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT are chosen in the interval [γ¯,γ¯]normal-¯𝛾normal-¯𝛾[\underline{\gamma},\overline{\gamma}][ under¯ start_ARG italic_γ end_ARG , over¯ start_ARG italic_γ end_ARG ] where γ¯normal-¯𝛾\underline{\gamma}under¯ start_ARG italic_γ end_ARG and γ¯normal-¯𝛾\overline{\gamma}over¯ start_ARG italic_γ end_ARG are some given positive real constants with 0<γ¯<γ¯<20normal-¯𝛾normal-¯𝛾20<\underline{\gamma}<\overline{\gamma}<20 < under¯ start_ARG italic_γ end_ARG < over¯ start_ARG italic_γ end_ARG < 2.
Then, the sequence (𝐖k,𝐇k)ksubscriptsuperscript𝐖𝑘superscript𝐇𝑘𝑘({\bf W}^{k},{\bf H}^{k})_{k\in{\mathbb{N}}}( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT converges to a critical point (𝐖^,𝐇^)normal-^𝐖normal-^𝐇(\widehat{{\bf W}},\widehat{{\bf H}})( over^ start_ARG bold_W end_ARG , over^ start_ARG bold_H end_ARG ) of the problem (2). Moreover, (Ψ(𝐖k,𝐇k))ksubscriptnormal-Ψsuperscript𝐖𝑘superscript𝐇𝑘𝑘(\Psi({\bf W}^{k},{\bf H}^{k}))_{k\in{\mathbb{N}}}( roman_Ψ ( bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT is a nonincreasing sequence converging to Ψ(𝐖^,𝐇^)normal-Ψnormal-^𝐖normal-^𝐇\Psi(\widehat{{\bf W}},\widehat{{\bf H}})roman_Ψ ( over^ start_ARG bold_W end_ARG , over^ start_ARG bold_H end_ARG ).

4.3 Dynamic stopping of inner iterations

Algorithm 1 is a barebone set of instructions for the proposed fastMU algorithm. However, to minimize the time spent when running the algorithm, we suggest to use a dynamic stopping criterion for the inner iterations, inspired from the literature on Hierarchial Alternating Least Squares (HALS) for computing NMF [12], instead of a fixed number of inner iterations Iksubscript𝐼𝑘I_{k}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Jksubscript𝐽𝑘J_{k}italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

The proof of convergence for fastMU allows any number of inner iterations to be run. But in practice, stopping inner iteration early avoids needlessly updating the last digits of a factor. The time saved by stopping the inner iterations can then be used more efficiently to update the other factor, and so on until convergence is observed.

Early stopping strategies have been described in the literature, and we use the same strategy as the accelerated HALS algorithm [12] which we describe next. Suppose factor 𝐇𝐇{\bf H}bold_H is being updated. At each inner iteration, the squared 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the factor update ηk:=𝐇k,j+1𝐇k,jF2assignsubscript𝜂𝑘superscriptsubscriptnormsuperscript𝐇𝑘𝑗1superscript𝐇𝑘𝑗𝐹2\eta_{k}:=\|{\bf H}^{k,j+1}-{\bf H}^{k,j}\|_{F}^{2}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := ∥ bold_H start_POSTSUPERSCRIPT italic_k , italic_j + 1 end_POSTSUPERSCRIPT - bold_H start_POSTSUPERSCRIPT italic_k , italic_j end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is computed. In principle, this norm should be large for the first inner iterations and should decrease toward zero as convergence occurs in the inner loop. Therefore one may stop the inner iteration if the factors update do not change much relative to the first iteration, i.e. when the following proposition is true:

ηk<δη0subscript𝜂𝑘𝛿subscript𝜂0\eta_{k}<\delta\eta_{0}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_δ italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (29)

where δ<1𝛿1\delta<1italic_δ < 1 is a tolerance value defined by the user. The optimal value of δ𝛿\deltaitalic_δ for our algorithm will be studied experimentally in Section 5. Note that this acceleration adds 𝒪((N+M)R)𝒪𝑁𝑀𝑅\mathcal{O}((N+M)R)caligraphic_O ( ( italic_N + italic_M ) italic_R ) to the complexity of the algorithm, which is an order of magnitude smaller than the inner loop complexity.

Moreover, unless specified otherwise, we use an aggressive constant stepsize γ=1.9𝛾1.9\gamma=1.9italic_γ = 1.9.

5 Experimental results

All the experiments and figures below can be reproduced by running the code shared on the repository attached to this project333https://github.com/cohenjer/MM-nmf.

Let us summarize the various hyperparameters of the proposed fastMU algorithm:

  • The loss function may be the KL divergence or the Frobenius norm, and we respectively denote the proposed algorithm by fastMU-KL and fastMU-Fro. Standard algorithms to compute NMF typically are not the same for these two losses. In particular, for the Frobenius norm, it is reported in the literature [11] that accelerated Hierarchical Alternating Least Squares [12] is state-of-the-art, while for KL divergence the standard MU algorithm proposed by Lee and Seung is still one of the best performing method [14].

  • The number of inner iterations needs to be tuned. A small number of inner iterations leads to faster outer loops, but a larger number of inner iterations can help decrease the cost more effectively. A dynamic inner stopping criterion has been successfully used in the literature, parameterized by a scalar δ𝛿\deltaitalic_δ defined in equation (29) that needs to be tuned.

5.1 Tuning the fastMU inner stopping hyperparameter

Below, we conduct synthetic tests in order to decide, on a limited set of experiments, what are the best choices for the hyperparameter δ𝛿\deltaitalic_δ for both supported losses. We will then fix this parameter for all subsequent comparisons.

5.1.1 Synthetic data setup

To generate synthetic data with approximately low nonnegative rank, factor matrices 𝐖𝐖{\bf W}bold_W and 𝐇𝐇{\bf H}bold_H are sampled elementwise from i.i.d. uniform distributions on [0,1]01[0,1][ 0 , 1 ]. The chosen dimensions M,N𝑀𝑁M,Nitalic_M , italic_N, and the rank R𝑅Ritalic_R vary among the experiments and are reported directly on the figures. The synthetic data is formed as 𝐕=𝐖T𝐇+σ𝐄𝐕superscript𝐖𝑇𝐇𝜎𝐄{\bf V}={\bf W}^{T}{\bf H}+\sigma{\bf E}bold_V = bold_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_H + italic_σ bold_E where 𝐄𝐄{\bf E}bold_E is a noise matrix also sampled elementwise from a uniform distribution over [0,1]01[0,1][ 0 , 1 ]444We use uniform noise with both KL and Frobenius loss to avoid negative entries in 𝐕𝐕{\bf V}bold_V, even though i) these losses do not correspond to maximum likelihood estimators for this noise model and ii) uniform noise induces bias.. Given a realization of the data and the noise, the noise level σ𝜎\sigmaitalic_σ is chosen so that the Signal to Noise Ratio (SNR) is fixed to a user-defined value reported on the figures or the captions. Note that no normalization is performed on the factors or the data. Initial values for all algorithms are identical and generated randomly in the same way as the true underlying factors.

The results are collected using a number P𝑃Pitalic_P of realizations of the above setup. If not specified otherwise, we chose P=5𝑃5P=5italic_P = 5. The maximum number of outer iterations is in general set large enough to observe convergence, by default 20000. We report loss function values normalized by the product of dimensions M×N𝑀𝑁M\times Nitalic_M × italic_N. Experiments are run using toolbox shootout [8] to ensure that all the intermediary results are stored and that the experiments are reproducible.

5.1.2 Early stopping tuning

In this experiment, the dynamic inner stopping tolerance δ𝛿\deltaitalic_δ is set in [0,0.001,0.01,0.05,0.1,0.3,0.6,0.9]00.0010.010.050.10.30.60.9[0,0.001,0.01,0.05,0.1,0.3,0.6,0.9][ 0 , 0.001 , 0.01 , 0.05 , 0.1 , 0.3 , 0.6 , 0.9 ] with a maximum number of inner iterations set to 100. When δ=0𝛿0\delta=0italic_δ = 0 the maximum number of inner iterations is reached (i.e. 100), while for δ=0.9𝛿0.9\delta=0.9italic_δ = 0.9, in general, we observed only one or two inner iterations, see the median number of inner iterations in Figure 1. The total number of outer iterations here is set depending on δ𝛿\deltaitalic_δ so that the execution time of all tests are similar. Figure 2 shows the median normalized cost function against time for both the Frobenius norm and the KL divergence.

Refer to caption
Refer to caption
Figure 1: Median plot of the number of inner iterations vs outer iteration index for Frobenius loss (top) and KL loss (bottom) computed with fastMU. Various values of the inner stopping criterion parameter δ𝛿\deltaitalic_δ are shown in color. The dimensions are [M,N,R]=[200,100,5]𝑀𝑁𝑅2001005[M,N,R]=[200,100,5][ italic_M , italic_N , italic_R ] = [ 200 , 100 , 5 ] and the SNR is 100100100100. We can observe that increasing the parameters δ𝛿\deltaitalic_δ directly influences the number of inner iterations as expected.
Refer to caption
Refer to caption
Figure 2: Median plot of normalized cost function values vs time for Frobenius loss (top) and KL loss (bottom), computed with fastMU. Various values of the inner stopping criterion parameter δ𝛿\deltaitalic_δ are shown in color. Dimensions and SNR are the same as in Figure 1. We can observe a significant speed-up when δ𝛿\deltaitalic_δ is above zero, and maximum speed-up around [0.05,0.1]0.050.1[0.05,0.1][ 0.05 , 0.1 ].

We may observe that, first, the choice of δ𝛿\deltaitalic_δ directly impacts the number of inner iterations on a per-problem basis, which validates that the dynamic inner stopping criterion stops inner iterations adaptively. Second, while there is no clear winner for δ𝛿\deltaitalic_δ in the interval [0.01,0.3]0.010.3[0.01,0.3][ 0.01 , 0.3 ] and a lot of variability, we observe a significant decrease in performance outside that range. Therefore according to these tests, we may set δ=0.1𝛿0.1\delta=0.1italic_δ = 0.1 by default for both losses. Alternatively, we could fix the maximal number of inner iterations to ten iterations since this would lead to a similar performance in this experiment.

5.2 Comparisons with other algorithms

5.2.1 Baseline algorithms

We compare our implementation of the proposed MU algorithm with the following methods:

  • Multiplicative Updates (MU) as defined by Lee and Seung [18]. MU is still reported as the state-of-the-art for many problems, in particular when dealing with the KL loss [14]. Thus we use it as a baseline for the proposed fastMU.

  • Hierarchical Alternating Least Squares (HALS), which only applies to the Frobenius loss. We used a customized version of the implementation provided in nn-fac [22].

  • Nesterov NMF [13] (NeNMF), which is an extension of Nesterov Fast Gradient for computing NMF. We observed that NeNMF provided divergent iterates when used to minimize the KL loss, so it is only used with the Frobenius loss in our experiments.

  • Vanilla alternating proximal Gradient Descent (GD) with an aggressive stepsize 1.9/L1.9𝐿1.9/L1.9 / italic_L where L𝐿Litalic_L is the Lipschitz constant, also only for the Frobenius loss. GD was unreasonably slow for KL in our experiments, in particular because the Lipschitz constant is not cheap to compute.

To summarize, for the Frobenius loss we evaluate the performance of MU, HALS, NeNMF, and GD against the proposed fastMU algorithm fastMU_Fro. For KL, we compare the classical MU algorithm with the proposed fastMU (fastMU_KL). Note that all methods are implemented with the same criterion for stopping inner iterations, δ=0.1𝛿0.1\delta=0.1italic_δ = 0.1.

Moreover, in order to further compare the proposed algorithm with the baselines, we also solve the (nonlinear) Nonnegative Least Squares (NLS) problem obtained when factor 𝐖𝐖{\bf W}bold_W is known and fixed (to the ground-truth if known). The cost function is then convex and we expect all methods to return the same minimal error. The NLS problem is formally defined as

min𝐇+R×NΨ(𝐖,𝐇)𝐇superscriptsubscript𝑅𝑁Ψ𝐖𝐇\underset{{\bf H}\in{\mathbb{R}}_{+}^{R\times N}}{\min}~{}\Psi({\bf W},{\bf H})start_UNDERACCENT bold_H ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R × italic_N end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG roman_Ψ ( bold_W , bold_H ) (30)

5.2.2 Comparisons on synthetic dataset

Experiments with the Frobenius loss

Synthetic data are generated with the same procedure as in Section 5.1. All methods start from the same random initialization for each run. Two set of dimensions are used: [M,N,R]=[1000,400,20]𝑀𝑁𝑅100040020[M,N,R]=[1000,400,20][ italic_M , italic_N , italic_R ] = [ 1000 , 400 , 20 ] and [M,N,R]=[200,100,5]𝑀𝑁𝑅2001005[M,N,R]=[200,100,5][ italic_M , italic_N , italic_R ] = [ 200 , 100 , 5 ]. We set the SNR to 100dB100𝑑𝐵100dB100 italic_d italic_B for the NMF problem, and to 100dB100𝑑𝐵100dB100 italic_d italic_B or 30dB30𝑑𝐵30dB30 italic_d italic_B for the NLS problem.

Refer to caption
Figure 3: Median plots for the synthetic NLS problem with Frobenius loss with [M,N,R]=[1000,400,20]𝑀𝑁𝑅100040020[M,N,R]=[1000,400,20][ italic_M , italic_N , italic_R ] = [ 1000 , 400 , 20 ] (left) and [M,N,R]=[200,100,5]𝑀𝑁𝑅2001005[M,N,R]=[200,100,5][ italic_M , italic_N , italic_R ] = [ 200 , 100 , 5 ] (right), for SNR=30dBabsent30𝑑𝐵=30dB= 30 italic_d italic_B (top) and SNR =100dBabsent100𝑑𝐵=100dB= 100 italic_d italic_B (bottom). The normalized cost function is plotted against time for various methods, and P=10𝑃10P=10italic_P = 10 realizations are computed.
Refer to caption
Figure 4: Median plots for the synthetic NMF problem with Frobenius loss with [M,N,R]=[1000,400,20]𝑀𝑁𝑅100040020[M,N,R]=[1000,400,20][ italic_M , italic_N , italic_R ] = [ 1000 , 400 , 20 ] (left) and [M,N,R]=[200,100,5]𝑀𝑁𝑅2001005[M,N,R]=[200,100,5][ italic_M , italic_N , italic_R ] = [ 200 , 100 , 5 ] (right). The normalized cost function is plotted against time for various methods. The SNR is 100.

Figures 3 and  4 report the results of the Frobenius loss experiments. We may draw several conclusions from these experiments. First, fastMU has competitive performance with the baselines and in particular significantly outperforms MU. In this experiment, fastMU is also faster than HALS for the larger dimensions and rank. Second, the extrapolated algorithms NeNMF is overall the best performing method for both NMF and NLS problems.

The performance of fastMU is very close to vanilla Gradient Descent in the NLS problem. This may happen if the columns of 𝐖𝐖{\bf W}bold_W are close to orthogonality (in which case the Hessian is approximately the identity matrix), but note that the two methods perform differently on the NMF problem where the ground truth 𝐖𝐖{\bf W}bold_W is not provided. Gradient descent will perform less favorably on a realistic dataset where factor matrices are correlated, see Section 5.2.3.

Experiments with the KL loss

The data generation is more complex for the KL loss. We overall follow the same procedure as described in Section 5.1. However, because we observed that sparsity has an important impact on MU performance when minimizing the KL loss, we design four different dense/sparse setups:

  • setup dense: no sparsification555more precisely we used ϵitalic-ϵ\epsilonitalic_ϵ for the factor entries, and Rϵ2𝑅superscriptitalic-ϵ2R\epsilon^{2}italic_R italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for the data with ϵ=1e8italic-ϵ1𝑒8\epsilon=1e-8italic_ϵ = 1 italic_e - 8, which are the smallest values the algorithms can reconstruct. is applied, and the factors and the data are dense.

  • setup fac sparse: the smallest half of the entries of both 𝐖𝐖{\bf W}bold_W and 𝐇𝐇{\bf H}bold_H are set to zero.

  • setup data sparse: after setting 𝐕=𝐖𝐇𝐕superscript𝐖top𝐇{\bf V}={\bf W}^{\top}{\bf H}bold_V = bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H, the fifty percent smallest entries of 𝐕𝐕{\bf V}bold_V are set to zero. The NMF model is therefore only approximate and the residuals are in fact much larger than in the fac sparse and dense setups.

  • setup fac data sparse: we sparsify both the factors and the data with the above procedure.

In all setups, noise is added to the data, after sparsification if it applies, so that SNR is 100100100100. The dimensions are [M,N,R]=[200,100,5]𝑀𝑁𝑅2001005[M,N,R]=[200,100,5][ italic_M , italic_N , italic_R ] = [ 200 , 100 , 5 ] for the NMF problem, and [M,N,R]=[2000,1000,40]𝑀𝑁𝑅2000100040[M,N,R]=[2000,1000,40][ italic_M , italic_N , italic_R ] = [ 2000 , 1000 , 40 ] for the NLS problem to test the impact of dimensions on this simpler problem.

Refer to caption
Figure 5: Median plots for the synthetic experiments on the NLS problem with KL loss, showing normalized cost function against time.
Refer to caption
Figure 6: Median plots for the synthetic experiments on the NMF problem with KL loss, showing normalized cost function against time.

Figures 5 and 6 report the KL loss of all compared methods against time for both the NLS and NMF problems in each setup. The fastMU_KL algorithm converges faster than MU in all the sparse and dense cases for both NLS and NMF problems, and significantly so when the data is not sparsified.

5.2.3 Comparisons on realistic datasets

We used the following datasets to further the comparisons:

  • An amplitude spectrogram of dimensions M=1000𝑀1000M=1000italic_M = 1000 and N=1450𝑁1450N=1450italic_N = 1450 computed from an audio excerpt in the MAPS dataset [9]. More precisely, we hand-picked the file MAPS_MUS_bach-847_AkPnBcht.wav which is a recording of Bach’s second Prelude played on a Yamaha Disklavier. Only the first thirty seconds of the recording are kept, and we also discard all frequencies above 5300Hz. This piece is a moderately difficult piano piece, and NMF has been used in this context to perform blind automatic music transcription, see [1] for more details. Taking inspiration from a model called Attack-Decay [5], in which each rank-one component corresponds to either the attack or the decay of a different key on the piano, we suppose that the amplitude spectrogram is well approximated by an NMF of rank R=176𝑅176R=176italic_R = 176, i.e. twice the number of keys on an acoustic piano keyboard.

  • An hyperspectral image of dimensions N=3072𝑁superscript3072N=307^{2}italic_N = 307 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and M=162𝑀162M=162italic_M = 162 called “Urban” which has been used extensively in the blind spectral unmixing literature for showcasing the efficiency of various NMF-based methods, see for instance the survey [2]. Blind spectral unmixing consists in recovering the spectra of the materials present in the image as well as their spatial relative concentrations. These quantities are in principle estimated as rank-one components in the NMF. It is reported that Urban has between four and six components. We therefore set R=6𝑅6R=6italic_R = 6.

In what follows we are not interested in the interpretability of the results since the usefulness of NMF for music automatic transcription and spectral unmixing has already been established in previous works and we do not propose any novelty in the modeling. Instead, we investigate if fastMU and its variants are faster than its competitors to compute NMF on these data.

For both dataset, a reasonable ground-truth for the 𝐖𝐖{\bf W}bold_W factor is available, which allows us to also compare algorithms for the NLS problem as well. Indeed in the audio transcription problem, the MAPS database contains individual recordings of each key played on the Disklavier, which we can use to infer 𝐖𝐖{\bf W}bold_W as detailed in [31]. In the blind spectral unmixing problem, we use a “fake” ground-truth obtained in [34] where six pixels containing the unmixed spectra have been hand-chosen.

Refer to caption
Refer to caption
Figure 7: Results of the audio experiment with KL loss. Normalized loss values for P=10𝑃10P=10italic_P = 10 experiments are plotted against time. Top: NLS problem, Bottom: NMF problem.
Refer to caption
Refer to caption
Figure 8: Results of the audio experiment with Frobenius loss. Normalized loss values for P=10𝑃10P=10italic_P = 10 experiments are plotted against time. Top: NLS problem, Bottom: NMF problem. Notice the good performance of the vanilla MU algorithm for this dataset.
Refer to caption
Refer to caption
Figure 9: Results of the HSI experiment with Frobenius loss. Normalized loss values for P=10𝑃10P=10italic_P = 10 experiments are plotted against time. Top: NLS problem, Bottom: NMF problem.
Refer to caption
Refer to caption
Figure 10: Results of the HSI experiment with KL loss. Normalized loss values for P=10𝑃10P=10italic_P = 10 experiments are plotted against time. Top: NLS problem, Bottom: NMF problem.

Figure 7, 8, 9 and 10 respectively show the KL and Frobenius loss function values over time for the power spectrogram and the hyperspectral image. All experiments use P=5𝑃5P=5italic_P = 5 different initializations. Since a reasonable ground truth is available for these problems, we initialized matrix 𝐖𝐖{\bf W}bold_W as this ground-truth plus 0.1η0.1𝜂0.1\eta0.1 italic_η for an i.i.d. uniform noise matrix η𝜂\etaitalic_η to stabilize the runs. The NMF HSI (HyperSpectral Image) problem required more outer iterations, the maximum was therefore set to 200.

The results are not entirely consistent with the observations made in the synthetic experiments:

  • Frobenius loss: For the audio data, HALS is the best performing method as is generally reported in the literature, followed by MU. When computing NMF, fastMU_Fro is a close competitor but do not outperform MU. Nevertheless, fastMU improves significantly over MU in the HSI experiment and is competitive with HALS. Moreover, interestingly fastMU_Fro performs significantly better than NeNMF on both realistic dataset while the two algorithms were performing similarly on the simulated dataset. On the NLS problem results are more ambiguous, but overall fastMU is not a bad competitor to HALS.

  • KL loss: we may observe that fastMU is significantly faster than MU in the HSI experiment, but not in the audio experiment. This may be explained in light of the synthetic experiments: HSI are dense data with sparse underlying factors, while audio data is sparse with sparse factors and large residuals. However further research on fastMU seems required to better understand how to further speed-up the method for sparse datasets when the residuals are large.

6 Conclusions

In this work, we propose a tighter upper bound of the Hessian matrix to improve the convergence of the MU algorithm for Frobenius and KL loss. The proposed algorithm coined fastMU shows promising performance on both synthetic and real-world data. While, in the experiments we conducted, fastMU is not always better than MU proposed by Lee and Seung, in many cases it is significantly faster. Moreover, it is also competitive with HALS for the Frobenius loss, which is one of the state-of-the-art methods for NMF in that setting.

There are many promising research avenues for fastMU. This paper shows how to build a family of majorant functions that contains the Lee and Seung majorant and how to compute a good majorant in that family. While we chose the best majorant in that family according to some criteria, other criterion could be explored which might produce faster algorithms; it is anticipated that an efficient algorithmic procedure to compute a better majorant could be designed. One could also search for other majorant families which are even tighter to the original objective, or families of majorants from which sampling is faster. Finally, the algorithm could be improved to better handle sparse data matrices and large residuals in particular for the KL divergence loss.

Proof of Proposition 1

It is straightforward to prove that a square matrix 𝐊𝐊{\bf K}bold_K is positive semi-definite if and only if 𝐂=Diag(𝐮)𝐊Diag(𝐮)𝐂Diag𝐮𝐊Diag𝐮{\bf C}=\operatorname{Diag}({\bf u}){\bf K}\operatorname{Diag}({\bf u})bold_C = roman_Diag ( bold_u ) bold_K roman_Diag ( bold_u ) is a positive semi-definite matrix for any 𝐮0𝐮0{\bf u}\neq 0bold_u ≠ 0. Therefore the proof of the proposition 1 is equivalent to prove that 𝐂=Diag(𝐮)(Diag((𝐁𝐮)𝐮)𝐁)Diag(𝐮)𝐂Diag𝐮Diag𝐁𝐮𝐮𝐁Diag𝐮{\bf C}=\operatorname{Diag}({\bf u})\left(\operatorname{Diag}\left(({\bf B}{% \bf u})\oslash{\bf u}\right)-{\bf B}\right)\operatorname{Diag}({\bf u})\ bold_C = roman_Diag ( bold_u ) ( roman_Diag ( ( bold_Bu ) ⊘ bold_u ) - bold_B ) roman_Diag ( bold_u ) is positive semi-definite for any 𝐮0𝐮0{\bf u}\neq 0bold_u ≠ 0. Indeed, for any 𝐱N𝐱superscript𝑁{\bf x}\in{\mathbb{R}}^{N}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with 𝐱0𝐱0{\bf x}\neq 0bold_x ≠ 0 we have

𝐱𝐂𝐱superscript𝐱top𝐂𝐱\displaystyle{\bf x}^{\top}{\bf C}{\bf x}bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Cx =1nN(𝐁𝐮)nunxn21n,mNbn,munumxnxmabsentsubscript1𝑛𝑁subscript𝐁𝐮𝑛subscript𝑢𝑛superscriptsubscript𝑥𝑛2subscriptformulae-sequence1𝑛𝑚𝑁subscript𝑏𝑛𝑚subscript𝑢𝑛subscript𝑢𝑚subscript𝑥𝑛subscript𝑥𝑚\displaystyle=\sum_{1\leq n\leq N}({\bf B}{\bf u})_{n}u_{n}x_{n}^{2}-\sum_{1% \leq n,m\leq N}b_{n,m}u_{n}u_{m}x_{n}x_{m}= ∑ start_POSTSUBSCRIPT 1 ≤ italic_n ≤ italic_N end_POSTSUBSCRIPT ( bold_Bu ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT 1 ≤ italic_n , italic_m ≤ italic_N end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
=1n,mNbn,munumxn21n,mNbn,munumxnxmabsentsubscriptformulae-sequence1𝑛𝑚𝑁subscript𝑏𝑛𝑚subscript𝑢𝑛subscript𝑢𝑚superscriptsubscript𝑥𝑛2subscriptformulae-sequence1𝑛𝑚𝑁subscript𝑏𝑛𝑚subscript𝑢𝑛subscript𝑢𝑚subscript𝑥𝑛subscript𝑥𝑚\displaystyle=\sum_{1\leq n,m\leq N}b_{n,m}u_{n}u_{m}x_{n}^{2}-\sum_{1\leq n,m% \leq N}b_{n,m}u_{n}u_{m}x_{n}x_{m}= ∑ start_POSTSUBSCRIPT 1 ≤ italic_n , italic_m ≤ italic_N end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT 1 ≤ italic_n , italic_m ≤ italic_N end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
=1n,mNbn,munum(xnxm)220.absentsubscriptformulae-sequence1𝑛𝑚𝑁subscript𝑏𝑛𝑚subscript𝑢𝑛subscript𝑢𝑚superscriptsubscript𝑥𝑛subscript𝑥𝑚220\displaystyle=\sum_{1\leq n,m\leq N}b_{n,m}u_{n}u_{m}\frac{\left(x_{n}-x_{m}% \right)^{2}}{2}\geq 0.= ∑ start_POSTSUBSCRIPT 1 ≤ italic_n , italic_m ≤ italic_N end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT divide start_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ≥ 0 .

This concludes the proof.

Minmax problem

This appendix is to show that the solution of minmax problem 27 is any eigenvector associated with the largest singular value of matrix 𝐁𝐁{\bf B}bold_B. Indeed, assume that 𝐮*superscript𝐮{\bf u}^{*}bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is a solution of 27, then necessarily we will show that for all 1rR1𝑟𝑅1\leq r\leq R1 ≤ italic_r ≤ italic_R,

𝐛r𝐮*ur*=t*=min𝐮>0(𝐁𝐮)𝐮.superscriptsubscript𝐛𝑟topsuperscript𝐮superscriptsubscript𝑢𝑟superscript𝑡subscript𝐮0subscriptnormsuperscript𝐁top𝐮𝐮\frac{{\bf b}_{r}^{\top}{\bf u}^{*}}{u_{r}^{*}}=t^{*}=\min_{{\bf u}>0}\;||({% \bf B}^{\top}{\bf u})\oslash{\bf u}||_{\infty}.divide start_ARG bold_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG = italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT bold_u > 0 end_POSTSUBSCRIPT | | ( bold_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u ) ⊘ bold_u | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

If this is not the case, we can show that (𝐮*,t*)superscript𝐮superscript𝑡({\bf u}^{*},t^{*})( bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) is not optimal. Indeed, assume that I={1iR:(𝐛i𝐮*)/ui*=t*}𝐼conditional-set1𝑖𝑅superscriptsubscript𝐛𝑖topsuperscript𝐮superscriptsubscript𝑢𝑖superscript𝑡I=\{1\leq i\leq R:({\bf b}_{i}^{\top}{\bf u}^{*})/u_{i}^{*}=t^{*}\}italic_I = { 1 ≤ italic_i ≤ italic_R : ( bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) / italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT } and suppose that there exists an index 1jR1𝑗𝑅1\leq j\leq R1 ≤ italic_j ≤ italic_R such that

𝐛i𝐮*ui*>𝐛j𝐮*uj*𝐛r𝐮*ur*iI and rI.superscriptsubscript𝐛𝑖topsuperscript𝐮superscriptsubscript𝑢𝑖superscriptsubscript𝐛𝑗topsuperscript𝐮superscriptsubscript𝑢𝑗superscriptsubscript𝐛𝑟topsuperscript𝐮superscriptsubscript𝑢𝑟for-all𝑖𝐼 and for-all𝑟𝐼\frac{{\bf b}_{i}^{\top}{\bf u}^{*}}{u_{i}^{*}}>\frac{{\bf b}_{j}^{\top}{\bf u% }^{*}}{u_{j}^{*}}\geq\frac{{\bf b}_{r}^{\top}{\bf u}^{*}}{u_{r}^{*}}\;\forall i% \in I\text{ and }\forall r\notin I.divide start_ARG bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG > divide start_ARG bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ≥ divide start_ARG bold_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ∀ italic_i ∈ italic_I and ∀ italic_r ∉ italic_I .

Since 𝐛i𝐮uisuperscriptsubscript𝐛𝑖top𝐮subscript𝑢𝑖\frac{{\bf b}_{i}^{\top}{\bf u}}{u_{i}}divide start_ARG bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is a strictly decreasing function with respect to uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and is strictly increasing function with respect to ujsubscript𝑢𝑗u_{j}italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for ji𝑗𝑖j\neq iitalic_j ≠ italic_i, we can find a vector 𝐮¯=(u¯i)1iR¯𝐮subscriptsubscript¯𝑢𝑖1𝑖𝑅\bar{{\bf u}}=(\bar{u}_{i})_{1\leq i\leq R}over¯ start_ARG bold_u end_ARG = ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_R end_POSTSUBSCRIPT with

for 1iR,u¯i={ui*+ξ if iIui* otherwiseformulae-sequencefor 1𝑖𝑅subscript¯𝑢𝑖casessuperscriptsubscript𝑢𝑖𝜉 if 𝑖𝐼superscriptsubscript𝑢𝑖 otherwise\text{for }1\leq i\leq R,\;\bar{u}_{i}=\begin{cases}u_{i}^{*}+\xi&\text{ if }i% \in I\\ u_{i}^{*}&\text{ otherwise}\end{cases}for 1 ≤ italic_i ≤ italic_R , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + italic_ξ end_CELL start_CELL if italic_i ∈ italic_I end_CELL end_ROW start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW

and with ξ>0𝜉0\xi>0italic_ξ > 0 small enough, the following in-equations hold

t*>𝐛i𝐮¯u¯i>𝐛j𝐮¯u¯j𝐛r𝐮¯u¯riI and rI.superscript𝑡superscriptsubscript𝐛𝑖top¯𝐮subscript¯𝑢𝑖superscriptsubscript𝐛𝑗top¯𝐮subscript¯𝑢𝑗superscriptsubscript𝐛𝑟top¯𝐮subscript¯𝑢𝑟for-all𝑖𝐼 and for-all𝑟𝐼t^{*}>\frac{{\bf b}_{i}^{\top}\bar{{\bf u}}}{\bar{u}_{i}}>\frac{{\bf b}_{j}^{% \top}\bar{{\bf u}}}{\bar{u}_{j}}\geq\frac{{\bf b}_{r}^{\top}\bar{{\bf u}}}{% \bar{u}_{r}}\;\forall i\in I\text{ and }\forall r\notin I.italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT > divide start_ARG bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_u end_ARG end_ARG start_ARG over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG > divide start_ARG bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_u end_ARG end_ARG start_ARG over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG bold_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_u end_ARG end_ARG start_ARG over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ∀ italic_i ∈ italic_I and ∀ italic_r ∉ italic_I .

That means (𝐮*,t*)superscript𝐮superscript𝑡({\bf u}^{*},t^{*})( bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) is not optimal points of minmax problem. Therefore it must hold that

𝐁𝐮*=t*𝐮*superscript𝐁𝐮superscript𝑡superscript𝐮{\bf B}{\bf u}^{*}=t^{*}{\bf u}^{*}bold_Bu start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT

which we can write as an eigenvalue problem

(𝐁t*𝐈)𝐮*=0.𝐁superscript𝑡𝐈superscript𝐮0({\bf B}-t^{*}{\bf I}){\bf u}^{*}=0.( bold_B - italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_I ) bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = 0 .

We look for the largest eigenvalue t*superscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT of 𝐁𝐁{\bf B}bold_B such that an associated eigenvector 𝐮*superscript𝐮{\bf u}^{*}bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is nonnegative. By the Perron-Frobenius theorem, the eigenvector 𝐳𝐳{\bf z}bold_z associated with the largest eigenvalue is nonnegative (strictly if 𝐁𝐁{\bf B}bold_B is strictly positive). Since all other eigenvectors are orthogonal to 𝐳𝐳{\bf z}bold_z, they cannot satisfy the nonnegativity constraint. Therefore the only admissible set of solutions is α𝐳𝛼𝐳\alpha{\bf z}italic_α bold_z, α>0𝛼0\alpha>0italic_α > 0.

This means that (𝐁𝐮*)𝐮*=μ𝟙Rsuperscript𝐁𝐮superscript𝐮𝜇subscript1𝑅({\bf B}{\bf u}^{*})\oslash{\bf u}^{*}=\mu{\mathbbm{1}}_{R}( bold_Bu start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ⊘ bold_u start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_μ blackboard_1 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT where μ𝜇\muitalic_μ is the maximum eigenvalue of 𝐁𝐁{\bf B}bold_B. FastMU is then a simple scaled gradient descent algorithm.

References

  • [1] Emmanouil Benetos, Simon Dixon, Zhiyao Duan, and Sebastian Ewert. Automatic music transcription: An overview. IEEE Signal Processing Magazine, 36(1):20–30, 2018.
  • [2] José M Bioucas-Dias, Antonio Plaza, Nicolas Dobigeon, Mario Parente, Qian Du, Paul Gader, and Jocelyn Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE journal of selected topics in applied earth observations and remote sensing, 5(2):354–379, 2012.
  • [3] J. Bolte, P. L. Combettes, and J.-C. Pesquet. Alternating proximal algorithm for blind image recovery. In 2010 IEEE International Conference on Image Processing. IEEE, sep 2010.
  • [4] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494, jul 2013.
  • [5] Tian Cheng, Matthias Mauch, Emmanouil Benetos, Simon Dixon, et al. An attack/decay model for piano transcription. In ISMIR 2016 proceedings, 2016.
  • [6] Emilie Chouzenoux, Jean-Christophe Pesquet, and Audrey Repetti. A block coordinate variable metric forward-backward algorithm. Journal of Global Optimization, pages 1–29, February 2016.
  • [7] Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari. Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization. In International Conference on Independent Component Analysis and Signal Separation, pages 169–176. Springer, 2007.
  • [8] Jeremy E. Cohen. Shootout. https://github.com/cohenjer/shootout, 2022.
  • [9] Valentin Emiya, Nancy Bertin, Bertrand David, and Roland Badeau. MAPS-A piano database for multipitch estimation and automatic transcription of music. Research Report, pp.11. inria-00544155, 2010.
  • [10] Cédric Févotte and Jérôme Idier. Algorithms for nonnegative matrix factorization with the β𝛽\betaitalic_β-divergence. Neural Computation, 23(9):2421–2456, sep 2011.
  • [11] Nicolas Gillis. Nonnegative Matrix Factorization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2020.
  • [12] Nicolas Gillis and François Glineur. Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization. Neural computation, 24(4):1085–1105, 2012.
  • [13] Naiyang Guan, Dacheng Tao, Zhigang Luo, and Bo Yuan. NeNMF: An optimal gradient method for nonnegative matrix factorization. IEEE Transactions on Signal Processing, 60(6):2882–2898, jun 2012.
  • [14] Le Thi Khanh Hien and Nicolas Gillis. Algorithms for nonnegative matrix factorization with the Kullback–Leibler divergence. Journal of Scientific Computing, 87(3):1–32, 2021.
  • [15] P.O. Hoyer. Non-negative sparse coding. In Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing. IEEE, 2002.
  • [16] David R. Hunter and Kenneth Lange. [optimization transfer using surrogate objective functions]: Rejoinder. Journal of Computational and Graphical Statistics, 9(1):52, mar 2000.
  • [17] David R. Hunter and Kenneth Lange. A tutorial on MM algorithms. American Statistician, 58(1):30–37, February 2004.
  • [18] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 1999.
  • [19] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems. MIT Press, 2001.
  • [20] Valentin Leplat, Nicolas Gillis, and Jérôme Idier. Multiplicative updates for NMF with β𝛽\betaitalic_β-divergences under disjoint equality constraints. SIAM Journal on Matrix Analysis and Applications, 42(2):730–752, 2021.
  • [21] Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, jan 1992.
  • [22] Axel Marmoret and Jérémy Cohen. nn_fac: Nonnegative factorization techniques toolbox, 2020.
  • [23] Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, jun 1994.
  • [24] Audrey Repetti, Mai Quyen Pham, Laurent Duval, Emilie Chouzenoux, and Jean-Christophe Pesquet. Euclid in a taxicab: Sparse blind deconvolution with smoothed 1/2subscript1subscript2\ell_{1}/\ell_{2}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization. IEEE Signal Processing Letters, 22(5):539–543, may 2015.
  • [25] Romain Serizel, Slim Essid, and Gaël Richard. Mini-batch stochastic approaches for accelerated multiplicative updates in nonnegative matrix factorisation with beta-divergence. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2016.
  • [26] Yong Sheng Soh and Antonios Varvitsiotis. A non-commutative extension of Lee-Seung's algorithm for positive semidefinite factorizations. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 27491–27502. Curran Associates, Inc., 2021.
  • [27] N. Takahashi, J. Katayama, and J. Takeuchi. A generalized sufficient condition for global convergence of modified multiplicative updates for NMF. In Int. Symp. Nonlinear Theory Appl., 2014.
  • [28] Norikazu Takahashi and Ryota Hibi. Global convergence of modified multiplicative updates for nonnegative matrix factorization. Computational Optimization and Applications, 57(2):417–440, aug 2013.
  • [29] Norikazu Takahashi and Masato Seki. Multiplicative update for a class of constrained optimization problems related to NMF and its global convergence. In 2016 24th European Signal Processing Conference (Eusipco), pages 438–442. IEEE, 2016.
  • [30] Leo Taslaman and Björn Nilsson. A framework for regularized non-negative matrix factorization, with application to the analysis of gene expression data. PLoS ONE, 7(11):e46331, nov 2012.
  • [31] Haoran Wu, Axel Marmoret, and Jeremy E. Cohen. Semi-supervised convolutive NMF for automatic piano transcription. In Sound and Music Computing 2022. Zenodo, june 2022.
  • [32] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758–1789, jan 2013.
  • [33] Renbo Zhao and Vincent YF Tan. A unified convergence analysis of the multiplicative update algorithm for nonnegative matrix factorization. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2562–2566. IEEE, 2017.
  • [34] Feiyun Zhu. Hyperspectral unmixing: ground truth labeling, datasets, benchmark performances and survey. arXiv preprint arXiv:1708.05125, 2017.