-
Attention is a smoothed cubic spline
Authors:
Zehua Lai,
Lek-Heng Lim,
Yucong Liu
Abstract:
We highlight a perhaps important but hitherto unobserved insight: The attention module in a transformer is a smoothed cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. More precisely, we show that with ReLU-activation, attention, masked attention, encoder-d…
▽ More
We highlight a perhaps important but hitherto unobserved insight: The attention module in a transformer is a smoothed cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. More precisely, we show that with ReLU-activation, attention, masked attention, encoder-decoder attention are all cubic splines. As every component in a transformer is constructed out of compositions of various attention modules (= cubic splines) and feed forward neural networks (= linear splines), all its components -- encoder, decoder, and encoder-decoder blocks; multilayered encoders and decoders; the transformer itself -- are cubic or higher-order splines. If we assume the Pierce-Birkhoff conjecture, then the converse also holds, i.e., every spline is a ReLU-activated encoder. Since a spline is generally just $C^2$, one way to obtain a smoothed $C^\infty$-version is by replacing ReLU with a smooth activation; and if this activation is chosen to be SoftMax, we recover the original transformer as proposed by Vaswani et al. This insight sheds light on the nature of the transformer by casting it entirely in terms of splines, one of the best known and thoroughly understood objects in applied mathematics.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Simple matrix models for the flag, Grassmann, and Stiefel manifolds
Authors:
Lek-Heng Lim,
Ke Ye
Abstract:
We derive three families of orthogonally-equivariant matrix submanifold models for the Grassmann, flag, and Stiefel manifolds respectively. These families are exhaustive -- every orthogonally-equivariant submanifold model of the lowest dimension for any of these manifolds is necessarily a member of the respective family, with a small number of exceptions. They have several computationally desirabl…
▽ More
We derive three families of orthogonally-equivariant matrix submanifold models for the Grassmann, flag, and Stiefel manifolds respectively. These families are exhaustive -- every orthogonally-equivariant submanifold model of the lowest dimension for any of these manifolds is necessarily a member of the respective family, with a small number of exceptions. They have several computationally desirable features. The orthogonal equivariance allows one to obtain, for various differential geometric objects and operations, closed-form analytic expressions that are readily computable with standard numerical linear algebra. The minimal dimension aspect translates directly to a speed advantage in computations. And having an exhaustive list of all possible matrix models permits one to identify the model with the lowest matrix condition number, which translates to an accuracy advantage in computations. As an interesting aside, we will see that the family of models for the Stiefel manifold is naturally parameterized by the Cartan manifold, i.e., the positive definite cone equipped with its natural Riemannian metric.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Minimal equivariant embeddings of the Grassmannian and flag manifold
Authors:
Lek-Heng Lim,
Ke Ye
Abstract:
We show that the flag manifold $\operatorname{Flag}(k_1,\dots, k_p, \mathbb{R}^n)$, with Grassmannian the special case $p=1$, has an $\operatorname{SO}_n(\mathbb{R})$-equivariant embedding in an Euclidean space of dimension $(n-1)(n+2)/2$, two orders of magnitude below the current best known result. We will show that the value $(n-1)(n+2)/2$ is the smallest possible and that any…
▽ More
We show that the flag manifold $\operatorname{Flag}(k_1,\dots, k_p, \mathbb{R}^n)$, with Grassmannian the special case $p=1$, has an $\operatorname{SO}_n(\mathbb{R})$-equivariant embedding in an Euclidean space of dimension $(n-1)(n+2)/2$, two orders of magnitude below the current best known result. We will show that the value $(n-1)(n+2)/2$ is the smallest possible and that any $\operatorname{SO}_n(\mathbb{R})$-equivariant embedding of $\operatorname{Flag}(k_1,\dots, k_p, \mathbb{R}^n)$ in an ambient space of minimal dimension is equivariantly equivalent to the aforementioned one.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Grassmannian optimization is NP-hard
Authors:
Zehua Lai,
Lek-Heng Lim,
Ke Ye
Abstract:
We show that unconstrained quadratic optimization over a Grassmannian $\operatorname{Gr}(k,n)$ is NP-hard. Our results cover all scenarios: (i) when $k$ and $n$ are both allowed to grow; (ii) when $k$ is arbitrary but fixed; (iii) when $k$ is fixed at its lowest possible value $1$. We then deduce the NP-hardness of unconstrained cubic optimization over the Stiefel manifold $\operatorname{V}(k,n)$…
▽ More
We show that unconstrained quadratic optimization over a Grassmannian $\operatorname{Gr}(k,n)$ is NP-hard. Our results cover all scenarios: (i) when $k$ and $n$ are both allowed to grow; (ii) when $k$ is arbitrary but fixed; (iii) when $k$ is fixed at its lowest possible value $1$. We then deduce the NP-hardness of unconstrained cubic optimization over the Stiefel manifold $\operatorname{V}(k,n)$ and the orthogonal group $\operatorname{O}(n)$. As an addendum we demonstrate the NP-hardness of unconstrained quadratic optimization over the Cartan manifold, i.e., the positive definite cone $\mathbb{S}^n_{\scriptscriptstyle++}$ regarded as a Riemannian manifold, another popular example in manifold optimization. We will also establish the nonexistence of $\mathrm{FPTAS}$ in all cases.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Simple matrix expressions for the curvatures of Grassmannian
Authors:
Zehua Lai,
Lek-Heng Lim,
Ke Ye
Abstract:
We show that modeling a Grassmannian as symmetric orthogonal matrices $\operatorname{Gr}(k,\mathbb{R}^n) \cong\{Q \in \mathbb{R}^{n \times n} : Q^{\scriptscriptstyle\mathsf{T}} Q = I, \; Q^{\scriptscriptstyle\mathsf{T}} = Q,\; \operatorname{tr}(Q)=2k - n\}$ yields exceedingly simple matrix formulas for various curvatures and curvature-related quantities, both intrinsic and extrinsic. These include…
▽ More
We show that modeling a Grassmannian as symmetric orthogonal matrices $\operatorname{Gr}(k,\mathbb{R}^n) \cong\{Q \in \mathbb{R}^{n \times n} : Q^{\scriptscriptstyle\mathsf{T}} Q = I, \; Q^{\scriptscriptstyle\mathsf{T}} = Q,\; \operatorname{tr}(Q)=2k - n\}$ yields exceedingly simple matrix formulas for various curvatures and curvature-related quantities, both intrinsic and extrinsic. These include Riemann, Ricci, Jacobi, sectional, scalar, mean, principal, and Gaussian curvatures; Schouten, Weyl, Cotton, Bach, Plebański, cocurvature, nonmetricity, and torsion tensors; first, second, and third fundamental forms; Gauss and Weingarten maps; and upper and lower delta invariants. We will derive explicit, simple expressions for the aforementioned quantities in terms of standard matrix operations that are stably computable with numerical linear algebra. Many of these aforementioned quantities have never before been presented for the Grassmannian.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Summing divergent matrix series
Authors:
Rongbiao Wang,
JungHo Lee,
Lek-Heng Lim
Abstract:
We extend several celebrated methods in classical analysis for summing series of complex numbers to series of complex matrices. These include the summation methods of Abel, Borel, Cesáro, Euler, Lambert, Nörlund, and Mittag-Leffler, which are frequently used to sum scalar series that are divergent in the conventional sense. One feature of our matrix extensions is that they are fully noncommutative…
▽ More
We extend several celebrated methods in classical analysis for summing series of complex numbers to series of complex matrices. These include the summation methods of Abel, Borel, Cesáro, Euler, Lambert, Nörlund, and Mittag-Leffler, which are frequently used to sum scalar series that are divergent in the conventional sense. One feature of our matrix extensions is that they are fully noncommutative generalizations of their scalar counterparts -- not only is the scalar series replaced by a matrix series, positive weights are replaced by positive definite matrix weights, order on $\mathbb{R}$ replaced by Loewner order, exponential function replaced by matrix exponential function, etc. We will establish the regularity of our matrix summation methods, i.e., when applied to a matrix series convergent in the conventional sense, we obtain the same value for the sum. Our second goal is to provide numerical algorithms that work in conjunction with these summation methods. We discuss how the block and mixed-block summation algorithms, the Kahan compensated summation algorithm, may be applied to matrix sums with similar roundoff error bounds. These summation methods and algorithms apply not only to power or Taylor series of matrices but to any general matrix series including matrix Fourier and Dirichlet series. We will demonstrate the utility of these summation methods: establishing a Fejér's theorem and alleviating the Gibbs phenomenon for matrix Fourier series; extending the domains of matrix functions and accurately evaluating them; enhancing the matrix Padé approximation and Schur--Parlett algorithms; and more.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Degree of the Grassmannian as an affine variety
Authors:
Lek-Heng Lim,
Ke Ye
Abstract:
The degree of the Grassmannian with respect to the Plücker embedding is well-known. However, the Plücker embedding, while ubiquitous in pure mathematics, is almost never used in applied mathematics. In applied mathematics, the Grassmannian is usually embedded as projection matrices…
▽ More
The degree of the Grassmannian with respect to the Plücker embedding is well-known. However, the Plücker embedding, while ubiquitous in pure mathematics, is almost never used in applied mathematics. In applied mathematics, the Grassmannian is usually embedded as projection matrices $\operatorname{Gr}(k,\mathbb{R}^n) \cong \{P \in \mathbb{R}^{n \times n} : P^{\scriptscriptstyle\mathsf{T}} = P = P^2,\; \operatorname{tr}(P) = k\}$ or as involution matrices $\operatorname{Gr}(k,\mathbb{R}^n) \cong \{X \in \mathbb{R}^{n \times n} : X^{\scriptscriptstyle\mathsf{T}} = X,\; X^2 = I,\; \operatorname{tr}(X)=2k - n\}$. We will determine an explicit expression for the degree of the Grassmannian with respect to these embeddings. In so doing, we resolved a conjecture of Devriendt--Friedman--Sturmfels about the degree $\operatorname{Gr}(2, \mathbb{R}^n)$ and in fact generalized it to $\operatorname{Gr}(k, \mathbb{R}^n)$. We also proved a set theoretic variant of another conjecture of Devriendt--Friedman--Sturmfels about the limit of $\operatorname{Gr}(k,\mathbb{R}^n)$ in the sense of Gröbner degneration.
△ Less
Submitted 19 July, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
A study guide for the $\ell^2$ decoupling theorem for the paraboloid
Authors:
Ataleshvara Bhargava,
Tiklung Chan,
Zi Li Lim,
Yixuan Pang
Abstract:
This article serves as a study guide for the $\ell^2$ decoupling theorem for the paraboloid originally proved by Bourgain and Demeter. Given its popularity and importance, many expositions about the $\ell^2$ decoupling theorem already exist. Our study guide is intended to complement and combine these existing resources in order to provide a more gentle introduction to the subject.
This article serves as a study guide for the $\ell^2$ decoupling theorem for the paraboloid originally proved by Bourgain and Demeter. Given its popularity and importance, many expositions about the $\ell^2$ decoupling theorem already exist. Our study guide is intended to complement and combine these existing resources in order to provide a more gentle introduction to the subject.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
Three term rational function progressions in finite fields
Authors:
Guo-Dong Hong,
Zi Li Lim
Abstract:
Let $F(t),G(t)\in \mathbb{Q}(t)$ be rational functions such that $F(t),G(t)$ and the constant function $1$ are linearly independent over $\mathbb{Q}$, we prove an asymptotic formula for the number of the three term rational function progressions of the form $x,x+F(y),x+G(y)$ in subsets of $\mathbb{F}_p$. The main new ingredient is an algebraic geometry version of PET induction that bypasses Weyl's…
▽ More
Let $F(t),G(t)\in \mathbb{Q}(t)$ be rational functions such that $F(t),G(t)$ and the constant function $1$ are linearly independent over $\mathbb{Q}$, we prove an asymptotic formula for the number of the three term rational function progressions of the form $x,x+F(y),x+G(y)$ in subsets of $\mathbb{F}_p$. The main new ingredient is an algebraic geometry version of PET induction that bypasses Weyl's differencing. This answers a question of Bourgain and Chang.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
Convex Relaxation for Fokker-Planck
Authors:
Yian Chen,
Yuehaw Khoo,
Lek-Heng Lim
Abstract:
We propose an approach to directly estimate the moments or marginals for a high-dimensional equilibrium distribution in statistical mechanics, via solving the high-dimensional Fokker-Planck equation in terms of low-order cluster moments or marginals. With this approach, we bypass the exponential complexity of estimating the full high-dimensional distribution and directly solve the simplified parti…
▽ More
We propose an approach to directly estimate the moments or marginals for a high-dimensional equilibrium distribution in statistical mechanics, via solving the high-dimensional Fokker-Planck equation in terms of low-order cluster moments or marginals. With this approach, we bypass the exponential complexity of estimating the full high-dimensional distribution and directly solve the simplified partial differential equations for low-order moments/marginals. Moreover, the proposed moment/marginal relaxation is fully convex and can be solved via off-the-shelf solvers. We further propose a time-dependent version of the convex programs to study non-equilibrium dynamics. We show the proposed method can recover the meanfield approximation of an equilibrium density. Numerical results are provided to demonstrate the performance of the proposed algorithm for high-dimensional systems.
△ Less
Submitted 3 December, 2023; v1 submitted 5 June, 2023;
originally announced June 2023.
-
Simpler flag optimization
Authors:
Zehua Lai,
Lek-Heng Lim,
Ke Ye
Abstract:
We study the geometry of flag manifolds under different embeddings into a product of Grassmannians. We show that differential geometric objects and operations -- tangent vector, metric, normal vector, exponential map, geodesic, parallel transport, gradient, Hessian, etc -- have closed-form analytic expressions that are computable with standard numerical linear algebra. Furthermore, we are able to…
▽ More
We study the geometry of flag manifolds under different embeddings into a product of Grassmannians. We show that differential geometric objects and operations -- tangent vector, metric, normal vector, exponential map, geodesic, parallel transport, gradient, Hessian, etc -- have closed-form analytic expressions that are computable with standard numerical linear algebra. Furthermore, we are able to derive a coordinate descent method in the flag manifold that performs well compared to other gradient descent methods.
△ Less
Submitted 30 November, 2022;
originally announced December 2022.
-
Haagerup bound for quaternionic Grothendieck inequality
Authors:
Shmuel Friedland,
Zehua Lai,
Lek-Heng Lim
Abstract:
We present here several versions of the Grothendieck inequality over the skew field of quaternions: The first one is the standard Grothendieck inequality for rectangular matrices, and two additional inequalities for self-adjoint matrices, as introduced by the first and the last authors in a recent paper. We give several results on ``conic Grothendieck inequality'': as Nesterov $π/2$-Theorem, which…
▽ More
We present here several versions of the Grothendieck inequality over the skew field of quaternions: The first one is the standard Grothendieck inequality for rectangular matrices, and two additional inequalities for self-adjoint matrices, as introduced by the first and the last authors in a recent paper. We give several results on ``conic Grothendieck inequality'': as Nesterov $π/2$-Theorem, which corresponds to the cones of positive semidefinite matrices; the Goemans--Williamson inequality, which corresponds to the cones of weighted Laplacians; the diagonally dominant matrices. The most challenging technical part of this paper is the proof of the analog of Haagerup result that the inverse of the hypergeometric function $x {}_2F_1(\frac{1}{2}, \frac{1}{2}; 3; x^2)$ has first positive Taylor coefficient and all other Taylor coefficients are nonpositive.
△ Less
Submitted 30 November, 2022;
originally announced December 2022.
-
Stochastic Steffensen method
Authors:
Minda Zhao,
Zehua Lai,
Lek-Heng Lim
Abstract:
Is it possible for a first-order method, i.e., only first derivatives allowed, to be quadratically convergent? For univariate loss functions, the answer is yes -- the Steffensen method avoids second derivatives and is still quadratically convergent like Newton method. By incorporating an optimal step size we can even push its convergence order beyond quadratic to $1+\sqrt{2} \approx 2.414$. While…
▽ More
Is it possible for a first-order method, i.e., only first derivatives allowed, to be quadratically convergent? For univariate loss functions, the answer is yes -- the Steffensen method avoids second derivatives and is still quadratically convergent like Newton method. By incorporating an optimal step size we can even push its convergence order beyond quadratic to $1+\sqrt{2} \approx 2.414$. While such high convergence orders are a pointless overkill for a deterministic algorithm, they become rewarding when the algorithm is randomized for problems of massive sizes, as randomization invariably compromises convergence speed. We will introduce two adaptive learning rates inspired by the Steffensen method, intended for use in a stochastic optimization setting and requires no hyperparameter tuning aside from batch size. Extensive experiments show that they compare favorably with several existing first-order methods. When restricted to a quadratic objective, our stochastic Steffensen methods reduce to randomized Kaczmarz method -- note that this is not true for SGD or SLBFGS -- and thus we may also view our methods as a generalization of randomized Kaczmarz to arbitrary objectives.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
LU decomposition and Toeplitz decomposition of a neural network
Authors:
Yucong Liu,
Simiao Jiao,
Lek-Heng Lim
Abstract:
It is well-known that any matrix $A$ has an LU decomposition. Less well-known is the fact that it has a 'Toeplitz decomposition' $A = T_1 T_2 \cdots T_r$ where $T_i$'s are Toeplitz matrices. We will prove that any continuous function $f : \mathbb{R}^n \to \mathbb{R}^m$ has an approximation to arbitrary accuracy by a neural network that takes the form…
▽ More
It is well-known that any matrix $A$ has an LU decomposition. Less well-known is the fact that it has a 'Toeplitz decomposition' $A = T_1 T_2 \cdots T_r$ where $T_i$'s are Toeplitz matrices. We will prove that any continuous function $f : \mathbb{R}^n \to \mathbb{R}^m$ has an approximation to arbitrary accuracy by a neural network that takes the form $L_1 σ_1 U_1 σ_2 L_2 σ_3 U_2 \cdots L_r σ_{2r-1} U_r$, i.e., where the weight matrices alternate between lower and upper triangular matrices, $σ_i(x) := σ(x - b_i)$ for some bias vector $b_i$, and the activation $σ$ may be chosen to be essentially any uniformly continuous nonpolynomial function. The same result also holds with Toeplitz matrices, i.e., $f \approx T_1 σ_1 T_2 σ_2 \cdots σ_{r-1} T_r$ to arbitrary accuracy, and likewise for Hankel matrices. A consequence of our Toeplitz result is a fixed-width universal approximation theorem for convolutional neural networks, which so far have only arbitrary width versions. Since our results apply in particular to the case when $f$ is a general neural network, we may regard them as LU and Toeplitz decompositions of a neural network. The practical implication of our results is that one may vastly reduce the number of weight parameters in a neural network without sacrificing its power of universal approximation. We will present several experiments on real data sets to show that imposing such structures on the weight matrices sharply reduces the number of training parameters with almost no noticeable effect on test accuracy.
△ Less
Submitted 25 November, 2022;
originally announced November 2022.
-
Generalized matrix nearness problems
Authors:
Zihao Li,
Lek-Heng Lim
Abstract:
We show that the global minimum solution of $\lVert A - BXC \rVert$ can be found in closed-form with singular value decompositions and generalized singular value decompositions for a variety of constraints on $X$ involving rank, norm, symmetry, two-sided product, and prescribed eigenvalue. This extends the solution of Friedland--Torokhti for the generalized rank-constrained approximation problem t…
▽ More
We show that the global minimum solution of $\lVert A - BXC \rVert$ can be found in closed-form with singular value decompositions and generalized singular value decompositions for a variety of constraints on $X$ involving rank, norm, symmetry, two-sided product, and prescribed eigenvalue. This extends the solution of Friedland--Torokhti for the generalized rank-constrained approximation problem to other constraints as well as provides an alternative solution for rank constraint in terms of singular value decompositions. For more complicated constraints on $X$ involving structures such as Toeplitz, Hankel, circulant, nonnegativity, stochasticity, positive semidefiniteness, prescribed eigenvector, etc, we prove that a simple iterative method is linearly and globally convergent to the global minimum solution.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Complex matrix inversion via real matrix inversions
Authors:
Zhen Dai,
Lek-Heng Lim,
Ke Ye
Abstract:
We study the inversion analog of the well-known Gauss algorithm for multiplying complex matrices. A simple version is $(A + iB)^{-1} = (A + BA^{-1}B)^{-1} - i A^{-1}B(A+BA^{-1} B)^{-1}$ when $A$ is invertible, which may be traced back to Frobenius but has received scant attention. We prove that it is optimal, requiring fewest matrix multiplications and inversions over the base field, and we extend…
▽ More
We study the inversion analog of the well-known Gauss algorithm for multiplying complex matrices. A simple version is $(A + iB)^{-1} = (A + BA^{-1}B)^{-1} - i A^{-1}B(A+BA^{-1} B)^{-1}$ when $A$ is invertible, which may be traced back to Frobenius but has received scant attention. We prove that it is optimal, requiring fewest matrix multiplications and inversions over the base field, and we extend it in three ways: (i) to any invertible $A + iB$ without requiring $A$ or $B$ be invertible; (ii) to any iterated quadratic extension fields, with $\mathbb{C}$ over $\mathbb{R}$ a special case; (iii) to Hermitian positive definite matrices $A + iB$ by exploiting symmetric positive definiteness of $A$ and $A + BA^{-1}B$. We call all such algorithms Frobenius inversions, which we will see do not follow from Sherman--Morrison--Woodbury type identities and cannot be extended to Moore--Penrose pseudoinverse. We show that a complex matrix with well-conditioned real and imaginary parts can be arbitrarily ill-conditioned, a situation tailor-made for Frobenius inversion. We prove that Frobenius inversion for complex matrices is faster than standard inversion by LU decomposition and Frobenius inversion for Hermitian positive definite matrices is faster than standard inversion by Cholesky decomposition. We provide extensive numerical experiments, applying Frobenius inversion to solve linear systems, evaluate matrix sign function, solve Sylvester equation, and compute polar decomposition, showing that Frobenius inversion can be more efficient than LU/Cholesky decomposition with negligible loss in accuracy. A side result is a generalization of Gauss multiplication to iterated quadratic extensions, which we show is intimately related to the Karatsuba algorithm for fast integer multiplication and multidimensional fast Fourier transform.
△ Less
Submitted 9 October, 2023; v1 submitted 2 August, 2022;
originally announced August 2022.
-
Rank-constrained Hyperbolic Programming
Authors:
Zhen Dai,
Lek-Heng Lim
Abstract:
We extend rank-constrained optimization to general hyperbolic programs (HP) using the notion of matroid rank. For LP and SDP respectively, this reduces to sparsity-constrained LP and rank-constrained SDP that are already well-studied. But for QCQP and SOCP, we obtain new interesting optimization problems. For example, rank-constrained SOCP includes weighted Max-Cut and nonconvex QP as special case…
▽ More
We extend rank-constrained optimization to general hyperbolic programs (HP) using the notion of matroid rank. For LP and SDP respectively, this reduces to sparsity-constrained LP and rank-constrained SDP that are already well-studied. But for QCQP and SOCP, we obtain new interesting optimization problems. For example, rank-constrained SOCP includes weighted Max-Cut and nonconvex QP as special cases, and dropping the rank constraints yield the standard SOCP-relaxations of these problems. We will show (i) how to do rank reduction for SOCP and QCQP, (ii) that rank-constrained SOCP and rank-constrained QCQP are NP-hard, and (iii) an improved result for rank-constrained SDP showing that if the number of constraints is $m$ and the rank constraint is less than $2^{1/2-ε} \sqrt{m}$ for some $ε>0$, then the problem is NP-hard. We will also study sparsity-constrained HP and extend results on LP sparsification to SOCP and QCQP. In particular, we show that there always exist (a) a solution to SOCP of cardinality at most twice the number of constraints and (b) a solution to QCQP of cardinality at most the sum of the number of linear constraints and the sum of the rank of the matrices in the quadratic constraints; and both (a) and (b) can be found efficiently.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
Numerical stability and tensor nuclear norm
Authors:
Zhen Dai,
Lek-Heng Lim
Abstract:
We present a notion of bilinear stability, which is to numerical stability what bilinear complexity is to time complexity. In bilinear complexity, an algorithm for evaluating a bilinear operator $β: \mathbb{U} \times \mathbb{V} \to \mathbb{W}$ is a decomposition $β= \varphi_1 \otimes ψ_1 \otimes w_1 + \dots + \varphi_r \otimes ψ_r \otimes w_r $; the number of terms $r$ captures the speed of the al…
▽ More
We present a notion of bilinear stability, which is to numerical stability what bilinear complexity is to time complexity. In bilinear complexity, an algorithm for evaluating a bilinear operator $β: \mathbb{U} \times \mathbb{V} \to \mathbb{W}$ is a decomposition $β= \varphi_1 \otimes ψ_1 \otimes w_1 + \dots + \varphi_r \otimes ψ_r \otimes w_r $; the number of terms $r$ captures the speed of the algorithm; and its smallest possible value, i.e., the tensor rank of $β$, quantifies the speed of a fastest algorithm. Bilinear stability introduces norms to the mix: The growth factor of the algorithm $\lVert \varphi_1 \rVert_* \lVert ψ_1 \rVert_* \lVert w_1 \rVert + \dots + \lVert \varphi_r \rVert_* \lVert ψ_r \rVert_* \lVert w_r \rVert$ captures the accuracy of the algorithm; and its smallest possible value, i.e., the tensor nuclear norm of $β$, quantifies the accuracy of a stablest algorithm. To substantiate this notion, we establish a bound for the forward error in terms of the growth factor and present numerical evidence comparing various fast algorithms for matrix and complex multiplications, showing that larger growth factors correlate with less accurate results. Compared to similar studies of numerical stability, bilinear stability is more general, applying to any bilinear operators and not just matrix or complex multiplications; is more simplistic, bounding forward error in terms of a single (growth) factor; and is truly tensorial like bilinear complexity, invariant under any orthogonal change of coordinates. As an aside, we study a new algorithm for computing complex multiplication in terms of real, much like Gauss's, but is optimally fast and stable in that it attains both tensor rank and nuclear norm.
△ Less
Submitted 12 October, 2023; v1 submitted 18 July, 2022;
originally announced July 2022.
-
What is an equivariant neural network?
Authors:
Lek-Heng Lim,
Bradley J. Nelson
Abstract:
We explain equivariant neural networks, a notion underlying breakthroughs in machine learning from deep convolutional neural networks for computer vision to AlphaFold 2 for protein structure prediction, without assuming knowledge of equivariance or neural networks. The basic mathematical ideas are simple but are often obscured by engineering complications that come with practical realizations. We…
▽ More
We explain equivariant neural networks, a notion underlying breakthroughs in machine learning from deep convolutional neural networks for computer vision to AlphaFold 2 for protein structure prediction, without assuming knowledge of equivariance or neural networks. The basic mathematical ideas are simple but are often obscured by engineering complications that come with practical realizations. We extract and focus on the mathematical aspects, and limit ourselves to a cursory treatment of the engineering issues at the end.
△ Less
Submitted 16 November, 2022; v1 submitted 15 May, 2022;
originally announced May 2022.
-
Tensors in computations
Authors:
Lek-Heng Lim
Abstract:
The notion of a tensor captures three great ideas: equivariance, multilinearity, separability. But trying to be three things at once makes the notion difficult to understand. We will explain tensors in an accessible and elementary way through the lens of linear algebra and numerical linear algebra, elucidated with examples from computational and applied mathematics.
The notion of a tensor captures three great ideas: equivariance, multilinearity, separability. But trying to be three things at once makes the notion difficult to understand. We will explain tensors in an accessible and elementary way through the lens of linear algebra and numerical linear algebra, elucidated with examples from computational and applied mathematics.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Distances between probability distributions of different dimensions
Authors:
Yuhang Cai,
Lek-Heng Lim
Abstract:
Comparing probability distributions is an indispensable and ubiquitous task in machine learning and statistics. The most common way to compare a pair of Borel probability measures is to compute a metric between them, and by far the most widely used notions of metric are the Wasserstein metric and the total variation metric. The next most common way is to compute a divergence between them, and in t…
▽ More
Comparing probability distributions is an indispensable and ubiquitous task in machine learning and statistics. The most common way to compare a pair of Borel probability measures is to compute a metric between them, and by far the most widely used notions of metric are the Wasserstein metric and the total variation metric. The next most common way is to compute a divergence between them, and in this case almost every known divergences such as those of Kullback--Leibler, Jensen--Shannon, Rényi, and many more, are special cases of the $f$-divergence. Nevertheless these metrics and divergences may only be computed, in fact, are only defined, when the pair of probability measures are on spaces of the same dimension. How would one quantify, say, a KL-divergence between the uniform distribution on the interval $[-1,1]$ and a Gaussian distribution on $\mathbb{R}^3$? We show that these common notions of metrics and divergences give rise to natural distances between Borel probability measures defined on spaces of different dimensions, e.g., one on $\mathbb{R}^m$ and another on $\mathbb{R}^n$ where $m, n$ are distinct, so as to give a meaningful answer to the previous question.
△ Less
Submitted 29 January, 2022; v1 submitted 1 November, 2020;
originally announced November 2020.
-
Simpler Grassmannian optimization
Authors:
Zehua Lai,
Lek-Heng Lim,
Ke Ye
Abstract:
There are two widely used models for the Grassmannian $\operatorname{Gr}(k,n)$, as the set of equivalence classes of orthogonal matrices $\operatorname{O}(n)/(\operatorname{O}(k) \times \operatorname{O}(n-k))$, and as the set of trace-$k$ projection matrices $\{P \in \mathbb{R}^{n \times n} : P^{\mathsf{T}} = P = P^2,\; \operatorname{tr}(P) = k\}$. The former, standard in manifold optimization, ha…
▽ More
There are two widely used models for the Grassmannian $\operatorname{Gr}(k,n)$, as the set of equivalence classes of orthogonal matrices $\operatorname{O}(n)/(\operatorname{O}(k) \times \operatorname{O}(n-k))$, and as the set of trace-$k$ projection matrices $\{P \in \mathbb{R}^{n \times n} : P^{\mathsf{T}} = P = P^2,\; \operatorname{tr}(P) = k\}$. The former, standard in manifold optimization, has the advantage of giving numerically stable algorithms but the disadvantage of having to work with equivalence classes of matrices. The latter, widely used in coding theory and probability, has the advantage of using actual matrices (as opposed to equivalence classes) but working with projection matrices is numerically unstable. We present an alternative that has both advantages and suffers from neither of the disadvantages; by representing $k$-dimensional subspaces as symmetric orthogonal matrices of trace $2k-n$, we obtain \[ \operatorname{Gr}(k,n) \cong \{Q \in \operatorname{O}(n) : Q^{\mathsf{T}} = Q, \; \operatorname{tr}(Q) = 2k -n\}. \] As with the other two models, we show that differential geometric objects and operations -- tangent vector, metric, normal vector, exponential map, geodesic, parallel transport, gradient, Hessian, etc -- have closed-form analytic expressions that are computable with standard numerical linear algebra. In the proposed model, these expressions are considerably simpler, a result of representing $\operatorname{Gr}(k,n)$ as a linear section of a compact matrix Lie group $\operatorname{O}(n)$, and can be computed with at most one QR decomposition and one exponential of a special skew-symmetric matrix that takes only $O(nk(n-k))$ time. In particular, we completely avoid eigen- and singular value decompositions in our steepest descent, conjugate gradient, quasi-Newton, and Newton methods for the Grassmannian.
△ Less
Submitted 28 September, 2020;
originally announced September 2020.
-
Recht-Ré Noncommutative Arithmetic-Geometric Mean Conjecture is False
Authors:
Zehua Lai,
Lek-Heng Lim
Abstract:
Stochastic optimization algorithms have become indispensable in modern machine learning. An unresolved foundational question in this area is the difference between with-replacement sampling and without-replacement sampling -- does the latter have superior convergence rate compared to the former? A groundbreaking result of Recht and Ré reduces the problem to a noncommutative analogue of the arithme…
▽ More
Stochastic optimization algorithms have become indispensable in modern machine learning. An unresolved foundational question in this area is the difference between with-replacement sampling and without-replacement sampling -- does the latter have superior convergence rate compared to the former? A groundbreaking result of Recht and Ré reduces the problem to a noncommutative analogue of the arithmetic-geometric mean inequality where $n$ positive numbers are replaced by $n$ positive definite matrices. If this inequality holds for all $n$, then without-replacement sampling indeed outperforms with-replacement sampling. The conjectured Recht-Ré inequality has so far only been established for $n = 2$ and a special case of $n = 3$. We will show that the Recht-Ré conjecture is false for general $n$. Our approach relies on the noncommutative Positivstellensatz, which allows us to reduce the conjectured inequality to a semidefinite program and the validity of the conjecture to certain bounds for the optimum values, which we show are false as soon as $n = 5$.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Topology of deep neural networks
Authors:
Gregory Naitzat,
Andrey Zhitnikov,
Lek-Heng Lim
Abstract:
We study how the topology of a data set $M = M_a \cup M_b \subseteq \mathbb{R}^d$, representing two classes $a$ and $b$ in a binary classification problem, changes as it passes through the layers of a well-trained neural network, i.e., with perfect accuracy on training set and near-zero generalization error ($\approx 0.01\%$). The goal is to shed light on two mysteries in deep neural networks: (i)…
▽ More
We study how the topology of a data set $M = M_a \cup M_b \subseteq \mathbb{R}^d$, representing two classes $a$ and $b$ in a binary classification problem, changes as it passes through the layers of a well-trained neural network, i.e., with perfect accuracy on training set and near-zero generalization error ($\approx 0.01\%$). The goal is to shed light on two mysteries in deep neural networks: (i) a nonsmooth activation function like ReLU outperforms a smooth one like hyperbolic tangent; (ii) successful neural network architectures rely on having many layers, even though a shallow network can approximate any function arbitrary well. We performed extensive experiments on the persistent homology of a wide range of point cloud data sets, both real and simulated. The results consistently demonstrate the following: (1) Neural networks operate by changing topology, transforming a topologically complicated data set into a topologically simple one as it passes through the layers. No matter how complicated the topology of $M$ we begin with, when passed through a well-trained neural network $f : \mathbb{R}^d \to \mathbb{R}^p$, there is a vast reduction in the Betti numbers of both components $M_a$ and $M_b$; in fact they nearly always reduce to their lowest possible values: $β_k\bigl(f(M_i)\bigr) = 0$ for $k \ge 1$ and $β_0\bigl(f(M_i)\bigr) = 1$, $i =a, b$. Furthermore, (2) the reduction in Betti numbers is significantly faster for ReLU activation than hyperbolic tangent activation as the former defines nonhomeomorphic maps that change topology, whereas the latter defines homeomorphic maps that preserve topology. Lastly, (3) shallow and deep networks transform data sets differently -- a shallow network operates mainly through changing geometry and changes topology only in its final layers, a deep one spreads topological changes more evenly across all layers.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
Symmetric Grothendieck inequality
Authors:
Shmuel Friedland,
Lek-Heng Lim
Abstract:
We establish an analogue of the Grothendieck inequality where the rectangular matrix is replaced by a symmetric/Hermitian matrix and the bilinear form by a quadratic form. We call this the symmetric Grothendieck inequality; despite its name, it is a generalization -- the original Grothendieck inequality is a special case. While there are other proposals for such an inequality, ours differs in two…
▽ More
We establish an analogue of the Grothendieck inequality where the rectangular matrix is replaced by a symmetric/Hermitian matrix and the bilinear form by a quadratic form. We call this the symmetric Grothendieck inequality; despite its name, it is a generalization -- the original Grothendieck inequality is a special case. While there are other proposals for such an inequality, ours differs in two important ways: (i) we have no additional requirement like positive semidefiniteness; (ii) our symmetric Grothendieck constant is universal, i.e., independent of the matrix and its dimensions. A consequence of our symmetric Grothendieck inequality is a "conic Grothendieck inequality" for any family of cones of symmetric matrices: The original Grothendieck inequality is a special case; as is the Nesterov $π/2$-Theorem, which corresponds to the cones of positive semidefinite matrices; as well as the Goemans-Williamson inequality, which corresponds to the cones of Laplacians. For yet other cones, e.g., of diagonally dominant matrices, we get new Grothendieck-like inequalities. A slight extension leads to a unified framework that treats any Grothendieck-like inequality as an inequality between two norms within a family of "Grothendieck norms" restricted to a family of cones. This allows us to place on equal footing the Goemans-Williamson inequality, Nesterov $π/2$-Theorem, Ben-Tal-Nemirovski-Roos $4/π$-Theorem, generalized Grothendieck inequality, order-$p$ Grothendieck inequality, rank-constrained positive semidefinite Grothendieck inequality; and in turn allows us to simplify proofs, extend results from real to complex, obtain new bounds or establish sharpness of existing ones. The symmetric Grothendieck inequality may also be applied to obtain polynomial-time approximation bounds for NP-hard combinatorial, integer, and nonconvex optimization problems.
△ Less
Submitted 16 March, 2020;
originally announced March 2020.
-
Optimization on flag manifolds
Authors:
Ke Ye,
Ken Sze-Wai Wong,
Lek-Heng Lim
Abstract:
A flag is a sequence of nested subspaces. Flags are ubiquitous in numerical analysis, arising in finite elements, multigrid, spectral, and pseudospectral methods for numerical PDE; they arise in the form of Krylov subspaces in matrix computations, and as multiresolution analysis in wavelets constructions. They are common in statistics too --- principal component, canonical correlation, and corresp…
▽ More
A flag is a sequence of nested subspaces. Flags are ubiquitous in numerical analysis, arising in finite elements, multigrid, spectral, and pseudospectral methods for numerical PDE; they arise in the form of Krylov subspaces in matrix computations, and as multiresolution analysis in wavelets constructions. They are common in statistics too --- principal component, canonical correlation, and correspondence analyses may all be viewed as methods for extracting flags from a data set. The main goal of this article is to develop the tools needed for optimizing over a set of flags, which is a smooth manifold called the flag manifold, and it contains the Grassmannian as the simplest special case. We will derive closed-form analytic expressions for various differential geometric objects required for Riemannian optimization algorithms on the flag manifold; introducing various systems of extrinsic coordinates that allow us to parameterize points, metrics, tangent spaces, geodesics, distance, parallel transport, gradients, Hessians in terms of matrices and matrix operations; and thereby permitting us to formulate steepest descent, conjugate gradient, and Newton algorithms on the flag manifold using only standard numerical linear algebra.
△ Less
Submitted 7 August, 2019; v1 submitted 1 July, 2019;
originally announced July 2019.
-
Semi-Riemannian Manifold Optimization
Authors:
Tingran Gao,
Lek-Heng Lim,
Ke Ye
Abstract:
We introduce in this paper a manifold optimization framework that utilizes semi-Riemannian structures on the underlying smooth manifolds. Unlike in Riemannian geometry, where each tangent space is equipped with a positive definite inner product, a semi-Riemannian manifold allows the metric tensor to be indefinite on each tangent space, i.e., possessing both positive and negative definite subspaces…
▽ More
We introduce in this paper a manifold optimization framework that utilizes semi-Riemannian structures on the underlying smooth manifolds. Unlike in Riemannian geometry, where each tangent space is equipped with a positive definite inner product, a semi-Riemannian manifold allows the metric tensor to be indefinite on each tangent space, i.e., possessing both positive and negative definite subspaces; differential geometric objects such as geodesics and parallel-transport can be defined on non-degenerate semi-Riemannian manifolds as well, and can be carefully leveraged to adapt Riemannian optimization algorithms to the semi-Riemannian setting. In particular, we discuss the metric independence of manifold optimization algorithms, and illustrate that the weaker but more general semi-Riemannian geometry often suffices for the purpose of optimizing smooth functions on smooth manifolds in practice.
△ Less
Submitted 18 December, 2018;
originally announced December 2018.
-
Higher-Order Cone Programming
Authors:
Lijun Ding,
Lek-Heng Lim
Abstract:
We introduce a conic embedding condition that gives a hierarchy of cones and cone programs. This condition is satisfied by a large number of convex cones including the cone of copositive matrices, the cone of completely positive matrices, and all symmetric cones. We discuss properties of the intermediate cones and conic programs in the hierarchy. In particular, we demonstrate how this embedding co…
▽ More
We introduce a conic embedding condition that gives a hierarchy of cones and cone programs. This condition is satisfied by a large number of convex cones including the cone of copositive matrices, the cone of completely positive matrices, and all symmetric cones. We discuss properties of the intermediate cones and conic programs in the hierarchy. In particular, we demonstrate how this embedding condition gives rise to a family of cone programs that interpolates between LP, SOCP, and SDP. This family of $k$th order cones may be realized either as cones of $n$-by-$n$ symmetric matrices or as cones of $n$-variate even degree polynomials. The cases $k = 1, 2, n$ then correspond to LP, SOCP, SDP; or, in the language of polynomial optimization, to DSOS, SDSOS, SOS.
△ Less
Submitted 13 November, 2018;
originally announced November 2018.
-
The Grassmannian of affine subspaces
Authors:
Lek-Heng Lim,
Ken Sze-Wai Wong,
Ke Ye
Abstract:
The Grassmannian of affine subspaces is a natural generalization of both the Euclidean space, points being zero-dimensional affine subspaces, and the usual Grassmannian, linear subspaces being special cases of affine subspaces. We show that, like the Grassmannian, the affine Grassmannian has rich geometrical and topological properties: It has the structure of a homogeneous space, a differential ma…
▽ More
The Grassmannian of affine subspaces is a natural generalization of both the Euclidean space, points being zero-dimensional affine subspaces, and the usual Grassmannian, linear subspaces being special cases of affine subspaces. We show that, like the Grassmannian, the affine Grassmannian has rich geometrical and topological properties: It has the structure of a homogeneous space, a differential manifold, an algebraic variety, a vector bundle, a classifying space, among many more structures; furthermore; it affords an analogue of Schubert calculus and its (co)homology and homotopy groups may be readily determined. On the other hand, like the Euclidean space, the affine Grassmannian serves as a concrete computational platform on which various distances, metrics, probability densities may be explicitly defined and computed via numerical linear algebra. Moreover, many standard problems in machine learning and statistics --- linear regression, errors-in-variables regression, principal components analysis, support vector machines, or more generally any problem that seeks linear relations among variables that either best represent them or separate them into components --- may be naturally formulated as problems on the affine Grassmannian.
△ Less
Submitted 27 July, 2018;
originally announced July 2018.
-
Fiber product homotopy method for multiparameter eigenvalue problems
Authors:
Jose Israel Rodriguez,
Jin-Hong Du,
Yiling You,
Lek-Heng Lim
Abstract:
We develop a new homotopy method for solving multiparameter eigenvalue problems (MEPs) called the fiber product homotopy method. For a $k$-parameter eigenvalue problem with matrices of sizes $n_1,\dots ,n_k = O(n)$, fiber product homotopy method requires deformation of $O(1)$ linear equations, while existing homotopy methods for MEPs require $O(n)$ nonlinear equations. We show that the fiber produ…
▽ More
We develop a new homotopy method for solving multiparameter eigenvalue problems (MEPs) called the fiber product homotopy method. For a $k$-parameter eigenvalue problem with matrices of sizes $n_1,\dots ,n_k = O(n)$, fiber product homotopy method requires deformation of $O(1)$ linear equations, while existing homotopy methods for MEPs require $O(n)$ nonlinear equations. We show that the fiber product homotopy method theoretically finds all eigenpairs of an MEP with probability one. It is especially well-suited for dimension-deficient singular MEPs, a weakness of all other existing methods, as the fiber product homotopy method is provably convergent with probability one for such problems as well, a fact borne out by numerical experiments. More generally, our numerical experiments indicate that the fiber product homotopy method significantly outperforms the standard Delta method in terms of accuracy, with consistent backward errors on the order of $10^{-16}$, even for dimension-deficient singular problems, and without any use of extended precision. In terms of speed, it significantly outperforms previous homotopy-based methods on all problems and outperforms the Delta method on larger problems, and is also highly parallelizable. We show that the fiber product MEP that we solve in the fiber product homotopy method, although mathematically equivalent to a standard MEP, is typically a much better conditioned problem.
△ Less
Submitted 28 November, 2020; v1 submitted 27 June, 2018;
originally announced June 2018.
-
Geometric distance between positive definite matrices of different dimensions
Authors:
Lek-Heng Lim,
Rodolphe Sepulchre,
Ke Ye
Abstract:
We show how the Riemannian distance on $\mathbb{S}^n_{++}$, the cone of $n\times n$ real symmetric or complex Hermitian positive definite matrices, may be used to naturally define a distance between two such matrices of different dimensions. Given that $\mathbb{S}^n_{++}$ also parameterizes $n$-dimensional ellipsoids, and inner products on $\mathbb{R}^n$, $n \times n$ covariance matrices of nondeg…
▽ More
We show how the Riemannian distance on $\mathbb{S}^n_{++}$, the cone of $n\times n$ real symmetric or complex Hermitian positive definite matrices, may be used to naturally define a distance between two such matrices of different dimensions. Given that $\mathbb{S}^n_{++}$ also parameterizes $n$-dimensional ellipsoids, and inner products on $\mathbb{R}^n$, $n \times n$ covariance matrices of nondegenerate probability distributions, this gives us a natural way to define a geometric distance between a pair of such objects of different dimensions.
△ Less
Submitted 4 June, 2018;
originally announced June 2018.
-
Tropical Geometry of Deep Neural Networks
Authors:
Liwen Zhang,
Gregory Naitzat,
Lek-Heng Lim
Abstract:
We establish, for the first time, connections between feedforward neural networks with ReLU activation and tropical geometry --- we show that the family of such neural networks is equivalent to the family of tropical rational maps. Among other things, we deduce that feedforward ReLU neural networks with one hidden layer can be characterized by zonotopes, which serve as building blocks for deeper n…
▽ More
We establish, for the first time, connections between feedforward neural networks with ReLU activation and tropical geometry --- we show that the family of such neural networks is equivalent to the family of tropical rational maps. Among other things, we deduce that feedforward ReLU neural networks with one hidden layer can be characterized by zonotopes, which serve as building blocks for deeper networks; we relate decision boundaries of such neural networks to tropical hypersurfaces, a major object of study in tropical geometry; and we prove that linear regions of such neural networks correspond to vertices of polytopes associated with tropical rational functions. An insight from our tropical formulation is that a deeper network is exponentially more expressive than a shallow network.
△ Less
Submitted 18 May, 2018;
originally announced May 2018.
-
Topology of tensor ranks
Authors:
Pierre Comon,
Lek-Heng Lim,
Yang Qi,
Ke Ye
Abstract:
We study path-connectedness and homotopy groups of sets of tensors defined by tensor rank, border rank, multilinear rank, as well as their symmetric counterparts for symmetric tensors. We show that over $\mathbb{C}$, the set of rank-$r$ tensors and the set of symmetric rank-$r$ symmetric tensors are both path-connected if $r$ is not more than the complex generic rank; these results also extend to…
▽ More
We study path-connectedness and homotopy groups of sets of tensors defined by tensor rank, border rank, multilinear rank, as well as their symmetric counterparts for symmetric tensors. We show that over $\mathbb{C}$, the set of rank-$r$ tensors and the set of symmetric rank-$r$ symmetric tensors are both path-connected if $r$ is not more than the complex generic rank; these results also extend to border rank and symmetric border rank over $\mathbb{C}$. Over $\mathbb{R}$, the set of rank-$r$ tensors is path-connected if it has the expected dimension but the corresponding result for symmetric rank-$r$ symmetric $d$-tensors depends on the order $d$: connected when $d$ is odd but not when $d$ is even. Border rank and symmetric border rank over $\mathbb{R}$ have essentially the same path-connectedness properties as rank and symmetric rank over $\mathbb{R}$. When $r$ is greater than the complex generic rank, we are unable to discern any general pattern: For example, we show that border-rank-three tensors in $\mathbb{R}^2 \otimes \mathbb{R}^2 \otimes \mathbb{R}^2$ fall into four connected components. For multilinear rank, the manifold of $d$-tensors of multilinear rank $(r_1,\dots,r_d)$ in $\mathbb{C}^{n_1} \otimes \cdots \otimes \mathbb{C}^{n_d}$ is always path-connected, and the same is true in $\mathbb{R}^{n_1} \otimes \cdots \otimes \mathbb{R}^{n_d}$ unless $n_i = r_i = \prod_{j \ne i} r_j$ for some $i\in\{1, \dots, d\}$. Beyond path-connectedness, we determine, over both $\mathbb{R}$ and $\mathbb{C}$, the fundamental and higher homotopy groups of the set of tensors of a fixed small rank, and, taking advantage of Bott periodicity, those of the manifold of tensors of a fixed multilinear rank. We also obtain analogues of these results for symmetric tensors of a fixed symmetric rank or a fixed symmetric multilinear rank.
△ Less
Submitted 21 April, 2018;
originally announced April 2018.
-
Tensor network ranks
Authors:
Ke Ye,
Lek-Heng Lim
Abstract:
In problems involving approximation, completion, denoising, dimension reduction, estimation, interpolation, modeling, order reduction, regression, etc, we argue that the near-universal practice of assuming that a function, matrix, or tensor (which we will see are all the same object in this context) has \emph{low rank} may be ill-justified. There are many natural instances where the object in ques…
▽ More
In problems involving approximation, completion, denoising, dimension reduction, estimation, interpolation, modeling, order reduction, regression, etc, we argue that the near-universal practice of assuming that a function, matrix, or tensor (which we will see are all the same object in this context) has \emph{low rank} may be ill-justified. There are many natural instances where the object in question has high rank with respect to the classical notions of rank: matrix rank, tensor rank, multilinear rank --- the latter two being the most straightforward generalizations of the former. To remedy this, we show that one may vastly expand these classical notions of ranks: Given any undirected graph $G$, there is a notion of $G$-rank associated with $G$, which provides us with as many different kinds of ranks as there are undirected graphs. In particular, the popular tensor network states in physics (e.g., \textsc{mps}, \textsc{ttns}, \textsc{peps}) may be regarded as functions of a specific $G$-rank for various choices of $G$. Among other things, we will see that a function, matrix, or tensor may have very high matrix, tensor, or multilinear rank and yet very low $G$-rank for some $G$. In fact the difference is in the orders of magnitudes and the gaps between $G$-ranks and these classical ranks are arbitrarily large for some important objects in computer science, mathematics, and physics. Furthermore, we show that there is a $G$ such that almost every tensor has $G$-rank exponentially lower than its rank or the dimension of its ambient space.
△ Less
Submitted 9 February, 2019; v1 submitted 8 January, 2018;
originally announced January 2018.
-
Complex best $r$-term approximations almost always exist in finite dimensions
Authors:
Yang Qi,
Mateusz Michałek,
Lek-Heng Lim
Abstract:
We show that in finite-dimensional nonlinear approximations, the best $r$-term approximant of a function $f$ almost always exists over $\mathbb{C}$ but that the same is not true over $\mathbb{R}$, i.e., the infimum $\inf_{f_1,\dots,f_r \in Y} \lVert f - f_1 - \dots - f_r \rVert$ is almost always attainable by complex-valued functions $f_1,\dots, f_r$ in $Y$, a set of functions that have some desir…
▽ More
We show that in finite-dimensional nonlinear approximations, the best $r$-term approximant of a function $f$ almost always exists over $\mathbb{C}$ but that the same is not true over $\mathbb{R}$, i.e., the infimum $\inf_{f_1,\dots,f_r \in Y} \lVert f - f_1 - \dots - f_r \rVert$ is almost always attainable by complex-valued functions $f_1,\dots, f_r$ in $Y$, a set of functions that have some desired structures. Our result extends to functions that possess special properties like symmetry or skew-symmetry under permutations of arguments. For the case where $Y$ is the set of separable functions, the problem becomes that of best rank-$r$ tensor approximations. We show that over $\mathbb{C}$, any tensor almost always has a unique best rank-$r$ approximation. This extends to other notions of tensor ranks such as symmetric rank and alternating rank, to best $r$-block-terms approximations, and to best approximations by tensor networks. When applied to sparse-plus-low-rank approximations, we obtain that for any given $r$ and $k$, a general tensor has a unique best approximation by a sum of a rank-$r$ tensor and a $k$-sparse tensor with a fixed sparsity pattern; this arises in, for example, estimation of covariance matrices of a Gaussian hidden variable model with $k$ observed variables conditionally independent given $r$ hidden variables. The existential (but not the uniqueness) part of our result also applies to best approximations by a sum of a rank-$r$ tensor and a $k$-sparse tensor with no fixed sparsity pattern, as well as to tensor completion problems.
△ Less
Submitted 5 September, 2018; v1 submitted 30 November, 2017;
originally announced November 2017.
-
An elementary and unified proof of Grothendieck's inequality
Authors:
Shmuel Friedland,
Lek-Heng Lim,
Jinjie Zhang
Abstract:
We present an elementary, self-contained proof of Grothendieck's inequality that unifies the real and complex cases and yields both the Krivine and Haagerup bounds, the current best-known explicit bounds for the real and complex Grothendieck constants respectively. This article is intended to be pedagogical, combining and streamlining known ideas of Lindenstrauss--Pełczyński, Krivine, and Haagerup…
▽ More
We present an elementary, self-contained proof of Grothendieck's inequality that unifies the real and complex cases and yields both the Krivine and Haagerup bounds, the current best-known explicit bounds for the real and complex Grothendieck constants respectively. This article is intended to be pedagogical, combining and streamlining known ideas of Lindenstrauss--Pełczyński, Krivine, and Haagerup into a proof that need only univariate calculus, basic complex variables, and a modicum of linear algebra as prerequisites.
△ Less
Submitted 23 October, 2018; v1 submitted 28 November, 2017;
originally announced November 2017.
-
Accurate Solutions of Polynomial Eigenvalue Problems
Authors:
Yiling You,
Jose Israel Rodriguez,
Lek-Heng Lim
Abstract:
Quadratic eigenvalue problems (QEP) and more generally polynomial eigenvalue problems (PEP) are among the most common types of nonlinear eigenvalue problems. Both problems, especially the QEP, have extensive applications. A typical approach to solve QEP and PEP is to use a linearization method to reformulate the problem as a higher dimensional linear eigenvalue problem. In this article, we use hom…
▽ More
Quadratic eigenvalue problems (QEP) and more generally polynomial eigenvalue problems (PEP) are among the most common types of nonlinear eigenvalue problems. Both problems, especially the QEP, have extensive applications. A typical approach to solve QEP and PEP is to use a linearization method to reformulate the problem as a higher dimensional linear eigenvalue problem. In this article, we use homotopy continuation to solve these nonlinear eigenvalue problems without passing to higher dimensions. Our main contribution is to show that our method produces substantially more accurate results, and finds all eigenvalues with a certificate of correctness via Smale's $α$-theory. To explain the superior accuracy, we show that the nonlinear eigenvalue problem we solve is better conditioned than its reformulated linear eigenvalue problem, and our homotopy continuation algorithm is more stable than QZ algorithm - theoretical findings that are borne out by our numerical experiments. Our studies provide yet another illustration of the dictum in numerical analysis that, for reasons of conditioning and stability, it is sometimes better to solve a nonlinear problem directly even when it could be transformed into a linear problem with the same solution mathematically.
△ Less
Submitted 3 November, 2017;
originally announced November 2017.
-
Numerical algorithms on the affine Grassmannian
Authors:
Lek-Heng Lim,
Ken Sze-Wai Wong,
Ke Ye
Abstract:
The affine Grassmannian is a noncompact smooth manifold that parameterizes all affine subspaces of a fixed dimension. It is a natural generalization of Euclidean space, points being zero-dimensional affine subspaces. We will realize the affine Grassmannian as a matrix manifold and extend Riemannian optimization algorithms including steepest descent, Newton method, and conjugate gradient, to real-v…
▽ More
The affine Grassmannian is a noncompact smooth manifold that parameterizes all affine subspaces of a fixed dimension. It is a natural generalization of Euclidean space, points being zero-dimensional affine subspaces. We will realize the affine Grassmannian as a matrix manifold and extend Riemannian optimization algorithms including steepest descent, Newton method, and conjugate gradient, to real-valued functions on the affine Grassmannian. Like their counterparts for the Grassmannian, these algorithms are in the style of Edelman--Arias--Smith --- they rely only on standard numerical linear algebra and are readily computable.
△ Less
Submitted 25 June, 2018; v1 submitted 6 July, 2016;
originally announced July 2016.
-
Cohomology of Cryo-Electron Microscopy
Authors:
Ke Ye,
Lek-Heng Lim
Abstract:
The goal of cryo-electron microscopy (EM) is to reconstruct the 3-dimensional structure of a molecule from a collection of its 2-dimensional projected images. In this article, we show that the basic premise of cryo-EM --- patching together 2-dimensional projections to reconstruct a 3-dimensional object --- is naturally one of Cech cohomology with SO(2)-coefficients. We deduce that every cryo-EM re…
▽ More
The goal of cryo-electron microscopy (EM) is to reconstruct the 3-dimensional structure of a molecule from a collection of its 2-dimensional projected images. In this article, we show that the basic premise of cryo-EM --- patching together 2-dimensional projections to reconstruct a 3-dimensional object --- is naturally one of Cech cohomology with SO(2)-coefficients. We deduce that every cryo-EM reconstruction problem corresponds to an oriented circle bundle on a simplicial complex, allowing us to classify cryo-EM problems via principal bundles. In practice, the 2-dimensional images are noisy and a main task in cryo-EM is to denoise them. We will see how the aforementioned insights can be used towards this end.
△ Less
Submitted 22 April, 2017; v1 submitted 5 April, 2016;
originally announced April 2016.
-
Algorithms for structured matrix-vector product of optimal bilinear complexity
Authors:
Ke Ye,
Lek-Heng Lim
Abstract:
We present explicit algorithms for computing structured matrix-vector products that are optimal in the sense of Strassen, i.e., using a provably minimum number of multiplications. These structures include Toeplitz/Hankel/circulant, symmetric, Toeplitz-plus-Hankel, sparse, and multilevel structures. The last category include \textsc{bttb}, \textsc{bhhb}, \textsc{bccb} but also any arbitrarily compl…
▽ More
We present explicit algorithms for computing structured matrix-vector products that are optimal in the sense of Strassen, i.e., using a provably minimum number of multiplications. These structures include Toeplitz/Hankel/circulant, symmetric, Toeplitz-plus-Hankel, sparse, and multilevel structures. The last category include \textsc{bttb}, \textsc{bhhb}, \textsc{bccb} but also any arbitrarily complicated nested structures built out of other structures.
△ Less
Submitted 21 March, 2016;
originally announced March 2016.
-
The Spacey Random Walk: a Stochastic Process for Higher-order Data
Authors:
Austin R. Benson,
David F. Gleich,
Lek-Heng Lim
Abstract:
Random walks are a fundamental model in applied mathematics and are a common example of a Markov chain. The limiting stationary distribution of the Markov chain represents the fraction of the time spent in each state during the stochastic process. A standard way to compute this distribution for a random walk on a finite set of states is to compute the Perron vector of the associated transition mat…
▽ More
Random walks are a fundamental model in applied mathematics and are a common example of a Markov chain. The limiting stationary distribution of the Markov chain represents the fraction of the time spent in each state during the stochastic process. A standard way to compute this distribution for a random walk on a finite set of states is to compute the Perron vector of the associated transition matrix. There are algebraic analogues of this Perron vector in terms of transition probability tensors of higher-order Markov chains. These vectors are nonnegative, have dimension equal to the dimension of the state space, and sum to one and are derived by making an algebraic substitution in the equation for the joint-stationary distribution of a higher-order Markov chains. Here, we present the spacey random walk, a non-Markovian stochastic process whose stationary distribution is given by the tensor eigenvector. The process itself is a vertex-reinforced random walk, and its discrete dynamics are related to a continuous dynamical system. We analyze the convergence properties of these dynamics and discuss numerical methods for computing the stationary distribution. Finally, we provide several applications of the spacey random walk model in population genetics, ranking, and clustering data, and we use the process to analyze taxi trajectory data in New York. This example shows definite non-Markovian structure.
△ Less
Submitted 26 December, 2016; v1 submitted 5 February, 2016;
originally announced February 2016.
-
The Computational Complexity of Duality
Authors:
Shmuel Friedland,
Lek-Heng Lim
Abstract:
We show that for any given norm ball or proper cone, weak membership in its dual ball or dual cone is polynomial-time reducible to weak membership in the given ball or cone. A consequence is that the weak membership or membership problem for a ball or cone is NP-hard if and only if the corresponding problem for the dual ball or cone is NP-hard. In a similar vein, we show that computation of the du…
▽ More
We show that for any given norm ball or proper cone, weak membership in its dual ball or dual cone is polynomial-time reducible to weak membership in the given ball or cone. A consequence is that the weak membership or membership problem for a ball or cone is NP-hard if and only if the corresponding problem for the dual ball or cone is NP-hard. In a similar vein, we show that computation of the dual norm of a given norm is polynomial-time reducible to computation of the given norm. This extends to convex functions satisfying a polynomial growth condition: for such a given function, computation of its Fenchel dual/conjugate is polynomial-time reducible to computation of the given function. Hence the computation of a norm or a convex function of polynomial-growth is NP-hard if and only if the computation of its dual norm or Fenchel dual is NP-hard. We discuss implications of these results on the weak membership problem for a symmetric convex body and its polar dual, the polynomial approximability of Mahler volume, and the weak membership problem for the epigraph of a convex function with polynomial growth and that of its Fenchel dual.
△ Less
Submitted 23 July, 2016; v1 submitted 27 January, 2016;
originally announced January 2016.
-
Semialgebraic Geometry of Nonnegative Tensor Rank
Authors:
Yang Qi,
Pierre Comon,
Lek-Heng Lim
Abstract:
We study the semialgebraic structure of $D_r$, the set of nonnegative tensors of nonnegative rank not more than $r$, and use the results to infer various properties of nonnegative tensor rank. We determine all nonnegative typical ranks for cubical nonnegative tensors and show that the direct sum conjecture is true for nonnegative tensor rank. We show that nonnegative, real, and complex ranks are a…
▽ More
We study the semialgebraic structure of $D_r$, the set of nonnegative tensors of nonnegative rank not more than $r$, and use the results to infer various properties of nonnegative tensor rank. We determine all nonnegative typical ranks for cubical nonnegative tensors and show that the direct sum conjecture is true for nonnegative tensor rank. We show that nonnegative, real, and complex ranks are all equal for a general nonnegative tensor of nonnegative rank strictly less than the complex generic rank. In addition, such nonnegative tensors always have unique nonnegative rank-$r$ decompositions if the real tensor space is $r$-identifiable. We determine conditions under which a best nonnegative rank-$r$ approximation has a unique nonnegative rank-$r$ decomposition: for $r \le 3$, this is always the case; for general $r$, this is the case when the best nonnegative rank-$r$ approximation does not lie on the boundary of $D_r$. Many of our general identifiability results also apply to real tensors and real symmetric tensors.
△ Less
Submitted 22 August, 2016; v1 submitted 11 December, 2015;
originally announced January 2016.
-
Fast structured matrix computations: tensor rank and Cohn--Umans method
Authors:
Ke Ye,
Lek-Heng Lim
Abstract:
We discuss a generalization of the Cohn-Umans method, a potent technique developed for studying the bilinear complexity of matrix multiplication by embedding matrices into an appropriate group algebra. We investigate how the Cohn-Umans method may be used for bilinear operations other than matrix multiplication, with algebras other than group algebras, and we relate it to Strassen's tensor rank app…
▽ More
We discuss a generalization of the Cohn-Umans method, a potent technique developed for studying the bilinear complexity of matrix multiplication by embedding matrices into an appropriate group algebra. We investigate how the Cohn-Umans method may be used for bilinear operations other than matrix multiplication, with algebras other than group algebras, and we relate it to Strassen's tensor rank approach, the traditional framework for investigating bilinear complexity. To demonstrate the utility of the generalized method, we apply it to find the fastest algorithms for forming structured matrix-vector product, the basic operation underlying iterative algorithms for structured matrices. The structures we study include Toeplitz, Hankel, circulant, symmetric, skew-symmetric, f-circulant, block-Toeplitz-Toeplitz-block, triangular Toeplitz matrices, Toeplitz-plus-Hankel, sparse/banded/triangular. Except for the case of skew-symmetric matrices, for which we have only upper bounds, the algorithms derived using the generalized Cohn-Umans method in all other instances are the fastest possible in the sense of having minimum bilinear complexity. We also apply this framework to a few other bilinear operations including matrix-matrix, commutator, simultaneous matrix products, and briefly discuss the relation between tensor nuclear norm and numerical stability.
△ Less
Submitted 9 June, 2016; v1 submitted 3 January, 2016;
originally announced January 2016.
-
Fast randomized iteration: diffusion Monte Carlo through the lens of numerical linear algebra
Authors:
Lek-Heng Lim,
Jonathan Weare
Abstract:
We review the basic outline of the highly successful diffusion Monte Carlo technique commonly used in contexts ranging from electronic structure calculations to rare event simulation and data assimilation, and propose a new class of randomized iterative algorithms based on similar principles to address a variety of common tasks in numerical linear algebra. From the point of view of numerical linea…
▽ More
We review the basic outline of the highly successful diffusion Monte Carlo technique commonly used in contexts ranging from electronic structure calculations to rare event simulation and data assimilation, and propose a new class of randomized iterative algorithms based on similar principles to address a variety of common tasks in numerical linear algebra. From the point of view of numerical linear algebra, the main novelty of the Fast Randomized Iteration schemes described in this article is that they work in either linear or constant cost per iteration (and in total, under appropriate conditions) and are rather versatile: we will show how they apply to solution of linear systems, eigenvalue problems, and matrix exponentiation, in dimensions far beyond the present limits of numerical linear algebra. While traditional iterative methods in numerical linear algebra were created in part to deal with instances where a matrix (of size $\mathcal{O}(n^2)$) is too big to store, the algorithms that we propose are effective even in instances where the solution vector itself (of size $\mathcal{O}(n)$) may be too big to store or manipulate. In fact, our work is motivated by recent DMC based quantum Monte Carlo schemes that have been applied to matrices as large as $10^{108} \times 10^{108}$. We provide basic convergence results, discuss the dependence of these results on the dimension of the system, and demonstrate dramatic cost savings on a range of test problems.
△ Less
Submitted 9 October, 2017; v1 submitted 25 August, 2015;
originally announced August 2015.
-
Uniqueness of Nonnegative Tensor Approximations
Authors:
Yang Qi,
Pierre Comon,
Lek-Heng Lim
Abstract:
We show that for a nonnegative tensor, a best nonnegative rank-r approximation is almost always unique, its best rank-one approximation may always be chosen to be a best nonnegative rank-one approximation, and that the set of nonnegative tensors with non-unique best rank-one approximations form an algebraic hypersurface. We show that the last part holds true more generally for real tensors and the…
▽ More
We show that for a nonnegative tensor, a best nonnegative rank-r approximation is almost always unique, its best rank-one approximation may always be chosen to be a best nonnegative rank-one approximation, and that the set of nonnegative tensors with non-unique best rank-one approximations form an algebraic hypersurface. We show that the last part holds true more generally for real tensors and thereby determine a polynomial equation so that a real or nonnegative tensor which does not satisfy this equation is guaranteed to have a unique best rank-one approximation. We also establish an analogue for real or nonnegative symmetric tensors. In addition, we prove a singular vector variant of the Perron--Frobenius Theorem for positive tensors and apply it to show that a best nonnegative rank-r approximation of a positive tensor can never be obtained by deflation. As an aside, we verify that the Euclidean distance (ED) discriminants of the Segre variety and the Veronese variety are hypersurfaces and give defining equations of these ED discriminants.
△ Less
Submitted 15 February, 2016; v1 submitted 29 October, 2014;
originally announced October 2014.
-
Multilinear PageRank
Authors:
David F. Gleich,
Lek-Heng Lim,
Yongyang Yu
Abstract:
In this paper, we first extend the celebrated PageRank modification to a higher-order Markov chain. Although this system has attractive theoretical properties, it is computationally intractable for many interesting problems. We next study a computationally tractable approximation to the higher-order PageRank vector that involves a system of polynomial equations called multilinear PageRank, which i…
▽ More
In this paper, we first extend the celebrated PageRank modification to a higher-order Markov chain. Although this system has attractive theoretical properties, it is computationally intractable for many interesting problems. We next study a computationally tractable approximation to the higher-order PageRank vector that involves a system of polynomial equations called multilinear PageRank, which is a type of tensor PageRank vector. It is motivated by a novel "spacey random surfer" model, where the surfer remembers bits and pieces of history and is influenced by this information. The underlying stochastic process is an instance of a vertex-reinforced random walk. We develop convergence theory for a simple fixed-point method, a shifted fixed-point method, and a Newton iteration in a particular parameter regime. In marked contrast to the case of the PageRank vector of a Markov chain where the solution is always unique and easy to compute, there are parameter regimes of multilinear PageRank where solutions are not unique and simple algorithms do not converge. We provide a repository of these non-convergent cases that we encountered through exhaustive enumeration and randomly sampling that we believe is useful for future study of the problem.
△ Less
Submitted 4 September, 2014;
originally announced September 2014.
-
Schubert varieties and distances between subspaces of different dimensions
Authors:
Ke Ye,
Lek-Heng Lim
Abstract:
We resolve a basic problem on subspace distances that often arises in applications: How can the usual Grassmann distance between equidimensional subspaces be extended to subspaces of different dimensions? We show that a natural solution is given by the distance of a point to a Schubert variety within the Grassmannian. This distance reduces to the Grassmann distance when the subspaces are equidimen…
▽ More
We resolve a basic problem on subspace distances that often arises in applications: How can the usual Grassmann distance between equidimensional subspaces be extended to subspaces of different dimensions? We show that a natural solution is given by the distance of a point to a Schubert variety within the Grassmannian. This distance reduces to the Grassmann distance when the subspaces are equidimensional and does not depend on any embedding into a larger ambient space. Furthermore, it has a concrete expression involving principal angles, and is efficiently computable in numerically stable ways. Our results are largely independent of the Grassmann distance --- if desired, it may be substituted by any other common distances between subspaces. Our approach depends on a concrete algebraic geometric view of the Grassmannian that parallels the differential geometric perspective that is well-established in applied and computational mathematics.
△ Less
Submitted 16 June, 2016; v1 submitted 3 July, 2014;
originally announced July 2014.
-
Learning Subspaces of Different Dimension
Authors:
Brian St. Thomas,
Lizhen Lin,
Lek-Heng Lim,
Sayan Mukherjee
Abstract:
We introduce a Bayesian model for inferring mixtures of subspaces of different dimensions. The key challenge in such a mixture model is specification of prior distributions over subspaces of different dimensions. We address this challenge by embedding subspaces or Grassmann manifolds into a sphere of relatively low dimension and specifying priors on the sphere. We provide an efficient sampling alg…
▽ More
We introduce a Bayesian model for inferring mixtures of subspaces of different dimensions. The key challenge in such a mixture model is specification of prior distributions over subspaces of different dimensions. We address this challenge by embedding subspaces or Grassmann manifolds into a sphere of relatively low dimension and specifying priors on the sphere. We provide an efficient sampling algorithm for the posterior distribution of the model parameters. We illustrate that a simple extension of our mixture of subspaces model can be applied to topic modeling. We also prove posterior consistency for the mixture of subspaces model. The utility of our approach is demonstrated with applications to real and simulated data.
△ Less
Submitted 22 September, 2015; v1 submitted 27 April, 2014;
originally announced April 2014.
-
Counting Square Discriminants
Authors:
Thomas A. Hulse,
E. Mehmet Kıral,
Chan Ieong Kuan,
Li-Mei Lim
Abstract:
Counting integral binary quadratic forms with certain restrictions is a classical problem. In this paper, we count binary quadratic forms of fixed discriminant given restrictions on the size of their coefficients. We accomplish this by investigating the analytic properties of a certain double Dirichlet series, which is a shifted convolution sum of certain classical automorphic forms.
Counting integral binary quadratic forms with certain restrictions is a classical problem. In this paper, we count binary quadratic forms of fixed discriminant given restrictions on the size of their coefficients. We accomplish this by investigating the analytic properties of a certain double Dirichlet series, which is a shifted convolution sum of certain classical automorphic forms.
△ Less
Submitted 6 August, 2015; v1 submitted 24 July, 2013;
originally announced July 2013.