Analytical Solution of a Three-layer Network with a Matrix Exponential Activation Function

Kuo Gai, Shihua Zhang
Academy of Mathematics and Systems Science
Chinese Academy of Sciences
Beijing, 100190, China
School of Mathematics Sciences
University of Chinese Academy of Science
Beijing, 100049, China
{gaikuo, zsh}@amss.ac.cn

Abstract

In practice, deeper networks tend to be more powerful than shallow ones, but this has not been understood theoretically. In this paper, we find a analytical solution of a three-layer network with a matrix exponential activation function, i.e.,

\bm{f}(\bm{X})=\bm{W}_{3}\exp(\bm{W}_{2}\exp(\bm{W}_{1}\bm{X})),\bm{X}\in% \mathbb{C}^{d\times d}

have analytical solutions for the equations

\left\{\begin{array}[]{c}\bm{Y}_{1}=\bm{f}(\bm{X}_{1})\\ \bm{Y}_{2}=\bm{f}(\bm{X}_{2})\end{array}\right.

for $\bm{X}_{1},\bm{X}_{2},\bm{Y}_{1},\bm{Y}_{2}$ with only invertible assumptions. Our proof shows the power of depth and the use of a non-linear activation function, since one layer network can only solve one equation,i.e., $\bm{Y}=\bm{W}\bm{X}$ .

1 Introduction

Deep neural networks have become successful in many fields, including computer vision, natural language processing, bioinformatics, etc. However, the mathematical principle of deep learning is still not fully understood, especially why deeper networks with non-linear activation functions tend to be more powerful than shallower ones.

It is well known that sufficient large depth-2 neural networks with reasonable activation functions can approximate any continuous function on a bounded domain (Cybenko, 1989; Funahashi, 1989; Hornik et al., 1989; Barron, 1994; Pinkus, 1999), but this requires the width of networks to be exponential. Recent authors have shown that some functions can be approximated by deeper networks with fewer neurons than by shallower ones, such as radial functions (Eldan & Shamir, 2016), Boolean circuit (Rossman et al., 2015) or functions induced by neural network (Telgarsky, 2016). However, these functions are far from the function approximated by neural networks in practice.

There are also some studies on approximating data points of a fixed number instead of continuous functions, which is more general since data points can be sampled from arbitrary distributions. However, such works focus more on width rather than depth. For instance, the notable framework neural tangent kernel(NTK)(Jacot et al., 2018) proved that neural networks can fit the data with error 0 if the width is infinite. However, such wide neural networks would also have an extremely large number of parameters, and extract random features of data. Moreover, current state of art results are typically achieved by deep neural networks (He et al., 2016; Krizhevsky et al., 2012). Generally, when the width of the network is bounded since the function class of neural networks becomes more complex after the composition of layers, the optimization process of neural networks may not find the global optimal solution. There are some empirical explorations which reveal non-trivial properties of the landscape (Goodfellow et al., 2014; Li et al., 2018). However, these properties still lack theoretical understanding since the optimization of network is highly non-convex. Thus, to show the power of depth, a potential way is to pursue analytical solution instead of optimization. A line of research focuses on memory capacity (Vershynin, 2020; Yamasaki, 1993; Huang, 2003; Zhang et al., 2021; Yun et al., 2019), which aims at proving the existence of solutions through construction rather than computation. The construction is tricky and the labels are limited to be scalars.

Some studies are using the matrix-form activation function in practice. Li et al. (2017) introduces the use of a matrix operation (either matrix logarithm or matrix square root) on top of a convolutional layer with higher-order feature crosses. (Fischbacher et al., 2020) proposes a single matrix exponential layer to learn the periodic structure or geometric invariants of the input. Matrix-form activation functions make it possible to find the solution through matrix computation instead of construction and provide a better understanding of the power of depth and non-linear activation functions.

In this paper, we omit the optimization process and compute the analytical solution of a three-layer neural network with a matrix exponential activation function. We show the power of depth by proving that a three-layer network can map more matrix-form data points to their labels than a single-layer network. We also shed light on networks with element-wise activation function experimentally using similar methodology, indicating the number of equations a network can solve increases with the number of layers linearly.

2 Preliminary

The matrix exponential is a matrix function on the square matrices analogous to the ordinary exponential function. Let $\bm{X}$ be an $d\times d$ complex matrix. The exponential of $\bm{X}$ , denoted by $\operatorname{exp}(\bm{X})$ is the $d\times d$ matrix given by the power series

\operatorname{exp}(\bm{X})=\sum_{k=0}^{\infty}\frac{1}{k!}\bm{{X}}^{k}

(1)

where $\bm{X}^{0}$ is defined to be the identity matrix $\bm{I}$ with the same dimensions as $\bm{X}$ . The matrix exponential is well studied in the theory of Lie group and has many good properties.

Proposition 1.

Let $\bm{X},\bm{Y}\in\mathbb{C}^{d\times d}$ . If $\bm{X}\bm{Y}=\bm{Y}\bm{X}$ , then $\operatorname{exp}(\bm{X})\operatorname{exp}(\bm{Y})=\operatorname{exp}(\bm{X}% +\bm{Y})$

Proposition 2.

The matrix exponential gives a surjective map

\operatorname{exp}:M_{d}(\mathbb{C})\to\operatorname{GL}(d,\mathbb{C})

(2)

where $M_{d}(\mathbb{C})$ is the space of all $d\times d$ complex matrices and $\operatorname{GL}(d,\mathbb{C})$ is the general linear group of degree $d$ , i.e. the group of all $d\times d$ invertible matrices.

In general, $\operatorname{exp}(\bm{X})\operatorname{exp}(\bm{Y})$ can be expressed by the Baker Campbell Hausdorff (BCH) formula, and when $\bm{X}$ and $\bm{Y}$ commute, the computation of BCH formula can be simplified as in Proposition 1. Proposition 2 means every invertible matrix $\bm{X}$ can be written as the exponential of some other matrix $\bm{Z}$ (for this, it is essential to consider the field $\mathbb{C}$ and not $\mathbb{R}$ ).

$\bm{Z}$ can be calculated through the logarithm of matrix. First we need to find the Jordan decomposition of $X$ and calculate the logarithm of the Jordan blocks. For instance, we can write a Jordan block as

\displaystyle\bm{B}

\displaystyle=\left[\begin{array}[]{cccccc}\lambda&1&0&0&\cdots&0\\ 0&\lambda&1&0&\cdots&0\\ 0&0&\lambda&1&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\ddots&\vdots\\ 0&0&0&0&\lambda&1\\ 0&0&0&0&0&\lambda\end{array}\right]

		$\displaystyle=\lambda\left[\begin{array}[]{cccccc}1&\lambda^{-1}&0&0&\cdots&0% \\ 0&1&\lambda^{-1}&0&\cdots&0\\ 0&0&1&\lambda^{-1}&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\ddots&\vdots\\ 0&0&0&0&1&\lambda^{-1}\\ 0&0&0&0&0&1\end{array}\right]$		(3)
		$\displaystyle=\lambda(\bm{I}+\bm{K})$		(3)

where $\bm{K}$ is a matrix with zeros on and under the main diagonal. The number $\lambda$ is nonzero by the assumption that $\bm{X}$ is invertible. Then, by the Mercator series

\log(1+x)=x-\frac{x^{2}}{2}+\frac{x^{3}}{3}-\frac{x^{4}}{4}+\cdots

(4)

we have

\log\bm{B}=\log(\lambda(\bm{I}+\bm{K}))=(\log\lambda)\bm{I}+\bm{K}-\frac{\bm{K% }^{2}}{2}+\frac{\bm{K}^{3}}{3}-\frac{\bm{K}^{4}}{4}+\cdots

(5)

This series has a finite number of terms since $\bm{K}^{m}$ is $\bm{0}$ if $m$ is the dimension of of $\bm{K}$ . Thus the sum is well-defined. Assume that $\bm{J}$ is the Jordan normal form of $\bm{X}$ and $\bm{X}=\bm{P}\bm{J}\bm{P}^{-1}$ . Following the method above, we can calculate $\log\bm{J}$ and obtain $\bm{Z}=\log\bm{X}=\bm{P}\log\bm{J}\bm{P}^{-1}$ .

3 Main result

The basic task of machine learning is to find a function which maps the data to its label, i.e., for given $\{(\bm{x}_{i},\bm{y}_{i})\}_{i=1}^{n}$ where $\bm{x}_{i}\in\mathbb{R}^{d_{x}},\bm{y}_{i}\in\mathbb{R}^{d_{y}}$ , solve the equations $f(\bm{x}_{i})=\bm{y}_{i}$ , $i=1,\cdots,n$ . Specifically, for neural networks, $f$ is composed of linear transformations and nonlinear activation functions, i.e., for $m$ -layer network,

\bm{f}(\cdot)=\bm{W}_{m}\sigma\left(\bm{W}_{m-1}\cdots\sigma\left(\bm{W}_{1}% \cdot\right)\right)

(6)

where $\sigma$ is the nonlinear activation function and $\bm{W}_{1}\in\mathbb{R}^{d_{x}\times d_{1}}$ , $\bm{W}_{k}\in\mathbb{R}^{d_{k-1}\times d_{k}}$ , $k=2,\cdots,m-1$ , $\bm{W}_{m}\in\mathbb{R}^{d_{m-1}\times d_{y}}$ . $\sigma$ is elementwise function such as ReLU, sigmoid and tanh function. Generally, proving the existence of solution of nonlinear system is hard, especially when the element-wise function $\sigma$ does not integral well with the linear transformation matrix $\bm{W}$ . For instance, let $\sigma(x)=x^{2}$ , then $\sigma(\bm{A})=\bm{A}\circ\bm{A}$ for $\bm{A}\in\mathbb{R}^{d\times d^{\prime}}$ , where $\circ$ is the Hadamard product. As we know, generally, $\bm{A}\circ\bm{A}$ can not be expressed as a polynomial of $\bm{A}$ , i.e., $\bm{A}\circ\bm{A}\neq\operatorname{poly}(\bm{A})$ . This causes difficulties in finding the analytical solution of neural networks, since we can not transform the output of each layer to a operable form. To address this issue, we use matrix exponential function as nonlinear activation function instead, which gives chance to find the solution to the system when number of layers is more than one.

To make matrix exponential well-defined, we assume $\bm{X},\bm{Y},\bm{W}$ are square. To make the solution exists, we assume the items of $\bm{X},\bm{Y},\bm{W}$ in $\mathbb{C}$ . Consider $\bm{X},\bm{Y}\in\mathbb{C}^{d\times d}$ and $\bm{X}$ is invertible, then $\bm{W}=\bm{Y}\bm{X}^{-1}$ can solve the equation $\bm{Y}=\bm{W}\bm{X}$ . There doesn’t exist solution of $\bm{Y}_{1}=\bm{W}\bm{X}_{1},\bm{Y}_{2}=\bm{W}\bm{X}_{2}$ for $\bm{X}_{1},\bm{X}_{2},\bm{Y}_{1},\bm{Y}_{2}\in\mathbb{C}^{d\times d}$ except degenerate cases, since the number of parameter $d^{2}$ is less than the number of equations $2d^{2}$ . If we let the weight matrix be ‘wider’, i.e.,

\bm{W}=\left[\begin{array}[]{cc}\bm{W}_{1}&\bm{0}\\ \bm{0}&\bm{W}_{2}\end{array}\right]

(7)

then with the assumption that $\bm{X}_{1}$ and $\bm{X}_{2}$ are invertible, $\bm{W}_{1}=\bm{Y}_{1}\bm{X}_{1}^{-1}$ and $\bm{W}_{2}=\bm{Y}_{2}\bm{X}_{2}^{-1}$ can solve the equations

\left[\begin{array}[]{c}\bm{Y}_{1}\\ \bm{Y}_{2}\end{array}\right]=\left[\begin{array}[]{cc}\bm{W}_{1}&\bm{0}\\ \bm{0}&\bm{W}_{2}\end{array}\right]\cdot\left[\begin{array}[]{c}\bm{X}_{1}\\ \bm{X}_{2}\end{array}\right]

(8)

The above equation has solution because we can separate it to two sub-problems and solve $\bm{W}_{1}$ and $\bm{W}_{2}$ sequentially. However, this will not happen when we compose $\bm{W}_{1}$ and $\bm{W}_{2}$ (two-layer network with identity activation function), which means, solving the equation

\bm{Y}_{1}=\bm{W}_{2}\bm{W}_{1}\bm{X}_{1};\bm{Y}_{2}=\bm{W}_{2}\bm{W}_{1}\bm{X% }_{2}

(9)

When $\bm{W}_{1}$ is fixed, then $\bm{W}_{2}$ with $d^{2}$ parameters is involved in $2d^{2}$ equations, i.e., $\bm{Y}_{1}=\bm{W}_{2}(\bm{W}_{1}\bm{X}_{1})$ and $\bm{Y}_{2}=\bm{W}_{2}(\bm{W}_{1}\bm{X}_{2})$ and has no solution in general. Situation changes again by adding non-linear activation function, i.e., solving the equations

		$\displaystyle\bm{Y}_{1}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{X}_{1})$		(10)
		$\displaystyle\bm{Y}_{2}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})$		(10)

From the second equation, we obtain $\bm{W}_{2}=\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}$ . Taking it into the first equation, we have

\bm{Y}_{1}=\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}\sigma(\bm{W}_{1}\bm{X}_% {1})

(11)

If this equation has a solution for $\bm{W}_{1}$ , then the non-linear system (10) has a solution for $\bm{W}_{1}$ and $\bm{W}_{2}$ . Following this intuition, we prove that a three-layer network with a matrix exponential activation function can solve the equations, exhibiting the power of deepness and the use of non-linear activation.

Theorem 1.

Let $\bm{X}_{1},\bm{X}_{2}$ be the data matrices and $\bm{Y}_{1},\bm{Y}_{2}$ be the corresponding label matrices, where $\bm{X}_{1},\bm{X}_{2},\bm{Y}_{1},\bm{Y}_{2}\in\mathbb{C}^{d\times d}$ are invertible matrices. Assume that $\bm{X}_{1}-\bm{X}_{2}$ is invertible. $\bm{f}(\cdot)=\bm{W}_{3}\sigma(\bm{W}_{2}\sigma(\bm{W}_{1}\cdot))$ is a three-layer network where $\sigma(\cdot)$ is matrix exponential, i.e., $\sigma(\cdot)=\operatorname{exp}(\cdot):\mathbb{C}^{d\times d}\to\mathbb{C}^{d% \times d}$ , and $\bm{W}_{1},\bm{W}_{2},\bm{W}_{3}\in\mathbb{C}^{d\times d}$ . If

$\displaystyle\bm{W}_{1}$	$\displaystyle=\operatorname{ln}\alpha\cdot(\bm{X}_{1}-\bm{X}_{2})^{-1}$	(12)
$\displaystyle\bm{W}_{2}$	$\displaystyle=(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})\cdot\operatorname{% exp}(-\bm{W}_{1}\bm{X}_{2})\cdot\frac{1}{1-\alpha}$
$\displaystyle\bm{W}_{3}$	$\displaystyle=\bm{Y}_{1}\operatorname{exp}(-\bm{W}_{2}\operatorname{exp}(\bm{W% }_{1}\bm{X}_{1}))$

where $\alpha\in\mathbb{R}^{+},\alpha\neq 1$ and $\operatorname{exp}(\bm{Z})=\alpha\bm{Y}_{1}^{-1}\bm{Y}_{2}$ , then $\bm{f}$ maps the data points to their labels, i.e., $\bm{f}(\bm{X}_{1})=\bm{Y}_{1},\bm{f}(\bm{X}_{2})=\bm{Y}_{2}$

Proof 1.

We assume

\displaystyle\bm{W}_{1}=\bm{W}_{1,1}\bm{W}_{1,2}

(13)

where $\bm{W}_{1,1},\bm{W}_{1,2}\in\mathbb{C}^{d\times d}$ and $\bm{W}_{1,2}$ is invertible. It is known that the exponential of a matrix is always an invertible matrix, let

$\displaystyle\bm{M}_{1,X_{1}}$	$\displaystyle=\operatorname{exp}(\bm{W}_{1}\bm{X}_{1})\bm{X}_{1}^{-1}\bm{W}_{1% ,2}^{-1}$	(14)
$\displaystyle\bm{M}_{1,X_{2}}$	$\displaystyle=\operatorname{exp}(\bm{W}_{1}\bm{X}_{2})\bm{X}_{2}^{-1}\bm{W}_{1% ,2}^{-1}$
$\displaystyle\bm{M}_{2,X_{1}}$	$\displaystyle=\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1}\bm{X}% _{1}))\operatorname{exp}(\bm{W}_{1}\bm{X}_{1})^{-1}$
$\displaystyle\bm{M}_{2,X_{2}}$	$\displaystyle=\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1}\bm{X}% _{2}))\operatorname{exp}(\bm{W}_{1}\bm{X}_{2})^{-1}$

Use the trick

\left[\begin{array}[]{cc}\bm{A}&\bm{0}\\ \bm{0}&\bm{B}\end{array}\right]=\left[\begin{array}[]{cc}\bm{A}&\bm{0}\\ \bm{0}&\bm{A}\end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{I}&\bm{0}\\ \bm{0}&\bm{A}^{-1}\bm{B}\end{array}\right]

(15)

twice, then we have

	$\displaystyle\left[\begin{array}[]{cc}\operatorname{exp}(\bm{W}_{2}% \operatorname{exp}(\bm{W}_{1}\bm{X}_{1}))&0\\ 0&\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1}\bm{X}_{2}))\\ \end{array}\right]$	(16)
$\displaystyle=$	$\displaystyle\left[\begin{array}[]{cc}\bm{M}_{2,X_{1}}&0\\ 0&\bm{M}_{2,X_{2}}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{M}_{1,X_{1}}&0\\ 0&\bm{M}_{1,X_{2}}\\ \end{array}\right]$
$\displaystyle\cdot$	$\displaystyle\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right]$

	$\displaystyle=$	$\displaystyle\left[\begin{array}[]{cc}\bm{M}_{2,X_{1}}&0\\ 0&\bm{M}_{2,X_{2}}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}&0\\ 0&\bm{M}_{1,X_{2}}\\ \end{array}\right]$
	$\displaystyle\cdot$	$\displaystyle\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1,X_{1}}&0% \\ 0&\bm{I}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right]$
	$\displaystyle=$	$\displaystyle\left[\begin{array}[]{cc}\bm{M}_{2,X_{1}}\bm{M}_{1,X_{2}}&0\\ 0&\bm{M}_{2,X_{2}}\bm{M}_{1,X_{2}}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1% ,X_{1}}&0\\ 0&\bm{I}\\ \end{array}\right]$
	$\displaystyle\cdot$	$\displaystyle\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right]$
	$\displaystyle=$	$\displaystyle\left[\begin{array}[]{cc}\bm{M}_{2,X_{1}}\bm{M}_{1,X_{2}}&0\\ 0&\bm{M}_{2,X_{1}}\bm{M}_{1,X_{2}}\\ \end{array}\right]$
	$\displaystyle\cdot$	$\displaystyle\left[\begin{array}[]{cc}\bm{I}&0\\ 0&\bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1}\bm{M}_{2,X_{2}}\bm{M}_{1,X_{2}}\\ \end{array}\right]$
	$\displaystyle\cdot$	$\displaystyle\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1,X_{1}}&0% \\ 0&\bm{I}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right]$

Let

\bm{W}_{3}=\bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1},

(17)

to eliminate the fist matrix of the right side of the last equality in (16), then we have

	$\displaystyle\left[\begin{array}[]{cc}\bm{f}(\bm{X}_{1})&0\\ 0&\bm{f}(\bm{X}_{2})\\ \end{array}\right]$	(18)
$\displaystyle=$	$\displaystyle\left[\begin{array}[]{cc}\bm{W}_{3}\operatorname{exp}(\bm{W}_{2}% \operatorname{exp}(\bm{W}_{1}\bm{X}_{1}))&0\\ 0&\bm{W}_{3}\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1}\bm{X}_{% 2}))\\ \end{array}\right]$
$\displaystyle=$	$\displaystyle\left[\begin{array}[]{cc}\bm{I}&0\\ 0&\bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1}\bm{M}_{2,X_{2}}\bm{M}_{1,X_{2}}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1% ,X_{1}}&0\\ 0&\bm{I}\\ \end{array}\right]$
$\displaystyle\cdot$	$\displaystyle\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right]$

Let $\bm{\tilde{X}}_{1}=\bm{W}_{1,2}\bm{X}_{1},\bm{\tilde{X}}_{2}=\bm{W}_{1,2}\bm{X% }_{2}$ . To solve

\left[\begin{array}[]{cc}\bm{f}(\bm{X}_{1})&0\\ 0&\bm{f}(\bm{X}_{2})\\ \end{array}\right]=\left[\begin{array}[]{cc}\bm{Y}_{1}&0\\ 0&\bm{Y}_{2}\\ \end{array}\right],

(19)

it equals to solve

\left\{\begin{array}[]{c}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1,X_{1}}\bm{\tilde{X}}_{% 1}=\bm{Y}_{1}\\[17.07164pt] \bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1}\bm{M}_{2,X_{2}}\bm{M}_{1,X_{2}}\bm{% \tilde{X}}_{2}=\bm{Y}_{2}\end{array}\right.

(20)

By the definition of $\bm{M}_{1,X_{1}},\bm{M}_{1,X_{2}},\bm{M}_{2,X_{1}},\bm{M}_{2,X_{2}}$ , we can rewrite equalities in (20) as:

\left\{\begin{array}[]{l}\bm{\tilde{X}}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{% \tilde{X}}_{2})^{-1}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})=\bm{Y}_% {1}\\[17.07164pt] \bm{\tilde{X}}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})^{-1}% \operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})\operatorname{exp}(\bm{W}_{2% }\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1}))^{-1}\operatorname{exp}(% \bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2}))=\bm{Y}_{2}\end{% array}\right.

(21)

To solve the first equality in (21), let

\bm{W}_{1,2}=\frac{1}{\alpha}\bm{Y}_{1}\bm{X}_{2}^{-1}

(22)

where $\alpha\in\mathbb{R}^{+},\alpha\neq 1$ , then

\bm{\tilde{X}}_{2}^{-1}\bm{Y}_{1}=\alpha\bm{I}=\operatorname{exp}(% \operatorname{ln}\alpha\cdot\bm{I})

(23)

Then the first equality in (21) can be rewrite as

	$\displaystyle\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})$	$\displaystyle=\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})\operatorname{% exp}(\operatorname{ln}\alpha\cdot\bm{I})$		(24)
		$\displaystyle=\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2}+\operatorname{% ln}\alpha\cdot\bm{I})$		(24)

The second equality is because $\bm{W}_{1,1}\bm{\tilde{X}}_{2}$ commute with $\operatorname{ln}\alpha\cdot\bm{I}$ and Proposition 1. Thus it is sufficient to solve the equality

\bm{W}_{1,1}\bm{\tilde{X}}_{1}=\bm{W}_{1,1}\bm{\tilde{X}}_{2}+\operatorname{ln% }\alpha\cdot\bm{I}

(25)

since $\bm{X}_{1}-\bm{X}_{2}$ is invertible as assumed, then

\bm{W}_{1,1}=\operatorname{ln}\alpha\cdot(\bm{\tilde{X}}_{1}-\bm{\tilde{X}}_{2% })^{-1},\quad\bm{W}_{1}=\bm{W}_{1,1}\bm{W}_{1,2}=\operatorname{ln}\alpha\cdot(% \bm{X}_{1}-\bm{X}_{2})^{-1}

(26)

Taking the second equality in (21) into the first equality in (21), the first equality in (21) can be rewrite as

	$\displaystyle\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{% \tilde{X}}_{1}))^{-1}\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1% ,1}\bm{\tilde{X}}_{2}))$	$\displaystyle=\bm{Y}_{1}^{-1}\bm{Y}_{2}$		(27)
		$\displaystyle=\frac{1}{\alpha}\operatorname{exp}(\bm{Z})$		(27)

The second equality is because of the definition of $\bm{Z}$ . Such $\bm{Z}$ exists because of Proposition 2. If $\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})$ commute with $\bm{Z}$ , then we only need to solve

\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})=\bm{W}_{2}% \operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})+(\bm{Z}-\operatorname{ln}% \alpha\cdot\bm{I})

(28)

Note that according to (24)

	$\displaystyle\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})-\operatorname{% exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})$	$\displaystyle=\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})(\bm{I}-\alpha% \bm{I})$		(29)
		$\displaystyle=(1-\alpha)\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})$		(29)

then $\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})-\operatorname{exp}(\bm{W}_{% 1,1}\bm{\tilde{X}}_{1})$ is invertible since $\alpha\neq 1$ . Thus the solution to (28) is

	$\displaystyle\bm{W}_{2}$	$\displaystyle=(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})(\operatorname{exp}(% \bm{W}_{1,1}\bm{\tilde{X}}_{2})-\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_% {1}))^{-1}$		(30)
		$\displaystyle=\frac{1}{1-\alpha}(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})% \operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})^{-1}$		(30)

Finally we need to verify that $\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})$ commute with $\bm{Z}$ , it is obviously according to (24) since

	$\displaystyle\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})$	$\displaystyle=\frac{1}{1-\alpha}(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})% \operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})^{-1}\operatorname{exp}(\bm{% W}_{1,1}\bm{\tilde{X}}_{1})$		(31)
		$\displaystyle=\frac{\alpha}{1-\alpha}(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})$		(31)

When $\bm{W}_{1},\bm{W}_{2}$ are fixed as (26) and (30), then $\bm{W}_{3}$ is fixed

	$\displaystyle\bm{W}_{3}$	$\displaystyle=\bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1}$		(32)
		$\displaystyle=\bm{Y}_{1}\operatorname{exp}(-\bm{W}_{2}\operatorname{exp}(\bm{W% }_{1}\bm{X}_{1}))$		(32)

which concludes the proof.

Note that $\bm{Z}$ can be calculated using the method in Section 2, thus the solution given in Theorem 1 can be calculated without gradient descent. The only assumption of data is $\bm{X}_{1}-\bm{X}_{2}$ is invertible, which is much more general than a certain class of functions.

4 Experimental Results

Since we already found the analytical solution of a three-layer network with matrix exponential activation function, numerical experiments is not necessary. In this section, we focus on experiments on element-wise activation functions such as Relu and sigmoid using similar method. As discussed in Section 2, similar equation for two-layer network with element-wise activation $\sigma$ , i.e.,

\left\{\begin{array}[]{c}\bm{Y}_{1}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{X}_{1})\\ \bm{Y}_{2}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})\end{array}\right.

(33)

which equals to solving $\bm{W}_{1}$ and $\bm{W}_{2}$ sequentially through

\left\{\begin{array}[]{c}\bm{Y}_{1}=\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1% }\sigma(\bm{W}_{1}\bm{X}_{1})\\ \bm{W}_{2}=\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}\end{array}\right.

(34)

In our experiments, we optimize $\|\bm{Y}_{1}-\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}\sigma(\bm{W}_{1}\bm{X% }_{1})\|_{F}^{2}$ with gradient descent. Each item of $\bm{X}_{1}$ , $\bm{X}_{2}$ , $\bm{Y}_{1}$ and $\bm{Y}_{2}$ is sampled from Gaussian distribution $\mathcal{N}(0,1)$ . For comparison, we compute the same value when $\sigma$ is the identity function, i.e., $\|\bm{Y}_{1}-\bm{Y}_{2}(\bm{W}_{1}\bm{X}_{2})^{-1}\bm{W}_{1}\bm{X}_{1}\|_{F}^{% 2}=\|\bm{Y}_{1}-\bm{Y}_{2}\bm{X}_{2}^{-1}\bm{X}_{1}\|_{F}^{2}$ . Then we can construct a score to measure the benefit of using sigmoid function or ReLU function in the training process

s=\frac{\|\bm{Y}_{1}-\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}\sigma(\bm{W}_% {1}\bm{X}_{1})\|_{F}^{2}}{\|\bm{Y}_{1}-\bm{Y}_{2}\bm{X}_{2}^{-1}\bm{X}_{1}\|_{% F}^{2}}

(35)

In the experiment (Fig.1), we find that both ReLU and Sigmoid function can find the optimal $\bm{W}_{1}$ with $s$ close to 0. This indicates that a two-layer network with ReLU or Sigmoid activation function has obvious benefits compared with the identity function and has the potential to solve twice the number of equations. Also the $s$ score decrease with the increasing of dimension, which means, the optimization problem becomes easier in high dimention space. However, it is hard to prove the existence of a solution of equality (34) and the existence of a path from initial weights to global optimal weights with gradient descent.

Refer to caption — Figure 1: The $s$ score of two-layer network with Sigmoid (left) and ReLU (right) activation function in the training process.

5 Conclusion

In this paper, we design a problem for a three-layer network with matrix exponential as an activation function and find the analytical solution. By doing this, we show the power of depth by comparing our three-layer networks to single-layer ones. Our result has merit compared with existing studies, both the studies finding special functions to show the power of depth and studies analyzing the width of networks through optimization methods. We also shed light on two-layer networks with element-wise activation functions through experiments, indicating that neural networks have the potential to solve the number of equations equaling the number of parameters. As activation function, matrix exponential may provide less non-linearity as element-wise activation function do, but it may be possible to analyze based on the results in Lie theory. In the future, we will try to extend our method to multi-layer cases.

Acknowledgments

This work has been supported by the CAS Project for Young Scientists in Basic Research [No. YSBR-034].

References

Barron (1994) Andrew R Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1):115–133, 1994.
Cybenko (1989) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
Eldan & Shamir (2016) Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pp. 907–940. PMLR, 2016.
Fischbacher et al. (2020) Thomas Fischbacher, Iulia M Comsa, Krzysztof Potempa, Moritz Firsching, Luca Versari, and Jyrki Alakuijala. Intelligent matrix exponentiation. arXiv preprint arXiv:2008.03936, 2020.
Funahashi (1989) Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989.
Goodfellow et al. (2014) Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
Huang (2003) Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274–281, 2003.
Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 2012.
Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advances in Neural Information Processing Systems, 31, 2018.
Li et al. (2017) Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo. Is second-order information helpful for large-scale visual recognition? In Proceedings of the IEEE International Conference on Computer Vision, pp. 2070–2078, 2017.
Pinkus (1999) Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:143–195, 1999.
Rossman et al. (2015) Benjamin Rossman, Rocco A Servedio, and Li-Yang Tan. An average-case depth hierarchy theorem for boolean circuits. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 1030–1048. IEEE, 2015.
Telgarsky (2016) Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, pp. 1517–1539. PMLR, 2016.
Vershynin (2020) Roman Vershynin. Memory capacity of neural networks with threshold and rectified linear unit activations. SIAM Journal on Mathematics of Data Science, 2(4):1004–1033, 2020.
Yamasaki (1993) Masami Yamasaki. The lower bound of the capacity for a neural network with multiple hidden layers. In International Conference on Artificial Neural Networks, pp. 546–549. Springer, 1993.
Yun et al. (2019) Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. Advances in Neural Information Processing Systems, 32, 2019.
Zhang et al. (2021) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.