Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Analytical Solution of a Three-layer Network with a Matrix Exponential Activation Function

Kuo Gai, Shihua Zhang
Academy of Mathematics and Systems Science
Chinese Academy of Sciences
Beijing, 100190, China
School of Mathematics Sciences
University of Chinese Academy of Science
Beijing, 100049, China
{gaikuo, zsh}@amss.ac.cn
Abstract

In practice, deeper networks tend to be more powerful than shallow ones, but this has not been understood theoretically. In this paper, we find a analytical solution of a three-layer network with a matrix exponential activation function, i.e.,

𝒇(𝑿)=𝑾3exp(𝑾2exp(𝑾1𝑿)),𝑿d×dformulae-sequence𝒇𝑿subscript𝑾3subscript𝑾2subscript𝑾1𝑿𝑿superscript𝑑𝑑\bm{f}(\bm{X})=\bm{W}_{3}\exp(\bm{W}_{2}\exp(\bm{W}_{1}\bm{X})),\bm{X}\in% \mathbb{C}^{d\times d}bold_italic_f ( bold_italic_X ) = bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X ) ) , bold_italic_X ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT

have analytical solutions for the equations

{𝒀1=𝒇(𝑿1)𝒀2=𝒇(𝑿2)casessubscript𝒀1𝒇subscript𝑿1subscript𝒀2𝒇subscript𝑿2\left\{\begin{array}[]{c}\bm{Y}_{1}=\bm{f}(\bm{X}_{1})\\ \bm{Y}_{2}=\bm{f}(\bm{X}_{2})\end{array}\right.{ start_ARRAY start_ROW start_CELL bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_f ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_f ( bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY

for 𝑿1,𝑿2,𝒀1,𝒀2subscript𝑿1subscript𝑿2subscript𝒀1subscript𝒀2\bm{X}_{1},\bm{X}_{2},\bm{Y}_{1},\bm{Y}_{2}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with only invertible assumptions. Our proof shows the power of depth and the use of a non-linear activation function, since one layer network can only solve one equation,i.e.,𝒀=𝑾𝑿𝒀𝑾𝑿\bm{Y}=\bm{W}\bm{X}bold_italic_Y = bold_italic_W bold_italic_X.

1 Introduction

Deep neural networks have become successful in many fields, including computer vision, natural language processing, bioinformatics, etc. However, the mathematical principle of deep learning is still not fully understood, especially why deeper networks with non-linear activation functions tend to be more powerful than shallower ones.

It is well known that sufficient large depth-2 neural networks with reasonable activation functions can approximate any continuous function on a bounded domain (Cybenko, 1989; Funahashi, 1989; Hornik et al., 1989; Barron, 1994; Pinkus, 1999), but this requires the width of networks to be exponential. Recent authors have shown that some functions can be approximated by deeper networks with fewer neurons than by shallower ones, such as radial functions (Eldan & Shamir, 2016), Boolean circuit (Rossman et al., 2015) or functions induced by neural network (Telgarsky, 2016). However, these functions are far from the function approximated by neural networks in practice.

There are also some studies on approximating data points of a fixed number instead of continuous functions, which is more general since data points can be sampled from arbitrary distributions. However, such works focus more on width rather than depth. For instance, the notable framework neural tangent kernel(NTK)(Jacot et al., 2018) proved that neural networks can fit the data with error 0 if the width is infinite. However, such wide neural networks would also have an extremely large number of parameters, and extract random features of data. Moreover, current state of art results are typically achieved by deep neural networks (He et al., 2016; Krizhevsky et al., 2012). Generally, when the width of the network is bounded since the function class of neural networks becomes more complex after the composition of layers, the optimization process of neural networks may not find the global optimal solution. There are some empirical explorations which reveal non-trivial properties of the landscape (Goodfellow et al., 2014; Li et al., 2018). However, these properties still lack theoretical understanding since the optimization of network is highly non-convex. Thus, to show the power of depth, a potential way is to pursue analytical solution instead of optimization. A line of research focuses on memory capacity (Vershynin, 2020; Yamasaki, 1993; Huang, 2003; Zhang et al., 2021; Yun et al., 2019), which aims at proving the existence of solutions through construction rather than computation. The construction is tricky and the labels are limited to be scalars.

Some studies are using the matrix-form activation function in practice. Li et al. (2017) introduces the use of a matrix operation (either matrix logarithm or matrix square root) on top of a convolutional layer with higher-order feature crosses. (Fischbacher et al., 2020) proposes a single matrix exponential layer to learn the periodic structure or geometric invariants of the input. Matrix-form activation functions make it possible to find the solution through matrix computation instead of construction and provide a better understanding of the power of depth and non-linear activation functions.

In this paper, we omit the optimization process and compute the analytical solution of a three-layer neural network with a matrix exponential activation function. We show the power of depth by proving that a three-layer network can map more matrix-form data points to their labels than a single-layer network. We also shed light on networks with element-wise activation function experimentally using similar methodology, indicating the number of equations a network can solve increases with the number of layers linearly.

2 Preliminary

The matrix exponential is a matrix function on the square matrices analogous to the ordinary exponential function. Let 𝑿𝑿\bm{X}bold_italic_X be an d×d𝑑𝑑d\times ditalic_d × italic_d complex matrix. The exponential of 𝑿𝑿\bm{X}bold_italic_X, denoted by exp(𝑿)exp𝑿\operatorname{exp}(\bm{X})roman_exp ( bold_italic_X ) is the d×d𝑑𝑑d\times ditalic_d × italic_d matrix given by the power series

exp(𝑿)=k=01k!𝑿kexp𝑿superscriptsubscript𝑘01𝑘superscript𝑿𝑘\operatorname{exp}(\bm{X})=\sum_{k=0}^{\infty}\frac{1}{k!}\bm{{X}}^{k}roman_exp ( bold_italic_X ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k ! end_ARG bold_italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (1)

where 𝑿0superscript𝑿0\bm{X}^{0}bold_italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is defined to be the identity matrix 𝑰𝑰\bm{I}bold_italic_I with the same dimensions as 𝑿𝑿\bm{X}bold_italic_X. The matrix exponential is well studied in the theory of Lie group and has many good properties.

Proposition 1.

Let 𝐗,𝐘d×d𝐗𝐘superscript𝑑𝑑\bm{X},\bm{Y}\in\mathbb{C}^{d\times d}bold_italic_X , bold_italic_Y ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. If 𝐗𝐘=𝐘𝐗𝐗𝐘𝐘𝐗\bm{X}\bm{Y}=\bm{Y}\bm{X}bold_italic_X bold_italic_Y = bold_italic_Y bold_italic_X, then exp(𝐗)exp(𝐘)=exp(𝐗+𝐘)exp𝐗exp𝐘exp𝐗𝐘\operatorname{exp}(\bm{X})\operatorname{exp}(\bm{Y})=\operatorname{exp}(\bm{X}% +\bm{Y})roman_exp ( bold_italic_X ) roman_exp ( bold_italic_Y ) = roman_exp ( bold_italic_X + bold_italic_Y )

Proposition 2.

The matrix exponential gives a surjective map

exp:Md()GL(d,):expsubscript𝑀𝑑GL𝑑\operatorname{exp}:M_{d}(\mathbb{C})\to\operatorname{GL}(d,\mathbb{C})roman_exp : italic_M start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( blackboard_C ) → roman_GL ( italic_d , blackboard_C ) (2)

where Md()subscript𝑀𝑑M_{d}(\mathbb{C})italic_M start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( blackboard_C ) is the space of all d×d𝑑𝑑d\times ditalic_d × italic_d complex matrices and GL(d,)GL𝑑\operatorname{GL}(d,\mathbb{C})roman_GL ( italic_d , blackboard_C ) is the general linear group of degree d𝑑ditalic_d, i.e. the group of all d×d𝑑𝑑d\times ditalic_d × italic_d invertible matrices.

In general, exp(𝑿)exp(𝒀)exp𝑿exp𝒀\operatorname{exp}(\bm{X})\operatorname{exp}(\bm{Y})roman_exp ( bold_italic_X ) roman_exp ( bold_italic_Y ) can be expressed by the Baker Campbell Hausdorff (BCH) formula, and when 𝑿𝑿\bm{X}bold_italic_X and 𝒀𝒀\bm{Y}bold_italic_Y commute, the computation of BCH formula can be simplified as in Proposition 1. Proposition 2 means every invertible matrix 𝑿𝑿\bm{X}bold_italic_X can be written as the exponential of some other matrix 𝒁𝒁\bm{Z}bold_italic_Z (for this, it is essential to consider the field \mathbb{C}blackboard_C and not \mathbb{R}blackboard_R).

𝒁𝒁\bm{Z}bold_italic_Z can be calculated through the logarithm of matrix. First we need to find the Jordan decomposition of X𝑋Xitalic_X and calculate the logarithm of the Jordan blocks. For instance, we can write a Jordan block as

𝑩𝑩\displaystyle\bm{B}bold_italic_B =[λ10000λ10000λ100000λ100000λ]absentdelimited-[]𝜆10000𝜆10000𝜆100000𝜆100000𝜆\displaystyle=\left[\begin{array}[]{cccccc}\lambda&1&0&0&\cdots&0\\ 0&\lambda&1&0&\cdots&0\\ 0&0&\lambda&1&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\ddots&\vdots\\ 0&0&0&0&\lambda&1\\ 0&0&0&0&0&\lambda\end{array}\right]= [ start_ARRAY start_ROW start_CELL italic_λ end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_λ end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL italic_λ end_CELL start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL italic_λ end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL italic_λ end_CELL end_ROW end_ARRAY ]
=λ[1λ100001λ100001λ1000001λ1000001]absent𝜆delimited-[]1superscript𝜆100001superscript𝜆100001superscript𝜆1000001superscript𝜆1000001\displaystyle=\lambda\left[\begin{array}[]{cccccc}1&\lambda^{-1}&0&0&\cdots&0% \\ 0&1&\lambda^{-1}&0&\cdots&0\\ 0&0&1&\lambda^{-1}&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\ddots&\vdots\\ 0&0&0&0&1&\lambda^{-1}\\ 0&0&0&0&0&1\end{array}\right]= italic_λ [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARRAY ] (3)
=λ(𝑰+𝑲)absent𝜆𝑰𝑲\displaystyle=\lambda(\bm{I}+\bm{K})= italic_λ ( bold_italic_I + bold_italic_K )

where 𝑲𝑲\bm{K}bold_italic_K is a matrix with zeros on and under the main diagonal. The number λ𝜆\lambdaitalic_λ is nonzero by the assumption that 𝑿𝑿\bm{X}bold_italic_X is invertible. Then, by the Mercator series

log(1+x)=xx22+x33x44+1𝑥𝑥superscript𝑥22superscript𝑥33superscript𝑥44\log(1+x)=x-\frac{x^{2}}{2}+\frac{x^{3}}{3}-\frac{x^{4}}{4}+\cdotsroman_log ( 1 + italic_x ) = italic_x - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG 3 end_ARG - divide start_ARG italic_x start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG + ⋯ (4)

we have

log𝑩=log(λ(𝑰+𝑲))=(logλ)𝑰+𝑲𝑲22+𝑲33𝑲44+𝑩𝜆𝑰𝑲𝜆𝑰𝑲superscript𝑲22superscript𝑲33superscript𝑲44\log\bm{B}=\log(\lambda(\bm{I}+\bm{K}))=(\log\lambda)\bm{I}+\bm{K}-\frac{\bm{K% }^{2}}{2}+\frac{\bm{K}^{3}}{3}-\frac{\bm{K}^{4}}{4}+\cdotsroman_log bold_italic_B = roman_log ( italic_λ ( bold_italic_I + bold_italic_K ) ) = ( roman_log italic_λ ) bold_italic_I + bold_italic_K - divide start_ARG bold_italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG bold_italic_K start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG 3 end_ARG - divide start_ARG bold_italic_K start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG + ⋯ (5)

This series has a finite number of terms since 𝑲msuperscript𝑲𝑚\bm{K}^{m}bold_italic_K start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is 𝟎0\bm{0}bold_0 if m𝑚mitalic_m is the dimension of of 𝑲𝑲\bm{K}bold_italic_K. Thus the sum is well-defined. Assume that 𝑱𝑱\bm{J}bold_italic_J is the Jordan normal form of 𝑿𝑿\bm{X}bold_italic_X and 𝑿=𝑷𝑱𝑷1𝑿𝑷𝑱superscript𝑷1\bm{X}=\bm{P}\bm{J}\bm{P}^{-1}bold_italic_X = bold_italic_P bold_italic_J bold_italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Following the method above, we can calculate log𝑱𝑱\log\bm{J}roman_log bold_italic_J and obtain 𝒁=log𝑿=𝑷log𝑱𝑷1𝒁𝑿𝑷𝑱superscript𝑷1\bm{Z}=\log\bm{X}=\bm{P}\log\bm{J}\bm{P}^{-1}bold_italic_Z = roman_log bold_italic_X = bold_italic_P roman_log bold_italic_J bold_italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

3 Main result

The basic task of machine learning is to find a function which maps the data to its label, i.e., for given {(𝒙i,𝒚i)}i=1nsuperscriptsubscriptsubscript𝒙𝑖subscript𝒚𝑖𝑖1𝑛\{(\bm{x}_{i},\bm{y}_{i})\}_{i=1}^{n}{ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT where 𝒙idx,𝒚idyformulae-sequencesubscript𝒙𝑖superscriptsubscript𝑑𝑥subscript𝒚𝑖superscriptsubscript𝑑𝑦\bm{x}_{i}\in\mathbb{R}^{d_{x}},\bm{y}_{i}\in\mathbb{R}^{d_{y}}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, solve the equations f(𝒙i)=𝒚i𝑓subscript𝒙𝑖subscript𝒚𝑖f(\bm{x}_{i})=\bm{y}_{i}italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,,n𝑖1𝑛i=1,\cdots,nitalic_i = 1 , ⋯ , italic_n. Specifically, for neural networks, f𝑓fitalic_f is composed of linear transformations and nonlinear activation functions, i.e., for m𝑚mitalic_m-layer network,

𝒇()=𝑾mσ(𝑾m1σ(𝑾1))\bm{f}(\cdot)=\bm{W}_{m}\sigma\left(\bm{W}_{m-1}\cdots\sigma\left(\bm{W}_{1}% \cdot\right)\right)bold_italic_f ( ⋅ ) = bold_italic_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ⋯ italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ) ) (6)

where σ𝜎\sigmaitalic_σ is the nonlinear activation function and 𝑾1dx×d1subscript𝑾1superscriptsubscript𝑑𝑥subscript𝑑1\bm{W}_{1}\in\mathbb{R}^{d_{x}\times d_{1}}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝑾kdk1×dksubscript𝑾𝑘superscriptsubscript𝑑𝑘1subscript𝑑𝑘\bm{W}_{k}\in\mathbb{R}^{d_{k-1}\times d_{k}}bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, k=2,,m1𝑘2𝑚1k=2,\cdots,m-1italic_k = 2 , ⋯ , italic_m - 1, 𝑾mdm1×dysubscript𝑾𝑚superscriptsubscript𝑑𝑚1subscript𝑑𝑦\bm{W}_{m}\in\mathbb{R}^{d_{m-1}\times d_{y}}bold_italic_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. σ𝜎\sigmaitalic_σ is elementwise function such as ReLU, sigmoid and tanh function. Generally, proving the existence of solution of nonlinear system is hard, especially when the element-wise function σ𝜎\sigmaitalic_σ does not integral well with the linear transformation matrix 𝑾𝑾\bm{W}bold_italic_W. For instance, let σ(x)=x2𝜎𝑥superscript𝑥2\sigma(x)=x^{2}italic_σ ( italic_x ) = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then σ(𝑨)=𝑨𝑨𝜎𝑨𝑨𝑨\sigma(\bm{A})=\bm{A}\circ\bm{A}italic_σ ( bold_italic_A ) = bold_italic_A ∘ bold_italic_A for 𝑨d×d𝑨superscript𝑑superscript𝑑\bm{A}\in\mathbb{R}^{d\times d^{\prime}}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where \circ is the Hadamard product. As we know, generally, 𝑨𝑨𝑨𝑨\bm{A}\circ\bm{A}bold_italic_A ∘ bold_italic_A can not be expressed as a polynomial of 𝑨𝑨\bm{A}bold_italic_A, i.e., 𝑨𝑨poly(𝑨)𝑨𝑨poly𝑨\bm{A}\circ\bm{A}\neq\operatorname{poly}(\bm{A})bold_italic_A ∘ bold_italic_A ≠ roman_poly ( bold_italic_A ). This causes difficulties in finding the analytical solution of neural networks, since we can not transform the output of each layer to a operable form. To address this issue, we use matrix exponential function as nonlinear activation function instead, which gives chance to find the solution to the system when number of layers is more than one.

To make matrix exponential well-defined, we assume 𝑿,𝒀,𝑾𝑿𝒀𝑾\bm{X},\bm{Y},\bm{W}bold_italic_X , bold_italic_Y , bold_italic_W are square. To make the solution exists, we assume the items of 𝑿,𝒀,𝑾𝑿𝒀𝑾\bm{X},\bm{Y},\bm{W}bold_italic_X , bold_italic_Y , bold_italic_W in \mathbb{C}blackboard_C. Consider 𝑿,𝒀d×d𝑿𝒀superscript𝑑𝑑\bm{X},\bm{Y}\in\mathbb{C}^{d\times d}bold_italic_X , bold_italic_Y ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT and 𝑿𝑿\bm{X}bold_italic_X is invertible, then 𝑾=𝒀𝑿1𝑾𝒀superscript𝑿1\bm{W}=\bm{Y}\bm{X}^{-1}bold_italic_W = bold_italic_Y bold_italic_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT can solve the equation 𝒀=𝑾𝑿𝒀𝑾𝑿\bm{Y}=\bm{W}\bm{X}bold_italic_Y = bold_italic_W bold_italic_X. There doesn’t exist solution of 𝒀1=𝑾𝑿1,𝒀2=𝑾𝑿2formulae-sequencesubscript𝒀1𝑾subscript𝑿1subscript𝒀2𝑾subscript𝑿2\bm{Y}_{1}=\bm{W}\bm{X}_{1},\bm{Y}_{2}=\bm{W}\bm{X}_{2}bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_W bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for 𝑿1,𝑿2,𝒀1,𝒀2d×dsubscript𝑿1subscript𝑿2subscript𝒀1subscript𝒀2superscript𝑑𝑑\bm{X}_{1},\bm{X}_{2},\bm{Y}_{1},\bm{Y}_{2}\in\mathbb{C}^{d\times d}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT except degenerate cases, since the number of parameter d2superscript𝑑2d^{2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is less than the number of equations 2d22superscript𝑑22d^{2}2 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. If we let the weight matrix be ‘wider’, i.e.,

𝑾=[𝑾1𝟎𝟎𝑾2]𝑾delimited-[]subscript𝑾100subscript𝑾2\bm{W}=\left[\begin{array}[]{cc}\bm{W}_{1}&\bm{0}\\ \bm{0}&\bm{W}_{2}\end{array}\right]bold_italic_W = [ start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (7)

then with the assumption that 𝑿1subscript𝑿1\bm{X}_{1}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝑿2subscript𝑿2\bm{X}_{2}bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are invertible, 𝑾1=𝒀1𝑿11subscript𝑾1subscript𝒀1superscriptsubscript𝑿11\bm{W}_{1}=\bm{Y}_{1}\bm{X}_{1}^{-1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and 𝑾2=𝒀2𝑿21subscript𝑾2subscript𝒀2superscriptsubscript𝑿21\bm{W}_{2}=\bm{Y}_{2}\bm{X}_{2}^{-1}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT can solve the equations

[𝒀1𝒀2]=[𝑾1𝟎𝟎𝑾2][𝑿1𝑿2]delimited-[]subscript𝒀1subscript𝒀2delimited-[]subscript𝑾100subscript𝑾2delimited-[]subscript𝑿1subscript𝑿2\left[\begin{array}[]{c}\bm{Y}_{1}\\ \bm{Y}_{2}\end{array}\right]=\left[\begin{array}[]{cc}\bm{W}_{1}&\bm{0}\\ \bm{0}&\bm{W}_{2}\end{array}\right]\cdot\left[\begin{array}[]{c}\bm{X}_{1}\\ \bm{X}_{2}\end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] ⋅ [ start_ARRAY start_ROW start_CELL bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (8)

The above equation has solution because we can separate it to two sub-problems and solve 𝑾1subscript𝑾1\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝑾2subscript𝑾2\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sequentially. However, this will not happen when we compose 𝑾1subscript𝑾1\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝑾2subscript𝑾2\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (two-layer network with identity activation function), which means, solving the equation

𝒀1=𝑾2𝑾1𝑿1;𝒀2=𝑾2𝑾1𝑿2formulae-sequencesubscript𝒀1subscript𝑾2subscript𝑾1subscript𝑿1subscript𝒀2subscript𝑾2subscript𝑾1subscript𝑿2\bm{Y}_{1}=\bm{W}_{2}\bm{W}_{1}\bm{X}_{1};\bm{Y}_{2}=\bm{W}_{2}\bm{W}_{1}\bm{X% }_{2}bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (9)

When 𝑾1subscript𝑾1\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is fixed, then 𝑾2subscript𝑾2\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with d2superscript𝑑2d^{2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT parameters is involved in 2d22superscript𝑑22d^{2}2 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT equations, i.e., 𝒀1=𝑾2(𝑾1𝑿1)subscript𝒀1subscript𝑾2subscript𝑾1subscript𝑿1\bm{Y}_{1}=\bm{W}_{2}(\bm{W}_{1}\bm{X}_{1})bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and 𝒀2=𝑾2(𝑾1𝑿2)subscript𝒀2subscript𝑾2subscript𝑾1subscript𝑿2\bm{Y}_{2}=\bm{W}_{2}(\bm{W}_{1}\bm{X}_{2})bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and has no solution in general. Situation changes again by adding non-linear activation function, i.e., solving the equations

𝒀1=𝑾2σ(𝑾1𝑿1)subscript𝒀1subscript𝑾2𝜎subscript𝑾1subscript𝑿1\displaystyle\bm{Y}_{1}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{X}_{1})bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (10)
𝒀2=𝑾2σ(𝑾1𝑿2)subscript𝒀2subscript𝑾2𝜎subscript𝑾1subscript𝑿2\displaystyle\bm{Y}_{2}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

From the second equation, we obtain 𝑾2=𝒀2σ(𝑾1𝑿2)1subscript𝑾2subscript𝒀2𝜎superscriptsubscript𝑾1subscript𝑿21\bm{W}_{2}=\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Taking it into the first equation, we have

𝒀1=𝒀2σ(𝑾1𝑿2)1σ(𝑾1𝑿1)subscript𝒀1subscript𝒀2𝜎superscriptsubscript𝑾1subscript𝑿21𝜎subscript𝑾1subscript𝑿1\bm{Y}_{1}=\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}\sigma(\bm{W}_{1}\bm{X}_% {1})bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (11)

If this equation has a solution for 𝑾1subscript𝑾1\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, then the non-linear system (10) has a solution for 𝑾1subscript𝑾1\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝑾2subscript𝑾2\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Following this intuition, we prove that a three-layer network with a matrix exponential activation function can solve the equations, exhibiting the power of deepness and the use of non-linear activation.

Theorem 1.

Let 𝐗1,𝐗2subscript𝐗1subscript𝐗2\bm{X}_{1},\bm{X}_{2}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the data matrices and 𝐘1,𝐘2subscript𝐘1subscript𝐘2\bm{Y}_{1},\bm{Y}_{2}bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the corresponding label matrices, where 𝐗1,𝐗2,𝐘1,𝐘2d×dsubscript𝐗1subscript𝐗2subscript𝐘1subscript𝐘2superscript𝑑𝑑\bm{X}_{1},\bm{X}_{2},\bm{Y}_{1},\bm{Y}_{2}\in\mathbb{C}^{d\times d}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT are invertible matrices. Assume that 𝐗1𝐗2subscript𝐗1subscript𝐗2\bm{X}_{1}-\bm{X}_{2}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is invertible. 𝐟()=𝐖3σ(𝐖2σ(𝐖1))\bm{f}(\cdot)=\bm{W}_{3}\sigma(\bm{W}_{2}\sigma(\bm{W}_{1}\cdot))bold_italic_f ( ⋅ ) = bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ) ) is a three-layer network where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is matrix exponential, i.e., σ()=exp():d×dd×d:𝜎expsuperscript𝑑𝑑superscript𝑑𝑑\sigma(\cdot)=\operatorname{exp}(\cdot):\mathbb{C}^{d\times d}\to\mathbb{C}^{d% \times d}italic_σ ( ⋅ ) = roman_exp ( ⋅ ) : blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT → blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, and 𝐖1,𝐖2,𝐖3d×dsubscript𝐖1subscript𝐖2subscript𝐖3superscript𝑑𝑑\bm{W}_{1},\bm{W}_{2},\bm{W}_{3}\in\mathbb{C}^{d\times d}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. If

𝑾1subscript𝑾1\displaystyle\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =lnα(𝑿1𝑿2)1absentln𝛼superscriptsubscript𝑿1subscript𝑿21\displaystyle=\operatorname{ln}\alpha\cdot(\bm{X}_{1}-\bm{X}_{2})^{-1}= roman_ln italic_α ⋅ ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (12)
𝑾2subscript𝑾2\displaystyle\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =(𝒁lnα𝑰)exp(𝑾1𝑿2)11αabsent𝒁ln𝛼𝑰expsubscript𝑾1subscript𝑿211𝛼\displaystyle=(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})\cdot\operatorname{% exp}(-\bm{W}_{1}\bm{X}_{2})\cdot\frac{1}{1-\alpha}= ( bold_italic_Z - roman_ln italic_α ⋅ bold_italic_I ) ⋅ roman_exp ( - bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ divide start_ARG 1 end_ARG start_ARG 1 - italic_α end_ARG
𝑾3subscript𝑾3\displaystyle\bm{W}_{3}bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =𝒀1exp(𝑾2exp(𝑾1𝑿1))absentsubscript𝒀1expsubscript𝑾2expsubscript𝑾1subscript𝑿1\displaystyle=\bm{Y}_{1}\operatorname{exp}(-\bm{W}_{2}\operatorname{exp}(\bm{W% }_{1}\bm{X}_{1}))= bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_exp ( - bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )

where α+,α1formulae-sequence𝛼superscript𝛼1\alpha\in\mathbb{R}^{+},\alpha\neq 1italic_α ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_α ≠ 1 and exp(𝐙)=α𝐘11𝐘2exp𝐙𝛼superscriptsubscript𝐘11subscript𝐘2\operatorname{exp}(\bm{Z})=\alpha\bm{Y}_{1}^{-1}\bm{Y}_{2}roman_exp ( bold_italic_Z ) = italic_α bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then 𝐟𝐟\bm{f}bold_italic_f maps the data points to their labels, i.e., 𝐟(𝐗1)=𝐘1,𝐟(𝐗2)=𝐘2formulae-sequence𝐟subscript𝐗1subscript𝐘1𝐟subscript𝐗2subscript𝐘2\bm{f}(\bm{X}_{1})=\bm{Y}_{1},\bm{f}(\bm{X}_{2})=\bm{Y}_{2}bold_italic_f ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_f ( bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Proof 1.

We assume

𝑾1=𝑾1,1𝑾1,2subscript𝑾1subscript𝑾11subscript𝑾12\displaystyle\bm{W}_{1}=\bm{W}_{1,1}\bm{W}_{1,2}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT (13)

where 𝐖1,1,𝐖1,2d×dsubscript𝐖11subscript𝐖12superscript𝑑𝑑\bm{W}_{1,1},\bm{W}_{1,2}\in\mathbb{C}^{d\times d}bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT and 𝐖1,2subscript𝐖12\bm{W}_{1,2}bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT is invertible. It is known that the exponential of a matrix is always an invertible matrix, let

𝑴1,X1subscript𝑴1subscript𝑋1\displaystyle\bm{M}_{1,X_{1}}bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =exp(𝑾1𝑿1)𝑿11𝑾1,21absentexpsubscript𝑾1subscript𝑿1superscriptsubscript𝑿11superscriptsubscript𝑾121\displaystyle=\operatorname{exp}(\bm{W}_{1}\bm{X}_{1})\bm{X}_{1}^{-1}\bm{W}_{1% ,2}^{-1}= roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (14)
𝑴1,X2subscript𝑴1subscript𝑋2\displaystyle\bm{M}_{1,X_{2}}bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =exp(𝑾1𝑿2)𝑿21𝑾1,21absentexpsubscript𝑾1subscript𝑿2superscriptsubscript𝑿21superscriptsubscript𝑾121\displaystyle=\operatorname{exp}(\bm{W}_{1}\bm{X}_{2})\bm{X}_{2}^{-1}\bm{W}_{1% ,2}^{-1}= roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
𝑴2,X1subscript𝑴2subscript𝑋1\displaystyle\bm{M}_{2,X_{1}}bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =exp(𝑾2exp(𝑾1𝑿1))exp(𝑾1𝑿1)1\displaystyle=\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1}\bm{X}% _{1}))\operatorname{exp}(\bm{W}_{1}\bm{X}_{1})^{-1}= roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
𝑴2,X2subscript𝑴2subscript𝑋2\displaystyle\bm{M}_{2,X_{2}}bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =exp(𝑾2exp(𝑾1𝑿2))exp(𝑾1𝑿2)1\displaystyle=\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1}\bm{X}% _{2}))\operatorname{exp}(\bm{W}_{1}\bm{X}_{2})^{-1}= roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

Use the trick

[𝑨𝟎𝟎𝑩]=[𝑨𝟎𝟎𝑨][𝑰𝟎𝟎𝑨1𝑩]delimited-[]𝑨00𝑩delimited-[]𝑨00𝑨delimited-[]𝑰00superscript𝑨1𝑩\left[\begin{array}[]{cc}\bm{A}&\bm{0}\\ \bm{0}&\bm{B}\end{array}\right]=\left[\begin{array}[]{cc}\bm{A}&\bm{0}\\ \bm{0}&\bm{A}\end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{I}&\bm{0}\\ \bm{0}&\bm{A}^{-1}\bm{B}\end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_A end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_italic_B end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL bold_italic_A end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_italic_A end_CELL end_ROW end_ARRAY ] ⋅ [ start_ARRAY start_ROW start_CELL bold_italic_I end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_B end_CELL end_ROW end_ARRAY ] (15)

twice, then we have

[exp(𝑾2exp(𝑾1𝑿1))00exp(𝑾2exp(𝑾1𝑿2))]delimited-[]expsubscript𝑾2expsubscript𝑾1subscript𝑿100expsubscript𝑾2expsubscript𝑾1subscript𝑿2\displaystyle\left[\begin{array}[]{cc}\operatorname{exp}(\bm{W}_{2}% \operatorname{exp}(\bm{W}_{1}\bm{X}_{1}))&0\\ 0&\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1}\bm{X}_{2}))\\ \end{array}\right][ start_ARRAY start_ROW start_CELL roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_CELL end_ROW end_ARRAY ] (16)
=\displaystyle== [𝑴2,X100𝑴2,X2][𝑴1,X100𝑴1,X2]delimited-[]subscript𝑴2subscript𝑋100subscript𝑴2subscript𝑋2delimited-[]subscript𝑴1subscript𝑋100subscript𝑴1subscript𝑋2\displaystyle\left[\begin{array}[]{cc}\bm{M}_{2,X_{1}}&0\\ 0&\bm{M}_{2,X_{2}}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{M}_{1,X_{1}}&0\\ 0&\bm{M}_{1,X_{2}}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] ⋅ [ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]
\displaystyle\cdot [𝑾1,2𝑿100𝑾1,2𝑿2]delimited-[]subscript𝑾12subscript𝑿100subscript𝑾12subscript𝑿2\displaystyle\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]
=\displaystyle== [𝑴2,X100𝑴2,X2][𝑴1,X200𝑴1,X2]delimited-[]subscript𝑴2subscript𝑋100subscript𝑴2subscript𝑋2delimited-[]subscript𝑴1subscript𝑋200subscript𝑴1subscript𝑋2\displaystyle\left[\begin{array}[]{cc}\bm{M}_{2,X_{1}}&0\\ 0&\bm{M}_{2,X_{2}}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}&0\\ 0&\bm{M}_{1,X_{2}}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] ⋅ [ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]
\displaystyle\cdot [𝑴1,X21𝑴1,X100𝑰][𝑾1,2𝑿100𝑾1,2𝑿2]delimited-[]superscriptsubscript𝑴1subscript𝑋21subscript𝑴1subscript𝑋100𝑰delimited-[]subscript𝑾12subscript𝑿100subscript𝑾12subscript𝑿2\displaystyle\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1,X_{1}}&0% \\ 0&\bm{I}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_I end_CELL end_ROW end_ARRAY ] ⋅ [ start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]
=\displaystyle== [𝑴2,X1𝑴1,X200𝑴2,X2𝑴1,X2][𝑴1,X21𝑴1,X100𝑰]delimited-[]subscript𝑴2subscript𝑋1subscript𝑴1subscript𝑋200subscript𝑴2subscript𝑋2subscript𝑴1subscript𝑋2delimited-[]superscriptsubscript𝑴1subscript𝑋21subscript𝑴1subscript𝑋100𝑰\displaystyle\left[\begin{array}[]{cc}\bm{M}_{2,X_{1}}\bm{M}_{1,X_{2}}&0\\ 0&\bm{M}_{2,X_{2}}\bm{M}_{1,X_{2}}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1% ,X_{1}}&0\\ 0&\bm{I}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] ⋅ [ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_I end_CELL end_ROW end_ARRAY ]
\displaystyle\cdot [𝑾1,2𝑿100𝑾1,2𝑿2]delimited-[]subscript𝑾12subscript𝑿100subscript𝑾12subscript𝑿2\displaystyle\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]
=\displaystyle== [𝑴2,X1𝑴1,X200𝑴2,X1𝑴1,X2]delimited-[]subscript𝑴2subscript𝑋1subscript𝑴1subscript𝑋200subscript𝑴2subscript𝑋1subscript𝑴1subscript𝑋2\displaystyle\left[\begin{array}[]{cc}\bm{M}_{2,X_{1}}\bm{M}_{1,X_{2}}&0\\ 0&\bm{M}_{2,X_{1}}\bm{M}_{1,X_{2}}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]
\displaystyle\cdot [𝑰00𝑴1,X21𝑴2,X11𝑴2,X2𝑴1,X2]delimited-[]𝑰00superscriptsubscript𝑴1subscript𝑋21superscriptsubscript𝑴2subscript𝑋11subscript𝑴2subscript𝑋2subscript𝑴1subscript𝑋2\displaystyle\left[\begin{array}[]{cc}\bm{I}&0\\ 0&\bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1}\bm{M}_{2,X_{2}}\bm{M}_{1,X_{2}}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_I end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]
\displaystyle\cdot [𝑴1,X21𝑴1,X100𝑰][𝑾1,2𝑿100𝑾1,2𝑿2]delimited-[]superscriptsubscript𝑴1subscript𝑋21subscript𝑴1subscript𝑋100𝑰delimited-[]subscript𝑾12subscript𝑿100subscript𝑾12subscript𝑿2\displaystyle\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1,X_{1}}&0% \\ 0&\bm{I}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_I end_CELL end_ROW end_ARRAY ] ⋅ [ start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]

Let

𝑾3=𝑴1,X21𝑴2,X11,subscript𝑾3superscriptsubscript𝑴1subscript𝑋21superscriptsubscript𝑴2subscript𝑋11\bm{W}_{3}=\bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1},bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (17)

to eliminate the fist matrix of the right side of the last equality in (16), then we have

[𝒇(𝑿1)00𝒇(𝑿2)]delimited-[]𝒇subscript𝑿100𝒇subscript𝑿2\displaystyle\left[\begin{array}[]{cc}\bm{f}(\bm{X}_{1})&0\\ 0&\bm{f}(\bm{X}_{2})\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_f ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_f ( bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] (18)
=\displaystyle== [𝑾3exp(𝑾2exp(𝑾1𝑿1))00𝑾3exp(𝑾2exp(𝑾1𝑿2))]delimited-[]subscript𝑾3expsubscript𝑾2expsubscript𝑾1subscript𝑿100subscript𝑾3expsubscript𝑾2expsubscript𝑾1subscript𝑿2\displaystyle\left[\begin{array}[]{cc}\bm{W}_{3}\operatorname{exp}(\bm{W}_{2}% \operatorname{exp}(\bm{W}_{1}\bm{X}_{1}))&0\\ 0&\bm{W}_{3}\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1}\bm{X}_{% 2}))\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_CELL end_ROW end_ARRAY ]
=\displaystyle== [𝑰00𝑴1,X21𝑴2,X11𝑴2,X2𝑴1,X2][𝑴1,X21𝑴1,X100𝑰]delimited-[]𝑰00superscriptsubscript𝑴1subscript𝑋21superscriptsubscript𝑴2subscript𝑋11subscript𝑴2subscript𝑋2subscript𝑴1subscript𝑋2delimited-[]superscriptsubscript𝑴1subscript𝑋21subscript𝑴1subscript𝑋100𝑰\displaystyle\left[\begin{array}[]{cc}\bm{I}&0\\ 0&\bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1}\bm{M}_{2,X_{2}}\bm{M}_{1,X_{2}}\\ \end{array}\right]\cdot\left[\begin{array}[]{cc}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1% ,X_{1}}&0\\ 0&\bm{I}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_I end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] ⋅ [ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_I end_CELL end_ROW end_ARRAY ]
\displaystyle\cdot [𝑾1,2𝑿100𝑾1,2𝑿2]delimited-[]subscript𝑾12subscript𝑿100subscript𝑾12subscript𝑿2\displaystyle\left[\begin{array}[]{cc}\bm{W}_{1,2}\bm{X}_{1}&0\\ 0&\bm{W}_{1,2}\bm{X}_{2}\\ \end{array}\right][ start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]

Let 𝐗~1=𝐖1,2𝐗1,𝐗~2=𝐖1,2𝐗2formulae-sequencesubscriptbold-~𝐗1subscript𝐖12subscript𝐗1subscriptbold-~𝐗2subscript𝐖12subscript𝐗2\bm{\tilde{X}}_{1}=\bm{W}_{1,2}\bm{X}_{1},\bm{\tilde{X}}_{2}=\bm{W}_{1,2}\bm{X% }_{2}overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. To solve

[𝒇(𝑿1)00𝒇(𝑿2)]=[𝒀100𝒀2],delimited-[]𝒇subscript𝑿100𝒇subscript𝑿2delimited-[]subscript𝒀100subscript𝒀2\left[\begin{array}[]{cc}\bm{f}(\bm{X}_{1})&0\\ 0&\bm{f}(\bm{X}_{2})\\ \end{array}\right]=\left[\begin{array}[]{cc}\bm{Y}_{1}&0\\ 0&\bm{Y}_{2}\\ \end{array}\right],[ start_ARRAY start_ROW start_CELL bold_italic_f ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_f ( bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] , (19)

it equals to solve

{𝑴1,X21𝑴1,X1𝑿~1=𝒀1𝑴1,X21𝑴2,X11𝑴2,X2𝑴1,X2𝑿~2=𝒀2casessuperscriptsubscript𝑴1subscript𝑋21subscript𝑴1subscript𝑋1subscriptbold-~𝑿1subscript𝒀1superscriptsubscript𝑴1subscript𝑋21superscriptsubscript𝑴2subscript𝑋11subscript𝑴2subscript𝑋2subscript𝑴1subscript𝑋2subscriptbold-~𝑿2subscript𝒀2\left\{\begin{array}[]{c}\bm{M}_{1,X_{2}}^{-1}\bm{M}_{1,X_{1}}\bm{\tilde{X}}_{% 1}=\bm{Y}_{1}\\[17.07164pt] \bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1}\bm{M}_{2,X_{2}}\bm{M}_{1,X_{2}}\bm{% \tilde{X}}_{2}=\bm{Y}_{2}\end{array}\right.{ start_ARRAY start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY (20)

By the definition of 𝐌1,X1,𝐌1,X2,𝐌2,X1,𝐌2,X2subscript𝐌1subscript𝑋1subscript𝐌1subscript𝑋2subscript𝐌2subscript𝑋1subscript𝐌2subscript𝑋2\bm{M}_{1,X_{1}},\bm{M}_{1,X_{2}},\bm{M}_{2,X_{1}},\bm{M}_{2,X_{2}}bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we can rewrite equalities in (20) as:

{𝑿~2exp(𝑾1,1𝑿~2)1exp(𝑾1,1𝑿~1)=𝒀1𝑿~2exp(𝑾1,1𝑿~2)1exp(𝑾1,1𝑿~1)exp(𝑾2exp(𝑾1,1𝑿~1))1exp(𝑾2exp(𝑾1,1𝑿~2))=𝒀2\left\{\begin{array}[]{l}\bm{\tilde{X}}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{% \tilde{X}}_{2})^{-1}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})=\bm{Y}_% {1}\\[17.07164pt] \bm{\tilde{X}}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})^{-1}% \operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})\operatorname{exp}(\bm{W}_{2% }\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1}))^{-1}\operatorname{exp}(% \bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2}))=\bm{Y}_{2}\end{% array}\right.{ start_ARRAY start_ROW start_CELL overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) = bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY (21)

To solve the first equality in (21), let

𝑾1,2=1α𝒀1𝑿21subscript𝑾121𝛼subscript𝒀1superscriptsubscript𝑿21\bm{W}_{1,2}=\frac{1}{\alpha}\bm{Y}_{1}\bm{X}_{2}^{-1}bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_α end_ARG bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (22)

where α+,α1formulae-sequence𝛼superscript𝛼1\alpha\in\mathbb{R}^{+},\alpha\neq 1italic_α ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_α ≠ 1, then

𝑿~21𝒀1=α𝑰=exp(lnα𝑰)superscriptsubscriptbold-~𝑿21subscript𝒀1𝛼𝑰expln𝛼𝑰\bm{\tilde{X}}_{2}^{-1}\bm{Y}_{1}=\alpha\bm{I}=\operatorname{exp}(% \operatorname{ln}\alpha\cdot\bm{I})overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_α bold_italic_I = roman_exp ( roman_ln italic_α ⋅ bold_italic_I ) (23)

Then the first equality in (21) can be rewrite as

exp(𝑾1,1𝑿~1)expsubscript𝑾11subscriptbold-~𝑿1\displaystyle\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =exp(𝑾1,1𝑿~2)exp(lnα𝑰)absentexpsubscript𝑾11subscriptbold-~𝑿2expln𝛼𝑰\displaystyle=\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})\operatorname{% exp}(\operatorname{ln}\alpha\cdot\bm{I})= roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_exp ( roman_ln italic_α ⋅ bold_italic_I ) (24)
=exp(𝑾1,1𝑿~2+lnα𝑰)absentexpsubscript𝑾11subscriptbold-~𝑿2ln𝛼𝑰\displaystyle=\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2}+\operatorname{% ln}\alpha\cdot\bm{I})= roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_ln italic_α ⋅ bold_italic_I )

The second equality is because 𝐖1,1𝐗~2subscript𝐖11subscriptbold-~𝐗2\bm{W}_{1,1}\bm{\tilde{X}}_{2}bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT commute with lnα𝐈ln𝛼𝐈\operatorname{ln}\alpha\cdot\bm{I}roman_ln italic_α ⋅ bold_italic_I and Proposition 1. Thus it is sufficient to solve the equality

𝑾1,1𝑿~1=𝑾1,1𝑿~2+lnα𝑰subscript𝑾11subscriptbold-~𝑿1subscript𝑾11subscriptbold-~𝑿2ln𝛼𝑰\bm{W}_{1,1}\bm{\tilde{X}}_{1}=\bm{W}_{1,1}\bm{\tilde{X}}_{2}+\operatorname{ln% }\alpha\cdot\bm{I}bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_ln italic_α ⋅ bold_italic_I (25)

since 𝐗1𝐗2subscript𝐗1subscript𝐗2\bm{X}_{1}-\bm{X}_{2}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is invertible as assumed, then

𝑾1,1=lnα(𝑿~1𝑿~2)1,𝑾1=𝑾1,1𝑾1,2=lnα(𝑿1𝑿2)1formulae-sequencesubscript𝑾11ln𝛼superscriptsubscriptbold-~𝑿1subscriptbold-~𝑿21subscript𝑾1subscript𝑾11subscript𝑾12ln𝛼superscriptsubscript𝑿1subscript𝑿21\bm{W}_{1,1}=\operatorname{ln}\alpha\cdot(\bm{\tilde{X}}_{1}-\bm{\tilde{X}}_{2% })^{-1},\quad\bm{W}_{1}=\bm{W}_{1,1}\bm{W}_{1,2}=\operatorname{ln}\alpha\cdot(% \bm{X}_{1}-\bm{X}_{2})^{-1}bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT = roman_ln italic_α ⋅ ( overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT = roman_ln italic_α ⋅ ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (26)

Taking the second equality in (21) into the first equality in (21), the first equality in (21) can be rewrite as

exp(𝑾2exp(𝑾1,1𝑿~1))1exp(𝑾2exp(𝑾1,1𝑿~2))\displaystyle\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{% \tilde{X}}_{1}))^{-1}\operatorname{exp}(\bm{W}_{2}\operatorname{exp}(\bm{W}_{1% ,1}\bm{\tilde{X}}_{2}))roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) =𝒀11𝒀2absentsuperscriptsubscript𝒀11subscript𝒀2\displaystyle=\bm{Y}_{1}^{-1}\bm{Y}_{2}= bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (27)
=1αexp(𝒁)absent1𝛼exp𝒁\displaystyle=\frac{1}{\alpha}\operatorname{exp}(\bm{Z})= divide start_ARG 1 end_ARG start_ARG italic_α end_ARG roman_exp ( bold_italic_Z )

The second equality is because of the definition of 𝐙𝐙\bm{Z}bold_italic_Z. Such 𝐙𝐙\bm{Z}bold_italic_Z exists because of Proposition 2. If 𝐖2exp(𝐖1,1𝐗~1)subscript𝐖2expsubscript𝐖11subscriptbold-~𝐗1\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) commute with 𝐙𝐙\bm{Z}bold_italic_Z, then we only need to solve

𝑾2exp(𝑾1,1𝑿~2)=𝑾2exp(𝑾1,1𝑿~1)+(𝒁lnα𝑰)subscript𝑾2expsubscript𝑾11subscriptbold-~𝑿2subscript𝑾2expsubscript𝑾11subscriptbold-~𝑿1𝒁ln𝛼𝑰\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})=\bm{W}_{2}% \operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})+(\bm{Z}-\operatorname{ln}% \alpha\cdot\bm{I})bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( bold_italic_Z - roman_ln italic_α ⋅ bold_italic_I ) (28)

Note that according to (24)

exp(𝑾1,1𝑿~2)exp(𝑾1,1𝑿~1)expsubscript𝑾11subscriptbold-~𝑿2expsubscript𝑾11subscriptbold-~𝑿1\displaystyle\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})-\operatorname{% exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =exp(𝑾1,1𝑿~2)(𝑰α𝑰)absentexpsubscript𝑾11subscriptbold-~𝑿2𝑰𝛼𝑰\displaystyle=\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})(\bm{I}-\alpha% \bm{I})= roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( bold_italic_I - italic_α bold_italic_I ) (29)
=(1α)exp(𝑾1,1𝑿~2)absent1𝛼expsubscript𝑾11subscriptbold-~𝑿2\displaystyle=(1-\alpha)\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})= ( 1 - italic_α ) roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

then exp(𝐖1,1𝐗~2)exp(𝐖1,1𝐗~1)expsubscript𝐖11subscriptbold-~𝐗2expsubscript𝐖11subscriptbold-~𝐗1\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})-\operatorname{exp}(\bm{W}_{% 1,1}\bm{\tilde{X}}_{1})roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is invertible since α1𝛼1\alpha\neq 1italic_α ≠ 1. Thus the solution to (28) is

𝑾2subscript𝑾2\displaystyle\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =(𝒁lnα𝑰)(exp(𝑾1,1𝑿~2)exp(𝑾1,1𝑿~1))1absent𝒁ln𝛼𝑰superscriptexpsubscript𝑾11subscriptbold-~𝑿2expsubscript𝑾11subscriptbold-~𝑿11\displaystyle=(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})(\operatorname{exp}(% \bm{W}_{1,1}\bm{\tilde{X}}_{2})-\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_% {1}))^{-1}= ( bold_italic_Z - roman_ln italic_α ⋅ bold_italic_I ) ( roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (30)
=11α(𝒁lnα𝑰)exp(𝑾1,1𝑿~2)1\displaystyle=\frac{1}{1-\alpha}(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})% \operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})^{-1}= divide start_ARG 1 end_ARG start_ARG 1 - italic_α end_ARG ( bold_italic_Z - roman_ln italic_α ⋅ bold_italic_I ) roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

Finally we need to verify that 𝐖2exp(𝐖1,1𝐗~1)subscript𝐖2expsubscript𝐖11subscriptbold-~𝐗1\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) commute with 𝐙𝐙\bm{Z}bold_italic_Z, it is obviously according to (24) since

𝑾2exp(𝑾1,1𝑿~1)subscript𝑾2expsubscript𝑾11subscriptbold-~𝑿1\displaystyle\bm{W}_{2}\operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{1})bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =11α(𝒁lnα𝑰)exp(𝑾1,1𝑿~2)1exp(𝑾1,1𝑿~1)\displaystyle=\frac{1}{1-\alpha}(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})% \operatorname{exp}(\bm{W}_{1,1}\bm{\tilde{X}}_{2})^{-1}\operatorname{exp}(\bm{% W}_{1,1}\bm{\tilde{X}}_{1})= divide start_ARG 1 end_ARG start_ARG 1 - italic_α end_ARG ( bold_italic_Z - roman_ln italic_α ⋅ bold_italic_I ) roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (31)
=α1α(𝒁lnα𝑰)absent𝛼1𝛼𝒁ln𝛼𝑰\displaystyle=\frac{\alpha}{1-\alpha}(\bm{Z}-\operatorname{ln}\alpha\cdot\bm{I})= divide start_ARG italic_α end_ARG start_ARG 1 - italic_α end_ARG ( bold_italic_Z - roman_ln italic_α ⋅ bold_italic_I )

When 𝐖1,𝐖2subscript𝐖1subscript𝐖2\bm{W}_{1},\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are fixed as (26) and (30), then 𝐖3subscript𝐖3\bm{W}_{3}bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is fixed

𝑾3subscript𝑾3\displaystyle\bm{W}_{3}bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =𝑴1,X21𝑴2,X11absentsuperscriptsubscript𝑴1subscript𝑋21superscriptsubscript𝑴2subscript𝑋11\displaystyle=\bm{M}_{1,X_{2}}^{-1}\bm{M}_{2,X_{1}}^{-1}= bold_italic_M start_POSTSUBSCRIPT 1 , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT 2 , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (32)
=𝒀1exp(𝑾2exp(𝑾1𝑿1))absentsubscript𝒀1expsubscript𝑾2expsubscript𝑾1subscript𝑿1\displaystyle=\bm{Y}_{1}\operatorname{exp}(-\bm{W}_{2}\operatorname{exp}(\bm{W% }_{1}\bm{X}_{1}))= bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_exp ( - bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )

which concludes the proof.

Note that 𝒁𝒁\bm{Z}bold_italic_Z can be calculated using the method in Section 2, thus the solution given in Theorem 1 can be calculated without gradient descent. The only assumption of data is 𝑿1𝑿2subscript𝑿1subscript𝑿2\bm{X}_{1}-\bm{X}_{2}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is invertible, which is much more general than a certain class of functions.

4 Experimental Results

Since we already found the analytical solution of a three-layer network with matrix exponential activation function, numerical experiments is not necessary. In this section, we focus on experiments on element-wise activation functions such as Relu and sigmoid using similar method. As discussed in Section 2, similar equation for two-layer network with element-wise activation σ𝜎\sigmaitalic_σ, i.e.,

{𝒀1=𝑾2σ(𝑾1𝑿1)𝒀2=𝑾2σ(𝑾1𝑿2)casessubscript𝒀1subscript𝑾2𝜎subscript𝑾1subscript𝑿1subscript𝒀2subscript𝑾2𝜎subscript𝑾1subscript𝑿2\left\{\begin{array}[]{c}\bm{Y}_{1}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{X}_{1})\\ \bm{Y}_{2}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})\end{array}\right.{ start_ARRAY start_ROW start_CELL bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY (33)

which equals to solving 𝑾1subscript𝑾1\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝑾2subscript𝑾2\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sequentially through

{𝒀1=𝒀2σ(𝑾1𝑿2)1σ(𝑾1𝑿1)𝑾2=𝒀2σ(𝑾1𝑿2)1casessubscript𝒀1subscript𝒀2𝜎superscriptsubscript𝑾1subscript𝑿21𝜎subscript𝑾1subscript𝑿1subscript𝑾2subscript𝒀2𝜎superscriptsubscript𝑾1subscript𝑿21\left\{\begin{array}[]{c}\bm{Y}_{1}=\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1% }\sigma(\bm{W}_{1}\bm{X}_{1})\\ \bm{W}_{2}=\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}\end{array}\right.{ start_ARRAY start_ROW start_CELL bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY (34)

In our experiments, we optimize 𝒀1𝒀2σ(𝑾1𝑿2)1σ(𝑾1𝑿1)F2superscriptsubscriptnormsubscript𝒀1subscript𝒀2𝜎superscriptsubscript𝑾1subscript𝑿21𝜎subscript𝑾1subscript𝑿1𝐹2\|\bm{Y}_{1}-\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}\sigma(\bm{W}_{1}\bm{X% }_{1})\|_{F}^{2}∥ bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with gradient descent. Each item of 𝑿1subscript𝑿1\bm{X}_{1}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,𝑿2subscript𝑿2\bm{X}_{2}bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 𝒀1subscript𝒀1\bm{Y}_{1}bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒀2subscript𝒀2\bm{Y}_{2}bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is sampled from Gaussian distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ). For comparison, we compute the same value when σ𝜎\sigmaitalic_σ is the identity function, i.e., 𝒀1𝒀2(𝑾1𝑿2)1𝑾1𝑿1F2=𝒀1𝒀2𝑿21𝑿1F2superscriptsubscriptnormsubscript𝒀1subscript𝒀2superscriptsubscript𝑾1subscript𝑿21subscript𝑾1subscript𝑿1𝐹2superscriptsubscriptnormsubscript𝒀1subscript𝒀2superscriptsubscript𝑿21subscript𝑿1𝐹2\|\bm{Y}_{1}-\bm{Y}_{2}(\bm{W}_{1}\bm{X}_{2})^{-1}\bm{W}_{1}\bm{X}_{1}\|_{F}^{% 2}=\|\bm{Y}_{1}-\bm{Y}_{2}\bm{X}_{2}^{-1}\bm{X}_{1}\|_{F}^{2}∥ bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then we can construct a score to measure the benefit of using sigmoid function or ReLU function in the training process

s=𝒀1𝒀2σ(𝑾1𝑿2)1σ(𝑾1𝑿1)F2𝒀1𝒀2𝑿21𝑿1F2𝑠superscriptsubscriptnormsubscript𝒀1subscript𝒀2𝜎superscriptsubscript𝑾1subscript𝑿21𝜎subscript𝑾1subscript𝑿1𝐹2superscriptsubscriptnormsubscript𝒀1subscript𝒀2superscriptsubscript𝑿21subscript𝑿1𝐹2s=\frac{\|\bm{Y}_{1}-\bm{Y}_{2}\sigma(\bm{W}_{1}\bm{X}_{2})^{-1}\sigma(\bm{W}_% {1}\bm{X}_{1})\|_{F}^{2}}{\|\bm{Y}_{1}-\bm{Y}_{2}\bm{X}_{2}^{-1}\bm{X}_{1}\|_{% F}^{2}}italic_s = divide start_ARG ∥ bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (35)

In the experiment (Fig.1), we find that both ReLU and Sigmoid function can find the optimal 𝑾1subscript𝑾1\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with s𝑠sitalic_s close to 0. This indicates that a two-layer network with ReLU or Sigmoid activation function has obvious benefits compared with the identity function and has the potential to solve twice the number of equations. Also the s𝑠sitalic_s score decrease with the increasing of dimension, which means, the optimization problem becomes easier in high dimention space. However, it is hard to prove the existence of a solution of equality (34) and the existence of a path from initial weights to global optimal weights with gradient descent.

Refer to caption
Figure 1: The s𝑠sitalic_s score of two-layer network with Sigmoid (left) and ReLU (right) activation function in the training process.

5 Conclusion

In this paper, we design a problem for a three-layer network with matrix exponential as an activation function and find the analytical solution. By doing this, we show the power of depth by comparing our three-layer networks to single-layer ones. Our result has merit compared with existing studies, both the studies finding special functions to show the power of depth and studies analyzing the width of networks through optimization methods. We also shed light on two-layer networks with element-wise activation functions through experiments, indicating that neural networks have the potential to solve the number of equations equaling the number of parameters. As activation function, matrix exponential may provide less non-linearity as element-wise activation function do, but it may be possible to analyze based on the results in Lie theory. In the future, we will try to extend our method to multi-layer cases.

Acknowledgments

This work has been supported by the CAS Project for Young Scientists in Basic Research [No. YSBR-034].

References

  • Barron (1994) Andrew R Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1):115–133, 1994.
  • Cybenko (1989) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
  • Eldan & Shamir (2016) Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pp.  907–940. PMLR, 2016.
  • Fischbacher et al. (2020) Thomas Fischbacher, Iulia M Comsa, Krzysztof Potempa, Moritz Firsching, Luca Versari, and Jyrki Alakuijala. Intelligent matrix exponentiation. arXiv preprint arXiv:2008.03936, 2020.
  • Funahashi (1989) Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989.
  • Goodfellow et al. (2014) Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  770–778, 2016.
  • Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
  • Huang (2003) Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274–281, 2003.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 2012.
  • Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advances in Neural Information Processing Systems, 31, 2018.
  • Li et al. (2017) Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo. Is second-order information helpful for large-scale visual recognition? In Proceedings of the IEEE International Conference on Computer Vision, pp.  2070–2078, 2017.
  • Pinkus (1999) Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:143–195, 1999.
  • Rossman et al. (2015) Benjamin Rossman, Rocco A Servedio, and Li-Yang Tan. An average-case depth hierarchy theorem for boolean circuits. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp.  1030–1048. IEEE, 2015.
  • Telgarsky (2016) Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, pp.  1517–1539. PMLR, 2016.
  • Vershynin (2020) Roman Vershynin. Memory capacity of neural networks with threshold and rectified linear unit activations. SIAM Journal on Mathematics of Data Science, 2(4):1004–1033, 2020.
  • Yamasaki (1993) Masami Yamasaki. The lower bound of the capacity for a neural network with multiple hidden layers. In International Conference on Artificial Neural Networks, pp.  546–549. Springer, 1993.
  • Yun et al. (2019) Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. Advances in Neural Information Processing Systems, 32, 2019.
  • Zhang et al. (2021) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.