A.1 Proof of Theorem 1
Recall that
|
|
|
Put
|
|
|
Let and , we can write
|
|
|
|
|
(A.35) |
|
|
|
|
|
Define
|
|
|
Recall for , there hold that and
|
|
|
By the -smooth of the individual function in (A3) and the independence of and , we have
|
|
|
Then from the iteration (A.35), we have
|
|
|
|
|
(A.36) |
|
|
|
|
|
In the remaining part of proof, we use the inductive method to derive the bound of . Let and be some constants which will be specified later. We will prove that, for any , the inequalities
|
|
|
(A.37) |
with fixed , imply that
|
|
|
(A.38) |
By Lemma 20 below, we have for all .
Theorem 3 provides that . By applying Lemma 27, we are equipped to manage the summation presented in (A.36).
Consequently, from (A.36) and (A.37), we deduce that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Let the learning rate and the batch size satisfy
|
|
|
we can take
|
|
|
It is straightforward to verify that (A.38) is satisfied, and the induction is completed.
A.2 Proof of Theorem 3
Put
|
|
|
where , , is an orthogonal matrix. Now we can define
|
|
|
for , and the diagonal matrix
|
|
|
(A.42) |
Notice that if , is complex and the definition is
|
|
|
Prior to establishing Theorem 3, we first demonstrate the diagonalization of , with representing the resultant diagonal matrix. Subsequent to this, we proceed to delineate the bounds of the spectral radius of .
Lemma 20
For satisfying that and for all , we have
|
|
|
for some invertible matrix that satisfies
|
|
|
where . Therefore, for any , where
|
|
|
and is the spectral radius of .
Proof.
The eigenvalues satisfy
|
|
|
|
|
|
Given that and considering , it follows that for all .
Considering the product of the eigenvalues , subsequent calculations reveal that , which holds even in the case of complex eigenvalues. Moreover, since for any , it follows that , confirming that the matrix is indeed diagonalizable.
Let be the unit vector which has 1 in the -th coordinate and others zero. By the definition of , we have
|
|
|
which is equivalent to
|
|
|
(A.44) |
Now define by
|
|
|
which can be written as .
Combining this equation with (A.44), we have
|
|
|
which yields that
|
|
|
Thus
|
|
|
(A.53) |
Define
|
|
|
and
|
|
|
|
|
|
|
|
|
|
Note that and are diagonal matrices. Then by (A.53),
|
|
|
where is defined by (A.42).
By the definition of , we have
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For , we have
|
|
|
and
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
So it holds that
|
|
|
|
|
|
|
|
|
|
Then for and , it holds that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where .
Put
|
|
|
Then we have
|
|
|
After diagonalizing , we can obtain an upper bound for the spectral radius by evaluating the maximal absolute eigenvalue in the diagonal matrix , as specified in definition (A.42).
Proof. [Theorem 3] From Lemma 20, the eigenvalues satisfy
|
|
|
For real eigenvalues, there exists such that . By the fact that , we have
|
|
|
|
|
(A.59) |
|
|
|
|
|
|
|
|
|
|
where
|
|
|
In this condition, there holds that . Define , then
|
|
|
By the fact that and , where , we have that is concave and
|
|
|
For complex eigenvalues, which means that for all . Then and
|
|
|
A.3 Proof of Theorem 9
Recall the iteration (A.35) that
|
|
|
(A.66) |
Define the events
|
|
|
For certain constants , which will be defined subsequently, the probability of event is .
In the remaining part of proof, we use the inductive method to demonstrate that, for any , on the events , , the event occurs with at least probability.
For , it holds that
|
|
|
|
|
(A.68) |
|
|
|
|
|
where .
By applying Lemmas 21, 22 and 23, we establish bounds for each of the three components in (A.68).
Then from the iteration (A.35), with at least probability, we have that on the events
|
|
|
|
|
|
|
|
|
|
|
|
where is an absolute constant. To ensure the occurrence of event , the learning rate and the batch size must fulfill the condition
|
|
|
where
|
|
|
Furthermore, we require that the initialization satisfy
|
|
|
where
|
|
|
It is straightforward to verify that
|
|
|
and the induction is completed. Consequently, we deduce that the probability of the intersection of events from through is at least .
A.4 Supporting Lemmas for Bounds in Theorem 9 Proof
Lemma 21
Under (A1)-(A2) and (A3’), we have
|
|
|
for , where is an absolute constant.
Proof.
Define
|
|
|
By the definition of the sub-exponential, is -sub-exponential random vector for some absolute constant , and are independent.
Then by Lemmas 27 and 20, we have
|
|
|
By applying Lemma 28 on , with probability , we can obtain
|
|
|
Lemma 22
Under (A1)-(A2) and (A3’), suppose that the -th iteration satisfies
|
|
|
for , we have
|
|
|
where
|
|
|
and is an absolute constant.
Proof.
Define
|
|
|
According to the iteration of SGDM, we have and are independent and are martingale difference. Due to the -smooth of the individual gradient , it holds that
|
|
|
where
|
|
|
and is an absolute constant.
Then according to Lemmas 27 and 20, we have
|
|
|
By applying Lemma 28 on , with probability , we can obtain the following result:
|
|
|
Lemma 23
Under (A1)-(A2) and (A3’), suppose that the -th iteration satisfies
|
|
|
for , for , we have
|
|
|
Proof.
By the fact that and Lemma 27, we have
|
|
|
|
|
|
|
|
|
|
|
|