Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
56 views

Appendix PDF

The document provides theoretical analysis for the AGLD algorithm. It lists requirements and assumptions, including: 1) Gradient snapshots must be from a fixed number of previous iterations. 2) Individual objective functions and their sum are smooth and strongly convex. Lemmas are presented to bound terms in the AGLD update, including bounding the variance of the gradient difference and sampled gradient terms. It is shown these bounds decay exponentially as iterations increase.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Appendix PDF

The document provides theoretical analysis for the AGLD algorithm. It lists requirements and assumptions, including: 1) Gradient snapshots must be from a fixed number of previous iterations. 2) Individual objective functions and their sum are smooth and strongly convex. Lemmas are presented to bound terms in the AGLD update, including bounding the variance of the gradient difference and sampled gradient terms. It is shown these bounds decay exponentially as iterations increase.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Appendix

Theoretical Analysis
We first list the requirement for AGLD and the assumptions on f .
(k) (k)
Requirement 1. For the gradient snapshot A(k) , we have αi ∈ {∇fi (x(j) )}j=k−D+1 , where D is a fixed constant.
Assumption 1 (Smoothness). Each individual fi is M̃ smooth. That is, fi is twice differentiable and there exists a constant
M̃ > 0 such that for all x, y ∈ Rd

fi (y) ≤ fi (x) + h∇fi (x), y − xi + kx − yk22 . (1)
2
Accordingly, we can verify that the summation f of fi0 s is M = M̃ N smooth.
Assumption 2 (Strongly Convexity). The sum f is µ strongly convex. That is, there exists a constant µ > 0 such that for all
x, y ∈ Rd ,
µ
f (y) ≥ f (x) + h∇f (x), y − xi + kx − yk22 . (2)
2

Lemma 1. Suppose one uniformly samples n data points from dataset of size N at one trial. Let T denotes the time to collect all
the data points, then we have
N ln N
P (T > β ) < N 1−β . (3)
n
(k)
Proof. Let Yi denote the event that the i-th sample was not selected in the first k trials. Then we have
(k)  n
P Yi = (1 − )(k) ≤ e−(nr)/N .
N
(k)
Thus we have P [Yi ] ≤ N −β for k = βN log N/n. Hence,
N ln N  [ (βN log N/n) 
P T >β = P Yi
n
(βN log N/n)
≤ N P [Yi ] ≤ N 1−β

From euqation (14) from the main paper, we can derive


∆(k+1) = y (k+1) − x(k+1)
Z (k+1)η
(k)
=∆ −( ∇f (y(s)) − ∇(y (k) )ds)−

(k)
η(∇f (y − ∇f (x(k) )) + η(g (k) − ∇f (x(k) ))
= ∆(k) − V (k) − ηU (k) + ηΨ(k) + ηE (k) ,
where
Z (k+1)η
(k)
V = ∇f (y(s)) − ∇(y (k) )ds,

U (k) = ∇f (y (k) − ∇f (x(k) ),


XN N
(k) (k)
X
Ψ(k) = (∇fi (x(k) ) − αi ) + αi − ∇f (x(k) ),
n i=1
i∈Ik
N X (k)
X (k)
E (k) = (∇fi (x(k) ) − αi ) − (∇fi (x(k) ) − αi ).
n
i∈Sk i∈Ik
2
Lemma 2. Assuming the M -smoothness and µ-strongly convexity of f and that η ≤ M +µ , we have
1 4 3
EkV (k) k2 ≤ η M d + h3 M 2 d, (4)
3
Ek∆(k) − ηU (k) k2 ≤ (1 − ηµ)2 Ek∆(k) k2 . (5)
Lemma 3. Assuming the M -smoothness and µ-strongly convexity of f and Requirement 1, we have the following upper bound
on EkΨ(k) k2 and EkE (k) k2
N
NX (k)
EkΨ(k) k2 ≤ Ek∇fi (x(k) ) − αi k2 ,
n i=1
N
2N (N + n) X (k)
EkE (k) k2 ≤ Ek∇fi (x(k) ) − αi k2 ,
n i=1
and
N
(k)
X
Ek∇fi (x(k) ) − αi k2 ≤ 32η 2 D2 M 2 Ek∆(k:k−2D) k2,∞ + 4ηDd + 48η 3 M 2 D3 d(ηDM + 1) + 8η 2 M N D2 d,
i=1
where EkE (k) k2 = 0 if the data access strategy is RA and ∆(k:k−2D) := [∆(k) , ∆(k−1) , · · · ∆([k−2D]+ ) ].
Proof.
XN N
(k) (k)
X
EkΨ(k) k2 = Ek (∇fi (x(k) ) − αi ) + αi − ∇f (x(k) k2
n i=1
i∈Ik
N
1 X (k)
X (k) 
= Ek N (∇fi (x(k) ) − αi ) − (∇f (x(k) ) − αi ) k2
n i=1
i∈Ik
N
1 (k)
X (k)
= EkN (∇fi (x(k) ) − αi ) − (∇f (x(k) ) − αi k2
n i=1
N2 (k)
≤ Ek∇fi (x(k) ) − αi k2
n
N
NX (k)
= Ek∇fi (x(k) ) − αi k2 .
n i=1
The third equality follows from the fact that Ik are chosen uniformly and independently. The first inequality is due to the fact
that EkX − EXk2 ≤ EkXk2 for any random variable X. Here in the last equality,we use that i is chosen uniformly from
{1, · · · , N } and i here is no longer a random variable.

N X (k)
X (k)
EkE (k) k2 = Ek (∇fi (x(k) ) − αi ) − (∇fi (x(k) ) − αi )k2
n
i∈Sk i∈Ik

2N 2 X (k)
X (k)
≤ 2 E(k (∇fi (x(k) ) − αi )k2 + k (∇fi (x(k) ) − αi )k2 )
n
i∈Sk i∈Ik
2
2N X (k)
X (k)
≤ E(n k∇fi (x(k) ) − αi k2 + n k∇fi (x(k) ) − αi k2 )
n2
i∈Sk i∈Ik
N N
2N 2 X
(k) (k) 2 n2 X (k)
≤ E(n k∇fi (x ) − αi k + k∇fi (x(k) ) − αi k2 )
n2 i=1
N i=1
N
2N (N + n) X (k)
= Ek∇fi (x(k) ) − αi k2 .
n i=1
Pn Pn
In the first two inequality, we use that k i=1 ai k2 ≤ n i=1 kai k2 . The third inequality follows from the fact that {Sk } are
subset of {1, · · · , N } in CA and RS and Ik are chosen uniformly and independently from {1, · · · , N }. When we use RA, Sk
just equals to Ik and EkE (k) k0 = 0.
(k)
Suppose that in the k-th iteration, snapshot αi are taken at x(ki ) , where ki ∈ {(k − 1) ∨ 0, (k − 2) ∨ 0, · · · , (k − D) ∨ 0}.
By the M̃ smoothness of fi , we have
N N N
X (k)
X X M2
Ek∇fi (x(k) ) − αi k2 ≤ M̃ 2 Ekx(k) − x(ki ) k2 = 2
Ekx(k) − x(ki ) k2 .
i=1 i=1 i=1
N
According to the update rule of x(k) , we have
k−1
X √ k−1
X k−1
X
Ekx(k) − x(ki ) k2 = Ekη g (j) + 2 ξ (j) k2 ≤ 2Dη 2 Ekg (j) k2 + 4Ddη,
j=ki j=ki j=k−D

where the in equality follows from ka + bk2 ≤ 2(kak2 + kbk2 ), ξ (j) are independent Gaussian variables and ki ≥ k − D.
By expanding g (j) , we have
X N XN
Ekg (j) k2 = Ek (∇fp (x(j) ) − αp(j) ) + αp(j) k2
n p=1
p∈Sj

X N XN
≤ 2Ek (∇fp (x(j) ) − αp(j) )k2 + 2Ek αp(j) k2
n p=1
p∈Sj
| {z } | {z }
A B

For A, we have
X N2
A ≤ 2n Ek(∇fp (x(j) ) − ∇fp (y (j) )) + (∇fp (y (j) ) − ∇fp (y (jp ) )) + (∇fp (y (jp ) ) − αp(j) )k2
n2
p∈Sj

6N 2 X
≤ (Ek∇fp (x(j) ) − ∇fp (y (j) )k2 + Ek∇fp (y (j) ) − ∇fp (y (jp ) )k2 + Ek∇fp (y (jp ) ) − αp(j) k2 )
n
p∈Sj
2
6M X
≤ (Ekx(j) ) − y (j) k2 + Eky (j) ) − y (jp ) k2 + Eky (jp ) ) − x(jp ) k2 )
n
p∈Sj

where the last inequality follows from the smoothness of fp .


Bu further expanding y (j) and y (jp ) , we have
j
Z jη √ X
Eky (j) − y (jp ) k2 = Ek ∇f (y(s))ds − 2 ξ (q) k2
jp η q=jp
Z jη
≤ 2(j − jp )η Ek∇f (y(s))k2 ds + 4ηDd
jp η
≤ 2Dη · DηM d + 4ηDd
≤ 2D2 η 2 M d + 4ηDd
Here, the first inequality is due to the Jensen’s inequality and the second inequality follows by Lemma 3 in (Dalalyan and
Karagulyan 2017) to bound Ek∇f (y(s))k2 ds ≤ M d.
Then we can bound A above by
6M 2 X
A≤ (Ek∆j k2 + 2D2 η 2 M d + 4ηDd + Ek∆jp k2 )
n
p∈Sj

≤ 6M (Ek∆j k2 + 2D2 η 2 M d + 4ηDd + Ek∆kD


2
j )
Now we can bound B with similar technique
N
X N
X
B = 2Ek (αp(j) − ∇fp (y (jp ) )) + ∇fp (y (jp ) )k2
p=1 p=1
N
X N
X
≤ 4N Ek∇fp (x(jp ) ) − ∇fp (y (jp ) )k2 + 4N Ek∇fp (y (jp ) )k2
p=1 p=1
N
2 X
4M
≤ Ek∆jp k2 + 4N M D
N p=1

≤ 4M Ek∆kD
2
j + 4N DM.
By substituting all these back, then we have
N N
X (k)
X M2
Ek∇fi (x(k) ) − αi k2 ≤ Ekx(k) − x(ki ) k2
i=1 i=1
N2
N k−1
M2 X 2
X
≤ (2Dη Ekg (j) k2 + 4Ddη)
N 2 i=1
j=k−D
N 2 k−1
M X X
≤ 2 2Dη 2 6M 2 (Ek∆j k2 + 2D2 η 2 M d + 4ηDd + Ek∆(j:j−D) k2,∞ )+
N i=1
j=k−D
2
Ek∆kD
 
4M j + 4N DM + 4Ddη
M2
≤ (4Ddη + 2D2 η 2 (16Ek∆(k:k−2D) k2,∞ + 24M 2 Ddη(ηDM + 1) + 4N M d)).
N
Then we can conclude this lemma.
C
Lemma 4. Given a positive sequence {ai }N i=0 and ρ ∈ (0, 1), if we have ρ < ai for all i ∈ {1, 2, · · · , N } and ak ≤
(1 − ρ) max(a[k−1]+ , a[k−2]+ , · · · , a[k−D]+ ) + C, then we can conclude
dk/De
X C
ak ≤ (1 − ρ)dk/De a0 + (1 − ρ)i−1 C ≤ exp(−ρdk/De)a0 + .
i=1
ρ
Proof. For all i ∈ {1, 2, · · · , D}, we have ai ≤ (1 − ρ)a0 + C < a0 .
P2
Then aD+1 ≤ (1 − ρ) max(aD , aD−1 , · · · , a1 ) + C ≤ (1 − ρ)2 a0 + i=1 (1 − ρ)i−1 C < (1 − ρ)a0 + C.
2
And aD+2 ≤ (1 − ρ) max(aD+1 , aD , · · · , a2 ) + C ≤ (1 − ρ)2 a0 + i=1 (1 − ρ)i−1 C < (1 − ρ)a0 + C.
P
dk/De
By repeating this argument, we can conclude ak ≤ (1 − ρ)dk/De a0 + i=1 (1 − ρ)i−1 C by induction.
P
N
Since 1 − x ≤ exp−x and i=1 (1 − ρ)i−1 C ≤ Cρ , we conclude this lemma.
P


µ n
Proposition 1. Assuming the M -smoothness and µ-strongly convexity of f and Requirement 1, if η ≤ min{ 8√10DM , 2 },
N m+M
we have for all k ≥ 0
ηµ
Ek∆(k+1) k2 ≤ (1 − )Ek∆(k:k−2D) k2,∞ + C1 η 3 + C2 η 2 ,
2
where both C1 and C2 are constants that only depend on M, N, D, µ.
Proof. We give the proof sketch here.For full proof, please refer to the Supplementary. Since E[Ψ(k) |x(k) ] = 0, we have
Ek∆(k+1) k2 = Ek∆(k) − ηU (k) − V (k) + ηE (k) k2 + η 2 EkΨ(k) k2
1
≤ (1 + α)Ek∆(k) − ηU (k) k2 + (1 + )EkV (k) + ηE (k) k2 + η 2 EkΨ(k) k2
α
1
≤ (1 + α)Ek∆ − ηU k + 2(1 + )(EkV (k) k2 + η 2 EkE (k) k2 ) + η 2 EkΨ(k) k2 ,
(k) (k) 2
α
where the first and the second inequalities are due to the Young’s inequality.
By substituting the bound in Lemma 2 and Lemma 3, we can get a one step result for Ek∆(k+1) k2 .
1 η4 M 3 d
Ek∆(k+1) k2 ≤ (1 + α)(1 − ηµ)2 Ek∆(k) k2 + 2(1 + )( + h3 M 2 d)+
α 3
N
N η2 1 X (k)
(4(1 + )(N + n) + 1) Ek∇fi (x(k) ) − αi k2
n α i=1
1 η4 M 3 d
≤ (1 + α)(1 − ηµ)2 Ek∆(k) k2 + 2(1 + )( + h3 M 2 d)
α 3
5N (N + n)η 2 1 M2
4ηDd + 32η 2 D2 M 2 Ek∆k2D 3 2 3 2 2

+ (1 + ) k + 48η M D d(ηDM + 1) + 8η M N D d
n α N
2 160D2 M 4 (N + n)η 4 1 
≤ (1 + α)(1 − ηµ) + (1 + ) Ek∆(k:k−2D) k2,∞ + C.
n α
2
4
M 3d (N +n)η 2
where C = 2(1 + α1 )( η + η 3 M 2 d) + + 5M + α1 ) 4ηDd + 48η 3 M 2 D3 d(ηDM + 1) + 8η 2 M N D2 d

3 n (1

160D 2 M 4 (N +n)η 4
By choosing α = ηµ < 1 and η ≤ √ µ n , we have (1 + α)(1 − ηµ)2 + n (1 + α1 ) ≤ 1 − ηµ
2 and
8 10(N +n)DM 2

1 η4 M 3 d 5M 2 (N + n)η 2 1
+ η 3 M 2 d) + + (1 + ) 4ηDd + 48η 3 M 2 D3 d(ηDM + 1) + 8η 2 M N D2 d

C ≤ 2(1 + )(
α 3 n √α
3 2 2 2 2
4M d 10m (N + n) 3µ D nd 6D dµ n 4M 2 d 40M 2 (N + n)Dd
≤ η3 + 8M N D2 d) +η 2 (

+ ( +p + )
µ nµ 40(N + n)M 10(M + n) µ nµ
| {z } | {z }
C1 C2

Then we can simplify the one iteration relation into


ηµ
Ek∆(k+1) k2 ≤ (1 − )Ek∆(k:k−2D) k2,∞ + C1 η 3 + C2 η 2 .
2

Theorem 1. Under the same assumptions as in Proposition 1, AGLD can guarantee to get -accuracy 2-Wasserstein distance in
k = O( 12 log 1 ) iterations by setting η = O(2 ).
Proof. Now we try to get a -accuracy 2-Wasserstein distance approximation . In order to use Lemma 4, we can assume that
3
2
2 C2 η 2 2
Ek∆(k) k2 > 4 (for otherwise, we already have get /2-accuracy ) and C 1η
ηµ/2 ≤ 16 and ηµ/2 ≤ 16 . Then by using lemma 4 and
the fact that |a|2 + |b|2 | + |c|2 ≤ (|a| + |b| + |c|)2 , the Wasserstein distance between p(k) and p∗ is bounded by

µηdk/(2D)e C1 η C2 η
W2 (p(k) , p∗ ) ≤ exp(− )W0 + √ + √ ,
4 µ µ

C2 η
Then by requiring that exp(− µηdk/(2D)e
4 )W0 ≤ 
2 , C√1µη ≤ 4 , √
µ ≤ 4 , we have W2 (p(k) , p∗ ) ≤ . That is η = O(2 ) and
k = O( 12 log 1 )

Improved results under additional smoothness assumptions


Under the Hessian Lipschitz continuous condition, we can improve the convergence rate of IAGLD with random access.
Hessian Lipschitz: There exists a constant L > 0 such that for all x, y ∈ Rd
k∇2 f (x) − ∇2 f (y)k ≤ Lkx − yk22 . (6)
We first give a technical lemma
Lemma 5. [(Dalalyan and Karagulyan 2017)] Assuming the M -smoothness µ-strongly convexity and L Hessian Lipschitz
smoothness of f , we have
η3 M 2 d
EkS (k) k2 ≤ (7)
3
4 2 2 3
η (L d + M d)
EkV (k) − S (k) k2 ≤ (8)
2
√ (k+1)η s
where S (k) = 2 kη ∇2 f (y(r))dW (r)ds.
R R

Theorem 2. Under the same assumption with Proposition 1 and Hessian Lipschitz, AGLD with RA procedure can achieve
-accuracy after k = O( 1 log 1 ) iterations by setting η = O().
Proof. The proof is similar to the proof in Theorem 1, but there are some key differences. First, we also give the one-iteration
result here. Since E[Ψ(k) |x(k) ] = 0, we have
Ek∆(k+1) k2 = Ek∆(k) − ηU (k) − (V (k) − S (k) ) − Sk + ηE (k) k2 + η 2 EkΨ(k) k2
1
≤ (1 + α)Ek∆(k) − ηU (k) − S (k) k2 + (1 + )EkV (k) − S (k) + ηE (k) k2 + η 2 EkΨ(k) k2
α
1
≤ (1 + α)(Ek∆ − ηU k + EkS k ) + η 2 EkΨ(k) k2 + 2(1 + )(EkV (k) − S (k) k2 + η 2 EkE (k) k2 ),
(k) (k) 2 (k) 2
α
where in the second inequality, we use the fact that E(S (k) |∆(k) , u(k) ) = 0. By substituting the bound in Lemma 2 , Lemma 3
and Lemma 5, we can get a one step result for Ek∆(k+1) k2 in the same way as in Proposition 1.
ηµ
Ek∆(k+1) k2 ≤ (1 − )Ek∆k2D 3 2
k + C1 η + C2 η (1 − I{RA} ).
2
Here, we can see that for RA, the η 2 term has now disappeared and that is the reason why we can get a better result. Then following
similar argument as the proof of Theorem 1, it can be verified that IAGLD with RA procedure can achieve -accuracy after
k = O( 1 log 1 ) iterations by setting η = O(). However, for CA and RS, we still need η = O(2 ) and k = O( 12 log 1 ).
References
Dalalyan, A. S., and Karagulyan, A. G. 2017. User-friendly guarantees for the langevin monte carlo with inaccurate gradient.
arXiv preprint arXiv:1710.00095.

You might also like