Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
51 views

Theoretical Statistics. Lecture 4.: 1. Concentration Inequalities

The document summarizes concentration inequalities for statistics involving sums of independent random variables and martingale difference sequences. It discusses Chernoff bounds, Hoeffding's inequality, sub-Gaussian and sub-exponential random variables, and provides proofs of the Johnson-Lindenstrauss embedding theorem and Azuma-Hoeffding inequality for martingales. Examples on Rademacher averages and empirical processes are also given to demonstrate the bounded differences approach.

Uploaded by

Akon Akki
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Theoretical Statistics. Lecture 4.: 1. Concentration Inequalities

The document summarizes concentration inequalities for statistics involving sums of independent random variables and martingale difference sequences. It discusses Chernoff bounds, Hoeffding's inequality, sub-Gaussian and sub-exponential random variables, and provides proofs of the Johnson-Lindenstrauss embedding theorem and Azuma-Hoeffding inequality for martingales. Examples on Rademacher averages and empirical processes are also given to demonstrate the bounded differences approach.

Uploaded by

Akon Akki
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Theoretical Statistics. Lecture 4.

Peter Bartlett
1. Concentration inequalities.

Outline of todays lecture


We have been looking at deviation inequalities, i.e., bounds on tail
probabilities like P (Xn t) for some statistic Xn .
1. Using moment generating function bounds, for sums of independent
r.v.s:
Chernoff; Hoeffding; sub-Gaussian, sub-exponential random variables;
Bernstein.
Today: Johnson-Lindenstrauss.
2. Martingale methods:
Hoeffding-Azuma, bounded differences.

Review. Chernoff technique

Theorem: For t > 0:


P (X EX t) inf et MX ().
>0

Theorem: [Hoeffdings Inequality] For a random variable X [a, b] with


EX = and R,
2 (b a)2
ln MX ()
.
8

Review. Sub-Gaussian, Sub-Exponential Random Variables

Definition: X is sub-Gaussian with parameter 2 if, for all R,


2 2
.
ln MX ()
2

Definition: X is sub-exponential with parameters ( 2 , b) if, for all || <


1/b,
2 2
.
ln MX ()
2

Review. Sub-Exponential Random Variables

Theorem: For X sub-exponential with parameters ( 2 , b),

 2 
exp t 2
if 0 t 2 /b,
2
P (X + t)
exp t 
if t > 2 /b.
2b

For independent Xi , sub-exponential with parameters (i2 , bi ), the sum


X = X1 + + Xn is sub-exponential with parameters

P 2
i i , maxi bi .
Example: X 21 is sub-exponential with parameters (4, 4).
5

Sub-Exponential Random Variables: Example

Theorem: [Johnson-Lindenstrauss] For m points x1 , . . . , xm from Rd ,


there is a projection F : Rd Rn that preserves distances in the sense
that, for all xi , xj ,
(1 )kxi xj k22 kF (xi ) F (xj )k22 (1 + )kxi xj k22 ,
provided that n > (16/ 2 ) log m.
That is, we can embed these points in Rn and approximately maintain their
distance relationships, provided that n is not too small. Notice that n is
independent of the ambient dimension d, and depends only logarithmically
on the number of points m.

Johnson-Lindenstrauss
Applications: dimension reduction to simplify computation (nearest
neighbor, clustering, image processing, text processing).
Analysis of machine learning methods: separable by a large margin in high
dimensions implies its really a low-dimensional problem after all.

Johnson-Lindenstrauss Embedding: Proof


We use a random projection:
1

F (x) =
Y x,
n
where Y Rnd has independent N (0, 1) entries.
Let Yi denote the ith row, for 1 i n. It has a N (0, I) distribution, so
YiT x/kxk2 N (0, 1). Thus,
n
X
2
kY xk22
T
2
Y

x/kxk
=
Z=
i
n.
2
kxk2
i=1

Johnson-Lindenstrauss Embedding: Proof


Since Z 2n is the sum of n independent sub-exponential (4, 4) random
variables, it is sub-exponential (4n, 4). And we have that for 0 < t < n,
P (|Z 1| t) 2 exp(t2 /(8n)).
Hence, for 0 < < 1,



kY xk22

2 exp(n 2 /8)
P

1

nkxk22


2
kF (x)k2
2
P

6
[1

,
1
+
]

2
exp(n
/8).
2
kxk2

Johnson-Lindenstrauss Embedding: Proof




m
2

distinct pairs x = xi xj , and using the union


Applying this to the
bound gives

 

2
m
kF (xi xj )k2
2

6
[1

,
1
+
]

2
exp(n
/8).
P i 6= j s.t.
2
2
kxi xj k2
Thus, for n > 16/ 2 log(m), this probability is strictly less than 1, so there
exists a suitable mapping.
In fact, we can choose a random projection in this way and ensure that the
probability that it does not satisfy the approximate isometry property is no
more than for n > 16/ 2 log(m/).

10

Concentration Bounds for Martingale Difference Sequences

Next, were going to consider concentration of martingale difference


sequences. The application is to understand how tails of
f (X1 , . . . , Xn ) Ef (X1 , . . . , Xn ) behave, for some function f .
[e.g., in the homework, we have that f is some measure of the performance
of a kernel density estimator.] If we write
f (X1 , . . . , Xn ) Ef (X1 , . . . , Xn )
=

n
X
i=1

E[f (X1 , . . . , Xn )|X1 , . . . , Xi ] E[f (X1 , . . . , Xn )|X1 , . . . , Xi1 ],

then we have represented this deviation as a martingale difference sequence.


11

Martingales

Definition: A sequence Yn of random variables adapted to a filtration Fn is


a martingale if, for all n,
E|Yn | <
E[Yn+1 |Fn ] = Yn .
Fn is a filtration means these -fields are nested: Fn Fn+1 .
Yn is adapted to Fn means that each Yn is measurable with respect to Fn .
e.g. Fn = (Y1 , . . . , Yn ), the -field generated by the first n variables.
Then we say Yn is a martingale sequence.
e.g. Fn = (X1 , . . . , Xn ). Then Yn is a martingale sequence wrt Xn .
12

Martingale Difference Sequences


Definition: A sequence Dn of random variables adapted to a filtration Fn
is a martingale difference sequence if, for all n,
E|Dn | <
E[Dn+1 |Fn ] = 0.
e.g., Dn = Yn Yn1 .
E[Dn+1 |Fn ] = E[Yn+1 |Fn ] E[Yn |Fn ]
= E[Yn+1 |Fn ] Yn = 0

(because Yn is measurable wrt Fn , and because of the martingale property).


Pn
Hence, Yn Y0 = i=1 Di .
13

Martingale Difference Sequences: the Doob construction

X = (X1 , . . . , Xn ),

Define

X1i = (X1 , . . . , Xi ),
Y0 = Ef (X),
Yi = E[f (X)|X1i ].
Then

f (X) Ef (X) = Yn Y0 =

n
X

Di ,

i=1

where Di = Yi Yi1 . Also, Yi is a martingale w.r.t. Xi , and hence Di is a


martingale difference sequence. Indeed (because EX = EE[X|Y ]),
i

i+1
i
E[Yi+1 |X1 ] = E E[f (X)|X1 ] X1 = E[f (X)|X1i ] = Yi .
14

Martingale Difference Sequences: another example


[An aside:] Consider two densities f and g, with g absolutely continuous
w.r.t. f . Suppose Xn are drawn i.i.d. from f , and Yn is the likelihood ratio,
n
Y
g(Xi )
Yn =
.
f (Xi )
i=1

Then Yn is a martingale w.r.t. Xn . Indeed,



#
" n+1

Y
n

Y g(Xi )
g(X
)
g(Xi )
n+1
n
n
E[Yn+1 |X1 ] = E
X = E
f (Xi ) 1
f (Xn+1 )
f (Xi )
i=1

i=1

n
Y
g(Xi )
=
= Yn ,
f (Xi )
i=1

because E[g(Xn+1 )/f (Xn+1 )] = 1.


15

Concentration Bounds for Martingale Difference Sequences

Theorem: Consider a martingale difference sequence Dn (adapted to a


filtration Fn ) that satisfies
for || 1/bn a.s., E [ exp(Dn )| Fn1 ] exp(2 n2 /2).
Then

Pn

Pn

2
D
is
sub-exponential,
with
(
,
b)
=
(

i
i=1
i=1 i , maxi bi ).


!
2 exp(t2 /(2 2 )) if 0 t 2 /b
X


Di t
P

2 exp(t/(2b))

if t > 2 /b.
i

16

Concentration Bounds for Martingale Difference Sequences

Proof:
E exp

X
i

Di

"

= E exp

n1
X

Di

Di

!#

i=1

"

E exp
provided || < b. Iterating shows that

n1
X
i=1

17

E [ exp(Dn )| Fn1 ]
exp(2 n2 /2),

Di is sub-exponential.

Concentration Bounds for Martingale Difference Sequences

Theorem: Consider a martingale difference sequence Di with |Di | Bi


a.s. Then


!


X
2
2t


P
.
P
Di t 2 exp
2


B
i i
i

Proof:
It suffices to show that

E [ exp(Di )| Fi1 ] exp(2 Bi2 /2)


But |Di | Bi a.s., so the conditioned variable (Di |Fi1 ) Bi a.s., so it is
sub-Gaussian with parameter i2 = Bi2 .
18

Bounded Differences Inequality

Theorem: Suppose f : X n R satisfies the following bounded differences inequality:


for all x1 , . . . , xn , xi X ,
|f (x1 , . . . , xn ) f (x1 , . . . , xi1 , xi , xi+1 , . . . , xn )| Bi .
Then

2t
P (|f (X) Ef (X)| t) 2 exp P 2
i Bi

19

Bounded Differences Inequality


Proof: Use the Doob construction.
Yi = E[f (X)|X1i ],
Di = Yi Yi1 ,
f (X) Ef (X) =

n
X

Di .

i=1

Then


i1
i

|Di | = |Yi Yi1 | = E[f (X)|X1 ] E[f (X)|X1 ]

i1 
i

= E E[f (X)|X1 ] f (X) X1 Bi .
20

Examples: Rademacher Averages


For a set A Rn , consider
Z = sup h, ai,
aA

where = (1 , . . . n ) is a sequence of i.i.d. uniform {1} random


variables. Define the Rademacher complexity of A as R(A) = EZ. [This
is a measure of the size of A.] The bounded differences approach implies
that Z is concentrated around R(A):
Theorem: Z is sub-Gaussian with parameter 4

2
sup
a
aA
i.
i

Proof:
Write Z = f (1 , . . . , n ), and notice that a change of i can lead to a
change in Z of no more than Bn = supaA 2|ai |. The result follows.
21

Examples: Empirical Processes


For a class F of functions f : X [0, 1], suppose that X1 , . . . , Xn , X are
i.i.d. on X , and consider






n




1X



f (Xi ) = P f Pn f .
Z = sup Ef (X)
| {z }
n i=1
f F
emp proc
F

If Z converges to 0, this is called a uniform law of large numbers. Here, we


show that Z is concentrated about EZ:
Theorem: Z is sub-Gaussian with parameter 1/n.

Proof:
Write Z = g(X1 , . . . , Xn ), and notice that a change of Xi can lead to a
change in Z of no more than Bn = 1/n. The result follows.
22

You might also like