Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Theoretical Statistics. Lecture 4.: 1. Concentration Inequalities

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Theoretical Statistics. Lecture 4.

Peter Bartlett
1. Concentration inequalities.

Outline of todays lecture


We have been looking at deviation inequalities, i.e., bounds on tail
probabilities like P (Xn t) for some statistic Xn .
1. Using moment generating function bounds, for sums of independent
r.v.s:
Chernoff; Hoeffding; sub-Gaussian, sub-exponential random variables;
Bernstein.
Today: Johnson-Lindenstrauss.
2. Martingale methods:
Hoeffding-Azuma, bounded differences.

Review. Chernoff technique

Theorem: For t > 0:


P (X EX t) inf et MX ().
>0

Theorem: [Hoeffdings Inequality] For a random variable X [a, b] with


EX = and R,
2 (b a)2
ln MX ()
.
8

Review. Sub-Gaussian, Sub-Exponential Random Variables

Definition: X is sub-Gaussian with parameter 2 if, for all R,


2 2
.
ln MX ()
2

Definition: X is sub-exponential with parameters ( 2 , b) if, for all || <


1/b,
2 2
.
ln MX ()
2

Review. Sub-Exponential Random Variables

Theorem: For X sub-exponential with parameters ( 2 , b),

 2 
exp t 2
if 0 t 2 /b,
2
P (X + t)
exp t 
if t > 2 /b.
2b

For independent Xi , sub-exponential with parameters (i2 , bi ), the sum


X = X1 + + Xn is sub-exponential with parameters

P 2
i i , maxi bi .
Example: X 21 is sub-exponential with parameters (4, 4).
5

Sub-Exponential Random Variables: Example

Theorem: [Johnson-Lindenstrauss] For m points x1 , . . . , xm from Rd ,


there is a projection F : Rd Rn that preserves distances in the sense
that, for all xi , xj ,
(1 )kxi xj k22 kF (xi ) F (xj )k22 (1 + )kxi xj k22 ,
provided that n > (16/ 2 ) log m.
That is, we can embed these points in Rn and approximately maintain their
distance relationships, provided that n is not too small. Notice that n is
independent of the ambient dimension d, and depends only logarithmically
on the number of points m.

Johnson-Lindenstrauss
Applications: dimension reduction to simplify computation (nearest
neighbor, clustering, image processing, text processing).
Analysis of machine learning methods: separable by a large margin in high
dimensions implies its really a low-dimensional problem after all.

Johnson-Lindenstrauss Embedding: Proof


We use a random projection:
1

F (x) =
Y x,
n
where Y Rnd has independent N (0, 1) entries.
Let Yi denote the ith row, for 1 i n. It has a N (0, I) distribution, so
YiT x/kxk2 N (0, 1). Thus,
n
X
2
kY xk22
T
2
Y

x/kxk
=
Z=
i
n.
2
kxk2
i=1

Johnson-Lindenstrauss Embedding: Proof


Since Z 2n is the sum of n independent sub-exponential (4, 4) random
variables, it is sub-exponential (4n, 4). And we have that for 0 < t < n,
P (|Z 1| t) 2 exp(t2 /(8n)).
Hence, for 0 < < 1,



kY xk22

2 exp(n 2 /8)
P

1

nkxk22


2
kF (x)k2
2
P

6
[1

,
1
+
]

2
exp(n
/8).
2
kxk2

Johnson-Lindenstrauss Embedding: Proof




m
2

distinct pairs x = xi xj , and using the union


Applying this to the
bound gives

 

2
m
kF (xi xj )k2
2

6
[1

,
1
+
]

2
exp(n
/8).
P i 6= j s.t.
2
2
kxi xj k2
Thus, for n > 16/ 2 log(m), this probability is strictly less than 1, so there
exists a suitable mapping.
In fact, we can choose a random projection in this way and ensure that the
probability that it does not satisfy the approximate isometry property is no
more than for n > 16/ 2 log(m/).

10

Concentration Bounds for Martingale Difference Sequences

Next, were going to consider concentration of martingale difference


sequences. The application is to understand how tails of
f (X1 , . . . , Xn ) Ef (X1 , . . . , Xn ) behave, for some function f .
[e.g., in the homework, we have that f is some measure of the performance
of a kernel density estimator.] If we write
f (X1 , . . . , Xn ) Ef (X1 , . . . , Xn )
=

n
X
i=1

E[f (X1 , . . . , Xn )|X1 , . . . , Xi ] E[f (X1 , . . . , Xn )|X1 , . . . , Xi1 ],

then we have represented this deviation as a martingale difference sequence.


11

Martingales

Definition: A sequence Yn of random variables adapted to a filtration Fn is


a martingale if, for all n,
E|Yn | <
E[Yn+1 |Fn ] = Yn .
Fn is a filtration means these -fields are nested: Fn Fn+1 .
Yn is adapted to Fn means that each Yn is measurable with respect to Fn .
e.g. Fn = (Y1 , . . . , Yn ), the -field generated by the first n variables.
Then we say Yn is a martingale sequence.
e.g. Fn = (X1 , . . . , Xn ). Then Yn is a martingale sequence wrt Xn .
12

Martingale Difference Sequences


Definition: A sequence Dn of random variables adapted to a filtration Fn
is a martingale difference sequence if, for all n,
E|Dn | <
E[Dn+1 |Fn ] = 0.
e.g., Dn = Yn Yn1 .
E[Dn+1 |Fn ] = E[Yn+1 |Fn ] E[Yn |Fn ]
= E[Yn+1 |Fn ] Yn = 0

(because Yn is measurable wrt Fn , and because of the martingale property).


Pn
Hence, Yn Y0 = i=1 Di .
13

Martingale Difference Sequences: the Doob construction

X = (X1 , . . . , Xn ),

Define

X1i = (X1 , . . . , Xi ),
Y0 = Ef (X),
Yi = E[f (X)|X1i ].
Then

f (X) Ef (X) = Yn Y0 =

n
X

Di ,

i=1

where Di = Yi Yi1 . Also, Yi is a martingale w.r.t. Xi , and hence Di is a


martingale difference sequence. Indeed (because EX = EE[X|Y ]),
i

i+1
i
E[Yi+1 |X1 ] = E E[f (X)|X1 ] X1 = E[f (X)|X1i ] = Yi .
14

Martingale Difference Sequences: another example


[An aside:] Consider two densities f and g, with g absolutely continuous
w.r.t. f . Suppose Xn are drawn i.i.d. from f , and Yn is the likelihood ratio,
n
Y
g(Xi )
Yn =
.
f (Xi )
i=1

Then Yn is a martingale w.r.t. Xn . Indeed,



#
" n+1

Y
n

Y g(Xi )
g(X
)
g(Xi )
n+1
n
n
E[Yn+1 |X1 ] = E
X = E
f (Xi ) 1
f (Xn+1 )
f (Xi )
i=1

i=1

n
Y
g(Xi )
=
= Yn ,
f (Xi )
i=1

because E[g(Xn+1 )/f (Xn+1 )] = 1.


15

Concentration Bounds for Martingale Difference Sequences

Theorem: Consider a martingale difference sequence Dn (adapted to a


filtration Fn ) that satisfies
for || 1/bn a.s., E [ exp(Dn )| Fn1 ] exp(2 n2 /2).
Then

Pn

Pn

2
D
is
sub-exponential,
with
(
,
b)
=
(

i
i=1
i=1 i , maxi bi ).


!
2 exp(t2 /(2 2 )) if 0 t 2 /b
X


Di t
P

2 exp(t/(2b))

if t > 2 /b.
i

16

Concentration Bounds for Martingale Difference Sequences

Proof:
E exp

X
i

Di

"

= E exp

n1
X

Di

Di

!#

i=1

"

E exp
provided || < b. Iterating shows that

n1
X
i=1

17

E [ exp(Dn )| Fn1 ]
exp(2 n2 /2),

Di is sub-exponential.

Concentration Bounds for Martingale Difference Sequences

Theorem: Consider a martingale difference sequence Di with |Di | Bi


a.s. Then


!


X
2
2t


P
.
P
Di t 2 exp
2


B
i i
i

Proof:
It suffices to show that

E [ exp(Di )| Fi1 ] exp(2 Bi2 /2)


But |Di | Bi a.s., so the conditioned variable (Di |Fi1 ) Bi a.s., so it is
sub-Gaussian with parameter i2 = Bi2 .
18

Bounded Differences Inequality

Theorem: Suppose f : X n R satisfies the following bounded differences inequality:


for all x1 , . . . , xn , xi X ,
|f (x1 , . . . , xn ) f (x1 , . . . , xi1 , xi , xi+1 , . . . , xn )| Bi .
Then

2t
P (|f (X) Ef (X)| t) 2 exp P 2
i Bi

19

Bounded Differences Inequality


Proof: Use the Doob construction.
Yi = E[f (X)|X1i ],
Di = Yi Yi1 ,
f (X) Ef (X) =

n
X

Di .

i=1

Then


i1
i

|Di | = |Yi Yi1 | = E[f (X)|X1 ] E[f (X)|X1 ]

i1 
i

= E E[f (X)|X1 ] f (X) X1 Bi .
20

Examples: Rademacher Averages


For a set A Rn , consider
Z = sup h, ai,
aA

where = (1 , . . . n ) is a sequence of i.i.d. uniform {1} random


variables. Define the Rademacher complexity of A as R(A) = EZ. [This
is a measure of the size of A.] The bounded differences approach implies
that Z is concentrated around R(A):
Theorem: Z is sub-Gaussian with parameter 4

2
sup
a
aA
i.
i

Proof:
Write Z = f (1 , . . . , n ), and notice that a change of i can lead to a
change in Z of no more than Bn = supaA 2|ai |. The result follows.
21

Examples: Empirical Processes


For a class F of functions f : X [0, 1], suppose that X1 , . . . , Xn , X are
i.i.d. on X , and consider






n




1X



f (Xi ) = P f Pn f .
Z = sup Ef (X)
| {z }
n i=1
f F
emp proc
F

If Z converges to 0, this is called a uniform law of large numbers. Here, we


show that Z is concentrated about EZ:
Theorem: Z is sub-Gaussian with parameter 1/n.

Proof:
Write Z = g(X1 , . . . , Xn ), and notice that a change of Xi can lead to a
change in Z of no more than Bn = 1/n. The result follows.
22

You might also like