Locality-Sensitive Binary Codes From Shift-Invariant Kernels
Locality-Sensitive Binary Codes From Shift-Invariant Kernels
Locality-Sensitive Binary Codes From Shift-Invariant Kernels
i=1
1
{Fi(x)=Fi(y)}
between F
n
(x) = (F
1
(x), . . . , F
n
(x)) and F
n
(y) = (F
1
(y), . . . , F
n
(y)) behaves like
h
1
(K(x y))
1
n
d
H
(F
n
(x), F
n
(y)) h
2
(K(x y))
where h
1
, h
2
: [0, 1] R
+
are continuous decreasing functions, and h
1
(1) = h
2
(1) = 0 and
h
1
(0) = h
2
(0) = c > 0. In other words, we would like to map D-dimensional real vectors into
n-bit binary strings in a locality-sensitive manner, where the notion of locality is induced by the
kernel K. We will achieve this goal by drawing F
n
appropriately at random.
Random Fourier features. Recently, Rahimi and Recht [8] gave a scheme that takes a Mercer
kernel satisfying (K1) and (K2) and produces a random mapping
n
: R
D
R
n
, such that,
with high probability, the inner product of any two transformed points approximates the kernel:
n
(x)
n
(y) K(xy) for all x, y. Their scheme exploits Bochners theorem[9], a fundamental
result in harmonic analysis which says that any such K is a Fourier transform of a uniquely dened
probability measure P
K
on R
D
. They dene the random Fourier features (RFF) via
,b
(x)
=
i
P
K
and b
i
Unif[0, 2], and dene a mapping
n
: R
D
R
n
via
n
(x)
=
1
n
_
1,b1
(x), . . . ,
n,bn
(x)
_
for x A. Then E[
n
(x)
n
(y)] = K(x y) for all x, y.
From random Fourier features to random binary codes. We will compose the RFFs with
random binary quantizers. Draw a random threshold t Unif[1, 1] and dene the quantizer
Q
t
: [1, 1] 1, +1 via Q
t
(u)
= sgn(u + t), where we let sgn(u) = 1 if u < 0 and
sgn(u) = +1 if u 0. We note the following basic fact (we omit the easy proof):
Lemma 2.1 For any u, v [1, 1], P
t
Q
t
(u) ,= Q
t
(v) = [u v[/2.
Now, given a kernel K, we dene a random map F
t,,b
: R
D
0, 1 through
F
t,,b
(x)
=
1
2
[1 +Q
t
(cos( x +b))] , (2)
where t Unif[1, 1], P
K
, and b Unif[0, 2] are independent of one another. From now
on, we will often omit the subscripts t, , b and just write F for the sake of brevity. We have:
Lemma 2.2
E1
{F(x)=F(y)}
= h
K
(x y)
=
8
m=0
1 K(mx my)
4m
2
1
, x, y (3)
Proof: Using Lemma 2.1, we can showE1
{F(x)=F(y)}
=
1
2
E
,b
[ cos( x+b) cos( y+b)[.
Using trigonometric identities and the independence of and b, we can express this expectation as
E
b,
[cos( x +b) cos( y +b)[ =
4
sin
_
(x y)
2
_
.
We now make use of the Fourier series representation of the full rectied sine wave g() = [ sin()[:
g() =
2
+
4
m=1
1
1 4m
2
cos(m) =
4
m=1
1 cos(2m)
4m
2
1
.
Using this together with the fact that E
2
(1 u) and h
2
(u)
= min
_
1
2
1 u,
4
2
(1 2u/3)
_
,
where u [0, 1]. Note that h
1
(0) = h
2
(0) = 4/
2
0.405 and that h
1
(1) = h
2
(1) = 0. Then
h
1
(K(x y)) h
K
(x y) h
2
(K(x y)) for all x, y.
Proof: Let
= cos( x + b) cos( y + b). Then E[[ = E
E
2
(the last step
uses concavity of the square root). Using the properties of the RFF, E
2
= (1/2) E[(
,b
(x)
,b
(y))
2
] = 1 K(xy). Therefore, E1
{F(x)=F(y)}
= (1/2) E[[ (1/2)
_
1 K(x y).
We also have
E1
{F(x)=F(y)}
=
4
m=1
K(mx my)
4m
2
1
4
8
3
2
K(xy) =
4
2
_
12K(xy)/3
_
.
This proves the upper bound in the lemma. On the other hand, since K satises (K3),
h
K
(x y)
_
1 K(x y)
_
m=1
1
4m
2
1
=
4
2
_
1 K(x y)
_
,
because the mth term of the series in (3) is not smaller than
_
1 K(x y)
_
/(4m
2
1).
Fig. 1 shows a comparison of the kernel approximation properties of the RFFs [8] with our scheme
for the Gaussian kernel.
(a) (b) (c)
Figure 1: (a) Approximating the Gaussian kernel by random features (green) and random signs (red). (b) Rela-
tionship of normalized Hamming distance between random signs to functions of the kernel. The scatter plots in
(a) and (b) are obtained from a synthetic set of 500 uniformly distributed 2D points with n = 5000. (c) Bounds
for normalized Hamming distance in Lemmas 2.2 and 2.3 vs. the Euclidean distance.
Now we concatenate several mappings of the form F
t,,b
to construct an embedding of A into the
binary cube 0, 1
n
. Specically, we draw n i.i.d. triples (t
1
,
1
, b
1
), . . . , (t
n
,
n
, b
n
) and dene
F
n
(x)
=
_
F
1
(x), . . . , F
n
(y)
_
, where F
i
(x) F
ti,i,bi
(x), i = 1, . . . , n
As we will show next, this construction ensures that, for any two points x and y, the fraction of the
bits where the binary strings F
n
(x) and F
n
(y) disagree sharply concentrates around h
K
(x y),
provided n is large enough. Using the results proved above, we conclude that, for any two points
x and y that are similar, i.e., K(x y) 1, most of the bits of F
n
(x) and F
n
(y) will agree,
whereas for any two points x and y that are dissimilar, i.e., K(x y) 0, F
n
(x) and F
n
(y)
will disagree in about 40% or more of their bits.
Analysis of performance. We rst prove a JohnsonLindenstrauss type result which says that,
for any nite subset of R
D
, the normalized Hamming distance respects the similarities between
points. It should be pointed out that the analogy with JohnsonLindenstrauss is only qualitative:
our embedding is highly nonlinear, in contrast to random linear projections used there [4], and the
resulting distortion of the neighborhood structure, although controllable, does not amount to a mere
rescaling by constants.
Theorem 2.4 Fix , (0, 1). For any nite data set T = x
1
, . . . , x
N
R
D
, F
n
is such that
h
K
(x
j
x
k
)
1
n
d
H
(F
n
(x
j
), F
n
(x
k
)) h
K
(x
j
x
k
) + (4)
h
1
(K(x
j
x
k
))
1
n
d
H
(F
n
(x
j
), F
n
(x
k
)) h
2
(K(x
j
x
k
)) + (5)
for all j, k with probability 1 N
2
e
2n
2
. Moreover, the events (4) and (5) will hold with
probability 1 if n (1/2
2
) log(N
2
/). Thus, any N-point subset of R
D
can be embedded,
with high probability, into the binary cube of dimension O(log N) in a similarity-preserving way.
The proof (omitted) is by a standard argument using Hoeffdings inequality and the union bound, as
well as the bounds of Lemma 2.3. We also prove a much stronger result: any compact subset A
R
D
can be embedded into a binary cube whose dimension depends only on the intrinsic dimension
and the diameter of A and on the second moment of P
K
, such that the normalized Hamming distance
behaves in a similarity-preserving way for all pairs of points in A simultaneously. We make use of
the following [5]:
Denition 2.5 The Assouad dimension of A R
D
, denoted by d
X
, is the smallest integer k, such
that, for any ball B R
D
, the set B A can be covered by 2
k
balls of half the radius of B.
The Assouad dimension is a widely used measure of the intrinsic dimension [2, 6, 3]. For example,
if A is an
p
ball in R
D
, then d
X
= O(D); if A is a d-dimensional hyperplane in R
D
, then
d
X
= O(d) [2]. Moreover, if A is a d-dimensional Riemannian submanifold of R
D
with a suitably
bounded curvature, then d
X
= O(d) [3]. We now have the following result:
Theorem 2.6 Suppose that the kernel K is such that L
K
=
_
E
PK
||
2
< +. Then there
exists a constant C > 0 independent of D and K, such that the following holds. Fix any , > 0. If
n max
_
CL
K
d
X
diamA
2
,
2
2
log
_
2
__
,
then, with probability at least 1 , the mapping F
n
is such that, for every pair x, y A,
h
K
(x y)
1
n
d
H
(F
n
(x), F
n
(y)) h
K
(x y) + (6)
Proof: For every pair x, y A, let A
x,y
be the set of all (t, , b), such that F
t,,b
(x) ,=
F
t,,b
(y), and let / = A
x,y
: x, y A. Then we can write
1
n
d
H
(F
n
(x), F
n
(y)) =
1
n
n
i=1
1
{iAx,y}
.
For any sequence
n
= (
1
, . . . ,
n
), dene the uniform deviation
(
n
)
= sup
x,yX
1
n
n
i=1
1
{iAx,y}
E1
{F
t,,b
(x)=F
t,,b
(y)}
.
For every 1 i n and an arbitrary
i
, let
n
(i)
denote
n
with the ith component replaced by
i
.
Then [(
n
) (
n
(i)
)[ 1/n for any i and any
i
. Hence, by McDiarmids inequality,
P[(
n
) E
n (
n
)[ > 2e
2n
2
, > 0. (7)
Now we need to bound E
n (
n
). Using a standard symmetrization technique [14], we can write
E
n (
n
) 2R(/)
= 2 E
n
,
n
_
sup
x,yX
1
n
n
i=1
i
1
{iAx,y}
_
, (8)
where
n
= (
1
, . . . ,
n
) is an i.i.d. Rademacher sequence, P
i
= 1 = P(
i
= +1 = 1/2.
The quantity R(/) can be bounded by the Dudley entropy integral [14]
R(/)
C
0
n
_
0
_
log N(, /, | |
L
2
()
)d, (9)
where C
0
> 0 is a universal constant, and N(, /, | |
L
2
()
) is the -covering number of the
function class 1
{A}
: A / with respect to the L
2
() norm, where is the distribution
of (t, , b). We will bound these covering numbers by the covering numbers of A with respect
to the Euclidean norm on R
D
. It can be shown that, for any four points x, x
, y, y
A,
_
_
1
Ax,y
1
A
x
,y
_
_
2
L
2
()
=
_
_
1
{Ax,y}
1
{A
x
,y
}
_
2
d() (B
x
B
x
) +(B
y
B
y
),
where denotes symmetric difference of sets, and B
x
= (t, , b) : Q
t
(cos( x + b)) = +1
(details omitted for lack of space). Now,
2(B
x
B
x
) = 2 E
,b
_
P
t
_
Q
t
(cos( x +b)) ,= Q
t
(cos( y +b))
_
_
= E
,b
[cos( x +b) cos( x
+b)[ E
[ (x x
)[ L
K
|x x
|.
Then (B
x
B
x
) + (B
y
B
y
)
LK
2
(|x x
| +|y y
2
_
2dX
. We can now estimate the integral in (9) by
R(/) C
1
_
L
K
d
X
diamA
n
, (10)
for some constant C
1
> 0. From (10) and (8), we obtain E
n (
n
) C
2
_
LKdX diamX
n
, where
C
2
= 2C
1
. Using this and (7) with = /2, we obtain (6) with C = 16C
2
2
.
For example, with the Gaussian kernel K(s) = e
s
2
/2
on R
D
, we have L
K
=
D. The kernel
bandwidth is often chosen as 1/[D(diamA)
2
] (see, e.g., [12, Sec. 7.8]); with this setting,
the number of bits needed to guarantee the bound (6) is n = ((d
X
/
2
) log(1/)). It is possible,
in principle, to construct a dimension-reducing embedding of A into a binary cube, provided the
number of bits in the embedding is larger than the intrinsic dimension of A.
Our method Spectral hashing
(a) (b)
(c) (d)
(e) (f)
Figure 2: Synthetic results. First row: scatter plots of normalized Hamming distance vs. Euclidean distance
for our method (a) and spectral hashing (b) with code size 32 bits. Green indicates pairs of data points that
are considered true neighbors for the purpose of retrieval. Second row: scatter plots for our method (c) and
spectral hashing (d) with code size 512 bits. Third row: recall-precision plots for our method (e) and spectral
hashing (f) for code sizes from 8 to 512 bits (best viewed in color).
3 Empirical Evaluation
In this section, we present the results of our scheme with a Gaussian kernel, and compare our perfor-
mance to spectral hashing [15].
1
Spectral hashing is a recently introduced, state-of-the-art approach
that has been reported to obtain better results than several other well-known methods, including
LSH [1] and restricted Boltzmann machines [11]. Unlike our method, spectral hashing chooses
code parameters in a deterministic, data-dependent way, motivated by results on convergence of
1
We use the code made available by the authors of [15] at http://www.cs.huji.ac.il/yweiss/SpectralHashing/.
Our method Spectral hashing
Figure 3: Recall-precision curves for the LabelMe database for our method (left) and for spectral hashing
(right). Best viewed in color.
eigenvectors of graph Laplacians to Laplacian eigenfunctions on manifolds. Though spectral hash-
ing is derived from completely different considerations than our method, its encoding scheme is
similar to ours in terms of basic computation. Namely, each bit of a spectral hashing code is given
by sgn(cos(k x)), where is a principal direction of the data (instead of a randomly sampled
direction, as in our method) and k is a weight that is deterministically chosen according to the ana-
lytical form of certain kinds of Laplacian eigenfunctions. The structural similarity between spectral
hashing and our method makes comparison between them appropriate.
To demonstrate the basic behavior of our method, we rst report results for two-dimensional syn-
thetic data using a protocol similar to [15] (we have also conducted tests on higher-dimensional
synthetic data, with very similar results). We sample 10,000 database and 1,000 query points
from a uniform distribution dened on a 2d rectangle with aspect ratio 0.5. To distinguish true posi-
tives from false positives for evaluating retrieval performance, we select a nominal neighborhood
radius so that each query point on average has 50 neighbors in the database. Next, we rescale the
data so that this radius is 1, and set the bandwidth of the kernel to = 1. Fig. 2 (a,c) shows scatter
plots of normalized Hamming distance vs. Euclidean distance for each query point paired with each
database point for 32-bit and 512-bit codes. As more bits are added to our code, the variance of the
scatter plots decreases, and the points cluster tighter around the theoretically expected curve (Eq. (3),
Fig. 1). The scatter plots for spectral hashing are shown in Fig. 2 (b,d). As the number of bits in the
spectral hashing code is increased, normalized Hamming distance does not appear to converge to any
clear function of the Euclidean distance. Because the derivation of spectral hashing in [15] includes
several heuristic steps, the behavior of the resulting scheme appears to be difcult to analyze, and
shows some undesirable effects as the code size increases. Figure 2 (e,f) compares recall-precision
curves for both methods using a range of code sizes. Since the normalized Hamming distance for
our method converges to a monotonic function of the Euclidean distance, its performance keeps
improving as a function of code size. On the other hand, spectral hashing starts out with promising
performance for very short codes (up to 32 bits), but then deteriorates for higher numbers of bits.
Next, we present retrieval results for 14,871 images taken from the LabelMe database [10]. The
images are represented by 320-dimensional GIST descriptors [7], which have proven to be effective
at capturing perceptual similarity between scenes. For this experiment, we randomly select 1,000
images to serve as queries, and the rest make up the database. As with the synthetic experiments, a
nominal threshold of the average distance to the 50th nearest neighbor is used to determine whether
a database point returned for a given query is considered a true positive. Figure 3 shows precision-
recall curves for code sizes ranging from 16 bits to 1024 bits. As in the synthetic experiments,
spectral hashing appears to have an advantage over our method for extremely small code sizes, up to
about 32 bits. However, this low bit regime may not be very useful in practice, since below 32 bits,
neither method achieves performance levels that would be satisfactory for real-world applications.
For larger code sizes, our method begins to dominate. For example, with a 128-bit code (which is
equivalent to just two double-precision oating point numbers), our scheme achieves 0.8 precision
Euclidean neighbors 32 bit code 512 bit code
Precision: 0.81 Precision: 1.00
Precision: 0.38 Precision: 0.96
Figure 4: Examples of retrieval for two query images on the LabelMe database. The left column shows top
48 neighbors for each query according to Euclidean distance (the query image is in the top left of the collage).
The middle (resp. right) column shows nearest neighbors according to normalized Hamming distance with a
32-bit (resp. 512-bit) code. The precision of retrieval is evaluated as the proportion of top Hamming neighbors
that are also Euclidean neighbors within the nominal radius. Incorrectly retrieved images in the middle and
right columns are shown with a red border. Best viewed in color.
at 0.2 recall, whereas spectral hashing only achieves about 0.5 precision at the same recall. More-
over, the performance of spectral hashing actually begins to decrease for code sizes above 256 bits.
Finally, Figure 4 shows retrieval results for our method on a couple of representative query images.
In addition to being completely distribution-free and exhibiting more desirable behavior as a func-
tion of code size, our scheme has one more practical advantage. Unlike spectral hashing, we retain
the kernel bandwidth as a free parameter, which gives us exibility in terms of adapting to target
neighborhood size, or setting a target Hamming distance for neighbors at a given Euclidean dis-
tance. This can be especially useful for making sure that a signicant fraction of neighbors for each
query are mapped to strings whose Hamming distance from the query is no greater than 2. This is a
necesary condition for being able to use binary codes for hashing as opposed to brute-force search
(although, as demonstrated in [11, 13], even brute-force search with binary codes can already be
quite fast). To ensure high recall within a low Hamming radius, we can progressively increase the
kernel bandwidth as the code size increases, thus counteracting the increase in unnormalized Ham-
ming distance that inevitably accompanies larger code sizes. Preliminary results (omitted for lack of
space) show that this strategy can indeed increase recall for low Hamming radius while sacricing
some precision. In the future, we will evaluate this tradeoff more extensively, and test our method
on datasets consisting of millions of data points. At present, our promising initial results, combined
with our comprehensive theoretical analysis, convincingly demonstrate the potential usefulness of
our scheme for large-scale indexing and search applications.
Acknowledgments
This work was supported by NSF CAREER Award No. IIS 0845629.
References
[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high
dimensions. Commun. ACM, 51(1):117122, 2008.
[2] K. Clarkson. Nearest-neighbor searching and metric space dimensions. In Nearest-Neighbor Methods for
Learning and Vision: Theory and Practice, pages 1559. MIT Press, 2006.
[3] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In STOC, 2008.
[4] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random
Struct. Alg., 22(1):6065, 2003.
[5] J. Heinonen. Lectures on Analysis on Metric Spaces. Springer, New York, 2001.
[6] P. Indyk and A. Naor. Nearest-neighbor-preserving embeddings. ACM Trans. Algorithms, 3(3):Art. 31,
2007.
[7] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial enve-
lope. Int. J. Computer Vision, 42(3):145175, 2001.
[8] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.
[9] M. Reed and B. Simon. Methods of Modern Mathematical Physics II: Fourier Analysis, Self-Adjointness.
Academic Press, 1975.
[10] B. Russell, A. Torralba, K. Murphy, and W. T. Freeman. LabelMe: a database and web-based tool for
image annotation. Int. J. Computer Vision, 77:157173, 2008.
[11] R. Salakhutdinov and G. Hinton. Semantic hashing. In SIGIR Workshop on Inf. Retrieval and App. of
Graphical Models, 2007.
[12] B. Sch olkopf and A. J. Smola. Learning With Kernels. MIT Press, 2002.
[13] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large databases for recognition. In CVPR, 2008.
[14] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.
[15] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.