Dense CRF
Dense CRF
Dense CRF
cC
G
c
(X
c
|I)), where G = (V, E) is a graph on X and each clique c
2
in a set of cliques C
G
in G induces a potential
c
[15]. The Gibbs energy of a labeling x L
N
is E(x|I) =
cC
G
c
(x
c
|I). The maximum a posteriori (MAP) labeling of the random eld is
x
= arg max
xL
N P(x|I). For notational convenience we will omit the conditioning in the rest of
the paper and use
c
(x
c
) to denote
c
(x
c
|I).
In the fully connected pairwise CRF model, G is the complete graph on X and C
G
is the set of all
unary and pairwise cliques. The corresponding Gibbs energy is
E(x) =
u
(x
i
) +
i<j
p
(x
i
, x
j
), (1)
where i and j range from 1 to N. The unary potential
u
(x
i
) is computed independently for each
pixel by a classier that produces a distribution over the label assignment x
i
given image features.
The unary potential used in our implementation incorporates shape, texture, location, and color
descriptors and is described in Section 5. Since the output of the unary classier for each pixel
is produced independently from the outputs of the classiers for other pixels, the MAP labeling
produced by the unary classiers alone is generally noisy and inconsistent, as shown in Figure 1(b).
The pairwise potentials in our model have the form
p
(x
i
, x
j
) = (x
i
, x
j
)
K
m=1
w
(m)
k
(m)
(f
i
, f
j
)
. .
k(fi,fj)
, (2)
where each k
(m)
is a Gaussian kernel k
(m)
(f
i
, f
j
) = exp(
1
2
(f
i
f
j
)
T
(m)
(f
i
f
j
)), the vectors f
i
and f
j
are feature vectors for pixels i and j in an arbitrary feature space, w
(m)
are linear combination
weights, and is a label compatibility function. Each kernel k
(m)
is characterized by a symmetric,
positive-denite precision matrix
(m)
, which denes its shape.
For multi-class image segmentation and labeling we use contrast-sensitive two-kernel potentials,
dened in terms of the color vectors I
i
and I
j
and positions p
i
and p
j
:
k(f
i
, f
j
) = w
(1)
exp
_
|p
i
p
j
|
2
2
2
|I
i
I
j
|
2
2
2
_
. .
appearance kernel
+w
(2)
exp
_
|p
i
p
j
|
2
2
2
_
. .
smoothness kernel
. (3)
The appearance kernel is inspired by the observation that nearby pixels with similar color are likely
to be in the same class. The degrees of nearness and similarity are controlled by parameters
and
. The smoothness kernel removes small isolated regions [19]. The parameters are learned from
data, as described in Section 4.
A simple label compatibility function is given by the Potts model, (x
i
, x
j
) = [x
i
= x
j
]. It
introduces a penalty for nearby similar pixels that are assigned different labels. While this simple
model works well in practice, it is insensitive to compatibility between labels. For example, it
penalizes a pair of nearby pixels labeled sky and bird to the same extent as pixels labeled sky
and cat. We can instead learn a general symmetric compatibility function (x
i
, x
j
) that takes
interactions between labels into account, as described in Section 4.
3 Efcient Inference in Fully Connected CRFs
Our algorithm is based on a mean eld approximation to the CRF distribution. This approxima-
tion yields an iterative message passing algorithm for approximate inference. Our key observation
is that message passing in the presented model can be performed using Gaussian ltering in fea-
ture space. This enables us to utilize highly efcient approximations for high-dimensional ltering,
which reduce the complexity of message passing from quadratic to linear, resulting in an approxi-
mate inference algorithm for fully connected CRFs that is linear in the number of variables N and
sublinear in the number of edges in the model.
3.1 Mean Field Approximation
Instead of computing the exact distribution P(X), the mean eld approximation computes a dis-
tribution Q(X) that minimizes the KL-divergence D(QP) among all distributions Q that can be
expressed as a product of independent marginals, Q(X) =
i
Q
i
(X
i
) [10].
3
Minimizing the KL-divergence, while constraining Q(X) and Q
i
(X
i
) to be valid distributions,
yields the following iterative update equation:
Q
i
(x
i
= l) =
1
Z
i
exp
_
_
_
u
(x
i
)
L
(l, l
)
K
m=1
w
(m)
j=i
k
(m)
(f
i
, f
j
)Q
j
(l
)
_
_
_
. (4)
A detailed derivation of Equation 4 is given in the supplementary material. This update equation
leads to the following inference algorithm:
Algorithm 1 Mean eld in fully connected CRFs
Initialize Q Q
i
(x
i
)
1
Zi
exp{
u
(x
i
)}
while not converged do See Section 6 for convergence analysis
Q
(m)
i
(l)
j=i
k
(m)
(f
i
, f
j
)Q
j
(l) for all m Message passing from all X
j
to all X
i
Q
i
(x
i
)
lL
(m)
(x
i
, l)
m
w
(m)
Q
(m)
i
(l) Compatibility transform
Q
i
(x
i
) exp{
u
(x
i
)
Q
i
(x
i
)} Local update
normalize Q
i
(x
i
)
end while
Each iteration of Algorithm 1 performs a message passing step, a compatibility transform, and a
local update. Both the compatibility transform and the local update run in linear time and are highly
efcient. The computational bottleneck is message passing. For each variable, this step requires
evaluating a sum over all other variables. A naive implementation thus has quadratic complexity in
the number of variables N. Next, we show how approximate high-dimensional ltering can be used
to reduce the computational cost of message passing to linear.
3.2 Efcient Message Passing Using High-Dimensional Filtering
From a signal processing standpoint, the message passing step can be expressed as a convolution
with a Gaussian kernel G
Q
(m)
i
(l) =
jV
k
(m)
(f
i
, f
j
)Q
j
(l) Q
i
(l)
. .
message passing
= [G
(m) Q(l)] (f
i
)
. .
Q
(m)
i
(l)
Q
i
(l). (5)
We subtract Q
i
(l) from the convolved function Q
(m)
i
(l) because the convolution sums over all vari-
ables, while message passing does not sum over Q
i
.
This convolution performs a low-pass lter, essentially band-limiting Q
(m)
i
(l). By the sampling
theorem, this function can be reconstructed from a set of samples whose spacing is proportional
to the standard deviation of the lter [20]. We can thus perform the convolution by downsampling
Q(l), convolving the samples with G
jV
k
(m)
(f
i
, f
j
)Q
j
(l)
Q
iV
Q
(m)
i
(l)
jV
k
(m)
(f
i
, f
j
)Q
j
(l) Convolution on samples f
Q
(m)
(l) upsample(Q
(m)
(l)) Upsample
A common approximation to the Gaussian kernel is a truncated Gaussian, where all values beyond
two standard deviations are set to zero. Since the spacing of the samples is proportional to the stan-
dard deviation, the support of the truncated kernel contains only a constant number of sample points.
Thus the convolution can be approximately computed at each sample by aggregating values from
only a constant number of neighboring samples. This implies that approximate message passing can
be performed in O(N) time [16].
High-dimensional ltering algorithms that follow this approach can still have computational com-
plexity exponential in d. However, a clever ltering scheme can reduce the complexity of the con-
volution operation to O(Nd). We use the permutohedral lattice, a highly efcient convolution data
4
structure that tiles the feature space with simplices arranged along d+1 axes [1]. The permutohedral
lattice exploits the separability of unit variance Gaussian kernels. Thus we need to apply a whitening
transform
f = Uf to the feature space in order to use it. The whitening transformation is found us-
ing the Cholesky decomposition of
(m)
into UU
T
. In the transformed space, the high-dimensional
convolution can be separated into a sequence of one-dimensional convolutions along the axes of the
lattice. The resulting approximate message passing procedure is highly efcient even with a fully
sequential implementation that does not make use of parallelism or the streaming capabilities of
graphics hardware, which can provide further acceleration if desired.
4 Learning
We learn the parameters of the model by piecewise training. First, the boosted unary classiers are
trained using the JointBoost algorithm [21], using the features described in Section 5. Next we learn
the appearance kernel parameters w
(1)
,
, and
and
and
.
The smoothness kernel parameters w
(2)
and
(a, b)
( : I
(n)
, T
(n)
)
i
T
(n)
i
(a)
j=i
k(f
i
, f
j
)T
(n)
j
(b) +
i
Q
i
(a)
j=i
k(f
i
, f
j
)Q
i
(b),
(6)
where (I
(n)
, T
(n)
) is a single training image with its ground truth labeling and T
(n)
(a) is a binary
image in which the ith pixel T
(n)
i
(a) has value 1 if the ground truth label at the ith pixel of T
(n)
is
a and 0 otherwise. A detailed derivation of Equation 6 is given in the supplementary material.
The sums
j=i
k(f
i
, f
j
)T
j
(b) and
j=i
k(f
i
, f
j
)Q
i
(b) are both computationally expensive to eval-
uate directly. As in Section 3.2, we use high-dimensional ltering to compute both sums efciently.
The runtime of the nal learning algorithm is linear in the number of variables N.
5 Implementation
The unary potentials used in our implementation are derived from TextonBoost [19, 13]. We use
the 17-dimensional lter bank suggested by Shotton et al. [19], and follow Ladick y et al. [13] by
adding color, histogram of oriented gradients (HOG), and pixel location features. Our evaluation
on the MSRC-21 dataset uses this extended version of TextonBoost for the unary potentials. For
the VOC 2010 dataset we include the response of bounding box object detectors [4] for each object
class as 20 additional features. This increases the performance of the unary classiers on the VOC
2010 from 13% to 22%. We gain an additional 5% by training a logistic regression classier on the
responses of the boosted classier.
For efcient high-dimensional ltering, we use a publicly available implementation of the permu-
tohedral lattice [1]. We found a downsampling rate of one standard deviation to work best for
all our experiments. Sampling-based ltering algorithms underestimate the edge strength k(f
i
, f
j
)
for very similar feature points. Proper normalization can cancel out most of this error. The per-
mutohedral lattice allows for two types of normalizations. A global normalization by the average
5
kernel strength
k =
1
N
i,j
k(f
i
, f
j
) can correct for constant error. A pixelwise normalization by
k
i
=
j
k(f
i
, f
j
) handles regional errors as well, but slightly violates the CRF symmetry assump-
tion
p
(x
i
, x
j
) =
p
(x
j
, x
i
). We found the pixelwise normalization to work better in practice.
6 Evaluation
We evaluate the presented algorithm on two standard benchmarks for multi-class image segmen-
tation and labeling. The rst is the MSRC-21 dataset, which consists of 591 color images of size
320 213 with corresponding ground truth labelings of 21 object classes [19]. The second is the
PASCAL VOC 2010 dataset, which contains 1928 color images of size approximately 500 400,
with a total of 20 object classes and one background class [3]. The presented approach was evalu-
ated alongside the adjacency (grid) CRF of Shotton et al. [19] and the Robust P
n
CRF of Kohli et
al. [9], using publicly available reference implementations. To ensure a fair comparison, all models
used the unary potentials described in Section 5. All experiments were conducted on an Intel i7-930
processor clocked at 2.80GHz. Eight CPU cores were used for training; all other experiments were
performed on a single core. The inference algorithm was implemented in a single CPU thread.
Convergence. We rst evaluate the convergence of the mean eld approximation by analyzing
the KL-divergence between Q and P. Figure 2 shows the KL-divergence between Q and P over
successive iterations of the inference algorithm. The KL-divergence was estimated up to a constant
as described in supplementary material. Results are shown for different standard deviations
and
of the kernels. The graphs were aligned at 20 iterations for visual comparison. The number of
iterations was set to 10 in all subsequent experiments.
MSRC-21 dataset. We use the standard split of the dataset into 45% training, 10% validation and
45% test images [19]. The unary potentials were learned on the training set, while the parameters of
all CRF models were learned using holdout validation. The total CRF training time was 40 minutes.
The learned label compatibility function performed on par with the Potts model on this dataset.
Figure 3 provides qualitative and quantitative results on the dataset. We report the standard measures
of multi-class segmentation accuracy: global denotes the overall percentage of correctly classied
image pixels and average is the unweighted average of per-category classication accuracy [19, 9].
The presented inference algorithm on the fully connected CRF signicantly outperforms the other
models, evaluated against the standard ground truth data provided with the dataset.
The ground truth labelings provided with the MSRC-21 dataset are quite imprecise. In particular,
regions around object boundaries are often left unlabeled. This makes it difcult to quantitatively
evaluate the performance of algorithms that strive for pixel-level accuracy. Following Kohli et al. [9],
we manually produced accurate segmentations and labelings for a set of images from the MSRC-21
dataset. Each image was fully annotated at the pixel level, with careful labeling around complex
boundaries. This labeling was performed by hand for 94 representative images from the MSRC-
21 dataset. Labeling a single image took 30 minutes on average. A number of images from this
accurate ground truth set are shown in Figure 3. Figure 3 reports segmentation accuracy against
this ground truth data alongside the evaluation against the standard ground truth. The results were
obtained using 5-fold cross validation, where
4
5
of the 94 images were used to train the CRF pa-
0 5 10 15 20
K
L
-
d
i
v
e
r
g
e
n
c
e
Number of iterations
=10
=30
=50
=70
=90
(a) KL-divergence
Image
Q(sky)
Q(bird)
0 iterations 1 iteration 2 iterations 10 iterations
(b) Distributions Q(Xi =bird) (top) and Q(Xi =sky) (bottom)
Figure 2: Convergence analysis. (a) KL-divergence of the mean eld approximation during successive itera-
tions of the inference algorithm, averaged across 94 images from the MSRC-21 dataset. (b) Visualization of
convergence on distributions for two class labels over an image from the dataset.
6
Image Grid CRF Robust P
n
CRF Our approach Accurate ground truth
bird
water
road
car
sky
tree
building
grass
cow
sky
tree
grass
grass
water
bird
Runtime Standard ground truth Accurate ground truth
Global Average Global Average
Unary classiers 84.0 76.6 83.2 1.5 80.6 2.3
Grid CRF 1s 84.6 77.2 84.8 1.5 82.4 1.8
Robust P
n
CRF 30s 84.9 77.5 86.5 1.0 83.1 1.5
Fully connected CRF 0.2s 86.0 78.3 88.2 0.7 84.7 0.7
Figure 3: Qualitative and quantitative results on the MSRC-21 dataset.
rameters. The unary potentials were learned on a separate training set that did not include the 94
accurately annotated images.
We also adopt the methodology proposed by Kohli et al. [9] for evaluating segmentation accuracy
around boundaries. Specically, we count the relative number of misclassied pixels within a nar-
row band (trimap) surrounding actual object boundaries, obtained from the accurate ground truth
images. As shown in Figure 4, our algorithm outperforms previous work across all trimap widths.
PASCAL VOC 2010. Due to the lack of a publicly available ground truth labeling for the test
set in the PASCAL VOC 2010, we use the training and validation data for all our experiments. We
randomly partitioned the images into 3 groups: 40% training, 15% validation, and 45% test set. Seg-
mentation accuracy was measured using the standard VOC measure [3]. The unary potentials were
learned on the training set and yielded an average classication accuracy of 27.6%. The parameters
for the Potts potentials in the fully connected CRF model were learned on the validation set. The
Image Ground truth Trimap (4px) Trimap (8px)
(a) Trimaps of different widths
20
30
40
50
0 4 8 12 16 20
P
i
x
e
l
w
i
s
e
C
l
a
s
s
i
f
i
a
c
t
i
o
n
E
r
r
o
r
[
%
]
Trimap Width [Pixels]
Unary classifiers
Grid CRF
Robust P
n
CRF
Fully connected CRF
(b) Segmentation accuracy within trimap
Figure 4: Segmentation accuracy around object boundaries. (a) Visualization of the trimap measure. (b)
Percent of misclassied pixels within trimaps of different widths.
7
Image Ground truth
cat
background
Our approach Ground truth
boat
background
sheep
background
Our approach Image
Figure 5: Qualitative results on the PASCAL VOC 2010 dataset. Average segmentation accuracy was 30.2%.
fully connected model with Potts potentials yielded an average classication accuracy of 29.1%.
The label compatibility function, learned on the validation set, further increased the classication
accuracy to 30.2%. For comparison, the grid CRF achieves 28.3%. Training time was 2.5 hours and
inference time is 0.5 seconds. Qualitative results are provided in Figure 5.
Long-range connections. We have examined the value of long-range connections in our model by
varying the spatial and color ranges
and
= 11. At this
setting, more than 50% of the pairwise potential energy in the model was assigned to edges of length
35 pixels or higher. However, long-range connections can also propagate misleading information,
as shown in Figure 7.
0 100 200
25
50
82%
84%
86%
88%
(a) Quantitative
1.0 121.0
1.0 41.0
(b) Qualitative
Figure 6: Inuence of long-range connections on classication accuracy. (a) Global classication accuracy on
the 94 MSRC images with accurate ground truth, as a function of kernel parameters and