Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hu Visualization of Convolutional Neural Networks For Monocular Depth Estimation ICCV 2019 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Visualization of Convolutional Neural Networks

for Monocular Depth Estimation

Junjie Hu1,2 Yan Zhang2 Takayuki Okatani1,2


1
Graduate School of Information Sciences, Tohoku University, Japan
2
Center for Advanced Intelligence Project, RIKEN, Japan
{junjie.hu, zhang, okatani}@vision.is.tohoku.ac.jp

Abstract

Recently, convolutional neural networks (CNNs) have


shown great success on the task of monocular depth esti-
mation. A fundamental yet unanswered question is: how
CNNs can infer depth from a single image. Toward answer-
ing this question, we consider visualization of inference of
a CNN by identifying relevant pixels of an input image to
depth estimation. We formulate it as an optimization prob-
lem of identifying the smallest number of image pixels from
which the CNN can estimate a depth map with the minimum
difference from the estimate from the entire image. To cope Figure 1. An example of the proposed visualization of single view
with a difficulty with optimization through a deep CNN, we depth estimation. Upper: an input image. Lower: a mask gener-
propose to use another network to predict those relevant im- ated by our method showing relevant pixels for depth estimation.
age pixels in a forward computation. In our experiments, we
first show the effectiveness of this approach, and then apply
it to different depth estimation networks on indoor and out- human vision uses several cues for monocular depth esti-
door scene datasets. The results provide several findings mation, such as linear perspective, relative size, interposi-
that help exploration of the above question. tion, texture gradient, light and shades, aerial perspective,
etc. [24, 13, 23, 32, 30, 18]. A natural question arises, do
CNNs utilize these cues? Exploring this question will help
1. Introduction our understanding of why CNNs can (or cannot) estimate
depth from a given scene image. To the best of our knowl-
Enabling computers to perceive depth from monocular edge, the present study is the first attempt to analyze how
images has attracted a lot of attention over the past decades. CNNs work on the task of monocular depth estimation.
It was shown recently [6] that employment of deep convolu- It is, however, hard to find direct answers to the above
tional neural networks (CNNs) achieves promising perfor- questions; after all, it is still difficult even with human vi-
mance. Since then, a number of studies [25, 5, 2, 3, 16, 40, sion. Thus, as the first step toward this end, we consider
26, 8, 19] have been published on this approach, leading to visualization of CNNs on the task. To be specific, as in
significant improvement of estimation accuracy. previous studies of visualization of CNNs for object recog-
On the other hand, it is largely unknown why and how nition, we attempt to identify the image pixels that are rel-
CNNs can estimate depth of a scene from its monocular im- evant to depth estimation; see Fig. 1 for an example. To do
age; they are basically black boxes as in other tasks. This this, we hypothesize that the CNNs can infer depths fairly
will be an obstacle for this method to be employed in real- accurately from only a selected set of image pixels. An un-
world applications, such as vision for self-driving cars and derlying idea is an observation with human vision that most
service robots, although it could be a cheap alternative so- of the cues are considered to be associated with small re-
lution to existing 3D sensors. In these applications, inter- gions in the visual field.
pretability is essential for safety reasons. We then formulate the problem of identifying relevant
Long-term studies in psychophysics have revealed that pixels as a problem of sparse optimization. Specifically, we

3869
Figure 2. Diagram of the proposed approach. The target of visualization is the trained depth estimation net N . To identify the pixels of the
input image I that N uses to estimate its depth map Y , we input I to the network G for predicting the set of relevant pixels, or the mask
M . The output M is element-wise multiplied with I and inputted to N , yielding an estimate Ŷ of the depth map. G is trained so that Ŷ
will be as close to the original estimate Y from the entire image I and M will be maximally sparse. Note that N is fixed in this process.

estimate an image mask that selects the smallest number of 2. Related work
pixels from which the target CNN can provide the maxi-
mally similar depth map to that it estimates from the orig- There are many studies that attempt to interpret inference
inal input. This optimization requires optimization of the of CNNs, most of which have focused on the task of image
output of the CNN with respect to its input. As is shown in classification [1, 43, 37, 36, 44, 31, 33, 17, 7, 28, 38]. How-
previous studies of visualization, such optimization through ever, there are only a few methods that have been recog-
a CNN in its backward direction sometimes yields unex- nized to be practically useful in the community [11, 20, 21].
pected results, such as noisy visualization [35, 37] at best
and even phenomenon similar to adversarial examples [7] Gradient based methods [36, 28, 38] compute a saliency
at worst. To avoid this issue, we use an additional CNN map that visualizes sensitivity of each pixel of the input im-
to estimate the mask from the input image in the forward age to the final prediction, which is obtained by calculating
computation; this CNN is independent of the target CNN of the derivatives of the output of the model with respect to
visualization. Our method is illustrated in Fig. 2. each image pixel.
We conduct a number of experiments to evaluate the ef- There are many methods that mask part of the input im-
fectiveness of our approach. We apply our method to CNNs age to see its effects [42]. General-purpose methods devel-
trained on indoor scenes (the NYU-v2 dataset) and those oped for interpreting inference of machine learning models,
trained on outdoor scenes (the KITTI dataset). We confirm such as LIME [31] and Prediction Difference Analysis [44],
through the experiments that may be categorized in this class, when they are applied to
CNNs classifying an input image.
• CNNs can infer the depth map from only a sparse set
of pixels in the input image with similar accuracy to The most dependable method as of now for visualiza-
those they infer from the entire image; tion of CNNs for classification is arguably the class activa-
tion map (CAM) [43], which calculate the linear combina-
• The mask selecting the relevant pixels can be predicted tion of the activation of the last convolutional layers in its
stably by a CNN. This CNN is trained to predict masks channel dimension. Its extension, Grad-CAM [33], is also
for a target CNN for depth estimation. widely used, which integrates the gradient-based method
with CAM to enable to use general network architectures
The visualization of CNNs on the indoor and outdoor scenes that cannot be dealt with by CAM.
provides several findings including the following, which we
think contribute to understanding of how CNNs works on However, the above methods, which are developed
the monocular depth estimation task. mainly for explanation of classification, cannot directly
be applied to CNNs performing depth estimation. In the
• CNNs frequently use some of the edges in input im- case of depth estimation, the output of CNNs is a two-
ages but not all of them. Their importance depends not dimensional map, not a score for a category. This imme-
necessarily on their edge strengths but more on useful- diately excludes gradient based methods as well as CAM
ness for grasping the scene geometry. and its variants. The masking methods that employ fixed-
shape masks [44] or super-pixels obtained using low-level
• For outdoor scenes, large weights tend to be given image features [31] are not fit for our purpose, either, since
to distant regions around the vanishing points in the there is no guarantee that their shapes match well with the
scene. depth cues in input images that are utilized by the CNNs.

3870
3. Method
3.1. Problem Formulation
Suppose a network N that predicts the depth map of a
scene from its single RGB image as

Y = N (I), (1)

where Y is an estimated depth map and I is the normalized


version of the input RGB image. Following previous stud-
ies, we normalize each image by the z-score normalization.
This model N is the target of visualization. (a) (b) (c) (d)
Human vision is considered to use several cues to infer Figure 3. Form left to right, (a) RGB images (b) M obtained by
depth information, most of which are associated with re- solving (3), (c) M obtained by solving (4), (d) M obtained by
solving (5).
gions with small areas in the visual field. Thus, we make
an assumption here that CNNs can infer depth map equally
well from a selected set of sparse pixels of I, as long as they In [35], the optimal inputs to CNNs trained on object
are relevant to depth estimation. To be specific, we denote a recognition are computed that maximize the score of a se-
binary mask selecting pixels of I by M and a masked input lected object class for the purpose of visualization. Al-
by I ⊗ M , where ⊗ denotes element-wise multiplication. though they provide some insights into what the CNNs have
The depth estimate Ŷ provided by our network N for the learned, the images thus computed are unstable (e.g., sensi-
masked input is tive to initial values); they are distant from natural images
Ŷ = N (I ⊗ M ). (2) and not so easy to interpret. To obtain more visually in-
terpretable images, researchers have employed several con-
Our assumption is that Ŷ can become very close to the orig- straints on the input images to be optimized, e.g., the one
inal estimate Y = N (I), when the mask M is chosen prop- making them appear to be natural images [10, 29]. In addi-
erly. tion, optimization of (a function of) network outputs some-
Now, we wish to find such a mask M for a given input times yield unpredictable results; typical examples are the
I that Ŷ = N (I ⊗ M ) will be as close to Y = N (I) as adversarial examples [7].
possible. As our purpose is to understand depth estimation, Thus, instead of minimizing (3) with respect to individ-
we also want M that is as sparse as possible (i.e., having the ual elements of M , we use an additional network G to pre-
smallest number of non-zero pixels). To do so, we relax the dict M ≈ G(I) that minimizes (3). More specifically, we
condition that M is binary, i.e., its element is either 0 or 1. consider the following optimization:
We instead assume each element of M to have a continuous
value in the range of [0, 1]. We will validate this relaxation 1
min ldif (Y, N (I ⊗ G(I))) + λ kG(I)k1 , (4)
in our experiments, where we also check the validity of the G n
above assumption of depth estimation from sparse pixels.
where kG(I)k1 indicates ℓ1 norm of vectorized G(I). We
Finally, we formulate our problem as the following opti-
employ the sigmoid activation function for the output layer
mization:
1 of G, which constrain its output in the range of [0, 1]. The
min ldif (Y, Ŷ ) + λ kM k1 (3) details of our method for training G are shown in Algo-
M n
rithm 1. Figure 3 shows comparison of M computed by
where ldif is a measure of difference between Y and Ŷ ; different methods. It is seen that the direct optimization of
λ is a control parameter for the sparseness of M ; n is the (3) (Fig.3(b)) yields noisy, less interpretable maps than our
number of pixels; and kM k1 is the ℓ1 norm (of a vectorized approach (Fig.3(c)).
version) of M . We have considered removing as many unimportant pix-
els of I as possible while maximally maintaining the origi-
3.2. Learning to Predict Mask nal prediction Y = N (I). There is yet another approach to
Now we consider how to perform the optimization (3). identify important/unimportant pixels, which is to identify
The network N appears in the objective function through the most important pixels of I, without which the predic-
the variable Ŷ = N (I ⊗ M ). We need carefully consider tion will maximally deteriorate. This is formulated as the
such optimization associated with the output of a CNN with following optimization problem:
respect to its input, because it often provides unexpected 1
results, as is shown in previous studies. min −ldif (Y, N (I ⊗ G(I))) + λ k(1 − G(I))k1 . (5)
G n

3871
This formulation is similar to that employed in [7], a study sparse depth maps, we use the depth completion toolbox
for visualization of CNNs for object recognition, in which of the NYU-v2 dataset to interpolate pixels with missing
the most important pixels in the input image are identified depth.
by masking the pixels that maximally lower the score of a
selected object class. Unlike our method, the authors di-
Target CNN models There are many studies for monoc-
rectly optimize M ; to avoid artifacts that will emerge in
ular depth estimation, in which a variety of architectures
the optimization, they employ additional constraints on M
are proposed. Considering the purpose here, we choose
other than its sparseness1 . The results obtained by the op-
models that show strong performance in estimation accu-
timization of (5) are shown in Fig. 3(d). It is seen that this
racy with a simple architecture. One is an encoder-decoder
approach cannot provide useful results.
network based on ResNet-50 proposed in [16], which out-
performs previous ones by a large margin as of the time
Algorithm 1 Algorithm for training the network G for pre-
of publishing. We also consider more recent ones pro-
diction of M .
posed in [14], for which we choose three different backbone
Input: N : a target, fully-trained network for depth estima-
networks, ResNet-50 [12], DenseNet-161 [9], and SENet-
tion; ψ: a training set, i.e., pairs of the RGB image and
154 [15]. For better comparison, all the models are im-
depth map of a scene; λ: a parameter controlling the
plemented in the same experimental conditions. Following
sparseness of M .
their original implementation, the first and the latter three
Hyperparameters: Adam optimizer, learning rate: 1e−4 ,
models are trained using different losses. To be specific, the
weight decay: 1e−4 , training epochs: K.
first model is trained using ℓ1 norm of depth errors2 . For
Output: G: a network for predicting M .
the latter three
Pnmodels, sum of three1 P losses are used, i.e.,
1: Freeze N ; n
ldepth = n1 i=1 F (ei ), lgrad P = n i=1 (F (∇x (ei )) +
2: for j = 1 to K do 1 n
F (∇y (ei ))), and lnormal = n i=1 (1 − cos θi ) , where
3: for i = 1 to T do
F (ei ) = ln(ei + 0.5); ei = kyi − yˆi k1 ; yi and yˆi are true
4: Select RGB batch ψi from ψ;
and estimated depths; and θi is the angle between the sur-
5: Set gradients of G to 0;
face normals computed from the true and estimated depth
6: Calculate depth maps for ψi :
map.
7: Yψi = N (ψi );
8: Calculate the value (L) of objective function:
9: L = ldif (Yψi , N (ψi ⊗ G(ψi )) +λ n1 kG(ψi )k1 ; Network G for predicting M We employ an encoder-
10: Backpropagate L; decoder structure for G. For the encoder, we use the dilated
11: Update G; residual network (DRN) proposed in [41], which preserves
12: end for local structures of the input image due to a fewer counts of
13: end for down-sampling. Specifically, we use a DRN with 22 lay-
ers (DRN-D-22) pre-trained on ImageNet [4], from which
we remove the last fully connected layer. It yields a feature
4. Experiments map with 512 channels and 1/8 resolution of the input im-
age. For the decoder, we use a network consisting of three
4.1. Experimental Setup up-projection blocks [16] yielding a feature map with 64
channels and the same size as the input image, followed by
Datasets We use two datasets NYU-v2 [34] and KITTI
a 3 × 3 convolutional layer outputting M . The encoder and
datasets [39] for our analyses, which are the most widely
decoder are connected to form the network G, which has
used in the previous studies of monocular depth estimation.
25.3M parameters in total. For the loss used to train G, we
The NYU-v2 dataset contains 464 indoor scenes, for which
use ldif = ldepth + lgrad + lnormal .
we use the official splits, 249 scenes for training and 215
scenes for testing. We obtain approximately 50K unique 4.2. Estimating Depth from Sparse Pixels
pairs of an image and its corresponding depth map. Follow-
ing the previous studies, we use the same 654 samples for As explained above, our approach is based on the as-
testing. The KITTI dataset contains outdoor scenes and is sumption that the network N can accurately estimate depth
collected by car-mounted cameras and a LIDAR sensor. We from only a selected set of sparse pixels. We also relaxed
use the official training/validation splits; there are 86K im- the condition on the binary mask, allowing M to have con-
age pairs for training and 1K image pairs from the official tinuous values in the range of [0,1]. To validate the assump-
cropped subsets for testing. As the dataset only provides tion as well as this relaxation, we check how the accuracy
1 In our experiments, we confirmed that their method works well for 2 We have found that ℓ performs better than the berhu loss originally
1
VGG networks but behaves unstably for modern CNNs such as ResNets. used in [16], which agree with [27].

3872
(a) RGB (b) Ground (c) Ŷ when (d) Ŷ when (e) Ŷ when (f) M when λ=1 (g) M when λ=3 (h) M when λ=5
images truth λ=1 λ=3 λ=5

Figure 4. Visual comparison of approximated depth maps and estimated masks (M ’s) for different values of the sparseness parameter λ.

Table 1. Accuracy of depth estimation for different values of the 1.8


Edge map
sparseness parameter λ. Results on the NYU-v2 dataset by the 1.6 M
ResNet-50 model of [14]. Sparseness in the table indicates the 1.4
average number of non-zero pixels in M ′ .
1.2
RMSE

λ RMSE (M ) RMSE (M ′ ) Sparseness


original 0.555 0.555 1.0 1.0
λ=1 0.605 0.568 0.920 0.8
λ=2 0.668 0.617 0.746 0.6
λ=3 0.699 0.668 0.589 1.0 0.9 0.8
0.7 0.6 0.5 0.4 0.3 0.2
λ=4 0.731 0.733 0.425 Sparseness
λ=5 0.740 0.758 0.361 Figure 5. Comparison of accuracy of depth estimation when se-
λ=6 0.772 0.882 0.215 lecting input image pixels using M and using the edge map of
input images.

of depth estimation will change when binarizing the contin- Figure 4 shows examples of pairs of the mask M and
uous mask M predicted by G. estimated depth map Ŷ for different λ’s for four different
To be specific, computing M = G(I) for I, we binarize input images. It is also observed from Table 1 that the esti-
M into a binary map M ′ using a threshold ǫ = 0.025. We mated depth with the binarized mask M ′ is mostly the same
then compare accuracy of the predicted depth maps N (I ⊗ as that with the continuous M when λ is not too large; it is
M ′ ) and N (I ⊗ M ). As the sparseness of M is controlled even more accurate for small λ’s. This validates our re-
by the parameter λ as in Eq.(4), we evaluate the accuracy laxation allowing M to have continuous values. Consider-
for different λ’s. We use the NYU-v2 dataset and a ResNet- ing the trade-off between estimation accuracy and λ as well
50 based model of [14]. We train it for 10 epochs on the as the difference between prediction with M and M ′ , we
training set and measure its accuracy by RMSE. choose λ = 5 in the analyses shown in what follows.
Table 1 shows the results. It is first observed that there is
trade-off between accuracy of depth estimation and sparse- 4.3. Analyses of Predicted Mask
ness of the mask M . Please note that the RMSE values 4.3.1 NYU-v2 dataset
are calculated against the ground truth depths. The error
grows from 0.555 (λ = 0) to 0.740 (λ = 5, the value we Figure 6 shows predicted masks for different input images
used in the subsequent experiments), which is only 33% and different depth prediction networks. It is first observed
increase. We believe this is acceptable considering the that there are only small differences among different net-
accuracy-interpretability trade-off that is also seen in many works. This will be an evidence that the proposed visual-
visualization studies. ization method can stably identify relevant pixels to depth

3873
(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(a) RGB (b) Edge maps (c) M for [16] (d) M for [14] (e) M for [14] (f) M for [14]
images (ResNet-50) (ResNet-50) (DenseNet- (SENet-154)
161)

Figure 6. Predicted masks for different input images for different depth estimation networks, ResNet-50-based model of [16] and three
models of [14] whose backbones are ResNet-50, DenseNet-161, and SENet-154, respectively. The edge map of the input I is also shown
for comparison.

estimation. For the sake of comparison, edge maps of I edges (e.g., the vertical edge on the far side in (1)).
are also shown in Fig. 6. It is seen from comparison with
them that M tends to have non-zero values on the image However, a closer observation reveals that there is also a
edges; some non-zero pixels indeed lie exactly on image difference between M and the edge map; M tends to have
non-zero pixels over the filled regions of objects, not on

3874
RGB images

Edge maps

M for [16]
(ResNet-50)

M for [14]
(ResNet-50)

M for [14]
(DenseNet-161)

M for [14]
(SENet-154)

Figure 7. Predicted masks for different networks trained on the KITTI dataset for different input images from the test split.

their boundaries, as with the table in (5), the chairs in (7) the white line on the road surface provides strong edges in
etc. Moreover, very strong image edges sometimes disap- the edge map but is absent in the mask. This indicates that
pear in M , as is the case with a bottom edge of the cabinet the CNNs utilizes the guard rail but does not use the white
in (2); instead, M has non-zero pixels along a weaker im- line for depth estimation for some reason. This is also the
age edge emerging on the border of the cabinet and the wall. same as the white vertical narrow object on the roadside in
This is also the case with the intersecting lines between the the second image.
floor and the bed in (6); M has large values along them, A notable characteristic of the predicted masks on this
whereas their edge strength is very weak. dataset is that the region around the vanishing point of the
To further investigate (dis)similarity between M and the scene is strongly highlighted in the predicted masks. This is
edge map, we compare them by setting the edge map to M the case with all the images in the dataset, not limited to the
and evaluate the accuracy of the predicted depth N (I ⊗M ). three shown here. Our interpretation of this phenomenon
Figure 5 shows the results. It is seen that the use of edge will be given in the discussion below.
maps yields less accurate depth estimation, which clearly
indicates the difference of the edge maps and the masks pre- 4.3.3 Summary and Discussion
dicted by G.
Not boundary alone but filled region is highlighted for In summary, there are three findings from the above visual-
small objects. We conjecture that the CNNs recognize the ization results.
objects and somehow utilize it for depth estimation.
Important/unimportant image edges Some of the im-
age edges are highlighted in M and some are not. This im-
4.3.2 KITTI dataset
plies that the depth prediction network N selects important
Figure 7 shows the predicted masks on the KITTI dataset edges that are necessary for depth estimation. The selec-
for three randomly selected images along with their edge tion seems to be more or less independent of the strength of
maps. More examples are given in the supplementary mate- edges. We conjecture that those selected are essential for in-
rial. As with the NYU-v2 dataset, the predicted masks tend ferring the 3D structure (e.g., orientation, perspective etc.)
to consist of edges and filled regions, and are clearly dif- of a room and a road.
ferent from the edge maps. It is observed that some image
edges are seen in the masks but some are not. For exam- Attending on the regions inside objects As for objects
ple, in the first image, the guard rail on the left has strong in a scene, not only the boundary but the inside region of
edges, which are also seen in the mask. On the other hand, them tend to be highlighted. This is the case more with

3875
dients of scene surfaces); and lnormal (difference in orien-
tation of normal to scene surfaces). We train a ResNet-
50 based model of [14] on NYU-v2 using different com-
binations of the three losses, i.e., ldepth , ldepth + lgrad , and
ldepth +lgrad +lnormal . Figure 8 shows the generated masks
for networks trained using the three loss combinations. It is
observed that the inclusion of lgrad highlights more on the
surface of objects. The further addition of lnormal highlight
more on small objects and makes edges more straight if they
should be.

5. Summary and Conclusion


Toward answering the question of how CNNs can infer
(a) RGB (b) ldepth (c) ldepth (d) ldepth the depth of a scene from its monocular image, we have
images +lgrad +lgrad considered their visualization. Assuming that CNNs can in-
+lnormal fer a depth map accurately from a small number of image
Figure 8. Comparison of the estimated mask M for the three com- pixels, we considered the problem of identifying these pix-
binations of loss functions. els, or equivalently a mask concealing the other pixels, in
each input image. We formulated the problem as an opti-
mization problem of selecting the smallest number of pix-
smaller objects, although this may be partly attributable to
els from which the CNN can estimate a depth map with the
the use of sparseness constraint. Unlike the image edges
minimum difference to that it estimates from the entire im-
providing the geometric structure of the scene, we conjec-
age. Pointing out that there are difficulties with optimiza-
ture that the depth estimation network N may ‘recognize’
tion through a deep CNN, we propose to use an additional
the objects and use their sizes to infer absolute or relative
network to predict the mask for an input image in forward
distance to them.
computation.
We have confirmed through several experiments that the
Vanishing points In the case of outdoor scenes of KITTI,
above assumption holds well and the proposed approach can
the regions around vanishing points (or simply far-away
stably predict the mask for each input image with good ac-
regions) are always highlighted almost without exception.
curacy. We then applied the proposed method to a number
This shows that these regions are important for N to pro-
of monocular depth estimation CNNs on indoor and out-
vide accurate depths. This may be attributable to the fact
door scene datasets. The results provided several findings,
that distant scene points tend to yield large errors because
such as i) the behaviour of CNNs that they seem to select
of the loss evaluating the difference in absolute depths; then
edges in input images depending not on their strengths but
such distant scene regions will be given more weights than
on importance for inference of scene geometry; ii) the ten-
others. Another possible explanation is that this is due to the
dency of attending not only on the boundary but the inside
natural importance of vanishing points; they are naturally a
region of each individual object; iii) the importance of im-
strong cue to understand geometry of a scene. Although
age regions around the vanishing points for depth estima-
these two explanations appear to be orthogonal, they could
tion on outdoor scenes. We also show an application of the
be coupled with each other in practice. A possible hypoth-
proposed method, which is to visualize the effect of using
esis is that CNNs (and/or human vision) learn to look at the
different losses for training a depth estimation CNN.
vanishing points as they are distant and given more weights.
Further investigation will be a direction of future studies. We think these findings contribute to moving forward our
understanding of CNNs on the depth estimation task, shed-
4.4. Evaluation of Training Losses ding some light on the problem that has not been explored
There are several discussions in recent studies on how we so far in the community.
should measure accuracy of estimated depth maps [22, 14]
and what losses we should use for training CNNs [14].
We compare the impact of losses by visualizing a network Acknowledgments: This work was partly supported
N trained on different losses. Following [14], we con- by JSPS KAKENHI Grant Number JP15H05919 and
sider three losses, ldepth (the most widely used one mea- JP19H01110 and by JST CREST Grant Number JP-
suring difference in depth values); lgrad (difference in gra- MJCR14D1.

3876
References [19] Alex Kendall and Yarin Gal. What uncertainties do we need
in bayesian deep learning for computer vision? In NIPS,
[1] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang 2017.
Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang
[20] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Max-
Huang, Wei Xu, Deva Ramanan, and Thomas S. Huang.
imilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Er-
Look and think twice: Capturing top-down visual attention
han, and Been Kim. The (un)reliability of saliency methods.
with feedback convolutional neural networks. ICCV, pages
CoRR, abs/1711.00867, 2017.
2956–2964, 2015.
[21] Pieter-Jan Kindermans, Kristof T. Schutt, Maximilian Al-
[2] Ayan Chakrabarti, Jingyu Shao, and Gregory Shakhnarovich.
ber, K. Muller, Dumitru Erhan, Been Kim, and Sven Dahne.
Depth from a single image by harmonizing overcomplete lo-
Learning how to explain neural networks: Patternnet and pat-
cal network predictions. In NIPS, pages 2658–2666, 2016.
ternattribution. CoRR, 2017.
[3] Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-
[22] Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and
image depth perception in the wild. In NIPS, pages 730–738,
Marco Körner. Evaluation of cnn-based single-image depth
2016.
estimation methods. CoRR, abs/1805.01328, 2018.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
[23] Michael S Landy, Laurence T Maloney, Elizabeth B John-
ImageNet: A Large-Scale Hierarchical Image Database. In
ston, and Mark Young. Measurement and modeling of depth
CVPR, 2009.
cue combination: In defense of weak fusion. Vision research,
[5] David Eigen and Rob Fergus. Predicting depth, surface nor- 35(3):389–412, 1995.
mals and semantic labels with a common multi-scale convo-
[24] Pierre R. Lebreton, Alexander Raake, Marcus Barkowsky,
lutional architecture. ICCV, pages 2650–2658, 2015.
and Patrick Le Callet. Measuring perceived depth in natural
[6] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map images and study of its relation with monocular and binoc-
prediction from a single image using a multi-scale deep net- ular depth cues. In Stereoscopic Displays and Applications
work. In NIPS, pages 2366–2374, 2014. XXV, volume 9011, page 90110C. International Society for
[7] Ruth Fong and Andrea Vedaldi. Interpretable explanations of Optics and Photonics, 2014.
black boxes by meaningful perturbation. ICCV, pages 3449– [25] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel,
3457, 2017. and Mingyi He. Depth and surface normal estimation from
[8] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- monocular images using regression on deep features and hi-
manghelich, and Dacheng Tao. Deep ordinal regression net- erarchical crfs. CVPR, pages 1119–1127, 2015.
work for monocular depth estimation. In CVPR, pages 2002– [26] Jun Li, Reinhard Klein, and Angela Yao. A two-streamed
2011, 2018. network for estimating fine-scaled depth maps from single
[9] Huang Gao, Liu Zhuang, Weinberger Kilian Q, and van der rgb images. In CVPR, pages 3372–3380, 2017.
Maaten Laurens. Densely connected convolutional net- [27] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depth
works. CVPR, 2017. prediction from sparse depth samples and a single image.
[10] Google. https://deepdreamgenerator.com. ICRA, 2018.
[11] Riccardo Guidotti, Anna Monreale, Franco Turini, Dino Pe- [28] Aravindh Mahendran and Andrea Vedaldi. Salient deconvo-
dreschi, and Fosca Giannotti. A survey of methods for ex- lutional networks. In ECCV, 2016.
plaining black box models. CoRR, abs/1802.01933, 2018. [29] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Multi-
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. faceted feature visualization: Uncovering the different types
Deep residual learning for image recognition. In CVPR, of features learned by each neuron in deep neural networks.
pages 770–778, 2016. CoRR, abs/1602.03616, 2016.
[13] Ian P Howard. Seeing in depth, Vol. 1: Basic mechanisms. [30] Stephan Reichelt, Ralf Häussler, Gerald Fütterer, and Nor-
University of Toronto Press, 2002. bert Leister. Depth cues in human visual perception and
[14] Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. their realization in 3d displays. In Three-Dimensional Imag-
Revisiting single image depth estimation: Toward higher res- ing, Visualization, and Display 2010 and Display Technolo-
olution maps with accurate object boundaries. In WACV, gies and Applications for Defense, Security, and Avionics IV,
2019. volume 7690, page 76900B. International Society for Optics
[15] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- and Photonics, 2010.
works. In CVPR, 2018. [31] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin.
[16] Laina Iro, Rupprecht Christian, Belagiannis Vasileios, Why should i trust you?: Explaining the predictions of any
Tombari Federico, and Navab Nassir. Deeper depth pre- classifier. In Proceedings of the 22nd ACM SIGKDD interna-
diction with fully convolutional residual networks. In 3DV, tional conference on knowledge discovery and data mining,
pages 239–248, 2016. pages 1135–1144. ACM, 2016.
[17] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip [32] Ashutosh Saxena, Jamie Schulte, Andrew Y Ng, et al. Depth
H. S. Torr. Learn to pay attention. CoRR, abs/1804.02391, estimation using monocular and stereo cues. In IJCAI, vol-
2018. ume 7, 2007.
[18] DH Kelly. Visual contrast sensitivity. Optica Acta: Interna- [33] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek
tional Journal of Optics, 24(2):107–129, 1977. Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba-

3877
tra. Grad-cam: Visual explanations from deep networks via
gradient-based localization. ICCV, pages 618–626, 2017.
[34] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Fergus. Indoor segmentation and support inference from
rgbd images. In ECCV, 2012.
[35] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.
Deep inside convolutional networks: Visualising image clas-
sification models and saliency maps. CoRR, abs/1312.6034,
2013.
[36] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B.
Viégas, and Martin Wattenberg. Smoothgrad: removing
noise by adding noise. CoRR, abs/1706.03825, 2017.
[37] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas
Brox, and Martin A. Riedmiller. Striving for simplicity: The
all convolutional net. CoRR, abs/1412.6806, 2014.
[38] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic
attribution for deep networks. In ICML, 2017.
[39] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.
In 3DV, 2017.
[40] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and
Nicu Sebe. Multi-scale continuous crfs as sequential deep
networks for monocular depth estimation. CVPR, pages
161–169, 2017.
[41] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated
residual networks. In CVPR, 2017.
[42] Matthew D. Zeiler and Rob Fergus. Visualizing and under-
standing convolutional networks. In ECCV, 2014.
[43] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva,
and Antonio Torralba. Learning deep features for discrimi-
native localization. CVPR, pages 2921–2929, 2016.
[44] Luisa M. Zintgraf, Taco Cohen, Tameem Adel, and Max
Welling. Visualizing deep neural network decisions: Pre-
diction difference analysis. ICLR, 2017.

3878

You might also like