Hu Visualization of Convolutional Neural Networks For Monocular Depth Estimation ICCV 2019 Paper
Hu Visualization of Convolutional Neural Networks For Monocular Depth Estimation ICCV 2019 Paper
Hu Visualization of Convolutional Neural Networks For Monocular Depth Estimation ICCV 2019 Paper
Abstract
3869
Figure 2. Diagram of the proposed approach. The target of visualization is the trained depth estimation net N . To identify the pixels of the
input image I that N uses to estimate its depth map Y , we input I to the network G for predicting the set of relevant pixels, or the mask
M . The output M is element-wise multiplied with I and inputted to N , yielding an estimate Ŷ of the depth map. G is trained so that Ŷ
will be as close to the original estimate Y from the entire image I and M will be maximally sparse. Note that N is fixed in this process.
estimate an image mask that selects the smallest number of 2. Related work
pixels from which the target CNN can provide the maxi-
mally similar depth map to that it estimates from the orig- There are many studies that attempt to interpret inference
inal input. This optimization requires optimization of the of CNNs, most of which have focused on the task of image
output of the CNN with respect to its input. As is shown in classification [1, 43, 37, 36, 44, 31, 33, 17, 7, 28, 38]. How-
previous studies of visualization, such optimization through ever, there are only a few methods that have been recog-
a CNN in its backward direction sometimes yields unex- nized to be practically useful in the community [11, 20, 21].
pected results, such as noisy visualization [35, 37] at best
and even phenomenon similar to adversarial examples [7] Gradient based methods [36, 28, 38] compute a saliency
at worst. To avoid this issue, we use an additional CNN map that visualizes sensitivity of each pixel of the input im-
to estimate the mask from the input image in the forward age to the final prediction, which is obtained by calculating
computation; this CNN is independent of the target CNN of the derivatives of the output of the model with respect to
visualization. Our method is illustrated in Fig. 2. each image pixel.
We conduct a number of experiments to evaluate the ef- There are many methods that mask part of the input im-
fectiveness of our approach. We apply our method to CNNs age to see its effects [42]. General-purpose methods devel-
trained on indoor scenes (the NYU-v2 dataset) and those oped for interpreting inference of machine learning models,
trained on outdoor scenes (the KITTI dataset). We confirm such as LIME [31] and Prediction Difference Analysis [44],
through the experiments that may be categorized in this class, when they are applied to
CNNs classifying an input image.
• CNNs can infer the depth map from only a sparse set
of pixels in the input image with similar accuracy to The most dependable method as of now for visualiza-
those they infer from the entire image; tion of CNNs for classification is arguably the class activa-
tion map (CAM) [43], which calculate the linear combina-
• The mask selecting the relevant pixels can be predicted tion of the activation of the last convolutional layers in its
stably by a CNN. This CNN is trained to predict masks channel dimension. Its extension, Grad-CAM [33], is also
for a target CNN for depth estimation. widely used, which integrates the gradient-based method
with CAM to enable to use general network architectures
The visualization of CNNs on the indoor and outdoor scenes that cannot be dealt with by CAM.
provides several findings including the following, which we
think contribute to understanding of how CNNs works on However, the above methods, which are developed
the monocular depth estimation task. mainly for explanation of classification, cannot directly
be applied to CNNs performing depth estimation. In the
• CNNs frequently use some of the edges in input im- case of depth estimation, the output of CNNs is a two-
ages but not all of them. Their importance depends not dimensional map, not a score for a category. This imme-
necessarily on their edge strengths but more on useful- diately excludes gradient based methods as well as CAM
ness for grasping the scene geometry. and its variants. The masking methods that employ fixed-
shape masks [44] or super-pixels obtained using low-level
• For outdoor scenes, large weights tend to be given image features [31] are not fit for our purpose, either, since
to distant regions around the vanishing points in the there is no guarantee that their shapes match well with the
scene. depth cues in input images that are utilized by the CNNs.
3870
3. Method
3.1. Problem Formulation
Suppose a network N that predicts the depth map of a
scene from its single RGB image as
Y = N (I), (1)
3871
This formulation is similar to that employed in [7], a study sparse depth maps, we use the depth completion toolbox
for visualization of CNNs for object recognition, in which of the NYU-v2 dataset to interpolate pixels with missing
the most important pixels in the input image are identified depth.
by masking the pixels that maximally lower the score of a
selected object class. Unlike our method, the authors di-
Target CNN models There are many studies for monoc-
rectly optimize M ; to avoid artifacts that will emerge in
ular depth estimation, in which a variety of architectures
the optimization, they employ additional constraints on M
are proposed. Considering the purpose here, we choose
other than its sparseness1 . The results obtained by the op-
models that show strong performance in estimation accu-
timization of (5) are shown in Fig. 3(d). It is seen that this
racy with a simple architecture. One is an encoder-decoder
approach cannot provide useful results.
network based on ResNet-50 proposed in [16], which out-
performs previous ones by a large margin as of the time
Algorithm 1 Algorithm for training the network G for pre-
of publishing. We also consider more recent ones pro-
diction of M .
posed in [14], for which we choose three different backbone
Input: N : a target, fully-trained network for depth estima-
networks, ResNet-50 [12], DenseNet-161 [9], and SENet-
tion; ψ: a training set, i.e., pairs of the RGB image and
154 [15]. For better comparison, all the models are im-
depth map of a scene; λ: a parameter controlling the
plemented in the same experimental conditions. Following
sparseness of M .
their original implementation, the first and the latter three
Hyperparameters: Adam optimizer, learning rate: 1e−4 ,
models are trained using different losses. To be specific, the
weight decay: 1e−4 , training epochs: K.
first model is trained using ℓ1 norm of depth errors2 . For
Output: G: a network for predicting M .
the latter three
Pnmodels, sum of three1 P losses are used, i.e.,
1: Freeze N ; n
ldepth = n1 i=1 F (ei ), lgrad P = n i=1 (F (∇x (ei )) +
2: for j = 1 to K do 1 n
F (∇y (ei ))), and lnormal = n i=1 (1 − cos θi ) , where
3: for i = 1 to T do
F (ei ) = ln(ei + 0.5); ei = kyi − yˆi k1 ; yi and yˆi are true
4: Select RGB batch ψi from ψ;
and estimated depths; and θi is the angle between the sur-
5: Set gradients of G to 0;
face normals computed from the true and estimated depth
6: Calculate depth maps for ψi :
map.
7: Yψi = N (ψi );
8: Calculate the value (L) of objective function:
9: L = ldif (Yψi , N (ψi ⊗ G(ψi )) +λ n1 kG(ψi )k1 ; Network G for predicting M We employ an encoder-
10: Backpropagate L; decoder structure for G. For the encoder, we use the dilated
11: Update G; residual network (DRN) proposed in [41], which preserves
12: end for local structures of the input image due to a fewer counts of
13: end for down-sampling. Specifically, we use a DRN with 22 lay-
ers (DRN-D-22) pre-trained on ImageNet [4], from which
we remove the last fully connected layer. It yields a feature
4. Experiments map with 512 channels and 1/8 resolution of the input im-
age. For the decoder, we use a network consisting of three
4.1. Experimental Setup up-projection blocks [16] yielding a feature map with 64
channels and the same size as the input image, followed by
Datasets We use two datasets NYU-v2 [34] and KITTI
a 3 × 3 convolutional layer outputting M . The encoder and
datasets [39] for our analyses, which are the most widely
decoder are connected to form the network G, which has
used in the previous studies of monocular depth estimation.
25.3M parameters in total. For the loss used to train G, we
The NYU-v2 dataset contains 464 indoor scenes, for which
use ldif = ldepth + lgrad + lnormal .
we use the official splits, 249 scenes for training and 215
scenes for testing. We obtain approximately 50K unique 4.2. Estimating Depth from Sparse Pixels
pairs of an image and its corresponding depth map. Follow-
ing the previous studies, we use the same 654 samples for As explained above, our approach is based on the as-
testing. The KITTI dataset contains outdoor scenes and is sumption that the network N can accurately estimate depth
collected by car-mounted cameras and a LIDAR sensor. We from only a selected set of sparse pixels. We also relaxed
use the official training/validation splits; there are 86K im- the condition on the binary mask, allowing M to have con-
age pairs for training and 1K image pairs from the official tinuous values in the range of [0,1]. To validate the assump-
cropped subsets for testing. As the dataset only provides tion as well as this relaxation, we check how the accuracy
1 In our experiments, we confirmed that their method works well for 2 We have found that ℓ performs better than the berhu loss originally
1
VGG networks but behaves unstably for modern CNNs such as ResNets. used in [16], which agree with [27].
3872
(a) RGB (b) Ground (c) Ŷ when (d) Ŷ when (e) Ŷ when (f) M when λ=1 (g) M when λ=3 (h) M when λ=5
images truth λ=1 λ=3 λ=5
Figure 4. Visual comparison of approximated depth maps and estimated masks (M ’s) for different values of the sparseness parameter λ.
of depth estimation will change when binarizing the contin- Figure 4 shows examples of pairs of the mask M and
uous mask M predicted by G. estimated depth map Ŷ for different λ’s for four different
To be specific, computing M = G(I) for I, we binarize input images. It is also observed from Table 1 that the esti-
M into a binary map M ′ using a threshold ǫ = 0.025. We mated depth with the binarized mask M ′ is mostly the same
then compare accuracy of the predicted depth maps N (I ⊗ as that with the continuous M when λ is not too large; it is
M ′ ) and N (I ⊗ M ). As the sparseness of M is controlled even more accurate for small λ’s. This validates our re-
by the parameter λ as in Eq.(4), we evaluate the accuracy laxation allowing M to have continuous values. Consider-
for different λ’s. We use the NYU-v2 dataset and a ResNet- ing the trade-off between estimation accuracy and λ as well
50 based model of [14]. We train it for 10 epochs on the as the difference between prediction with M and M ′ , we
training set and measure its accuracy by RMSE. choose λ = 5 in the analyses shown in what follows.
Table 1 shows the results. It is first observed that there is
trade-off between accuracy of depth estimation and sparse- 4.3. Analyses of Predicted Mask
ness of the mask M . Please note that the RMSE values 4.3.1 NYU-v2 dataset
are calculated against the ground truth depths. The error
grows from 0.555 (λ = 0) to 0.740 (λ = 5, the value we Figure 6 shows predicted masks for different input images
used in the subsequent experiments), which is only 33% and different depth prediction networks. It is first observed
increase. We believe this is acceptable considering the that there are only small differences among different net-
accuracy-interpretability trade-off that is also seen in many works. This will be an evidence that the proposed visual-
visualization studies. ization method can stably identify relevant pixels to depth
3873
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(a) RGB (b) Edge maps (c) M for [16] (d) M for [14] (e) M for [14] (f) M for [14]
images (ResNet-50) (ResNet-50) (DenseNet- (SENet-154)
161)
Figure 6. Predicted masks for different input images for different depth estimation networks, ResNet-50-based model of [16] and three
models of [14] whose backbones are ResNet-50, DenseNet-161, and SENet-154, respectively. The edge map of the input I is also shown
for comparison.
estimation. For the sake of comparison, edge maps of I edges (e.g., the vertical edge on the far side in (1)).
are also shown in Fig. 6. It is seen from comparison with
them that M tends to have non-zero values on the image However, a closer observation reveals that there is also a
edges; some non-zero pixels indeed lie exactly on image difference between M and the edge map; M tends to have
non-zero pixels over the filled regions of objects, not on
3874
RGB images
Edge maps
M for [16]
(ResNet-50)
M for [14]
(ResNet-50)
M for [14]
(DenseNet-161)
M for [14]
(SENet-154)
Figure 7. Predicted masks for different networks trained on the KITTI dataset for different input images from the test split.
their boundaries, as with the table in (5), the chairs in (7) the white line on the road surface provides strong edges in
etc. Moreover, very strong image edges sometimes disap- the edge map but is absent in the mask. This indicates that
pear in M , as is the case with a bottom edge of the cabinet the CNNs utilizes the guard rail but does not use the white
in (2); instead, M has non-zero pixels along a weaker im- line for depth estimation for some reason. This is also the
age edge emerging on the border of the cabinet and the wall. same as the white vertical narrow object on the roadside in
This is also the case with the intersecting lines between the the second image.
floor and the bed in (6); M has large values along them, A notable characteristic of the predicted masks on this
whereas their edge strength is very weak. dataset is that the region around the vanishing point of the
To further investigate (dis)similarity between M and the scene is strongly highlighted in the predicted masks. This is
edge map, we compare them by setting the edge map to M the case with all the images in the dataset, not limited to the
and evaluate the accuracy of the predicted depth N (I ⊗M ). three shown here. Our interpretation of this phenomenon
Figure 5 shows the results. It is seen that the use of edge will be given in the discussion below.
maps yields less accurate depth estimation, which clearly
indicates the difference of the edge maps and the masks pre- 4.3.3 Summary and Discussion
dicted by G.
Not boundary alone but filled region is highlighted for In summary, there are three findings from the above visual-
small objects. We conjecture that the CNNs recognize the ization results.
objects and somehow utilize it for depth estimation.
Important/unimportant image edges Some of the im-
age edges are highlighted in M and some are not. This im-
4.3.2 KITTI dataset
plies that the depth prediction network N selects important
Figure 7 shows the predicted masks on the KITTI dataset edges that are necessary for depth estimation. The selec-
for three randomly selected images along with their edge tion seems to be more or less independent of the strength of
maps. More examples are given in the supplementary mate- edges. We conjecture that those selected are essential for in-
rial. As with the NYU-v2 dataset, the predicted masks tend ferring the 3D structure (e.g., orientation, perspective etc.)
to consist of edges and filled regions, and are clearly dif- of a room and a road.
ferent from the edge maps. It is observed that some image
edges are seen in the masks but some are not. For exam- Attending on the regions inside objects As for objects
ple, in the first image, the guard rail on the left has strong in a scene, not only the boundary but the inside region of
edges, which are also seen in the mask. On the other hand, them tend to be highlighted. This is the case more with
3875
dients of scene surfaces); and lnormal (difference in orien-
tation of normal to scene surfaces). We train a ResNet-
50 based model of [14] on NYU-v2 using different com-
binations of the three losses, i.e., ldepth , ldepth + lgrad , and
ldepth +lgrad +lnormal . Figure 8 shows the generated masks
for networks trained using the three loss combinations. It is
observed that the inclusion of lgrad highlights more on the
surface of objects. The further addition of lnormal highlight
more on small objects and makes edges more straight if they
should be.
3876
References [19] Alex Kendall and Yarin Gal. What uncertainties do we need
in bayesian deep learning for computer vision? In NIPS,
[1] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang 2017.
Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang
[20] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Max-
Huang, Wei Xu, Deva Ramanan, and Thomas S. Huang.
imilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Er-
Look and think twice: Capturing top-down visual attention
han, and Been Kim. The (un)reliability of saliency methods.
with feedback convolutional neural networks. ICCV, pages
CoRR, abs/1711.00867, 2017.
2956–2964, 2015.
[21] Pieter-Jan Kindermans, Kristof T. Schutt, Maximilian Al-
[2] Ayan Chakrabarti, Jingyu Shao, and Gregory Shakhnarovich.
ber, K. Muller, Dumitru Erhan, Been Kim, and Sven Dahne.
Depth from a single image by harmonizing overcomplete lo-
Learning how to explain neural networks: Patternnet and pat-
cal network predictions. In NIPS, pages 2658–2666, 2016.
ternattribution. CoRR, 2017.
[3] Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-
[22] Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and
image depth perception in the wild. In NIPS, pages 730–738,
Marco Körner. Evaluation of cnn-based single-image depth
2016.
estimation methods. CoRR, abs/1805.01328, 2018.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
[23] Michael S Landy, Laurence T Maloney, Elizabeth B John-
ImageNet: A Large-Scale Hierarchical Image Database. In
ston, and Mark Young. Measurement and modeling of depth
CVPR, 2009.
cue combination: In defense of weak fusion. Vision research,
[5] David Eigen and Rob Fergus. Predicting depth, surface nor- 35(3):389–412, 1995.
mals and semantic labels with a common multi-scale convo-
[24] Pierre R. Lebreton, Alexander Raake, Marcus Barkowsky,
lutional architecture. ICCV, pages 2650–2658, 2015.
and Patrick Le Callet. Measuring perceived depth in natural
[6] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map images and study of its relation with monocular and binoc-
prediction from a single image using a multi-scale deep net- ular depth cues. In Stereoscopic Displays and Applications
work. In NIPS, pages 2366–2374, 2014. XXV, volume 9011, page 90110C. International Society for
[7] Ruth Fong and Andrea Vedaldi. Interpretable explanations of Optics and Photonics, 2014.
black boxes by meaningful perturbation. ICCV, pages 3449– [25] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel,
3457, 2017. and Mingyi He. Depth and surface normal estimation from
[8] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- monocular images using regression on deep features and hi-
manghelich, and Dacheng Tao. Deep ordinal regression net- erarchical crfs. CVPR, pages 1119–1127, 2015.
work for monocular depth estimation. In CVPR, pages 2002– [26] Jun Li, Reinhard Klein, and Angela Yao. A two-streamed
2011, 2018. network for estimating fine-scaled depth maps from single
[9] Huang Gao, Liu Zhuang, Weinberger Kilian Q, and van der rgb images. In CVPR, pages 3372–3380, 2017.
Maaten Laurens. Densely connected convolutional net- [27] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depth
works. CVPR, 2017. prediction from sparse depth samples and a single image.
[10] Google. https://deepdreamgenerator.com. ICRA, 2018.
[11] Riccardo Guidotti, Anna Monreale, Franco Turini, Dino Pe- [28] Aravindh Mahendran and Andrea Vedaldi. Salient deconvo-
dreschi, and Fosca Giannotti. A survey of methods for ex- lutional networks. In ECCV, 2016.
plaining black box models. CoRR, abs/1802.01933, 2018. [29] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Multi-
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. faceted feature visualization: Uncovering the different types
Deep residual learning for image recognition. In CVPR, of features learned by each neuron in deep neural networks.
pages 770–778, 2016. CoRR, abs/1602.03616, 2016.
[13] Ian P Howard. Seeing in depth, Vol. 1: Basic mechanisms. [30] Stephan Reichelt, Ralf Häussler, Gerald Fütterer, and Nor-
University of Toronto Press, 2002. bert Leister. Depth cues in human visual perception and
[14] Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. their realization in 3d displays. In Three-Dimensional Imag-
Revisiting single image depth estimation: Toward higher res- ing, Visualization, and Display 2010 and Display Technolo-
olution maps with accurate object boundaries. In WACV, gies and Applications for Defense, Security, and Avionics IV,
2019. volume 7690, page 76900B. International Society for Optics
[15] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- and Photonics, 2010.
works. In CVPR, 2018. [31] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin.
[16] Laina Iro, Rupprecht Christian, Belagiannis Vasileios, Why should i trust you?: Explaining the predictions of any
Tombari Federico, and Navab Nassir. Deeper depth pre- classifier. In Proceedings of the 22nd ACM SIGKDD interna-
diction with fully convolutional residual networks. In 3DV, tional conference on knowledge discovery and data mining,
pages 239–248, 2016. pages 1135–1144. ACM, 2016.
[17] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip [32] Ashutosh Saxena, Jamie Schulte, Andrew Y Ng, et al. Depth
H. S. Torr. Learn to pay attention. CoRR, abs/1804.02391, estimation using monocular and stereo cues. In IJCAI, vol-
2018. ume 7, 2007.
[18] DH Kelly. Visual contrast sensitivity. Optica Acta: Interna- [33] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek
tional Journal of Optics, 24(2):107–129, 1977. Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba-
3877
tra. Grad-cam: Visual explanations from deep networks via
gradient-based localization. ICCV, pages 618–626, 2017.
[34] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Fergus. Indoor segmentation and support inference from
rgbd images. In ECCV, 2012.
[35] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.
Deep inside convolutional networks: Visualising image clas-
sification models and saliency maps. CoRR, abs/1312.6034,
2013.
[36] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B.
Viégas, and Martin Wattenberg. Smoothgrad: removing
noise by adding noise. CoRR, abs/1706.03825, 2017.
[37] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas
Brox, and Martin A. Riedmiller. Striving for simplicity: The
all convolutional net. CoRR, abs/1412.6806, 2014.
[38] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic
attribution for deep networks. In ICML, 2017.
[39] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.
In 3DV, 2017.
[40] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and
Nicu Sebe. Multi-scale continuous crfs as sequential deep
networks for monocular depth estimation. CVPR, pages
161–169, 2017.
[41] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated
residual networks. In CVPR, 2017.
[42] Matthew D. Zeiler and Rob Fergus. Visualizing and under-
standing convolutional networks. In ECCV, 2014.
[43] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva,
and Antonio Torralba. Learning deep features for discrimi-
native localization. CVPR, pages 2921–2929, 2016.
[44] Luisa M. Zintgraf, Taco Cohen, Tameem Adel, and Max
Welling. Visualizing deep neural network decisions: Pre-
diction difference analysis. ICLR, 2017.
3878