Abstract
In this paper, we propose a zoom-out-and-in network for generating object proposals. A key observation is that it is difficult to classify anchors of different sizes with the same set of features. Anchors of different sizes should be placed accordingly based on different depth within a network: smaller boxes on high-resolution layers with a smaller stride while larger boxes on low-resolution counterparts with a larger stride. Inspired by the conv/deconv structure, we fully leverage the low-level local details and high-level regional semantics from two feature map streams, which are complimentary to each other, to identify the objectness in an image. A map attention decision (MAD) unit is further proposed to aggressively search for neuron activations among two streams and attend the most contributive ones on the feature learning of the final loss. The unit serves as a decision-maker to adaptively activate maps along certain channels with the solely purpose of optimizing the overall training loss. One advantage of MAD is that the learned weights enforced on each feature channel is predicted on-the-fly based on the input context, which is more suitable than the fixed enforcement of a convolutional kernel. Experimental results on three datasets demonstrate the effectiveness of our proposed algorithm over other state-of-the-arts, in terms of average recall for region proposal and average precision for object detection.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The first row is the inner product of two vectors, resulting in a scalar gradient; while the second is the common vector multiplication by a scalar, resulting in a vector also.
Direction ‘matches’ means the included angle between two vectors in multi-dimensional space is within \(90{^{\circ }}\) degrees; and ‘departs’ means the angle falls at [90, 180].
References
Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR.
Bell, S., Zitnick, C. L., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR.
Chavali, N., Agrawal, H., Mahendru, A., & Batra, D. (2016). Object-proposal evaluation protocol is ‘gameable’. In: CVPR.
Cheng, M., Zhang, Z., Lin, W., & Torr, P. H. S. (2014). BING: binarized normed gradients for objectness estimation at 300fps. In CVPR.
Chi, Z., Li, H., Lu, H., & Yang, M.-H. (2016). Dual deep network for visual tracking. arXiv:1612.06053.
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In NIPS.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. arXiv:1703.06211.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Endres, I., & Hoiem, D. (2014). Category-independent object proposals with diverse ranking. IEEE Transactions on PAMI, 36, 222–234.
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). Dssd: Deconvolutional single shot detector. arXiv:1701.06659.
Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., & Gool, L. V. (2016). DeepProposals: Hunting objects and actions by cascading deep convolutional layers. arXiv:1606.04702.
Gidaris, S., & Komodakis, N. (2016). Attend refine repeat: Active box proposal generation via in-out localization. In BMVC.
Girshick, R. (2015). Fast R-CNN. In ICCV.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.
Hariharan, B., Arbelez, P., Girshick, R., & Malik, J. (2014). Hypercolumns for object segmentation and fine-grained localization. In CVPR.
Hayder, Z., He, X., & Salzmann, M. (2016). Learning to co-generate object proposals with a deep structured network. In CVPR.
He, S. & Lau, R. W. (2015). Oriented object proposals. In: ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.
Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2015). What makes for effective detection proposals? IEEE Transactions on PAMI, 38, 814–830.
Hu, J., Shen, L., & Sun, G. (2017). Squeeze-and-excitation networks. arXiv:1709.01507.
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR.
Humayun, A., Li, F., & Rehg, J. M. (2014). Rigor: Reusing inference in graph cuts for generating object regions. In CVPR.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.
Jie, Z., Liang, X., Feng, J., Lu, W. F., Tay, E. H. F., & Yan, S. (2016). Scale-aware pixelwise object proposal networks. IEEE Transactions on Image Processing, 25, 4525–4539.
Kaiming, H., Xiangyu, Z., Shaoqing, R., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.
Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR.
Krahenbuhl, P., & Koltun, V. (2014). Geodesic object proposals. In ECCV.
Krahenbuhl, P., & Koltun, V. (2015). Learning to propose objects. In CVPR.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS, (pp. 1106–1114).
Kuo, W., Hariharan, B., & Malik, J. (2015). DeepBox: Learning objectness with convolutional networks. In ICCV.
Li, H., Liu, Y., Ouyang, W., & Wang, X. (2017a). Zoom out-and-in network with recursive training for object proposal. arXiv:1702.05711.
Li, H., Liu, Y., Zhang, X., An, Z., Wang, J., Chen, Y., & Tong, J. (2017b). Do we really need more training data for object localization. In IEEE international conference on image processing. IEEE.
Li, H., Ouyang, W., & Wang, X. (2016). Multi-bias non-linear activation in deep neural networks. In ICML.
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR.
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollar, P. (2014). Microsoft COCO: Common objects in context. arXiv preprint:1405.0312.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S. (2016). SSD: Single shot multibox detector. In ECCV.
Liu, Y., Li, H., & Wang, X. (2017a). Learning deep features via congenerous cosine loss for person recognition. arXiv:1702.06890.
Liu, Y., Li, H., Yan, J., Wei, F., Wang, X., & Tang, X. (2017b). Recurrent scale approximation for object detection in cnn. In IEEE international conference on computer vision.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Manén, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim’s algorithm. In ICCV.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In ICCV.
Pinheiro, P. O., Collobert, R., & Dollar, P. (2015). Learning to segment object candidates. In NIPS.
Pinheiro, P. O., Lin, T.-Y., Collobert, R., & Dollr, P. (2016). Learning to refine object segments. In ECCV.
Pont-Tuset, J., & Gool, L. V. (2015). Boosting object proposals: From pascal to coco. In CVPR.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR.
Redmon, J., & Farhadi, A. (2016). Yolo9000: Better, faster, stronger. arXiv:1612.08242.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. arXiv:1505.04597.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
Sun, C., Paluri, M., Collobert, R., Nevatia, R., & Bourdev, L. (2016). ProNet: Learning to propose object-specific boxes for cascaded neural networks. In CVPR.
Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 10, 154–171.
Wang, X., Shrivastava, A., & Gupta, A. (2017). A-fast-rcnn: Hard positive generation via adversary for object detection. In CVPR.
Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In ECCV.
Zitnick, L., & Dollar, P. (2014). Edge Boxes: Locating object proposals from edges. In ECCV.
Acknowledgements
We would like to thank reviewers for helpful comments, S. Gidaris, X. Tong and K. Kang for fruitful discussions along the way, W. Yang for proofreading the manuscript. H. Li is funded by the Hong Kong Ph.D. Fellowship scheme. We are also grateful for SenseTime Group Ltd. donating the resource of GPUs at time of this project.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Antonio Torralba.
Rights and permissions
About this article
Cite this article
Li, H., Liu, Y., Ouyang, W. et al. Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection. Int J Comput Vis 127, 225–238 (2019). https://doi.org/10.1007/s11263-018-1101-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-018-1101-7