Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning Offset Probability Distribution for Accurate Object Detection

Published: 22 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Object detection combines object classification and object localization problems. Current object detection methods heavily depend on regression networks to locate objects, which are optimized with various regression loss functions to predict offsets between candidate boxes and objects. However, these regression losses are difficult to assign the appropriate penalties for samples with large offset errors, resulting in suboptimal regression networks and inaccurate object offsets. In this article, we consider object location as offset bin classification problem, and propose a distance-aware offset bin classification network optimized with multiple binary cross entropy losses to learn various offset probability distribution, including single label distribution and distance-aware label distribution. On one hand, it provides gradient contributions for different samples based on the bounded probability instead of previous incalculable offset error. On the other hand, it explores the distance correlations between discrete offset bins to facilitate network learning. Specifically, we discretize the continuous offset into a number of bins, and predict the probability of each offset bin, in which the probability should be higher for the offset bin closer to the target offsets, and vice versa. Furthermore, we propose an expectation-based offset prediction and a hierarchical focusing method to improve the precision of prediction. We conduct extensive experiments to evaluate the effectiveness of our method. In addition, our method can be conveniently and flexibly inserted into existing object detection methods, which consistently achieves a large gain based on popular anchor-based and anchor-free methods on the PASCAL VOC, MS-COCO, KITTI, and CrowdHuman datasets. Code will be released at: https://github.com/QiuHeqian/DBC.

    References

    [1]
    Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, and Nuno Vasconcelos. 2016. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the European Conference on Computer Vision. 354–370.
    [2]
    Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6154–6162.
    [3]
    Zhaowei Cai and Nuno Vasconcelos. 2019. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2019), 1483–1498.
    [4]
    Jiale Cao, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. 2020. D2Det: Towards high quality object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11485–11494.
    [5]
    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision. Springer, 213–229.
    [6]
    Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. 2019. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4974–4983.
    [7]
    Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. 2019. MMDetection: Open MMLab detection toolbox and benchmark. arXiv:1906.07155. Retrieved from https://arxiv.org/abs/1906.07155v1
    [8]
    Shengjia Chen, Zhixin Li, and Zhenjun Tang. 2020. Relation R-CNN: A graph based relation-aware network for object detection. IEEE Signal Processing Letters 27 (2020), 1680–1684.
    [9]
    Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, and Jian Sun. 2020. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12214–12223.
    [10]
    Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, and Lei Zhang. 2021. Dynamic DETR: End-to-end object detection with dynamic attention. In Proceedings of the IEEE International Conference on Computer Vision. 2988–2997.
    [11]
    Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei. 2020. Single shot video object detector. IEEE Transactions on Multimedia 23 (2020), 846–858.
    [12]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth \(16\times 16\) words: Transformers for image recognition at scale. International Conference on Learning Representations.
    [13]
    Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.
    [14]
    Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, and Hongsheng Li. 2021. Fast convergence of DETR with spatially modulated co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 3621–3630.
    [15]
    Shiming Ge, Fanzhao Lin, Chenyu Li, Daichi Zhang, Weiping Wang, and Dan Zeng. 2022. Deepfake video detection via predictive representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM’22) 18, 2s (2022), 115:1–115:21.
    [16]
    Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3354–3361.
    [17]
    Spyros Gidaris and Nikos Komodakis. 2016. LoCNet: Improving localization accuracy for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 789–798.
    [18]
    Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448.
    [19]
    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587.
    [20]
    Jicheng Gong, Zhao Zhao, and Nic Li. 2019. Improving multi-stage object detection via iterative proposal refinement. In Proceedings of the British Machine Vision Conference. 223.
    [21]
    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.
    [22]
    Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides, and Xiangyu Zhang. 2019. Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2888–2897.
    [23]
    Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. 2015. DenseBox: Unifying landmark localization with end to end object detection. arXiv:1509.04874. Retrieved from https://arxiv.org/abs/1509.04874
    [24]
    Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. 2018. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision. 784–799.
    [25]
    Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li, and Jianbo Shi. 2020. FoveaBox: Beyound anchor-based object detection. IEEE Transactions on Image Processing 29 (2020), 7389–7398.
    [26]
    Hei Law and Jia Deng. 2018. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision. 734–750.
    [27]
    Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. 2020. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the International Conference on Neural Information Processing Systems. 21002–21012.
    [28]
    Zhixin Li, Lan Lin, Canlong Zhang, Huifang Ma, Weizhong Zhao, and Zhiping Shi. 2021. A semi-supervised learning approach based on adaptive weighted fusion for automatic image annotation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1 (2021), 1–23.
    [29]
    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2117–2125.
    [30]
    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.
    [31]
    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.
    [32]
    Changming Liu, Xiaojing Ma, Sixing Cao, Jiayun Fu, and Bin B. Zhu. 2022. Privacy-preserving motion detection for HEVC-compressed surveillance video. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 1 (2022), 1–27.
    [33]
    Ji Liu, Dong Li, Rongzhang Zheng, Lu Tian, and Yi Shan. 2021. RankDetNet: Delving into ranking constraints for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 264–273.
    [34]
    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. 21–37.
    [35]
    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision. 10012–10022.
    [36]
    Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan. 2019. Grid R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7363–7372.
    [37]
    Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. 2021. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3651–3660.
    [38]
    Mahyar Najibi, Mohammad Rastegari, and Larry S. Davis. 2016. G-CNN: An iterative grid based object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2369–2377.
    [39]
    Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. 2019. Libra R-CNN: Towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 821–830.
    [40]
    Heqian Qiu, Hongliang Li, Qingbo Wu, and Hengcan Shi. 2020. Offset bin classification network for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13188–13197.
    [41]
    Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271.
    [42]
    Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An incremental improvement. arXiv:1804.02767. Retrieved from http://arxiv.org/abs/1804.02767
    [43]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.
    [44]
    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 658–666.
    [45]
    Rasmus Rothe, Radu Timofte, and Luc Van Gool. 2015. DEX: Deep expectation of apparent age from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshop. 10–15.
    [46]
    Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. 2018. CrowdHuman: A benchmark for detecting human in a crowd. arXiv:1805.00123. Retrieved from https://arxiv.org/abs/1805.00123
    [47]
    Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, and Ping Luo. 2021. Sparse R-CNN: End-to-end object detection with learnable proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14454–14463.
    [48]
    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.
    [49]
    Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision. 9627–9636.
    [50]
    Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2020. FCOS: A simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 4 (2020), 1922–1933.
    [51]
    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning. PMLR, 10347–10357.
    [52]
    Jiaqi Wang, Wenwei Zhang, Yuhang Cao, Kai Chen, Jiangmiao Pang, Tao Gong, Jianping Shi, Chen Change Loy, and Dahua Lin. 2020. Side-aware boundary localization for more precise object detection. In Proceedings of the European Conference on Computer Vision. 403–419.
    [53]
    Keyang Wang and Lei Zhang. 2021. Reconcile prediction consistency for balanced object detection. In Proceedings of the IEEE International Conference on Computer Vision. 3631–3640.
    [54]
    Zhoutao Wang, Qian Xie, Mingqiang Wei, Kun Long, and Jun Wang. 2022. Multi-feature fusion VoteNet for 3D object detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 1 (2022), 1–17.
    [55]
    Shuai Wu, Yong Xu, Bob Zhang, Jian Yang, and David Zhang. 2021. Deformable template network (DTN) for object detection. IEEE Transactions on Multimedia 24 (2021), 2058–2068.
    [56]
    Chunyan Xu, Rong Liu, Tong Zhang, Zhen Cui, Jian Yang, and Chunlong Hu. 2021. Dual-stream structured graph convolution network for skeleton-based action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1–22.
    [57]
    Dongbao Yang, Yu Zhou, Wei Shi, Dayan Wu, and Weiping Wang. 2022. RD-IOD: Two-level residual-distillation-based triple-network for incremental object detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 1 (2022), 1–23.
    [58]
    Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. 2019. RepPoints: Point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision. 9657–9666.
    [59]
    Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. 2016. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia. 516–520.
    [60]
    Donghuo Zeng, Jianming Wu, Gen Hattori, Rong Xu, and Yi Yu. 2023. Learning explicit and implicit dual common subspaces for audio-visual cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM’23) 19, 2s (2023), 97:1–97:23.
    [61]
    Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. 2020. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9759–9768.
    [62]
    Qiang Zhou, Chaohui Yu, Zhibin Wang, Qi Qian, and Hao Li. 2021. Instant-teaching: An end-to-end semi-supervised object detection framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4081–4090.
    [63]
    Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv:1904.07850. Retrieved from http://arxiv.org/abs/1904.07850
    [64]
    Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. 2019. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 850–859.
    [65]
    Chenchen Zhu, Yihui He, and Marios Savvides. 2019. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 840–849.
    [66]
    Xizhou Zhu,Weijie Su, Lewei Lu, Bin Li, XiaogangWang, and Jifeng Dai. 2020. Deformable DETR: Deformable transformers for end-to-end object detection. International Conference on Learning Representations.

    Index Terms

    1. Learning Offset Probability Distribution for Accurate Object Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 5
      May 2024
      650 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613634
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 January 2024
      Online AM: 13 December 2023
      Accepted: 29 November 2023
      Revised: 25 October 2023
      Received: 13 December 2022
      Published in TOMM Volume 20, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Object detection
      2. distance-aware offset bin classification
      3. offset probability distribution
      4. expectation-based offset prediction
      5. hierarchical focusing method

      Qualifiers

      • Research-article

      Funding Sources

      • Sichuan Province Innovative Talent Funding Project for Postdoctoral Fellows
      • National Natural Science Foundation of China
      • China Postdoctoral Science Foundation

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 195
        Total Downloads
      • Downloads (Last 12 months)195
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media