Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475500acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

FTAFace: Context-enhanced Face Detector with Fine-grained Task Attention

Published: 17 October 2021 Publication History

Abstract

In face detection, it is a common strategy to treat samples differently according to their difficulty for balancing training data distribution. However, we observe that widely used sampling strategies, such as OHEM and Focal loss, can lead to the performance imbalance between different tasks (e.g., classification and localization). Through analysis, we point out that, due to the driving of classification information, these sample-based strategies are difficult to coordinate the attention of different tasks during the training, thus leading to the above imbalance. Accordingly, we first confirm this by shifting the attention from the sample level to the task level. Then, we propose a fine-grained task attention method, a.k.a FTA, including inter-task importance and intra-task importance, which adaptively adjusts the attention of each item in the task from both global and local perspectives, so as to achieve finer optimization. In addition, we introduce transformer as a feature enhancer to assist our convolution network, and propose a context enhancement transformer, a.k.a CET, to mine the spatial relationship in the features towards more robust feature representation. Extensive experiments on WiderFace and FDDB benchmarks demonstrate that our method significantly boosts the baseline performance by 2.7%, 2.3% and 4.9% on easy, medium and hard validation sets respectively. Furthermore, the proposed FTAFace-light achieves higher accuracy than the state-of-the-art and reduces the amount of computation by 28.9%.

Supplementary Material

ZIP File (mfp1803aux.zip)
FTAFace: Context-enhanced Face Detector with Fine-grained Task Attention - Supplementary

References

[1]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et almbox. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
[2]
Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vasconcelos. 2016. A unified multi-scale deep convolutional neural network for fast object detection. In European conference on computer vision. Springer, 354--370.
[3]
Yuhang Cao, Kai Chen, Chen Change Loy, and Dahua Lin. 2020. Prime sample attention in object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11583--11591.
[4]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.
[5]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International Conference on Machine Learning. PMLR, 1691--1703.
[6]
Bowen Cheng, Yunchao Wei, Honghui Shi, Rogerio Feris, Jinjun Xiong, and Thomas Huang. 2018. Revisiting rcnn: On awakening the classification power of faster rcnn. In Proceedings of the European conference on computer vision (ECCV). 453--468.
[7]
Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan Z Li, and Xudong Zou. 2019. Selective refinement network for high performance face detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8231--8238.
[8]
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In CVPR.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et almbox. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[11]
Samuel WF Earp, Pavit Noinongyao, Justin A Cairns, and Ankush Ganguly. 2019. Face detection with feature pyramids and landmarks. arXiv preprint arXiv:1912.00596 (2019).
[12]
Sachin Sudhakar Farfade, Mohammad J Saberian, and Li-Jia Li. 2015. Multi-view face detection using deep convolutional neural networks. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. 643--650.
[13]
Golnaz Ghiasi and Charless C Fowlkes. 2015. Occlusion coherence: Detecting and localizing occluded faces. arXiv preprint arXiv:1506.08347 (2015).
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[15]
Yonghao He, Dezhong Xu, Lifang Wu, Meng Jian, Shiming Xiang, and Chunhong Pan. 2019. LFFD: A light and fast face detector for edge devices. arXiv preprint arXiv:1904.10633 (2019).
[16]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861 (2017).
[17]
Peiyun Hu and Deva Ramanan. 2017. Finding tiny faces. In Proceedings of the IEEE conference on computer vision and pattern recognition. 951--959.
[18]
Vidit Jain and Erik Learned-Miller. 2010. FDDB: A Benchmark for Face Detection in Unconstrained Settings. Technical Report UM-CS-2010-009. University of Massachusetts, Amherst.
[19]
Huaizu Jiang and Erik Learned-Miller. 2017. Face detection with the faster R-CNN. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 650--657.
[20]
Haoxiang Li, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Gang Hua. 2014. Efficient boosted exemplar-based face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1843--1850.
[21]
Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. 2015. A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5325--5334.
[22]
Jian Li, Yabiao Wang, Changan Wang, Ying Tai, Jianjun Qian, Jian Yang, Chengjie Wang, Jilin Li, and Feiyue Huang. 2019 b. DSFD: dual shot face detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5060--5069.
[23]
Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. 2020. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. In NeurIPS.
[24]
Yunzhu Li, Benyuan Sun, Tianfu Wu, and Yizhou Wang. 2016. Face detection with end-to-end integration of a convnet and a 3d model. In European Conference on Computer Vision. Springer, 420--436.
[25]
Zhihang Li, Xu Tang, Junyu Han, Jingtuo Liu, and Ran He. 2019 a. Pyramidbox: High performance detector for finding tiny face. arXiv preprint arXiv:1904.00386 (2019).
[26]
Shengcai Liao, Anil K Jain, and Stan Z Li. 2015. A fast and accurate unconstrained face detector. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 2 (2015), 211--223.
[27]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017a. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.
[28]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017b. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[29]
Wei Liu, Shengcai Liao, Weiqiang Ren, Weidong Hu, and Yinan Yu. 2019. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5187--5196.
[30]
Yu Liu, Hongyang Li, Junjie Yan, Fangyin Wei, Xiaogang Wang, and Xiaoou Tang. 2017. Recurrent scale approximation for object detection in cnn. In Proceedings of the IEEE International Conference on Computer Vision. 571--579.
[31]
Yang Liu and Xu Tang. 2020. BFBox: Searching Face-Appropriate Backbone and Feature Pyramid Network for Face Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13568--13577.
[32]
Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and Larry S Davis. 2017. Ssh: Single stage headless face detector. In Proceedings of the IEEE international conference on computer vision. 4875--4884.
[33]
Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. 2019. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 821--830.
[34]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In International Conference on Machine Learning. PMLR, 4055--4064.
[35]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).
[36]
Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. 2017. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 1 (2017), 121--135.
[37]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 6 (2016), 1137--1149.
[38]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 658--666.
[39]
Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. 2016. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition. 761--769.
[40]
Xudong Sun, Pengcheng Wu, and Steven CH Hoi. 2018. Face detection using deep learning: An improved faster RCNN approach. Neurocomputing, Vol. 299 (2018), 42--50.
[41]
Xu Tang, Daniel K Du, Zeqiang He, and Jingtuo Liu. 2018. Pyramidbox: A context-assisted single shot face detector. In Proceedings of the European Conference on Computer Vision (ECCV). 797--813.
[42]
Wanxin Tian, Zixuan Wang, Haifeng Shen, Weihong Deng, Yiping Meng, Binghui Chen, Xiubao Zhang, Yuan Zhao, and Xiehe Huang. 2018. Learning better features for face detection with feature fusion and segmentation supervision. arXiv preprint arXiv:1811.08557 (2018).
[43]
Danai Triantafyllidou and Anastasios Tefas. 2016. A fast deep convolutional neural network for face detection in big visual data. In INNS conference on big data. Springer, 61--70.
[44]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
[45]
Noranart Vesdapunt and Baoyuan Wang. 2021. CRFace: Confidence Ranker for Model-Agnostic Face Detection Refinement. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021).
[46]
Hao Wang, Zhifeng Li, Xing Ji, and Yitong Wang. 2017b. Face r-cnn. arXiv preprint arXiv:1706.01061 (2017).
[47]
Yitong Wang, Xing Ji, Zheng Zhou, Hao Wang, and Zhifeng Li. 2017a. Detecting faces using region-based fully convolutional networks. arXiv preprint arXiv:1709.05256 (2017).
[48]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2020. End-to-End Video Instance Segmentation with Transformers. arXiv preprint arXiv:2011.14503 (2020).
[49]
Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, and Baining Guo. 2020. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5791--5800.
[50]
Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. 2016. Wider face: A face detection benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5525--5533.
[51]
Shuo Yang, Yuanjun Xiong, Chen Change Loy, and Xiaoou Tang. 2017. Face detection through scale-friendly deep convolutional networks. arXiv preprint arXiv:1706.02863 (2017).
[52]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems (2019).
[53]
Dmitry Yashunin, Tamir Baydasov, and Roman Vlasov. 2020. MaskFace: multi-task face and landmark detector. arXiv preprint arXiv:2005.09412 (2020).
[54]
Bin Zhang, Jian Li, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yili Xia, Wenjiang Pei, and Rongrong Ji. 2020 b. ASFD: Automatic and Scalable Face Detector. arXiv preprint arXiv:2003.11228 (2020).
[55]
Changzheng Zhang, Xiang Xu, and Dandan Tu. 2018. Face detection using improved faster rcnn. arXiv preprint arXiv:1802.02142 (2018).
[56]
Faen Zhang, Xinyu Fan, Guo Ai, Jianfei Song, Yongqiang Qin, and Jiahong Wu. 2019 a. Accurate face detection for high performance. arXiv preprint arXiv:1905.01585 (2019).
[57]
Jialiang Zhang, Xiongwei Wu, Steven CH Hoi, and Jianke Zhu. 2020 c. Feature agglomeration networks for single stage face detection. Neurocomputing, Vol. 380 (2020), 180--189.
[58]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, Vol. 23, 10 (2016), 1499--1503.
[59]
Kaipeng Zhang, Zhanpeng Zhang, Hao Wang, Zhifeng Li, Yu Qiao, and Wei Liu. 2017a. Detecting faces using inside cascaded contextual cnn. In Proceedings of the IEEE International Conference on Computer Vision. 3171--3179.
[60]
Shifeng Zhang, Cheng Chi, Zhen Lei, and Stan Z Li. 2020 a. Refineface: Refinement neural network for high performance face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[61]
Shifeng Zhang, Longyin Wen, Hailin Shi, Zhen Lei, Siwei Lyu, and Stan Z Li. 2019 b. Single-shot scale-aware network for real-time face detection. International Journal of Computer Vision, Vol. 127, 6 (2019), 537--559.
[62]
Shifeng Zhang, Rui Zhu, Xiaobo Wang, Hailin Shi, Tianyu Fu, Shuo Wang, Tao Mei, and Stan Z Li. 2019 d. Improved selective refinement network for face detection. arXiv preprint arXiv:1901.06651 (2019).
[63]
Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017b. Faceboxes: A CPU real-time face detector with high accuracy. In 2017 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 1--9.
[64]
Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017c. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE international conference on computer vision. 192--201.
[65]
Yundong Zhang, Xiang Xu, and Xiaotao Liu. 2019 c. Robust and high performance face detector. arXiv preprint arXiv:1901.02350 (2019).
[66]
Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. 2020. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12993--13000.
[67]
Chenchen Zhu, Ran Tao, Khoa Luu, and Marios Savvides. 2018. Seeing Small Faces From Robust Anchor's Perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[68]
Jiashu Zhu, Dong Li, Tiantian Han, Lu Tian, and Yi Shan. 2020 b. ProgressFace: Scale-Aware Progressive Learning for Face Detection. In European Conference on Computer Vision. Springer, 344--360.
[69]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020 c. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv preprint arXiv:2010.04159 (2020).
[70]
Yanjia Zhu, Hongxiang Cai, Shuhan Zhang, Chenhao Wang, and Yichao Xiong. 2020 a. TinaFace: Strong but Simple Baseline for Face Detection. arXiv preprint arXiv:2011.13183 (2020).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. context feature enhancement
  2. face detector
  3. task attention

Qualifiers

  • Research-article

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 93
    Total Downloads
  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media