Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3681655acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

RDLNet: A Novel and Accurate Real-world Document Localization Method

Published: 28 October 2024 Publication History

Abstract

The increasing use of smartphones for capturing documents in various real-world conditions has underscored the need for robust document localization technologies. Current challenges in this domain include handling diverse document types, complex backgrounds, and varying photographic conditions such as low contrast and occlusion. However, there currently are no publicly available datasets containing these complex scenarios and few methods demonstrate their capabilities on these complex scenes. To address these issues, we create a new comprehensive real-world document localization benchmark dataset which contains the complex scenarios mentioned above and propose a novel Real-world Document Localization Network (RDLNet) for locating targeted documents in the wild. The RDLNet consists of an innovative light-SAM encoder and a masked attention decoder. Utilizing light-SAM encoder, the RDLNet transfers the mighty generalization capability of SAM to the document localization task. In the decoding stage, the RDLNet exploits the masked attention and object query method to efficiently output the triple-branch predictions consisting of corner point coordinates, instance-level segmentation area and categories of different documents without extra post-processing. We compare the performance of RDLNet with other state-of-the-art approaches for real-world document localization on multiple benchmarks, the results of which reveal that the RDLNet remarkably outperforms contemporary methods, demonstrating its superiority in terms of both accuracy and practicability.

References

[1]
Vladimir Viktorovich Arlazarov, Konstantin Bulatovich Bulatov, Timofey Sergeevich Chernov, and Vladimir Lvovich Arlazarov. 2019. MIDV-500: a dataset for identity document analysis and recognition on mobile devices in video stream. CoOpt, Vol. 43, 5 (2019), 818--824.
[2]
Konstantin Bulatov, Daniil Matalov, and Vladimir V Arlazarov. 2020. MIDV-2019: challenges of the modern mobile-based document OCR. In Twelfth International Conference on Machine Vision (ICMV 2019), Vol. 11433. SPIE, 717--722.
[3]
Bulatov Konstantin Bulatovich, Emelianova Ekaterina Vladimirovna, Tropin Daniil Vyacheslavovich, Skoryukina Natalya Sergeevna, Chernyshova Yulia Sergeevna, Ming Zuheng, Burie Jean-Christophe, and Luqman Muhammad Muzzamil. 2022. MIDV-2020: a comprehensive benchmark dataset for identity document analysis. CoOpt, Vol. 46, 2 (2022), 252--270.
[4]
Jean-Christophe Burie, Joseph Chazalon, Mickaël Coustaty, Sébastien Eskenazi, Muhammad Muzzamil Luqman, Maroua Mehri, Nibal Nayef, Jean-Marc Ogier, Sophea Prum, and Marccal Rusi nol. 2015. ICDAR2015 competition on smartphone document capture and OCR (SmartDoc). In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1161--1165.
[5]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.
[6]
Alejandra Castelblanco, Jesus Solano, Christian Lopez, Esteban Rivera, and Martín Ochoa. 2020. Machine Learning Techniques for Identity Document Verification in Uncontrolled Environments: A Case Study. (2020).
[7]
Ricardo Batista das Neves, Estanislau Lima, Byron LD Bezerra, Cleber Zanchettin, and Alejandro H Toselli. 2020. HU-PageScan: a fully convolutional neural network for document page crop. IET Image Processing, Vol. 14, 15 (2020), 3890--3898.
[8]
Ricardo Batista das Neves, Luiz Felipe Verccosa, David Macêdo, Byron Leite Dantas Bezerra, and Cleber Zanchettin. 2020. A fast fully octave convolutional neural network for document image segmentation. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--6.
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[10]
Rafael Grompone von Gioi, Jeremie Jakubowicz, Jean-Michel Morel, and Gregory Randall. 2010. LSD: A Fast Line Segment Detector with a False Detection Control. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, 4 (2010), 722--732. https://doi.org/10.1109/TPAMI.2008.300
[11]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.
[12]
Khurram Javed and Faisal Shafait. 2017. Real-time document localization in natural images by recursive application of a cnn. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 105--110.
[13]
Anastasiia Kabeshova, Guillaume Betmont, Julien Lerouge, Evgeny Stepankevich, and Alexis Bergès. 2023. Data Efficient Training of a U-Net Based Architecture for Structured Documents Localization. arXiv preprint arXiv:2310.00937 (2023).
[14]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015--4026.
[15]
Philipp Krähenbühl and Vladlen Koltun. 2014. Geodesic object proposals. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V 13. Springer, 725--739.
[16]
Christoph H Lampert, Tim Braun, Adrian Ulges, Daniel Keysers, and Thomas M Breuel. 2005. Oblivious Document Capture and Real-Time Retrieval. In International Workshop on Camera Based Document Analysis and Recognition (CBDAR).
[17]
Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. 2022. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. arxiv: 2206.02777 [cs.CV]
[18]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[19]
Shijian Lu and Chew Lim Tan. 2006. The Restoration of Camera Documents Through Image Segmentation. In Document Analysis Systems VII, 7th International Workshop, DAS 2006, Nelson, New Zealand, February 13--15, 2006, Proceedings.
[20]
Ligang Miao and Silong Peng. 2006. Perspective Rectification of Document Images Based on Morphology. In 2006 International Conference on Computational Intelligence and Security, Vol. 2. 1805--1808. https://doi.org/10.1109/ICCIAS.2006.295374
[21]
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565--571.
[22]
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4903--4911.
[23]
Y Qiao, Q. M. Hu, G. Y. Qian, S. H. Luo, and W. L. Nowinski. 2007. Thresholding based on variance and intensity contrast. Pattern Recognition: The Journal of the Pattern Recognition Society 2 (2007), 40.
[24]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5--9, 2015, proceedings, part III 18. Springer, 234--241.
[25]
Natalya Skoryukina, Julia Shemiakina, Vladimir L. Arlazarov, and Igor Faradjev. 2018. Document Localization Algorithms Based on Feature Points and Straight Lines. In International Conference on Machine Vision.
[26]
N Stamatopoulos, B Gatos, and A Kesidis. 2007. Automatic borders detection of camera document images. In 2nd International Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil. 71--78.
[27]
Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep Convolutional Network Cascade for Facial Point Detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.
[28]
Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris M Kitani. 2021. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 3611--3620.
[29]
Han Wu, Holland Qian, Huaming Wu, and Aad van Moorsel. 2022. LDRNet: Enabling Real-time Document Localization on Mobile Devices. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 618--629.
[30]
Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. 2024. Towards robust referring image segmentation. IEEE Transactions on Image Processing (2024).
[31]
Saining Xie and Zhuowen Tu. 2015. Holistically-Nested Edge Detection. In 2015 IEEE International Conference on Computer Vision (ICCV). 1395--1403. https://doi.org/10.1109/ICCV.2015.164
[32]
Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, and Vikas Chandra. 2023. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. arXiv:2312.00863 (2023).
[33]
Yongchao Xu, Edwin Carlinet, Thierry Géraud, and Laurent Najman. 2017. Hierarchical Segmentation Using Tree-Based Shape Spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 3 (2017), 1--1.
[34]
Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M Ni, and Lei Zhang. 2023. Mp-former: Mask-piloted transformer for image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18074--18083.
[35]
Anna Zhu, Chen Zhang, Zhi Li, and Shengwu Xiong. 2019. Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement. International Journal on Document Analysis and Recognition (IJDAR), Vol. 22 (2019), 351--360. https://api.semanticscholar.org/CorpusID:201905512
[36]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).

Cited By

View all
  • (2024)MRCI: Multi-range Context Interaction for Boundary Refinement in Image SegmentationPattern Recognition10.1007/978-3-031-80136-5_15(211-226)Online publication date: 1-Dec-2024

Index Terms

  1. RDLNet: A Novel and Accurate Real-world Document Localization Method

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Check for updates

    Author Tags

    1. distillation
    2. document localization
    3. encoder and decoder based network
    4. novel benchmark dataset
    5. triple branch prediction

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)93
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)MRCI: Multi-range Context Interaction for Boundary Refinement in Image SegmentationPattern Recognition10.1007/978-3-031-80136-5_15(211-226)Online publication date: 1-Dec-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media