research-article

Open access

RDLNet: A Novel and Accurate Real-world Document Localization Method

Authors:

Lianwen JinAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 9847 - 9855

https://doi.org/10.1145/3664647.3681655

Published: 28 October 2024 Publication History

Abstract

The increasing use of smartphones for capturing documents in various real-world conditions has underscored the need for robust document localization technologies. Current challenges in this domain include handling diverse document types, complex backgrounds, and varying photographic conditions such as low contrast and occlusion. However, there currently are no publicly available datasets containing these complex scenarios and few methods demonstrate their capabilities on these complex scenes. To address these issues, we create a new comprehensive real-world document localization benchmark dataset which contains the complex scenarios mentioned above and propose a novel Real-world Document Localization Network (RDLNet) for locating targeted documents in the wild. The RDLNet consists of an innovative light-SAM encoder and a masked attention decoder. Utilizing light-SAM encoder, the RDLNet transfers the mighty generalization capability of SAM to the document localization task. In the decoding stage, the RDLNet exploits the masked attention and object query method to efficiently output the triple-branch predictions consisting of corner point coordinates, instance-level segmentation area and categories of different documents without extra post-processing. We compare the performance of RDLNet with other state-of-the-art approaches for real-world document localization on multiple benchmarks, the results of which reveal that the RDLNet remarkably outperforms contemporary methods, demonstrating its superiority in terms of both accuracy and practicability.

References

[1]

Vladimir Viktorovich Arlazarov, Konstantin Bulatovich Bulatov, Timofey Sergeevich Chernov, and Vladimir Lvovich Arlazarov. 2019. MIDV-500: a dataset for identity document analysis and recognition on mobile devices in video stream. CoOpt, Vol. 43, 5 (2019), 818--824.

[2]

Konstantin Bulatov, Daniil Matalov, and Vladimir V Arlazarov. 2020. MIDV-2019: challenges of the modern mobile-based document OCR. In Twelfth International Conference on Machine Vision (ICMV 2019), Vol. 11433. SPIE, 717--722.

[3]

Bulatov Konstantin Bulatovich, Emelianova Ekaterina Vladimirovna, Tropin Daniil Vyacheslavovich, Skoryukina Natalya Sergeevna, Chernyshova Yulia Sergeevna, Ming Zuheng, Burie Jean-Christophe, and Luqman Muhammad Muzzamil. 2022. MIDV-2020: a comprehensive benchmark dataset for identity document analysis. CoOpt, Vol. 46, 2 (2022), 252--270.

[4]

Jean-Christophe Burie, Joseph Chazalon, Mickaël Coustaty, Sébastien Eskenazi, Muhammad Muzzamil Luqman, Maroua Mehri, Nibal Nayef, Jean-Marc Ogier, Sophea Prum, and Marccal Rusi nol. 2015. ICDAR2015 competition on smartphone document capture and OCR (SmartDoc). In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1161--1165.

[5]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.

Digital Library

[6]

Alejandra Castelblanco, Jesus Solano, Christian Lopez, Esteban Rivera, and Martín Ochoa. 2020. Machine Learning Techniques for Identity Document Verification in Uncontrolled Environments: A Case Study. (2020).

[7]

Ricardo Batista das Neves, Estanislau Lima, Byron LD Bezerra, Cleber Zanchettin, and Alejandro H Toselli. 2020. HU-PageScan: a fully convolutional neural network for document page crop. IET Image Processing, Vol. 14, 15 (2020), 3890--3898.

[8]

Ricardo Batista das Neves, Luiz Felipe Verccosa, David Macêdo, Byron Leite Dantas Bezerra, and Cleber Zanchettin. 2020. A fast fully octave convolutional neural network for document image segmentation. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--6.

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[10]

Rafael Grompone von Gioi, Jeremie Jakubowicz, Jean-Michel Morel, and Gregory Randall. 2010. LSD: A Fast Line Segment Detector with a False Detection Control. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, 4 (2010), 722--732. https://doi.org/10.1109/TPAMI.2008.300

Digital Library

[11]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[12]

Khurram Javed and Faisal Shafait. 2017. Real-time document localization in natural images by recursive application of a cnn. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 105--110.

[13]

Anastasiia Kabeshova, Guillaume Betmont, Julien Lerouge, Evgeny Stepankevich, and Alexis Bergès. 2023. Data Efficient Training of a U-Net Based Architecture for Structured Documents Localization. arXiv preprint arXiv:2310.00937 (2023).

[14]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015--4026.

[15]

Philipp Krähenbühl and Vladlen Koltun. 2014. Geodesic object proposals. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V 13. Springer, 725--739.

[16]

Christoph H Lampert, Tim Braun, Adrian Ulges, Daniel Keysers, and Thomas M Breuel. 2005. Oblivious Document Capture and Real-Time Retrieval. In International Workshop on Camera Based Document Analysis and Recognition (CBDAR).

[17]

Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. 2022. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. arxiv: 2206.02777 [cs.CV]

[18]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

[19]

Shijian Lu and Chew Lim Tan. 2006. The Restoration of Camera Documents Through Image Segmentation. In Document Analysis Systems VII, 7th International Workshop, DAS 2006, Nelson, New Zealand, February 13--15, 2006, Proceedings.

[20]

Ligang Miao and Silong Peng. 2006. Perspective Rectification of Document Images Based on Morphology. In 2006 International Conference on Computational Intelligence and Security, Vol. 2. 1805--1808. https://doi.org/10.1109/ICCIAS.2006.295374

[21]

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565--571.

[22]

George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4903--4911.

[23]

Y Qiao, Q. M. Hu, G. Y. Qian, S. H. Luo, and W. L. Nowinski. 2007. Thresholding based on variance and intensity contrast. Pattern Recognition: The Journal of the Pattern Recognition Society 2 (2007), 40.

[24]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5--9, 2015, proceedings, part III 18. Springer, 234--241.

[25]

Natalya Skoryukina, Julia Shemiakina, Vladimir L. Arlazarov, and Igor Faradjev. 2018. Document Localization Algorithms Based on Feature Points and Straight Lines. In International Conference on Machine Vision.

[26]

N Stamatopoulos, B Gatos, and A Kesidis. 2007. Automatic borders detection of camera document images. In 2nd International Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil. 71--78.

[27]

Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep Convolutional Network Cascade for Facial Point Detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.

Digital Library

[28]

Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris M Kitani. 2021. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 3611--3620.

[29]

Han Wu, Holland Qian, Huaming Wu, and Aad van Moorsel. 2022. LDRNet: Enabling Real-time Document Localization on Mobile Devices. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 618--629.

[30]

Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. 2024. Towards robust referring image segmentation. IEEE Transactions on Image Processing (2024).

Digital Library

[31]

Saining Xie and Zhuowen Tu. 2015. Holistically-Nested Edge Detection. In 2015 IEEE International Conference on Computer Vision (ICCV). 1395--1403. https://doi.org/10.1109/ICCV.2015.164

Digital Library

[32]

Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, and Vikas Chandra. 2023. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. arXiv:2312.00863 (2023).

[33]

Yongchao Xu, Edwin Carlinet, Thierry Géraud, and Laurent Najman. 2017. Hierarchical Segmentation Using Tree-Based Shape Spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 3 (2017), 1--1.

Digital Library

[34]

Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M Ni, and Lei Zhang. 2023. Mp-former: Mask-piloted transformer for image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18074--18083.

[35]

Anna Zhu, Chen Zhang, Zhi Li, and Shengwu Xiong. 2019. Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement. International Journal on Document Analysis and Recognition (IJDAR), Vol. 22 (2019), 351--360. https://api.semanticscholar.org/CorpusID:201905512

Digital Library

[36]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).

Cited By

Wu YLyu WLiang XZheng QWei JJin L(2024)MRCI: Multi-range Context Interaction for Boundary Refinement in Image SegmentationPattern Recognition10.1007/978-3-031-80136-5_15(211-226)Online publication date: 1-Dec-2024
https://doi.org/10.1007/978-3-031-80136-5_15

Index Terms

RDLNet: A Novel and Accurate Real-world Document Localization Method
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document scanning

Recommendations

A Robust Document Localization Solution with Segmentation and Clustering
Advances and Trends in Artificial Intelligence. Theory and Applications
Abstract
In the fields of optical character recognition and textual information extraction, document localization is recognized as a potential preprocessing step with a significant impact on accuracy. Despite numerous solutions being presented, localizing ...
Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement
Abstract
Document localization is a promising step for document-based optical character recognition. This task gains difficulty when documents are located in complex natural scene images. In this paper, we propose a coarse-to-fine document localization ...
A real-world noisy unstructured handwritten notebook corpus for document image analysis research
MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data

Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
93
Total Downloads

Downloads (Last 12 months)93
Downloads (Last 6 weeks)26

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu YLyu WLiang XZheng QWei JJin L(2024)MRCI: Multi-range Context Interaction for Boundary Refinement in Image SegmentationPattern Recognition10.1007/978-3-031-80136-5_15(211-226)Online publication date: 1-Dec-2024
https://doi.org/10.1007/978-3-031-80136-5_15

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten