research-article

Document Registration: Towards Automated Labeling of Pixel-Level Alignment Between Warped-Flat Documents

Authors:

Weiguang Zhang,

Xiaomeng GuAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 9933 - 9942

https://doi.org/10.1145/3664647.3681548

Published: 28 October 2024 Publication History

Abstract

Photographed documents are prevalent but often suffer from deformations like curves or folds, hindering readability. Consequently, document dewarping has been widely studied, however its performance is still not satisfied due to lack of real training samples with pixel-level annotation. To obtain the pixel-level labels, we leverage a document registration pipeline to automatically align warped-flat documents. Unlike general image registration works, registering documents poses unique challenges due to their severe deformations and fine-grained textures. In this paper, we introduce a coarse-to-fine framework including a coarse registration network (CRN) aiming to eliminate severe deformations then a fine registration network (FRN) focusing on fine-grained features. In addition, we utilize self-supervised learning to initialize our document registration model, where we propose a cross-reconstruction pre-training task on the pair of warped-flat documents. Extensive experiments show that we can achieve satisfied document registration performance, consequently obtaining a high-quality registered document dataset with pixel-level annotation. Without bells and whistles, we re-train two popular document dewarping models on our registered document dataset WarpDoc-R, and obtain superior performance with those using almost 100× scale of synthetic training data, verifying the label quality of our document registration method.

References

[1]

Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Guttag, and Adrian V. Dalca. 2019. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging, Vol. 38, 8 (2019), 1788--1800.

[2]

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).

[3]

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European Conference on Computer Vision(ECCV). Springer, 404--417.

Digital Library

[4]

Gangcheng Cai, Huaying Liu, Wei Zou, Nan Hu, and JiaJun Wang. 2023. Registration of 3D medical images based on unsupervised cooperative cascade of deep networks. Biomedical Signal Processing and Control, Vol. 82 (April 2023), 104594.

[5]

Junyu Chen, Eric C. Frey, Yufan He, William P. Segars, Ye Li, and Yong Du. 2022. TransMorph: Transformer for unsupervised medical image registration. Medical Image Analysis, Vol. 82 (Nov. 2022), 102615.

[6]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arXiv preprint arXiv:2002.05709 (2020).

[7]

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. 2023. Context Autoencoder for Self-Supervised Representation Learning. http://arxiv.org/abs/2202.03026 arXiv:2202.03026 [cs].

[8]

Sagnik Das, Ke Ma, Zhixin Shu, Dimitris Samaras, and Roy Shilkrot. 2019. DewarpNet: Single-image document unwarping with stacked 3D and 2D regression networks. In International Conference on Computer Vision(ICCV). 131--140.

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations(ICLR).

[10]

Johan Edstedt, Ioannis Athanasiadis, Mårten Wadenbäck, and Michael Felsberg. 2023. DKM: Dense Kernelized Feature Matching for Geometry Estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).

[11]

Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. 2024. RoMa: Robust Dense Feature Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).

[12]

Hao Feng, Yuechen Wang, Wengang Zhou, Jiajun Deng, and Houqiang Li. 2021. DocTr: Document image transformer for geometric unwarping and illumination correction. In Proceedings of the ACM International Conference on Multimedia(MM). 273--281.

Digital Library

[13]

Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, and Yu Qiao. 2022. ConvMAE: Masked Convolution Meets Masked Autoencoders. http://arxiv.org/abs/2205.03892 arXiv:2205.03892 [cs].

[14]

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. 2018. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations(ICLR).

[15]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 16000--16009.

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 770--778.

[17]

Felix Hertlein, Alexander Naumann, and Patrick Philipp. 2023. Inv3D: a high-resolution 3D invoice dataset for template-guided single-image document unwarping. International Journal on Document Analysis and Recognition(IJDAR) (2023), 1--12.

Digital Library

[18]

Jisoo Jeong, Hong Cai, Risheek Garrepalli, and Fatih Porikli. 2023. DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling. http://arxiv.org/abs/2303.14078 arXiv:2303.14078 [cs].

[19]

Puhua Jiang, Mingze Sun, and Ruqi Huang. 2023. Non-Rigid Shape Registration via Deep Functional Maps Prior. http://arxiv.org/abs/2311.04494

[20]

Boah Kim, Inhwa Han, and Jong Chul Ye. 2022. DiffuseMorph: Unsupervised Deformable Image Registration Using Diffusion Model. In European Conference on Computer Vision(ECCV). Vol. 13691. Springer Nature Switzerland, Cham, 347--364.

[21]

Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, Vol. 10, 8 (1966), 707--710.

[22]

Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao. 2021. Efficient self-supervised vision transformers for representation learning. arXiv preprint arXiv:2106.09785 (2021).

[23]

Jing Li, Qiu-Feng Wang, Rui Zhang, and Kaizhu Huang. 2020. Adversarial rectification network for scene text regularization. In Neural Information Processing: 27th International Conference, ICONIP 2020, Bangkok, Thailand, November 23--27, 2020, Proceedings, Part II 27. Springer, 152--163.

Digital Library

[24]

Pu Li, Weize Quan, Jianwei Guo, and Dong-Ming Yan. 2023. Layout-Aware Single-Image Document Flattening. ACM Transactions on Graphics(TOG), Vol. 43, 1 (2023).

[25]

Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. 2020. Hierarchical scene coordinate classification and regression for visual localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 11983--11992.

[26]

Xiaoyu Li, Bo Zhang, Jing Liao, and Pedro V Sander. 2019. Document rectification and illumination correction using a patch-based CNN. ACM Transactions on Graphics(TOG), Vol. 38, 6 (2019), 1--11.

Digital Library

[27]

Zinuo Li, Xuhang Chen, Chi-Man Pun, and Xiaodong Cun. 2023. High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net. http://arxiv.org/abs/2308.14221 arXiv:2308.14221 [cs].

[28]

Shaokai Liu, Hao Feng, Wengang Zhou, Houqiang Li, Cong Liu, and Feng Wu. 2023. Docmae: Document image rectification via self-supervised representation learning. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1613--1618.

[29]

Songtao Liu, Zeming Li, and Jian Sun. 2020. Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677 (2020).

[30]

Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng lin Liu, Lianwen Jin, and Xiang Bai. 2024. On the Hidden Mystery of OCR in Large Multimodal Models. arxiv: 2305.07895 [cs.CV]

[31]

Wei-Yin Loh. 2011. Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, Vol. 1, 1 (2011), 14--23.

[32]

David G Lowe. 1999. Object recognition from local scale-invariant features. In International Conference on Computer Vision(ICCV), Vol. 2. Ieee, 1150--1157.

[33]

Jiayi Ma, Xingyu Jiang, Aoxiang Fan, Junjun Jiang, and Junchi Yan. 2021. Image Matching from Handcrafted to Deep Features: A Survey. International Journal of Computer Vision(IJCV), Vol. 129 (Jan. 2021), 23--79.

Digital Library

[34]

Ke Ma, Sagnik Das, Zhixin Shu, and Dimitris Samaras. 2022. Learning From Documents in the Wild to Improve Document Unwarping. In ACM Special Interest Group on Computer Graphics(SIGGRAPH). 1--9.

[35]

Ke Ma, Zhixin Shu, Xue Bai, Jue Wang, and Dimitris Samaras. 2018. DocUNet: Document image unwarping via a stacked U-Net. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 4700--4709.

[36]

Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc Pollefeys, Esa Rahtu, and Juho Kannala. 2018. DGC-Net: Dense Geometric Correspondence Network. http://arxiv.org/abs/1810.08393 arXiv:1810.08393 [cs].

[37]

Junjie Ni, Yijin Li, Zhaoyang Huang, Hongsheng Li, Hujun Bao, Zhaopeng Cui, and Guofeng Zhang. 2023. Pats: Patch area transportation with subdivision for local feature matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 17776--17786.

[38]

Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. 2021. Back to the feature: Learning robust camera localization from pixels to pose. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 3247--3257.

[39]

Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi Zhang, and Lianwen Jin. 2023. Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation. arxiv: 2310.16809 [cs.CV]

[40]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[41]

Leslie N Smith and Nicholay Topin. 2017. Super-convergence: Very fast training of neural networks using large learning rates. arXiv. arXiv preprint arXiv:1708.07120, Vol. 6 (2017).

[42]

Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. PMLR, 6105--6114.

[43]

Prune Truong, Martin Danelljan, and Radu Timofte. 2020. GLU-Net: Global-local universal network for dense flow and correspondences. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 6258--6268.

[44]

Floor Verhoeven, Tanguy Magne, and Olga Sorkine-Hornung. 2023. UVDoc: Neural Grid-based Document Unwarping. In ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia(SIGGRAPH ASIA).

[45]

Zhou Wang, EeroP Simoncelli, and AlanC Bovik. 2003. Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems and Computers(CSSC). 1398--1402.

[46]

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jérôme Revaud. 2023. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow. In International Conference on Computer Vision(ICCV). IEEE, 17923--17934.

[47]

Guangyang Wu, Xiaohong Liu, Kunming Luo, Xi Liu, Qingqing Zheng, Shuaicheng Liu, Xinyang Jiang, Guangtao Zhai, and Wenyi Wang. 2023. AccFlow: Backward Accumulation for Long-Range Optical Flow. http://arxiv.org/abs/2308.13133 arXiv:2308.13133 [cs].

[48]

Guo-Wang Xie, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2020. Dewarping document image by displacement flow estimation with fully convolutional Network. In International Workshop on Document Analysis Systems(DAS). 131--144.

Digital Library

[49]

Guo-Wang Xie, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2021. Document Dewarping with Control Points. In International Conference on Document Analysis and Recognition(ICDAR). 466--480.

[50]

Chuhui Xue, Zichen Tian, Fangneng Zhan, Shijian Lu, and Song Bai. 2022. Fourier document restoration for robust document dewarping and recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 4573--4582.

[51]

Shaodi You, Yasuyuki Matsushita, Sudipta Sinha, Yusuke Bou, and Katsushi Ikeuchi. 2017. Multiview rectification of folded documents. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), Vol. 40, 2 (2017), 505--511.

Digital Library

[52]

Jiaxin Zhang, Bangdong Chen, Hiuyi Cheng, Fengjun Guo, Kai Ding, and Lianwen Jin. 2023. DocAligner: Annotating Real-world Photographic Document Images by Simply Taking Pictures. arXiv:2306.05749 [cs].

[53]

Jiaxin Zhang, Canjie Luo, Lianwen Jin, Fengjun Guo, and Kai Ding. 2022. Marior: Margin Removal and Iterative Content Rectification for Document Dewarping in the Wild. In Proceedings of the ACM International Conference on Multimedia(MM). 2805--2815.

Digital Library

[54]

Weiguang Zhang, Qiufeng Wang, and Kaizhu Huang. 2023. Polar-Doc: One-Stage Document Dewarping with Multi-Scope Constraints under Polar Representation. arXiv preprint arXiv:2312.07925 (2023).

[55]

Shengjie Zhu and Xiaoming Liu. 2023. PMatch: Paired Masked Image Modeling for Dense Geometric Matching. arXiv:2303.17342 [cs].

Index Terms

Document Registration: Towards Automated Labeling of Pixel-Level Alignment Between Warped-Flat Documents
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document scanning
    2. Document preparation
      1. Annotation

Recommendations

Coarse-to-Fine Document Image Registration for Dewarping
Document Analysis and Recognition - ICDAR 2024
Abstract
Document dewarping has made great progress in recent years, however it usually requires huge document pairs with pixel-level annotation to learn a mapping function. Although photographed document images are easy to obtain, the pixel-level ...
Document registration using projective geometry
ICDAR '95: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2

In this paper, a technique for registering filled-in forms is presented. The technique determines the transformations that is required to convert a filled-in form to match a known master and then extracts filled-in information. This method involves ...
Document dewarping via text-line based optimization

This paper presents a new document image dewarping method that removes geometric distortions in camera-captured document images. The proposed method does not directly use the text-line which has been the most widely used feature for the document ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
41
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)31

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents