Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3681548acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Document Registration: Towards Automated Labeling of Pixel-Level Alignment Between Warped-Flat Documents

Published: 28 October 2024 Publication History

Abstract

Photographed documents are prevalent but often suffer from deformations like curves or folds, hindering readability. Consequently, document dewarping has been widely studied, however its performance is still not satisfied due to lack of real training samples with pixel-level annotation. To obtain the pixel-level labels, we leverage a document registration pipeline to automatically align warped-flat documents. Unlike general image registration works, registering documents poses unique challenges due to their severe deformations and fine-grained textures. In this paper, we introduce a coarse-to-fine framework including a coarse registration network (CRN) aiming to eliminate severe deformations then a fine registration network (FRN) focusing on fine-grained features. In addition, we utilize self-supervised learning to initialize our document registration model, where we propose a cross-reconstruction pre-training task on the pair of warped-flat documents. Extensive experiments show that we can achieve satisfied document registration performance, consequently obtaining a high-quality registered document dataset with pixel-level annotation. Without bells and whistles, we re-train two popular document dewarping models on our registered document dataset WarpDoc-R, and obtain superior performance with those using almost 100× scale of synthetic training data, verifying the label quality of our document registration method.

References

[1]
Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Guttag, and Adrian V. Dalca. 2019. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging, Vol. 38, 8 (2019), 1788--1800.
[2]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
[3]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European Conference on Computer Vision(ECCV). Springer, 404--417.
[4]
Gangcheng Cai, Huaying Liu, Wei Zou, Nan Hu, and JiaJun Wang. 2023. Registration of 3D medical images based on unsupervised cooperative cascade of deep networks. Biomedical Signal Processing and Control, Vol. 82 (April 2023), 104594.
[5]
Junyu Chen, Eric C. Frey, Yufan He, William P. Segars, Ye Li, and Yong Du. 2022. TransMorph: Transformer for unsupervised medical image registration. Medical Image Analysis, Vol. 82 (Nov. 2022), 102615.
[6]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arXiv preprint arXiv:2002.05709 (2020).
[7]
Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. 2023. Context Autoencoder for Self-Supervised Representation Learning. http://arxiv.org/abs/2202.03026 arXiv:2202.03026 [cs].
[8]
Sagnik Das, Ke Ma, Zhixin Shu, Dimitris Samaras, and Roy Shilkrot. 2019. DewarpNet: Single-image document unwarping with stacked 3D and 2D regression networks. In International Conference on Computer Vision(ICCV). 131--140.
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations(ICLR).
[10]
Johan Edstedt, Ioannis Athanasiadis, Mårten Wadenbäck, and Michael Felsberg. 2023. DKM: Dense Kernelized Feature Matching for Geometry Estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).
[11]
Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. 2024. RoMa: Robust Dense Feature Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).
[12]
Hao Feng, Yuechen Wang, Wengang Zhou, Jiajun Deng, and Houqiang Li. 2021. DocTr: Document image transformer for geometric unwarping and illumination correction. In Proceedings of the ACM International Conference on Multimedia(MM). 273--281.
[13]
Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, and Yu Qiao. 2022. ConvMAE: Masked Convolution Meets Masked Autoencoders. http://arxiv.org/abs/2205.03892 arXiv:2205.03892 [cs].
[14]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. 2018. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations(ICLR).
[15]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 16000--16009.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 770--778.
[17]
Felix Hertlein, Alexander Naumann, and Patrick Philipp. 2023. Inv3D: a high-resolution 3D invoice dataset for template-guided single-image document unwarping. International Journal on Document Analysis and Recognition(IJDAR) (2023), 1--12.
[18]
Jisoo Jeong, Hong Cai, Risheek Garrepalli, and Fatih Porikli. 2023. DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling. http://arxiv.org/abs/2303.14078 arXiv:2303.14078 [cs].
[19]
Puhua Jiang, Mingze Sun, and Ruqi Huang. 2023. Non-Rigid Shape Registration via Deep Functional Maps Prior. http://arxiv.org/abs/2311.04494
[20]
Boah Kim, Inhwa Han, and Jong Chul Ye. 2022. DiffuseMorph: Unsupervised Deformable Image Registration Using Diffusion Model. In European Conference on Computer Vision(ECCV). Vol. 13691. Springer Nature Switzerland, Cham, 347--364.
[21]
Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, Vol. 10, 8 (1966), 707--710.
[22]
Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao. 2021. Efficient self-supervised vision transformers for representation learning. arXiv preprint arXiv:2106.09785 (2021).
[23]
Jing Li, Qiu-Feng Wang, Rui Zhang, and Kaizhu Huang. 2020. Adversarial rectification network for scene text regularization. In Neural Information Processing: 27th International Conference, ICONIP 2020, Bangkok, Thailand, November 23--27, 2020, Proceedings, Part II 27. Springer, 152--163.
[24]
Pu Li, Weize Quan, Jianwei Guo, and Dong-Ming Yan. 2023. Layout-Aware Single-Image Document Flattening. ACM Transactions on Graphics(TOG), Vol. 43, 1 (2023).
[25]
Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. 2020. Hierarchical scene coordinate classification and regression for visual localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 11983--11992.
[26]
Xiaoyu Li, Bo Zhang, Jing Liao, and Pedro V Sander. 2019. Document rectification and illumination correction using a patch-based CNN. ACM Transactions on Graphics(TOG), Vol. 38, 6 (2019), 1--11.
[27]
Zinuo Li, Xuhang Chen, Chi-Man Pun, and Xiaodong Cun. 2023. High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net. http://arxiv.org/abs/2308.14221 arXiv:2308.14221 [cs].
[28]
Shaokai Liu, Hao Feng, Wengang Zhou, Houqiang Li, Cong Liu, and Feng Wu. 2023. Docmae: Document image rectification via self-supervised representation learning. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1613--1618.
[29]
Songtao Liu, Zeming Li, and Jian Sun. 2020. Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677 (2020).
[30]
Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng lin Liu, Lianwen Jin, and Xiang Bai. 2024. On the Hidden Mystery of OCR in Large Multimodal Models. arxiv: 2305.07895 [cs.CV]
[31]
Wei-Yin Loh. 2011. Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, Vol. 1, 1 (2011), 14--23.
[32]
David G Lowe. 1999. Object recognition from local scale-invariant features. In International Conference on Computer Vision(ICCV), Vol. 2. Ieee, 1150--1157.
[33]
Jiayi Ma, Xingyu Jiang, Aoxiang Fan, Junjun Jiang, and Junchi Yan. 2021. Image Matching from Handcrafted to Deep Features: A Survey. International Journal of Computer Vision(IJCV), Vol. 129 (Jan. 2021), 23--79.
[34]
Ke Ma, Sagnik Das, Zhixin Shu, and Dimitris Samaras. 2022. Learning From Documents in the Wild to Improve Document Unwarping. In ACM Special Interest Group on Computer Graphics(SIGGRAPH). 1--9.
[35]
Ke Ma, Zhixin Shu, Xue Bai, Jue Wang, and Dimitris Samaras. 2018. DocUNet: Document image unwarping via a stacked U-Net. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 4700--4709.
[36]
Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc Pollefeys, Esa Rahtu, and Juho Kannala. 2018. DGC-Net: Dense Geometric Correspondence Network. http://arxiv.org/abs/1810.08393 arXiv:1810.08393 [cs].
[37]
Junjie Ni, Yijin Li, Zhaoyang Huang, Hongsheng Li, Hujun Bao, Zhaopeng Cui, and Guofeng Zhang. 2023. Pats: Patch area transportation with subdivision for local feature matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 17776--17786.
[38]
Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. 2021. Back to the feature: Learning robust camera localization from pixels to pose. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 3247--3257.
[39]
Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi Zhang, and Lianwen Jin. 2023. Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation. arxiv: 2310.16809 [cs.CV]
[40]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[41]
Leslie N Smith and Nicholay Topin. 2017. Super-convergence: Very fast training of neural networks using large learning rates. arXiv. arXiv preprint arXiv:1708.07120, Vol. 6 (2017).
[42]
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. PMLR, 6105--6114.
[43]
Prune Truong, Martin Danelljan, and Radu Timofte. 2020. GLU-Net: Global-local universal network for dense flow and correspondences. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 6258--6268.
[44]
Floor Verhoeven, Tanguy Magne, and Olga Sorkine-Hornung. 2023. UVDoc: Neural Grid-based Document Unwarping. In ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia(SIGGRAPH ASIA).
[45]
Zhou Wang, EeroP Simoncelli, and AlanC Bovik. 2003. Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems and Computers(CSSC). 1398--1402.
[46]
Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jérôme Revaud. 2023. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow. In International Conference on Computer Vision(ICCV). IEEE, 17923--17934.
[47]
Guangyang Wu, Xiaohong Liu, Kunming Luo, Xi Liu, Qingqing Zheng, Shuaicheng Liu, Xinyang Jiang, Guangtao Zhai, and Wenyi Wang. 2023. AccFlow: Backward Accumulation for Long-Range Optical Flow. http://arxiv.org/abs/2308.13133 arXiv:2308.13133 [cs].
[48]
Guo-Wang Xie, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2020. Dewarping document image by displacement flow estimation with fully convolutional Network. In International Workshop on Document Analysis Systems(DAS). 131--144.
[49]
Guo-Wang Xie, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2021. Document Dewarping with Control Points. In International Conference on Document Analysis and Recognition(ICDAR). 466--480.
[50]
Chuhui Xue, Zichen Tian, Fangneng Zhan, Shijian Lu, and Song Bai. 2022. Fourier document restoration for robust document dewarping and recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 4573--4582.
[51]
Shaodi You, Yasuyuki Matsushita, Sudipta Sinha, Yusuke Bou, and Katsushi Ikeuchi. 2017. Multiview rectification of folded documents. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), Vol. 40, 2 (2017), 505--511.
[52]
Jiaxin Zhang, Bangdong Chen, Hiuyi Cheng, Fengjun Guo, Kai Ding, and Lianwen Jin. 2023. DocAligner: Annotating Real-world Photographic Document Images by Simply Taking Pictures. arXiv:2306.05749 [cs].
[53]
Jiaxin Zhang, Canjie Luo, Lianwen Jin, Fengjun Guo, and Kai Ding. 2022. Marior: Margin Removal and Iterative Content Rectification for Document Dewarping in the Wild. In Proceedings of the ACM International Conference on Multimedia(MM). 2805--2815.
[54]
Weiguang Zhang, Qiufeng Wang, and Kaizhu Huang. 2023. Polar-Doc: One-Stage Document Dewarping with Multi-Scope Constraints under Polar Representation. arXiv preprint arXiv:2312.07925 (2023).
[55]
Shengjie Zhu and Xiaoming Liu. 2023. PMatch: Paired Masked Image Modeling for Dense Geometric Matching. arXiv:2303.17342 [cs].

Index Terms

  1. Document Registration: Towards Automated Labeling of Pixel-Level Alignment Between Warped-Flat Documents

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. document dewarping
      2. document registration
      3. image matching
      4. photographed documents
      5. pixel-level alignment

      Qualifiers

      • Research-article

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 41
        Total Downloads
      • Downloads (Last 12 months)41
      • Downloads (Last 6 weeks)31
      Reflects downloads up to 25 Dec 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media