Object Recognition from Scientific Document Based on Compartment and Text Blocks Refinement Framework

Li, Jinghong; Gu, Wen; Ota, Koichi; Hasegawa, Shinobu

doi:10.1007/s42979-024-03130-7

Object Recognition from Scientific Document Based on Compartment and Text Blocks Refinement Framework

Original Research
Published: 23 August 2024

Volume 5, article number 816, (2024)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Jinghong Li ORCID: orcid.org/0009-0000-6203-8512¹^na1,
Wen Gu²,
Koichi Ota² &
…
Shinobu Hasegawa²^na1

191 Accesses
1 Altmetric
Explore all metrics

Abstract

With the rapid development of the internet in the past decade, it has become increasingly important to extract valuable information from vast resources efficiently, which is crucial for establishing a comprehensive digital ecosystem, particularly in the context of research surveys and comprehension. The foundation of these tasks focuses on accurate extraction and deep mining of data from scientific documents, which are essential for building a robust data infrastructure. However, parsing raw data or extracting data from complex scientific documents have been ongoing challenges. Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches. However, using rule-based methods can incur high coding costs for articles with intricate typesetting. Conversely, relying solely on machine learning methods necessitates annotation work for complex content types within the scientific document, which can be costly. Additionally, few studies have thoroughly defined and explored the hierarchical layout within scientific documents. The lack of a comprehensive definition of the internal structure and elements of the documents indirectly impacts the accuracy of text classification and object recognition tasks. From the perspective of analyzing the standard layout and typesetting used in the specified publication, we propose a new document layout analysis framework called Compartment and Text Blocks Refinement (CTBR). Firstly, we define scientific documents into hierarchical divisions: base domain, compartment, and text blocks. Next, we conduct an in-depth exploration and classification of the meanings of text blocks. Finally, we utilize the results of text block classification to implement object recognition within scientific documents based on rule-based compartment segmentation. For the experiment, we used the well-known ACL format proceeding articles as experimental data for the validation experiment. The experiment shows that our approach achieved over 95% text block classification accuracy and 90% object recognition accuracy for tables and figures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ICDAR 2021 Competition on Scientific Literature Parsing

IIIT-AR-13K: A New Dataset for Graphical Object Detection in Documents

TACTFUL: A Framework for Targeted Active Learning for Document Analysis

Data Availability

We will release the dataset and all experimental data at an appropriate time.

References

Hsu A, Khoo W, Goyal N, Wainstein M. Next-generation digital ecosystem for climate data mining and knowledge discovery: a review of digital data collection technologies. Fron Big Data. 2020;3:29. https://doi.org/10.3389/fdata.2020.00029.
Article Google Scholar
Gharagozlou H, Mohammadzadeh J, Bastanfard A, Ghidary SS. Semantic relation extraction: a review of approaches, datasets, and evaluation methods with looking at the methods and datasets in the persian language. ACM Trans Asian Low-Resour Lang Inf Process. 2023. https://doi.org/10.1145/3592601.
Article Google Scholar
Kinney R, Anastasiades C, Authur R, Beltagy I, Bragg J, Buraczynski A, Cachola I, Candra S, Chandrasekhar Y, Cohan A, Crawford M, Downey D, Dunkelberger J, Etzioni O, Evans R, Feldman S, Gorney J, Graham D, Hu F, Huff R, King D, Kohlmeier S, Kuehl B, Langan M, Lin D, Liu H, Lo K, Lochner J, MacMillan K, Murray T, Newell C, Rao S, Rohatgi S, Sayre P, Shen Z, Singh A, Soldaini L, Subramanian S, Tanaka A, Wade AD, Wagner L, Wang LL, Wilhelm C, Wu C, Yang J, Zamarron A, Zuylen MV, Weld DS. The Semantic Scholar Open Data Platform. 2023; https://arxiv.org/abs/2301.10140.
Lo K, Wang LL, Neumann M, Kinney R, Weld D. S2ORC: The semantic scholar open research corpus. In: Jurafsky D, Chai J, Schluter N, Tetreault J, editors. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4969–4983. Association for Computational Linguistics, Online 2020. https://doi.org/10.18653/v1/2020.acl-main.447. https://aclanthology.org/2020.acl-main.447.
Saier T, Krause J, Färber M. unarxive 2022: All arxiv publications pre-processed for nlp, including structured full-text and citation network. In: 2023 ACM/IEEE joint conference on digital libraries (JCDL), 2023. pp. 66–70. https://doi.org/10.1109/JCDL57899.2023.00020.
Li J, Tanabe H, Ota K, Gu W, Hasegawa S. Automatic summarization for academic articles using deep learning and reinforcement learning with viewpoints. Int FLAIRS Conf Proc. 2023. https://doi.org/10.32473/flairs.36.133308.
Article Google Scholar
Sallam M. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023. https://doi.org/10.3390/healthcare11060887.
Article Google Scholar
Lo CK. What is the impact of chatgpt on education? a rapid review of the literature. Educ Sci. 2023. https://doi.org/10.3390/educsci13040410.
Article Google Scholar
Del Giglio A, Costa MUP. The use of artificial intelligence to improve the scientific writing of non-native english speakers. Rev Assoc Med Bras. 2023;69(9):20230560. https://doi.org/10.1590/1806-9282.20230560.
Article Google Scholar
Ibrahim Altmami N, El Bachir Menai M. Automatic summarization of scientific articles: a survey. J King Saud Univ Comput Inf Sci. 2022;34(4):1011–28. https://doi.org/10.1016/j.jksuci.2020.04.020.
Article Google Scholar
Zaman G, Mahdin H, Hussain K, Atta-Ur-Rahman, Abawajy J, Mostafa SA. An ontological framework for information extraction from diverse scientific sources. IEEE Access. 2021;9:42111–24. https://doi.org/10.1109/ACCESS.2021.3063181.
Article Google Scholar
Binmakhashen GM, Mahmoud SA. Document layout analysis: a comprehensive survey. ACM Comput Surv. 2019. https://doi.org/10.1145/3355610.
Article Google Scholar
Safder I, Hassan S-U, Visvizi A, Noraset T, Nawaz R, Tuarob S. Deep learning-based extraction of algorithmic metadata in full-text scholarly documents. Inf Process Manage. 2020;57(6): 102269. https://doi.org/10.1016/j.ipm.2020.102269.
Article Google Scholar
Ramakrishnan C, Patnia A, Hovy E, Burns GA. Layout-aware text extraction from full-text pdf of scientific articles. Source Code Biol Med. 2012;7:1–10. https://doi.org/10.1186/1751-0473-7-7.
Article Google Scholar
Siegel N, Lourie N, Power R, Ammar W. Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. JCDL ’18, pp. 223–232. Association for Computing Machinery, New York, NY, USA 2018. https://doi.org/10.1145/3197026.3197040.
Jinghong L, Koichi O, Wen G, Shinobu H. A text block refinement framework for text classification and object recognition from academic articles. In: 2023 international conference on innovations in intelligent systems and applications (INISTA), 2023. pp. 1–6. https://doi.org/10.1109/INISTA59065.2023.10310320.
Da C, Luo C, Zheng Q, Yao C. Vision grid transformer for document layout analysis. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 2023. pp. 19462–19472.
Smock B, Pesala R, Abraham R. Pubtables-1m: towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022. pp. 4634–4642.
Paliwal SS, D V, Rahul R, Sharma M, Vig L. Tablenet: deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In: 2019 international conference on document analysis and recognition (ICDAR), 2019. pp. 128–133. https://doi.org/10.1109/ICDAR.2019.00029.
Clark C, Divvala S. Pdffigures 2.0: mining figures from research papers. In: Proceedings of the 16th ACM/IEEE-CS on joint conference on digital libraries. JCDL ’16, 2016. pp. 143–152. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2910896.2910904.
Frerebeau N. tabula: an r package for analysis, seriation, and visualization of archaeological count data. JOpen Sour Softw. 2019;4(44):1821. https://doi.org/10.21105/joss.01821.
Article Google Scholar
Lopez P: Grobid: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti M, Borbinha J, Kapidakis S, Papatheodorou C, Tsakonas G, editors. Research and advanced technology for digital libraries, 2009. pp. 473–474. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04346-8_62.
Hosking T, Tang H, Lapata M. Hierarchical sketch induction for paraphrase generation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 2489–2501. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.178.
Artifex: PyMuPDF 1.23.5 documentation (2015-2023). https://pymupdf.readthedocs.io/en/latest/.
Ghosh S, Srivastava S. ePiC: employing proverbs in context as a benchmark for abstract language understanding. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. p. 3989–4004. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.276.
Cervantes J, Garcia-Lamont F, Rodríguez-Mazahua L, Lopez A. A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing. 2020;408:189–215. https://doi.org/10.1016/j.neucom.2019.10.118.
Article Google Scholar
Zhao J, Zhang T, Hu J, Liu Y, Jin Q, Wang X, Li H. M3ED: Multi-modal multi-scene multi-label emotional dialogue database. In: Muresan S, Nakov P, Villavicencio A. editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. p. 5699–5710. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.391.
Li J, Shang J, McAuley J: UCTopic: Unsupervised contrastive learning for phrase representations and topic mining. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 6159–6169. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.426.
Vasilakes J, Zerva C, Miwa M, Ananiadou S. Learning disentangled representations of negation and uncertainty. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 8380–8397. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.574.
Sollaci LB, Pereira MG. The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey. J Med Libr Assoc. 2004;92(3):364.
Google Scholar
Syarif I, Prugel-Bennett A, Wills G. Svm parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommun Comput Electron Control). 2016;14(4):1502–9. https://doi.org/10.12928/telkomnika.v14i4.3956.
Article Google Scholar
Muslim MA, et al. Support vector machine (svm) optimization using grid search and unigram to improve e-commerce review accuracy. J Soft Comput Explor. 2020;1(1):8–15. https://doi.org/10.52465/joscex.v1i1.3.
Article Google Scholar
Bradley AP. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997;30(7):1145–59.
Article Google Scholar
Sugawara S, Nangia N, Warstadt A, Bowman S. What makes reading comprehension questions difficult? In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp 6951–6971. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.479.
Cassidy L, Lynn T, Barry J, Foster J. TwittIrish: A Universal Dependencies treebank of tweets in Modern Irish. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp.6869–6884. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.473.
Gan L, Meng Y, Kuang K, Sun X, Fan C, Wu F, Li J. Dependency parsing as MRC-based span-span prediction. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp 2427–2437. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.173.
Jie Z, Li J, Lu W. Learning to reason deductively: math word problem solving as complex relation extraction. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. p. 5944–5955. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.410.
Sugimoto T, Yanaka H. Compositional semantics and inference system for temporal order based on Japanese CCG. In: Louvan S, Madotto A, Madureira B, editors. Proceedings of the 60th annual meeting of the association for computational linguistics: student research workshop, 2022. p. 104–114. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-srw.10.
Conforti C, Berndt J, Pilehvar MT, Giannitsarou C, Toxvaerd F, Collier N. Incorporating stock market signals for Twitter stance detection. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 4074–4091. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.281.
Bikaun T, Stewart M, Liu W. QuickGraph: A rapid annotation tool for knowledge graph extraction from technical text. In: Basile V, Kozareva Z, Stajner S, editors. Proceedings of the 60th annual meeting of the association for computational linguistics: system demonstrations, 2022. pp. 270–278. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-demo.27.

Download references

Funding

This work was supported by JSPS KAKENHI Grant No. JP20H04295.

Author information

Jinghong Li and Shinobu Hasegawa have contributed equally to this work.

Authors and Affiliations

Division of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Asahidai, Nomi, Ishikawa, 9231292, Japan
Jinghong Li
Center for Innovative Distance Education and Research, Japan Advanced Institute of Science and Technology, Asahidai, Nomi, Ishikawa, 9231292, Japan
Wen Gu, Koichi Ota & Shinobu Hasegawa

Authors

Jinghong Li
View author publications
You can also search for this author in PubMed Google Scholar
Wen Gu
View author publications
You can also search for this author in PubMed Google Scholar
Koichi Ota
View author publications
You can also search for this author in PubMed Google Scholar
Shinobu Hasegawa
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

(1) We propose a novel framework for understanding the layout of scientific documents in a hierarchical structure. This framework includes base domains, compartments, and text blocks, with a hierarchical structure that clearly represents the functionality of single-modal and multi-modal elements. (2) To process text blocks, which are the fundamental elements of scientific document layout analysis in this work, we developed an integrated encoding template highlighting their characteristics. These patterns encompass dimensions, coordinates, font type, font size, and text density within the text blocks. (3) To differentiate between the different types of information conveyed by each text block, we manually annotated the linguistic and non-linguistic information in a short period. This allowed us to create a small-scale dataset for implementing a text block classification module based on machine learning technology. Our approach is characterized by its relatively low time cost for training on specific sets of scientific documents. This enables accurate multi-modal text block classification and information extraction for large volumes of similarly formatted scientific documents. (4) Based on the classification results, we implemented a compartment segmentation module to improve the identification of figures and tables to achieve more accurate object recognition for complex cases. In order to evaluate the effectiveness of our proposed method for object recognition, we conducted comparison experiments with existing multi-modal document processing models.

Corresponding authors

Correspondence to Jinghong Li, Wen Gu, Koichi Ota or Shinobu Hasegawa.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Research Involving Human and/or Animals

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, J., Gu, W., Ota, K. et al. Object Recognition from Scientific Document Based on Compartment and Text Blocks Refinement Framework. SN COMPUT. SCI. 5, 816 (2024). https://doi.org/10.1007/s42979-024-03130-7

Download citation

Received: 21 December 2023
Accepted: 12 July 2024
Published: 23 August 2024
DOI: https://doi.org/10.1007/s42979-024-03130-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object Recognition from Scientific Document Based on Compartment and Text Blocks Refinement Framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ICDAR 2021 Competition on Scientific Literature Parsing

IIIT-AR-13K: A New Dataset for Graphical Object Detection in Documents

TACTFUL: A Framework for Targeted Active Learning for Document Analysis

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Research Involving Human and/or Animals

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Object Recognition from Scientific Document Based on Compartment and Text Blocks Refinement Framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ICDAR 2021 Competition on Scientific Literature Parsing

IIIT-AR-13K: A New Dataset for Graphical Object Detection in Documents

TACTFUL: A Framework for Targeted Active Learning for Document Analysis

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Research Involving Human and/or Animals

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation