Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Object Recognition from Scientific Document Based on Compartment and Text Blocks Refinement Framework

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

With the rapid development of the internet in the past decade, it has become increasingly important to extract valuable information from vast resources efficiently, which is crucial for establishing a comprehensive digital ecosystem, particularly in the context of research surveys and comprehension. The foundation of these tasks focuses on accurate extraction and deep mining of data from scientific documents, which are essential for building a robust data infrastructure. However, parsing raw data or extracting data from complex scientific documents have been ongoing challenges. Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches. However, using rule-based methods can incur high coding costs for articles with intricate typesetting. Conversely, relying solely on machine learning methods necessitates annotation work for complex content types within the scientific document, which can be costly. Additionally, few studies have thoroughly defined and explored the hierarchical layout within scientific documents. The lack of a comprehensive definition of the internal structure and elements of the documents indirectly impacts the accuracy of text classification and object recognition tasks. From the perspective of analyzing the standard layout and typesetting used in the specified publication, we propose a new document layout analysis framework called Compartment and Text Blocks Refinement (CTBR). Firstly, we define scientific documents into hierarchical divisions: base domain, compartment, and text blocks. Next, we conduct an in-depth exploration and classification of the meanings of text blocks. Finally, we utilize the results of text block classification to implement object recognition within scientific documents based on rule-based compartment segmentation. For the experiment, we used the well-known ACL format proceeding articles as experimental data for the validation experiment. The experiment shows that our approach achieved over 95% text block classification accuracy and 90% object recognition accuracy for tables and figures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data Availability

We will release the dataset and all experimental data at an appropriate time.

References

  1. Hsu A, Khoo W, Goyal N, Wainstein M. Next-generation digital ecosystem for climate data mining and knowledge discovery: a review of digital data collection technologies. Fron Big Data. 2020;3:29. https://doi.org/10.3389/fdata.2020.00029.

    Article  Google Scholar 

  2. Gharagozlou H, Mohammadzadeh J, Bastanfard A, Ghidary SS. Semantic relation extraction: a review of approaches, datasets, and evaluation methods with looking at the methods and datasets in the persian language. ACM Trans Asian Low-Resour Lang Inf Process. 2023. https://doi.org/10.1145/3592601.

    Article  Google Scholar 

  3. Kinney R, Anastasiades C, Authur R, Beltagy I, Bragg J, Buraczynski A, Cachola I, Candra S, Chandrasekhar Y, Cohan A, Crawford M, Downey D, Dunkelberger J, Etzioni O, Evans R, Feldman S, Gorney J, Graham D, Hu F, Huff R, King D, Kohlmeier S, Kuehl B, Langan M, Lin D, Liu H, Lo K, Lochner J, MacMillan K, Murray T, Newell C, Rao S, Rohatgi S, Sayre P, Shen Z, Singh A, Soldaini L, Subramanian S, Tanaka A, Wade AD, Wagner L, Wang LL, Wilhelm C, Wu C, Yang J, Zamarron A, Zuylen MV, Weld DS. The Semantic Scholar Open Data Platform. 2023; https://arxiv.org/abs/2301.10140.

  4. Lo K, Wang LL, Neumann M, Kinney R, Weld D. S2ORC: The semantic scholar open research corpus. In: Jurafsky D, Chai J, Schluter N, Tetreault J, editors. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4969–4983. Association for Computational Linguistics, Online 2020. https://doi.org/10.18653/v1/2020.acl-main.447. https://aclanthology.org/2020.acl-main.447.

  5. Saier T, Krause J, Färber M. unarxive 2022: All arxiv publications pre-processed for nlp, including structured full-text and citation network. In: 2023 ACM/IEEE joint conference on digital libraries (JCDL), 2023. pp. 66–70. https://doi.org/10.1109/JCDL57899.2023.00020.

  6. Li J, Tanabe H, Ota K, Gu W, Hasegawa S. Automatic summarization for academic articles using deep learning and reinforcement learning with viewpoints. Int FLAIRS Conf Proc. 2023. https://doi.org/10.32473/flairs.36.133308.

    Article  Google Scholar 

  7. Sallam M. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023. https://doi.org/10.3390/healthcare11060887.

    Article  Google Scholar 

  8. Lo CK. What is the impact of chatgpt on education? a rapid review of the literature. Educ Sci. 2023. https://doi.org/10.3390/educsci13040410.

    Article  Google Scholar 

  9. Del Giglio A, Costa MUP. The use of artificial intelligence to improve the scientific writing of non-native english speakers. Rev Assoc Med Bras. 2023;69(9):20230560. https://doi.org/10.1590/1806-9282.20230560.

    Article  Google Scholar 

  10. Ibrahim Altmami N, El Bachir Menai M. Automatic summarization of scientific articles: a survey. J King Saud Univ Comput Inf Sci. 2022;34(4):1011–28. https://doi.org/10.1016/j.jksuci.2020.04.020.

    Article  Google Scholar 

  11. Zaman G, Mahdin H, Hussain K, Atta-Ur-Rahman, Abawajy J, Mostafa SA. An ontological framework for information extraction from diverse scientific sources. IEEE Access. 2021;9:42111–24. https://doi.org/10.1109/ACCESS.2021.3063181.

    Article  Google Scholar 

  12. Binmakhashen GM, Mahmoud SA. Document layout analysis: a comprehensive survey. ACM Comput Surv. 2019. https://doi.org/10.1145/3355610.

    Article  Google Scholar 

  13. Safder I, Hassan S-U, Visvizi A, Noraset T, Nawaz R, Tuarob S. Deep learning-based extraction of algorithmic metadata in full-text scholarly documents. Inf Process Manage. 2020;57(6): 102269. https://doi.org/10.1016/j.ipm.2020.102269.

    Article  Google Scholar 

  14. Ramakrishnan C, Patnia A, Hovy E, Burns GA. Layout-aware text extraction from full-text pdf of scientific articles. Source Code Biol Med. 2012;7:1–10. https://doi.org/10.1186/1751-0473-7-7.

    Article  Google Scholar 

  15. Siegel N, Lourie N, Power R, Ammar W. Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. JCDL ’18, pp. 223–232. Association for Computing Machinery, New York, NY, USA 2018. https://doi.org/10.1145/3197026.3197040.

  16. Jinghong L, Koichi O, Wen G, Shinobu H. A text block refinement framework for text classification and object recognition from academic articles. In: 2023 international conference on innovations in intelligent systems and applications (INISTA), 2023. pp. 1–6. https://doi.org/10.1109/INISTA59065.2023.10310320.

  17. Da C, Luo C, Zheng Q, Yao C. Vision grid transformer for document layout analysis. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 2023. pp. 19462–19472.

  18. Smock B, Pesala R, Abraham R. Pubtables-1m: towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022. pp. 4634–4642.

  19. Paliwal SS, D V, Rahul R, Sharma M, Vig L. Tablenet: deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In: 2019 international conference on document analysis and recognition (ICDAR), 2019. pp. 128–133. https://doi.org/10.1109/ICDAR.2019.00029.

  20. Clark C, Divvala S. Pdffigures 2.0: mining figures from research papers. In: Proceedings of the 16th ACM/IEEE-CS on joint conference on digital libraries. JCDL ’16, 2016. pp. 143–152. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2910896.2910904.

  21. Frerebeau N. tabula: an r package for analysis, seriation, and visualization of archaeological count data. JOpen Sour Softw. 2019;4(44):1821. https://doi.org/10.21105/joss.01821.

    Article  Google Scholar 

  22. Lopez P: Grobid: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti M, Borbinha J, Kapidakis S, Papatheodorou C, Tsakonas G, editors. Research and advanced technology for digital libraries, 2009. pp. 473–474. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04346-8_62.

  23. Hosking T, Tang H, Lapata M. Hierarchical sketch induction for paraphrase generation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 2489–2501. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.178.

  24. Artifex: PyMuPDF 1.23.5 documentation (2015-2023). https://pymupdf.readthedocs.io/en/latest/.

  25. Ghosh S, Srivastava S. ePiC: employing proverbs in context as a benchmark for abstract language understanding. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. p. 3989–4004. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.276.

  26. Cervantes J, Garcia-Lamont F, Rodríguez-Mazahua L, Lopez A. A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing. 2020;408:189–215. https://doi.org/10.1016/j.neucom.2019.10.118.

    Article  Google Scholar 

  27. Zhao J, Zhang T, Hu J, Liu Y, Jin Q, Wang X, Li H. M3ED: Multi-modal multi-scene multi-label emotional dialogue database. In: Muresan S, Nakov P, Villavicencio A. editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. p. 5699–5710. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.391.

  28. Li J, Shang J, McAuley J: UCTopic: Unsupervised contrastive learning for phrase representations and topic mining. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 6159–6169. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.426.

  29. Vasilakes J, Zerva C, Miwa M, Ananiadou S. Learning disentangled representations of negation and uncertainty. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 8380–8397. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.574.

  30. Sollaci LB, Pereira MG. The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey. J Med Libr Assoc. 2004;92(3):364.

    Google Scholar 

  31. Syarif I, Prugel-Bennett A, Wills G. Svm parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommun Comput Electron Control). 2016;14(4):1502–9. https://doi.org/10.12928/telkomnika.v14i4.3956.

    Article  Google Scholar 

  32. Muslim MA, et al. Support vector machine (svm) optimization using grid search and unigram to improve e-commerce review accuracy. J Soft Comput Explor. 2020;1(1):8–15. https://doi.org/10.52465/joscex.v1i1.3.

    Article  Google Scholar 

  33. Bradley AP. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997;30(7):1145–59.

    Article  Google Scholar 

  34. Sugawara S, Nangia N, Warstadt A, Bowman S. What makes reading comprehension questions difficult? In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp 6951–6971. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.479.

  35. Cassidy L, Lynn T, Barry J, Foster J. TwittIrish: A Universal Dependencies treebank of tweets in Modern Irish. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp.6869–6884. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.473.

  36. Gan L, Meng Y, Kuang K, Sun X, Fan C, Wu F, Li J. Dependency parsing as MRC-based span-span prediction. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp 2427–2437. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.173.

  37. Jie Z, Li J, Lu W. Learning to reason deductively: math word problem solving as complex relation extraction. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. p. 5944–5955. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.410.

  38. Sugimoto T, Yanaka H. Compositional semantics and inference system for temporal order based on Japanese CCG. In: Louvan S, Madotto A, Madureira B, editors. Proceedings of the 60th annual meeting of the association for computational linguistics: student research workshop, 2022. p. 104–114. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-srw.10.

  39. Conforti C, Berndt J, Pilehvar MT, Giannitsarou C, Toxvaerd F, Collier N. Incorporating stock market signals for Twitter stance detection. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 4074–4091. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.281.

  40. Bikaun T, Stewart M, Liu W. QuickGraph: A rapid annotation tool for knowledge graph extraction from technical text. In: Basile V, Kozareva Z, Stajner S, editors. Proceedings of the 60th annual meeting of the association for computational linguistics: system demonstrations, 2022. pp. 270–278. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-demo.27.

Download references

Funding

This work was supported by JSPS KAKENHI Grant No. JP20H04295.

Author information

Authors and Affiliations

Authors

Contributions

(1) We propose a novel framework for understanding the layout of scientific documents in a hierarchical structure. This framework includes base domains, compartments, and text blocks, with a hierarchical structure that clearly represents the functionality of single-modal and multi-modal elements. (2) To process text blocks, which are the fundamental elements of scientific document layout analysis in this work, we developed an integrated encoding template highlighting their characteristics. These patterns encompass dimensions, coordinates, font type, font size, and text density within the text blocks. (3) To differentiate between the different types of information conveyed by each text block, we manually annotated the linguistic and non-linguistic information in a short period. This allowed us to create a small-scale dataset for implementing a text block classification module based on machine learning technology. Our approach is characterized by its relatively low time cost for training on specific sets of scientific documents. This enables accurate multi-modal text block classification and information extraction for large volumes of similarly formatted scientific documents. (4) Based on the classification results, we implemented a compartment segmentation module to improve the identification of figures and tables to achieve more accurate object recognition for complex cases. In order to evaluate the effectiveness of our proposed method for object recognition, we conducted comparison experiments with existing multi-modal document processing models.

Corresponding authors

Correspondence to Jinghong Li, Wen Gu, Koichi Ota or Shinobu Hasegawa.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Research Involving Human and/or Animals

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Gu, W., Ota, K. et al. Object Recognition from Scientific Document Based on Compartment and Text Blocks Refinement Framework. SN COMPUT. SCI. 5, 816 (2024). https://doi.org/10.1007/s42979-024-03130-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-024-03130-7

Keywords