Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3219819.3219834acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale

Published: 19 July 2018 Publication History

Abstract

Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather groundtruth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements.

Supplementary Material

MP4 File (staar_corpus_conversion.mp4)

References

[1]
A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschacher. 2015. ICDAR2015 Competition on Recognition of Documents with Complex Layouts - RDCL2015. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR2015). Nancy, 1151--1155.
[2]
Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (01 Oct 2001), 5--32.
[3]
R. Cattoni, T. Coianiz, S. Messelodi, and C. M. Modena. 1998. Geometric layout analysis techniques for document image understanding: a review. Technical Report.
[4]
Jean-Pierre Chanod, Boris Chidlovskii, Hervé Dejean, Olivier Fambon, Jérôme Fuselier, Thierry Jacquin, and Jean-Luc Meunier. 2005. From Legacy Documents to XML: A Conversion Framework. Springer Berlin Heidelberg, Berlin, Heidelberg, 92--103.
[5]
Ross Girshick. 2015. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV '15). IEEE Computer Society, Washington, DC, USA, 1440--1448.
[6]
Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013). arXiv:1311.2524 http://arxiv.org/abs/1311.2524
[7]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. Springer International Publishing, Cham, 21--37.
[8]
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 779--788.
[9]
Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 (2016).
[10]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 91--99. http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
[11]
Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. 2018. Corpus Conversion Service poster at the SysML conference. http://www.sysml.cc/doc/76.pdf

Cited By

View all
  • (2024)OMNIPARSER: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01481(15641-15653)Online publication date: 16-Jun-2024
  • (2024)An ontology-based text mining dataset for extraction of process-structure-property entitiesScientific Data10.1038/s41597-024-03926-511:1Online publication date: 10-Oct-2024
  • (2024)Datasets and annotations for layout analysis of scientific articlesInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-024-00461-2Online publication date: 18-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ai
  2. artificial intelligence
  3. asynchronous architecture
  4. cloud architecture
  5. cloud computing
  6. deep learning
  7. document conversion
  8. ibm
  9. ibm research
  10. knowledge ingestion
  11. machine learning
  12. pdf
  13. table processing

Qualifiers

  • Research-article

Funding Sources

  • Swiss National Science Foundation
  • Horizon 2020 NMBP-23-2016

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)226
  • Downloads (Last 6 weeks)30
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)OMNIPARSER: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01481(15641-15653)Online publication date: 16-Jun-2024
  • (2024)An ontology-based text mining dataset for extraction of process-structure-property entitiesScientific Data10.1038/s41597-024-03926-511:1Online publication date: 10-Oct-2024
  • (2024)Datasets and annotations for layout analysis of scientific articlesInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-024-00461-2Online publication date: 18-Mar-2024
  • (2023)Semantically enabling clinical decision support recommendationsJournal of Biomedical Semantics10.1186/s13326-023-00285-914:1Online publication date: 18-Jul-2023
  • (2023)An Arabic Manuscript Regions Detection, Recognition and Its Applications for OCRingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/353260922:1(1-28)Online publication date: 13-Feb-2023
  • (2023)Skin Tone Analysis for Representation in Educational Materials (STAR-ED) using machine learningnpj Digital Medicine10.1038/s41746-023-00881-06:1Online publication date: 18-Aug-2023
  • (2023)Optimized Table Tokenization for Table Structure RecognitionDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_3(37-50)Online publication date: 19-Aug-2023
  • (2023)ICDAR 2023 Competition on Robust Layout Segmentation in Corporate DocumentsDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_27(471-482)Online publication date: 19-Aug-2023
  • (2023)Text Mining-Innovationen für UnternehmenNeue Trends in Wirtschaftsinformatik und eingesetzte Technologien10.1007/978-3-031-32538-0_4(51-65)Online publication date: 28-Jul-2023
  • (2022)FETAProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602436(29873-29888)Online publication date: 28-Nov-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media