research-article

Open access

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale

Authors:

Peter W J Staar,

Michele Dolfi,

Christoph Auer,

Costas BekasAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 774 - 782

https://doi.org/10.1145/3219819.3219834

Published: 19 July 2018 Publication History

PDF eReader

Abstract

Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather groundtruth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements.

Supplementary Material

MP4 File (staar_corpus_conversion.mp4)

Download
346.45 MB

References

[1]

A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschacher. 2015. ICDAR2015 Competition on Recognition of Documents with Complex Layouts - RDCL2015. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR2015). Nancy, 1151--1155.

Digital Library

Google Scholar

[2]

Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (01 Oct 2001), 5--32.

Digital Library

Google Scholar

[3]

R. Cattoni, T. Coianiz, S. Messelodi, and C. M. Modena. 1998. Geometric layout analysis techniques for document image understanding: a review. Technical Report.

Google Scholar

[4]

Jean-Pierre Chanod, Boris Chidlovskii, Hervé Dejean, Olivier Fambon, Jérôme Fuselier, Thierry Jacquin, and Jean-Luc Meunier. 2005. From Legacy Documents to XML: A Conversion Framework. Springer Berlin Heidelberg, Berlin, Heidelberg, 92--103.

Digital Library

Google Scholar

[5]

Ross Girshick. 2015. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV '15). IEEE Computer Society, Washington, DC, USA, 1440--1448.

Digital Library

Google Scholar

[6]

Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013). arXiv:1311.2524 http://arxiv.org/abs/1311.2524

Digital Library

Google Scholar

[7]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. Springer International Publishing, Cham, 21--37.

Google Scholar

[8]

Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 779--788.

Crossref

Google Scholar

[9]

Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 (2016).

Google Scholar

[10]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 91--99. http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf

Digital Library

Google Scholar

[11]

Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. 2018. Corpus Conversion Service poster at the SysML conference. http://www.sysml.cc/doc/76.pdf

Google Scholar

Cited By

View all

Wan JSong SYu WLiu YCheng WHuang FBai XYao CYang Z(2024)OMNIPARSER: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01481(15641-15653)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01481
Durmaz AThomas AMishra LMurthy RStraub T(2024)An ontology-based text mining dataset for extraction of process-structure-property entitiesScientific Data10.1038/s41597-024-03926-511:1Online publication date: 10-Oct-2024
https://doi.org/10.1038/s41597-024-03926-5
Gemelli AMarinai SPisaneschi LSantoni F(2024)Datasets and annotations for layout analysis of scientific articlesInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-024-00461-2Online publication date: 18-Mar-2024
https://doi.org/10.1007/s10032-024-00461-2
Show More Cited By

Index Terms

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
    2. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Converting PDF to HTML approach based on text detection
ICIS '09: Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human

Converting PDF document to HTML document with the same layout format is a very important and interesting research problem. After the conversion, it is easy for PDF document to be browsed online and information extracted. Based on the extraction result of ...
Multilanguage business document translator: an automatic translator tool

Electronic documents are traditionally and widely used in many industries without a standard format. This lack of a proper format produces many issues that hinder one from benefiting from those documents. Electronic Data Interchange (EDI) solves this ...
Cloud-agnostic architectures for machine learning based on Apache Spark
Highlights
- Cloud provider-independent cluster deployment in cloud
- Scalable multi-VM ...
Abstract
Reference architectures for Big Data, machine learning and stream processing include not only recommended practices and interconnected building blocks but considerations for scalability, availability, manageability, and security as ...

Comments

Information & Contributors

Information

Published In

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Swiss National Science Foundation
Horizon 2020 NMBP-23-2016

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
2,177
Total Downloads

Downloads (Last 12 months)226
Downloads (Last 6 weeks)30

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wan JSong SYu WLiu YCheng WHuang FBai XYao CYang Z(2024)OMNIPARSER: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01481(15641-15653)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01481
Durmaz AThomas AMishra LMurthy RStraub T(2024)An ontology-based text mining dataset for extraction of process-structure-property entitiesScientific Data10.1038/s41597-024-03926-511:1Online publication date: 10-Oct-2024
https://doi.org/10.1038/s41597-024-03926-5
Gemelli AMarinai SPisaneschi LSantoni F(2024)Datasets and annotations for layout analysis of scientific articlesInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-024-00461-2Online publication date: 18-Mar-2024
https://doi.org/10.1007/s10032-024-00461-2
Seneviratne ODas AChari SAgu NRashid SMcCusker JFranklin JQi MBennett KChen CHendler JMcGuinness D(2023)Semantically enabling clinical decision support recommendationsJournal of Biomedical Semantics10.1186/s13326-023-00285-914:1Online publication date: 18-Jul-2023
https://doi.org/10.1186/s13326-023-00285-9
Al-Barhamtoshy HJambi KRashwan MAbdou S(2023)An Arabic Manuscript Regions Detection, Recognition and Its Applications for OCRingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/353260922:1(1-28)Online publication date: 13-Feb-2023
https://dl.acm.org/doi/10.1145/3532609
Tadesse GCintas CVarshney KStaar PAgunwa CSpeakman SJia JBailey EAdelekun ALipoff JOnyekaba GLester JRotemberg VZou JDaneshjou R(2023)Skin Tone Analysis for Representation in Educational Materials (STAR-ED) using machine learningnpj Digital Medicine10.1038/s41746-023-00881-06:1Online publication date: 18-Aug-2023
https://doi.org/10.1038/s41746-023-00881-0
Lysak MNassar ALivathinos NAuer CStaar P(2023)Optimized Table Tokenization for Table Structure RecognitionDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_3(37-50)Online publication date: 19-Aug-2023
https://doi.org/10.1007/978-3-031-41679-8_3
Auer CNassar ALysak MDolfi MLivathinos NStaar P(2023)ICDAR 2023 Competition on Robust Layout Segmentation in Corporate DocumentsDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_27(471-482)Online publication date: 19-Aug-2023
https://doi.org/10.1007/978-3-031-41679-8_27
Pustulka EHanne T(2023)Text Mining-Innovationen für UnternehmenNeue Trends in Wirtschaftsinformatik und eingesetzte Technologien10.1007/978-3-031-32538-0_4(51-65)Online publication date: 28-Jul-2023
https://doi.org/10.1007/978-3-031-32538-0_4
Alfassy AArbelle AHalimi OHarary SHerzig RSchwartz EPanda RDolfi MAuer CSaenko KStaar PFeris RKarlinsky LKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)FETAProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602436(29873-29888)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602436
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Converting PDF to HTML approach based on text detection

Multilanguage business document translator: an automatic translator tool

Cloud-agnostic architectures for machine learning based on Apache Spark

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF

eReader

Login options

Full Access

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Converting PDF to HTML approach based on text detection

Multilanguage business document translator: an automatic translator tool

Cloud-agnostic architectures for machine learning based on Apache Spark

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations