-
Docling Technical Report
Authors:
Christoph Auer,
Maksym Lysak,
Ahmed Nassar,
Michele Dolfi,
Nikolaos Livathinos,
Panos Vagenas,
Cesar Berrospi Ramis,
Matteo Omenetti,
Fabian Lindlbauer,
Kasper Dinkla,
Lokesh Mishra,
Yusik Kim,
Shubham Gupta,
Rafael Teixeira de Lima,
Valery Weber,
Lucas Morin,
Ingmar Meijer,
Viktor Kuropiatnyk,
Peter W. J. Staar
Abstract:
This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addit…
▽ More
This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.
△ Less
Submitted 30 August, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs
Authors:
Lokesh Mishra,
Sohayl Dhibi,
Yusik Kim,
Cesar Berrospi Ramis,
Shubham Gupta,
Michele Dolfi,
Peter Staar
Abstract:
Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as…
▽ More
Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
INDUS: Effective and Efficient Language Models for Scientific Applications
Authors:
Bishwaranjan Bhattacharjee,
Aashka Trivedi,
Masayasu Muraoka,
Muthukumaran Ramasubramanian,
Takuma Udagawa,
Iksha Gurung,
Rong Zhang,
Bharath Dandala,
Rahul Ramachandran,
Manil Maskey,
Kaylin Bugbee,
Mike Little,
Elizabeth Fancher,
Lauren Sanders,
Sylvain Costes,
Sergi Blanco-Cuaresma,
Kelly Lockhart,
Thomas Allen,
Felix Grezes,
Megan Ansdell,
Alberto Accomazzi,
Yousef El-Kurdi,
Davis Wertheimer,
Birgit Pfitzmann,
Cesar Berrospi Ramis
, et al. (9 additional authors not shown)
Abstract:
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics,…
▽ More
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest.
△ Less
Submitted 20 May, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery
Authors:
Hoang Thanh Lam,
Marco Luca Sbodio,
Marcos Martínez Galindo,
Mykhaylo Zayats,
Raúl Fernández-Díaz,
Víctor Valls,
Gabriele Picco,
Cesar Berrospi Ramis,
Vanessa López
Abstract:
Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about e…
▽ More
Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about existing relations among molecules and proteins. In this study, we demonstrate that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, we can further enrich the representation and achieve state-of-the-art results for drug-target binding affinity prediction in the established Therapeutic Data Commons (TDC) benchmarks. We release a set of multimodal knowledge graphs, integrating data from seven public data sources, and containing over 30 million triples. Our intention is to foster additional research to explore how multimodal knowledge enhanced protein/molecule embeddings can improve prediction tasks, including prediction of binding affinity. We also release some pretrained models learned from our multimodal knowledge graphs, along with source code for running standard benchmark tasks for prediction of biding affinity.
△ Less
Submitted 19 October, 2023; v1 submitted 22 June, 2023;
originally announced June 2023.
-
Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness
Authors:
Christoph Auer,
Michele Dolfi,
André Carvalho,
Cesar Berrospi Ramis,
Peter W. J. Staar
Abstract:
Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as…
▽ More
Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on machine-learning methods on cloud infrastructure. Our key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Evidence for the scaling behavior and resource efficiency is provided for two alternative workload distribution strategies and deployment configurations. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.
△ Less
Submitted 1 June, 2022;
originally announced June 2022.