DocILE Benchmark for Document Information Localization and Extraction

Šimsa, Štěpán; Šulc, Milan; Uřičář, Michal; Patel, Yash; Hamdi, Ahmed; Kocián, Matěj; Skalický, Matyáš; Matas, Jiří; Doucet, Antoine; Coustaty, Mickaël; Karatzas, Dimosthenis

Computer Science > Computation and Language

arXiv:2302.05658 (cs)

[Submitted on 11 Feb 2023 (v1), last revised 3 May 2023 (this version, v2)]

Title:DocILE Benchmark for Document Information Localization and Extraction

Authors:Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas

View PDF

Abstract:This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at this https URL.

Comments:	Accepted to ICDAR 2023
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2302.05658 [cs.CL]
	(or arXiv:2302.05658v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2302.05658

Submission history

From: Milan Šulc [view email]
[v1] Sat, 11 Feb 2023 11:32:10 UTC (842 KB)
[v2] Wed, 3 May 2023 16:24:58 UTC (812 KB)

Computer Science > Computation and Language

Title:DocILE Benchmark for Document Information Localization and Extraction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:DocILE Benchmark for Document Information Localization and Extraction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators