research-article

Linked-DocRED - Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction Pipelines

Authors:

Pierre-Yves Genest,

Pierre-Edouard Portier,

Elöd Egyed-Zsigmond,

Martino LovisettoAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3064 - 3074

https://doi.org/10.1145/3539618.3591912

Published: 18 July 2023 Publication History

Get Access

Abstract

Information Extraction (IE) pipelines aim to extract meaningful entities and relations from documents and structure them into a knowledge graph that can then be used in downstream applications. Training and evaluating such pipelines requires a dataset annotated with entities, coreferences, relations, and entity-linking. However, existing datasets either lack entity-linking labels, are too small, not diverse enough, or automatically annotated (that is, without a strong guarantee of the correction of annotations). Therefore, we propose Linked-DocRED, to the best of our knowledge, the first manually-annotated, large-scale, document-level IE dataset. We enhance the existing and widely-used DocRED dataset with entity-linking labels that are generated thanks to a semi-automatic process that guarantees high-quality annotations. In particular, we use hyperlinks in Wikipedia articles to provide disambiguation candidates. We also propose a complete framework of metrics to benchmark end-to-end IE pipelines, and we define an entity-centric metric to evaluate entity-linking. The evaluation of a baseline shows promising results while highlighting the challenges of an end-to-end IE pipeline. Linked-DocRED, the source code for the entity-linking, the baseline, and the metrics are distributed under an open-source license and can be downloaded from a public repository.

Supplemental Material

MP4 File

Information Extraction pipelines aim to extract meaningful entities and relations from documents and structure them into a knowledge graph that can then be used in downstream applications. Training and evaluating such pipelines requires a dataset annotated with entities, coreferences, relations, and entity-linking. However, existing datasets either lack entity-linking labels, are too small, not diverse enough, or automatically annotated. Therefore, we propose Linked-DocRED, to the best of our knowledge, the first manually-annotated, large-scale, document-level information extraction dataset. We enhance the existing and widely-used DocRED dataset with entity-linking labels that are generated thanks to a semi-automatic process that guarantees high-quality annotations. In particular, we use hyperlinks in Wikipedia articles to provide disambiguation candidates. The evaluation of a baseline shows promising results while highlighting the challenges of an end-to-end pipeline. Linked-DocRED is open-source.

Download
49.44 MB

References

[1]

Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chain. In Proceedings of the 1st International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference. European Language Resources Association, Granada, Spain, 563--566.

Abstract

Supplemental Material

References

Index Terms

Recommendations

Entity linking leveraging: automatically generated annotation

End-to-End Entity Linking Combined with Bert-based Siamese and Interaction Network

Named Entity Linking on Handwritten Document Images

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations