Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3539618.3591912acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Linked-DocRED - Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction Pipelines

Published: 18 July 2023 Publication History

Abstract

Information Extraction (IE) pipelines aim to extract meaningful entities and relations from documents and structure them into a knowledge graph that can then be used in downstream applications. Training and evaluating such pipelines requires a dataset annotated with entities, coreferences, relations, and entity-linking. However, existing datasets either lack entity-linking labels, are too small, not diverse enough, or automatically annotated (that is, without a strong guarantee of the correction of annotations). Therefore, we propose Linked-DocRED, to the best of our knowledge, the first manually-annotated, large-scale, document-level IE dataset. We enhance the existing and widely-used DocRED dataset with entity-linking labels that are generated thanks to a semi-automatic process that guarantees high-quality annotations. In particular, we use hyperlinks in Wikipedia articles to provide disambiguation candidates. We also propose a complete framework of metrics to benchmark end-to-end IE pipelines, and we define an entity-centric metric to evaluate entity-linking. The evaluation of a baseline shows promising results while highlighting the challenges of an end-to-end IE pipeline. Linked-DocRED, the source code for the entity-linking, the baseline, and the metrics are distributed under an open-source license and can be downloaded from a public repository.

Supplemental Material

MP4 File
Information Extraction pipelines aim to extract meaningful entities and relations from documents and structure them into a knowledge graph that can then be used in downstream applications. Training and evaluating such pipelines requires a dataset annotated with entities, coreferences, relations, and entity-linking. However, existing datasets either lack entity-linking labels, are too small, not diverse enough, or automatically annotated. Therefore, we propose Linked-DocRED, to the best of our knowledge, the first manually-annotated, large-scale, document-level information extraction dataset. We enhance the existing and widely-used DocRED dataset with entity-linking labels that are generated thanks to a semi-automatic process that guarantees high-quality annotations. In particular, we use hyperlinks in Wikipedia articles to provide disambiguation candidates. The evaluation of a baseline shows promising results while highlighting the challenges of an end-to-end pipeline. Linked-DocRED is open-source.

References

[1]
Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chain. In Proceedings of the 1st International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference. European Language Resources Association, Granada, Spain, 563--566.
[2]
Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni. 2007. Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc., Hyderabad, India, 2670--2676.
[3]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. https://doi.org/10.48550/arXiv.2004.05150 arXiv:2004.05150 [cs].
[4]
Xiaojun Chen, Shengbin Jia, and Yang Xiang. 2020. A review: Knowledge reasoning over knowledge graph. Expert Systems with Applications, Vol. 141 (March 2020), 112948--112948. https://doi.org/10.1016/j.eswa.2019.112948 Publisher: Pergamon.
[5]
Qiao Cheng, Juntao Liu, Xiaoye Qu, Jin Zhao, Jiaqing Liang, Zhefeng Wang, Baoxing Huai, Nicholas Jing Yuan, and Yanghua Xiao. 2021. HacRED: A Large-Scale Relation Extraction Dataset Toward Hard Cases in Practical Applications. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 2819--2831. https://doi.org/10.18653/v1/2021.findings-acl.249
[6]
Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2019. Connecting the dots: Document-level neural relation extraction with edge-oriented graphs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 4925--4936. https://doi.org/10.18653/v1/d19-1498
[7]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, Vol. 20, 1 (1960), 37--46.
[8]
Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 4171--4186. https://doi.org/10.18653/V1/N19-1423
[9]
Sarah Elhammadi, Laks V.S. Lakshmanan, Raymond Ng, Michael Simpson, Baoxing Huai, Zhefeng Wang, and Lanjun Wang. 2020. A High Precision Pipeline for Financial Knowledge Graph Construction. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 967--977. https://doi.org/10.18653/v1/2020.coling-main.84
[10]
Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Elena Simperl, and Frederique Laforest. 2018. T-Rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the 11th International Conference on Language Resources and Evaluation. European Language Resources Association, Miyazaki, Japan, 3448--3452. https://aclanthology.org/L18-1544
[11]
Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management. Association for Computing Machinery, Toronto, ON, Canada, 1625--1628.
[12]
Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019. Fewrel 2.0: Towards more challenging few-shot relation classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 6250--6255. https://doi.org/10.18653/v1/d19-1649
[13]
Pierre-Yves Genest, Pierre-Edouard Portier, Elöd Egyed-Zsigmond, and Laurent-Walter Goix. 2022. PromptORE - A Novel Approach Towards Fully Unsupervised Relation Extraction. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management. ACM, Atlanta, USA, 11. https://doi.org/10.1145/3511808.3557422
[14]
Qingyu Guo, Fuzhen Zhuang, Chuan Qin, Hengshu Zhu, Xing Xie, Hui Xiong, and Qing He. 2022. A Survey on Knowledge Graph-Based Recommender Systems. IEEE Transactions on Knowledge and Data Engineering, Vol. 34, 8 (Aug. 2022), 3549--3568. https://doi.org/10.1109/tkde.2020.3028705 Publisher: Institute of Electrical and Electronics Engineers (IEEE).
[15]
Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4803--4809. https://doi.org/10.18653/v1/d18--1514
[16]
Quzhe Huang, Shengqi Zhu, Yansong Feng, Yuan Ye, Yuxuan Lai, and Dongyan Zhao. 2021. Three Sentences Are All You Need: Local Path Enhanced Document Relation Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 2. Association for Computational Linguistics, Online, 998--1004. https://doi.org/10.18653/v1/2021.acl-short.126
[17]
Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. Knowledge graph embedding based question answering. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, Inc, Melbourne, Australia, 105--113. https://doi.org/10.1145/3289600.3290956
[18]
V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, Vol. 10 (Feb. 1966), 707.
[19]
Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. 2016. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database, Vol. 2016 (Jan. 2016), baw068. https://doi.org/10.1093/database/baw068
[20]
Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. 2018. Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3219--3232. https://doi.org/10.18653/v1/D18-1360
[21]
Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, Minneapolis, Minnesota, 3036--3046. https://doi.org/10.18653/v1/n19-1308
[22]
Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems. Association for Computing Machinery, Graz, Austria, 1--8. https://doi.org/10.1145/2063518.2063519
[23]
Filipe Mesquita, Matteo Cannaviccio, Jordan Schmidek, Paramita Mirza, and Denilson Barbosa. 2019. Knowledgenet: A benchmark dataset for knowledge base population. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 749--758. https://doi.org/10.18653/v1/d19-1069
[24]
Nafise Sadat Moosavi and Michael Strube. 2016. Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 632--642. https://doi.org/10.18653/v1/P16--1060
[25]
Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, Vol. 48, 3 (March 1970), 443--453. https://doi.org/10.1016/0022-2836(70)90057-4
[26]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Doha, Qatar, 1532--1543. https://doi.org/10.3115/v1/d14-1162
[27]
Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Eduard Hovy, Vincent Ng, and Michael Strube. 2014. Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation. Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2014 (June 2014), 30--35. https://doi.org/10.3115/v1/P14-2006
[28]
Maxime Prieur, Cédric Du Mouza, Guillaume Gadek, and Bruno Grilhères. 2023. Peuplement de base de connaissances, liage dynamique et système end-to-end. Revue des Nouvelles Technologies de l'Information, Vol. Extraction et Gestion des Connaissances, RNTI-E-39 (2023), 281--288. https://hal.science/hal-03887658
[29]
Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 6323 LNAI. Springer, Berlin, Heidelberg, 148--163. https://doi.org/10.1007/978-3-642-15939-8_10 Issue: PART 3.
[30]
Stephen Robertson, S Walker, S Jones, M. M. Hancock-Beaulieu, and M. Gatford. 1994. Okapi at TREC-3. In Overview of the Third Text REtrieval Conference (TREC-3). DIANE Publishing Company, Gaithersburg, MD, USA, 109--126.
[31]
Arpita Roy and Shimei Pan. 2021. Incorporating medical knowledge in BERT for clinical relation extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Punta Cana, Dominican Republic, 5357--5366. https://doi.org/10.18653/v1/2021.emnlp-main.435
[32]
Severine Verlinden, Klim Zaporojets, Johannes Deleu, Thomas Demeester, and Chris Develder. 2021. Injecting Knowledge Base Information into End-to-End Joint Entity and Relation Extraction and Coreference Resolution. https://doi.org/10.48550/arXiv.2107.02286 arXiv:2107.02286 [cs].
[33]
David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity, relation, and event extraction with contextualized span representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 5784--5789. https://doi.org/10.18653/v1/d19-1585
[34]
Xinyi Wang, Zitao Wang, Weijian Sun, and Wei Hu. 2022. Enhancing Document-Level Relation Extraction by Entity Knowledge Injection. In The Semantic Web - ISWC 2022 (Lecture Notes in Computer Science), Ulrike Sattler, Aidan Hogan, Maria Keet, Valentina Presutti, João Paulo A. Almeida, Hideaki Takeda, Pierre Monnin, Giuseppe Pirrò, and Claudia d'Amato (Eds.). Springer International Publishing, Cham, 39--56. https://doi.org/10.1007/978-3-031-19433-7_3
[35]
Edwin B. Wilson. 1927. Probable inference, the law of succession, and statistical inference. J. Amer. Statist. Assoc., Vol. 22, 158 (1927), 209--212. https://doi.org/10.1080/01621459.1927.10502953
[36]
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A large-scale document-level relation extraction dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 764--777. https://doi.org/10.18653/v1/p19--1074
[37]
Klim Zaporojets, Johannes Deleu, Chris Develder, and Thomas Demeester. 2021. DWIE: An entity-centric dataset for multi-task document-level information extraction. Information Processing & Management, Vol. 58, 4 (July 2021), 102563. https://doi.org/10.1016/j.ipm.2021.102563
[38]
Shuang Zeng, Runxin Xu, Baobao Chang, and Lei Li. 2020. Double graph based reasoning for document-level relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), Online, 1630--1640. https://doi.org/10.18653/v1/2020.emnlp-main.127
[39]
Ningyu Zhang, Shumin Deng, Zhanlin Sun, Guanying Wang, Xi Chen, Wei Zhang, and Huajun Chen. 2019. Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, Minneapolis, Minnesota, USA, 3016--3025. https://doi.org/10.18653/v1/n19-1306
[40]
Zhenyu Zhang, Bowen Yu, Xiaobo Shu, Tingwen Liu, Hengzhu Tang, Wang Yubin, and Li Guo. 2020. Document-level Relation Extraction with Dual-tier Heterogeneous Graph. In Proceedings of the 28th International Conference on Computational Linguistics. Association for Computational Linguistics, Barcelona, Spain, 1630--1641. https://doi.org/10.18653/v1/2020.coling-main.143
[41]
Zexuan Zhong and Danqi Chen. 2021. A Frustratingly Easy Approach for Entity and Relation Extraction. In Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 50--61. https://doi.org/10.18653/v1/2021.naacl-main.5
[42]
Wenxuan Zhou, Kevin Huang, Tengyu Ma, and Jing Huang. 2021. Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. AAAI Press, Online, 14612--14620. https://doi.org/10.1609/aaai.v35i16.17717 Number: 16.

Index Terms

  1. Linked-DocRED - Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction Pipelines

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2023
      3567 pages
      ISBN:9781450394086
      DOI:10.1145/3539618
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 July 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SIGIR '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 164
        Total Downloads
      • Downloads (Last 12 months)89
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 04 Oct 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media