Datasets for Out-of-KB Mention Discovery with Entity Linking

Dong, Hang; Chen, Jiaoyan; He, Yuan; Yinan, Liu; Horrocks, Ian

doi:10.5281/zenodo.8228371

Published August 9, 2023 | Version v1

Dataset Open

Datasets for Out-of-KB Mention Discovery with Entity Linking

1. University of Oxford
2. University of Manchester & University of Oxford
3. Nankai University

The repository contains datasets for out-of-KB mention discovery from texts, documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023).

Each data setting (as a sub-folder) contains train, valid, and test files and also 100 random sample files for each data split for debugging.

Data folder names with “syn_full” at the end are synonym augmented data (each synonym as an entity) for the setting.

Ontology .jsonl files have two versions for each, "syn_attr" setting treats synonyms are attributes, "syn_full" setting treats synonyms as entities.

Data scripts are available at https://github.com/KRR-Oxford/BLINKout#data-scripts

Acknowledgement of the data sources below:

ShARe/CLEF 2013 dataset is from https://physionet.org/content/shareclefehealth2013/1.0/

MedMention dataset is from https://github.com/chanzuckerberg/MedMentions

UMLS (versions 2012AB, 2014AB, 2017AA) is from https://www.nlm.nih.gov/research/umls/index.html

SNOMED CT (corresponding versions) is from https://www.nlm.nih.gov/healthit/snomedct/index.html

NILK dataset is from https://zenodo.org/record/6607514

WikiData 2017 dump is from https://archive.org/download/enwiki-20170220/enwiki-20170220-pages-articles.xml.bz2

Files

Out-of-KB datasets 9 Aug 2023 version.zip

Files (113.4 MB)

Name	Size	Download all
Out-of-KB datasets 9 Aug 2023 version.zip md5:ceaaca5f07ff36c651b547cc55c0c1e1	113.4 MB	Preview Download

	All versions	This version
Views	239	237
Downloads	28	27
Data volume	4.0 GB	3.9 GB

Datasets for Out-of-KB Mention Discovery with Entity Linking

Creators

Description

Files

Out-of-KB datasets 9 Aug 2023 version.zip

Files (113.4 MB)