Given a piece of text in any language, a cross-lingual wikifier identifies mentions of named entities and grounds them to the corresponding entries in the English Wikipedia. This project implements the approaches proposed in the following two papers:
- Cross-Lingual Wikification Using Multilingual Embeddings (Tsai and Roth, NAACL 2016)
- Cross-Lingual Named Entity Recognition via Wikification (Tsai et al., CoNLL 2016)
This demo will give you some intuition about this project. The demo is presented in COLING 2016 (the paper and poster)
For CogComp members, resources for more than 40 languages are on our servers. The paths are specified in config/xlwikifier-demo.config. You only need to do the following soft link under the root of this project:
ln -s /shared/preprocessed/ctsai12/multilingual/xlwikifier-data xlwikifier-data
If you cannot access CogComp servers, we currently only release the resources for these three languages. Download this file which contains MapDB indices of FreeBase dump and English, Spanish, and Chinese Wikipedia. Follow the README inside to extract the files and set the corresponding paths in the config file.
mvn dependency:copy-dependencies
mvn compile
./scripts/run-benchmark.sh es config/xlwikifier-tac.config
This script runs and evaluates on the TAC-KBP 2016 EDL shared task (en: English, es: Spanish, zh: Chinese). You need to specify the paths to the evaluation documents and the gold annotations in the config file. Please check config/xlwikifier-tac.config for example. These documents are in the original format provided by LDC. Using the official evaluation script, this package gets the following performance on named entities:
English
strong mention match: Precision:93.4 Recall:83.7 F1:88.3
strong typed mention match: Precision:90.3 Recall:80.9 F1:85.4
strong typed all match: Precision:80.9 Recall:72.6 F1:76.5
Spanish
strong mention match: Precision:88.4 Recall:81.8 F1:85.0
strong typed mention match: Precision:85.7 Recall:79.3 F1:82.3
strong typed all match: Precision:78.1 Recall:72.3 F1:75.1
Chinese
strong mention match: Precision:87.0 Recall:72.8 F1:79.3
strong typed mention match: Precision:83.2 Recall:69.6 F1:75.8
strong typed all match: Precision:77.5 Recall:64.9 F1:70.6
Chen-Tse Tsai (ctsai12@illinois.edu)