2020
pdf
bib
abs
Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings
Bikash Gyawali
|
Lucas Anastasiou
|
Petr Knoth
Proceedings of the Twelfth Language Resources and Evaluation Conference
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
pdf
bib
Proceedings of the 8th International Workshop on Mining Scientific Publications
Petr Knoth
|
Christopher Stahl
|
Bikash Gyawali
|
David Pride
|
Suchetha N. Kunnath
|
Drahomira Herrmannova
Proceedings of the 8th International Workshop on Mining Scientific Publications
pdf
bib
abs
Overview of the 2020 WOSP 3C Citation Context Classification Task
Suchetha Nambanoor Kunnath
|
David Pride
|
Bikash Gyawali
|
Petr Knoth
Proceedings of the 8th International Workshop on Mining Scientific Publications
The 3C Citation Context Classification task is the first shared task addressing citation context classification. The two subtasks, A and B, associated with this shared task, involves the classification of citations based on their purpose and influence, respectively. Both tasks use a portion of the new ACT dataset, developed by the researchers at The Open University, UK. The tasks were hosted on Kaggle, and the participated systems were evaluated using the macro f-score. Three teams participated in subtask A and four teams participated in subtask B. The best performing systems obtained an overall score of 0.2056 for subtask A and 0.5556 for subtask B, outperforming the simple majority class baseline models, which scored 0.11489 and 0.32249, respectively. In this paper we provide a report specifying the shared task, the dataset used, a short description of the participating systems and the final results obtained by the teams based on the evaluation criteria. The shared task has been organised as part of the 8th International Workshop on Mining Scientific Publications (WOSP 2020) workshop.
2015
pdf
bib
A Domain Agnostic Approach to Verbalizing n-ary Events without Parallel Corpora
Bikash Gyawali
|
Claire Gardent
|
Christophe Cerisara
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)
2014
pdf
bib
Surface Realisation from Knowledge-Bases
Bikash Gyawali
|
Claire Gardent
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2013
pdf
bib
LOR-KBGEN, A Hybrid Approach To Generating from the KBGen Knowledge-Base
Bikash Gyawali
|
Claire Gardent
Proceedings of the 14th European Workshop on Natural Language Generation