article

MC4WEPS: a multilingual corpus for Web people search disambiguation

Authors:

Raquel Martínez,

Leonardo Campillos,

Agustín D. Delgado,

Víctor Fresno,

Felisa VerdejoAuthors Info & Claims

Language Resources and Evaluation, Volume 51, Issue 3

Pages 805 - 832

https://doi.org/10.1007/s10579-016-9365-4

Published: 01 September 2017 Publication History

Abstract

This article introduces the MC4WEPS corpus, a new resource for evaluating Web people search disambiguation tasks, and describes its design, collection and annotation process, the agreement between the different annotators, and finally introduces a baseline evaluation. This corpus is built by compiling multilingual search engines results where the queries are person names. Proper noun disambiguation is an open problem in natural language ambiguity resolution and, specifically, resolving the ambiguity of person names in Web search results is still a challenging problem. However, state-of-the-art approaches have been evaluated only with monolingual web page collections. The MC4WEPS corpus aims to provide the research community with a reference corpus for the task of disambiguating search engine results where the query is a person name shared by homonymous individuals. The features of this new corpus stand out from existing corpora for the same task, namely multilingualism and inclusion of social networking websites. These characteristics make it more representative of a real search scenario, especially for evaluating person name disambiguation in a multilingual context. The article also includes detailed information about the format and the availability of the corpus.

References

[1]

Artiles, J. (2009). Web people search. Ph.D. thesis, UNED.

[2]

Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., & Amigó, E. (2010). Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks. In Third Web people search evaluation forum (WePS-3).

[3]

Artiles, J., Gonzalo, J., & Sekine, S. (2007). The semeval- 2007 weps evaluation: Establishing a benchmark for the web people search task. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pp. 64-69. ACL.

[4]

Artiles, J., Gonzalo, J., & Sekine, S. (2009). Weps 2 evaluation campaign: Overview of the web people search clustering task. In Proceedings of the 2nd Web people search evaluation workshop (WePS 2009).

[5]

Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 36th anual meeting of the association of computational linguistics and 17th international conference on computational linguistics (Vol. 1, pp. 79-85).

[6]

Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international World Wide Web conference (WWW 2005) (pp. 463-470).

[7]

Berendsen, R., Kovachev, B., Nastou, E. P., de Rijke, M., & Weerkamp, W. (2012). Result disambiguation in web people search. In Proceedings of the 34th European conference on advances in information retrieval (ECIR2012) (pp. 146-157).

[8]

Bhowmick, P. K., Mitra, P., & Basu, A. (2008). An agreement measure for determining inter-annotator reliability of human judgements on affective text. In Proceedings of the workshop on Human Judgements in Computational Linguistics (COLING 2008) (pp. 58-65).

[9]

Chen, Y., Lee, S. Y. M., & Huang, C. R. (2012). A robust web personal name information extraction system. Expert Systems with Applications, 39, 2690-2699.

[10]

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46.

[11]

Delgado, A. D., Martínez, R., Fresno, V., & Montalvo, S. (2014a). An unsupervised algorithm for person name disambiguation in the web. Procesamiento del Lenguaje Natural, 53, 51-58.

[12]

Delgado, A. D., Martínez, R., Montalvo, S., & Fresno, V. (2014b). A data driven approach for person name disambiguation in web search results. In Proceedings of the 25th international conference on computational linguistics (COLING 2014) (pp. 301-310).

[13]

Di, B., & Glass, E. M. (2004). Squibs and discussions the kappa statistic: A second look. Computational Linguistics, 30(1), 95-101.

[14]

Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley.

[15]

Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553-569.

[16]

Gruetze, T., Kasneci, G., Zuo, Z., & Naumann, F. (2014). Bootstrapped grouping of results to ambiguous person name queries. In Proceedings of the 30th international conference on data engineering workshops (ICDE) (pp. 56-61).

[17]

Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3), 107-145.

Digital Library

[18]

Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Socit Vaudoise des Sciences Naturelles, 37, 547-579.

[19]

Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333-347.

Digital Library

[20]

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174.

[21]

Liu, V., & Curran, J.R. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (pp. 233-240).

[22]

Liu, Z., Lu, Q., & Xu, J. (2011). High performance clustering forweb person name disambiguation using topic capturing. In International workshop on entity-oriented Search (EOS).

[23]

Mann, G. S. (2006). Multi-document statistical fact extraction and fusion. Ph.D. thesis, Johns Hopkins University, Baltimore, MD, USA. AAI3213760

[24]

McEnery, A., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge.

[25]

Nuray-Turan, R., Kalashnikov, D. V., & Mehrotra, S. (2012). Exploiting web querying for Web people search. Journal ACM Transactions on Database Systems, 37(1), 1-41.

Digital Library

[26]

Pedersen, T., Kulkarni, A., Angheluta, R., Kozareva, Z., & Solorio, T. (2006). An unsupervised language independent method of name discrimination using second order co-occurrence features. Computational linguistics and intelligent text processing (Vol. 3878, pp. 208-222). Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

[27]

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846-850.

[28]

Rosell, M., Kann, V., & Litton, J.E. (2004). Comparing comparisons: Document clustering evaluation using two manual classifications. In Proceedings of the international conference on natural language processing (pp. 207-216).

[29]

Shen, D., Walker, T., Zheng, Z., Yang, Q., & Li, Y. (2008). Personal name classification in web queries. In Proceedings of the 2008 international conference on Web search and data mining (WSDM'08) (pp. 149-158).

[30]

Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw Hill.

[31]

Vu, Q. M., Takasu, A., & Adachi, J. (2008). Name disambiguation boosted by latent topics from web directories. In Proceedings of the IEEE/WIC/ACM international conference on Web intelligence and intelligent agent technology (WI-IAT '08) (pp. 697-703).

[32]

Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). Adana: Active name disambiguation. In Proceedings of the 2011 IEEE 11th international conference on data mining (ICDM'11) (pp. 794-803).

[33]

Xiao, R. (2010). The handbook of natural language processing, chap. corpus creation. Boca Raton: CRC Press.

[34]

Xu, J., Lu, Q., Li, M., & Li, W. (2015). Web person disambiguation using hierarchical co-reference model. In Proceedings of the 16th international conference CICLing 2015 (pp. 279-291).

[35]

Yoshida, M., Ikeda, M., Ono, S., Sato, I., & Nakagawa, H. (2010). Person name disambiguation by bootstrapping. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR'10) (pp. 10-17).

MC4WEPS: a multilingual corpus for Web people search disambiguation
1. Hardware
  1. Power and energy
    1. Power estimation and optimization
2. Information systems

Recommendations

Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval

Objectives:: We present in this article experiments on multi-language information extraction and access in the medical domain. For such applications, multilingual terminology plays a crucial role when working on specialized languages and specific ...
A Corpus Study of Verbal Multiword Expressions in Brazilian Portuguese
Computational Processing of the Portuguese Language
Abstract
Verbal multiword expressions (VMWEs) such as to make ends meet require special attention in NLP and linguistic research, and annotated corpora are valuable resources for studying them. Corpora annotated with VMWEs in several languages, including ...
The corpus of Tibetan grammatical works

This paper describes the creation of the parallel Tibetan---Russian corpus of works of the Tibetan grammatical tradition that formed in the 7---8th centuries AD. On the basis of the corpus, a special lexical base of grammatical terminology is formed ...

Comments

Information & Contributors

Information

Published In

cover image Language Resources and Evaluation

Language Resources and Evaluation Volume 51, Issue 3

September 2017

303 pages

ISSN:1574-020X

Issue’s Table of Contents

Copyright © Copyright © 2017 Springer Science+Business Media B.V.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 September 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents