Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

MC4WEPS: a multilingual corpus for Web people search disambiguation

Published: 01 September 2017 Publication History


This article introduces the MC4WEPS corpus, a new resource for evaluating Web people search disambiguation tasks, and describes its design, collection and annotation process, the agreement between the different annotators, and finally introduces a baseline evaluation. This corpus is built by compiling multilingual search engines results where the queries are person names. Proper noun disambiguation is an open problem in natural language ambiguity resolution and, specifically, resolving the ambiguity of person names in Web search results is still a challenging problem. However, state-of-the-art approaches have been evaluated only with monolingual web page collections. The MC4WEPS corpus aims to provide the research community with a reference corpus for the task of disambiguating search engine results where the query is a person name shared by homonymous individuals. The features of this new corpus stand out from existing corpora for the same task, namely multilingualism and inclusion of social networking websites. These characteristics make it more representative of a real search scenario, especially for evaluating person name disambiguation in a multilingual context. The article also includes detailed information about the format and the availability of the corpus.


Artiles, J. (2009). Web people search. Ph.D. thesis, UNED.
Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., & Amigó, E. (2010). Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks. In Third Web people search evaluation forum (WePS-3).
Artiles, J., Gonzalo, J., & Sekine, S. (2007). The semeval- 2007 weps evaluation: Establishing a benchmark for the web people search task. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pp. 64-69. ACL.
Artiles, J., Gonzalo, J., & Sekine, S. (2009). Weps 2 evaluation campaign: Overview of the web people search clustering task. In Proceedings of the 2nd Web people search evaluation workshop (WePS 2009).
Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 36th anual meeting of the association of computational linguistics and 17th international conference on computational linguistics (Vol. 1, pp. 79-85).
Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international World Wide Web conference (WWW 2005) (pp. 463-470).
Berendsen, R., Kovachev, B., Nastou, E. P., de Rijke, M., & Weerkamp, W. (2012). Result disambiguation in web people search. In Proceedings of the 34th European conference on advances in information retrieval (ECIR2012) (pp. 146-157).
Bhowmick, P. K., Mitra, P., & Basu, A. (2008). An agreement measure for determining inter-annotator reliability of human judgements on affective text. In Proceedings of the workshop on Human Judgements in Computational Linguistics (COLING 2008) (pp. 58-65).
Chen, Y., Lee, S. Y. M., & Huang, C. R. (2012). A robust web personal name information extraction system. Expert Systems with Applications, 39, 2690-2699.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46.
Delgado, A. D., Martínez, R., Fresno, V., & Montalvo, S. (2014a). An unsupervised algorithm for person name disambiguation in the web. Procesamiento del Lenguaje Natural, 53, 51-58.
Delgado, A. D., Martínez, R., Montalvo, S., & Fresno, V. (2014b). A data driven approach for person name disambiguation in web search results. In Proceedings of the 25th international conference on computational linguistics (COLING 2014) (pp. 301-310).
Di, B., & Glass, E. M. (2004). Squibs and discussions the kappa statistic: A second look. Computational Linguistics, 30(1), 95-101.
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley.
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553-569.
Gruetze, T., Kasneci, G., Zuo, Z., & Naumann, F. (2014). Bootstrapped grouping of results to ambiguous person name queries. In Proceedings of the 30th international conference on data engineering workshops (ICDE) (pp. 56-61).
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3), 107-145.
Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Socit Vaudoise des Sciences Naturelles, 37, 547-579.
Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333-347.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174.
Liu, V., & Curran, J.R. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (pp. 233-240).
Liu, Z., Lu, Q., & Xu, J. (2011). High performance clustering forweb person name disambiguation using topic capturing. In International workshop on entity-oriented Search (EOS).
Mann, G. S. (2006). Multi-document statistical fact extraction and fusion. Ph.D. thesis, Johns Hopkins University, Baltimore, MD, USA. AAI3213760
McEnery, A., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge.
Nuray-Turan, R., Kalashnikov, D. V., & Mehrotra, S. (2012). Exploiting web querying for Web people search. Journal ACM Transactions on Database Systems, 37(1), 1-41.
Pedersen, T., Kulkarni, A., Angheluta, R., Kozareva, Z., & Solorio, T. (2006). An unsupervised language independent method of name discrimination using second order co-occurrence features. Computational linguistics and intelligent text processing (Vol. 3878, pp. 208-222). Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846-850.
Rosell, M., Kann, V., & Litton, J.E. (2004). Comparing comparisons: Document clustering evaluation using two manual classifications. In Proceedings of the international conference on natural language processing (pp. 207-216).
Shen, D., Walker, T., Zheng, Z., Yang, Q., & Li, Y. (2008). Personal name classification in web queries. In Proceedings of the 2008 international conference on Web search and data mining (WSDM'08) (pp. 149-158).
Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw Hill.
Vu, Q. M., Takasu, A., & Adachi, J. (2008). Name disambiguation boosted by latent topics from web directories. In Proceedings of the IEEE/WIC/ACM international conference on Web intelligence and intelligent agent technology (WI-IAT '08) (pp. 697-703).
Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). Adana: Active name disambiguation. In Proceedings of the 2011 IEEE 11th international conference on data mining (ICDM'11) (pp. 794-803).
Xiao, R. (2010). The handbook of natural language processing, chap. corpus creation. Boca Raton: CRC Press.
Xu, J., Lu, Q., Li, M., & Li, W. (2015). Web person disambiguation using hierarchical co-reference model. In Proceedings of the 16th international conference CICLing 2015 (pp. 279-291).
Yoshida, M., Ikeda, M., Ono, S., Sato, I., & Nakagawa, H. (2010). Person name disambiguation by bootstrapping. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR'10) (pp. 10-17).
  1. MC4WEPS: a multilingual corpus for Web people search disambiguation



      Information & Contributors


      Published In

      cover image Language Resources and Evaluation
      Language Resources and Evaluation  Volume 51, Issue 3
      September 2017
      303 pages



      Berlin, Heidelberg

      Publication History

      Published: 01 September 2017

      Author Tags

      1. Annotation
      2. Corpus linguistics
      3. Multilingual
      4. People name disambiguation


      • Article


      Other Metrics

      Bibliometrics & Citations


      Article Metrics

      • 0
        Total Citations
      • 0
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 08 Feb 2025

      Other Metrics


      View Options

      View options






      Share this Publication link

      Share on social media