ABSTRACT Named entities (NEs) can facilitate access to multilingual knowledge sources--which have... more ABSTRACT Named entities (NEs) can facilitate access to multilingual knowledge sources--which have exploded in recent years--but the identification, classification, and retrieval of NEs remain challenging tasks.
Journal of the Association for Information Science and Technology, 2014
ABSTRACT In this article, we present a new algorithm for clustering a bilingual collection of com... more ABSTRACT In this article, we present a new algorithm for clustering a bilingual collection of comparable news items in groups of specific topics. Our hypothesis is that named entities (NEs) are more informative than other features in the news when clustering fine grained topics. The algorithm does not need as input any information related to the number of clusters, and carries out the clustering only based on information regarding the shared named entities of the news items. This proposal is evaluated using different data sets and outperforms other state-of-the-art algorithms, thereby proving the plausibility of the approach. In addition, because the applicability of our approach depends on the possibility of identifying equivalent named entities among the news, we propose a heuristic system to identify equivalent named entities in the same and different languages, thereby obtaining good performance.
Meeting of the Association for Computational Linguistics, 2006
This paper presents an approach for Mul- tilingual Document Clustering in compa- rable corpora. T... more This paper presents an approach for Mul- tilingual Document Clustering in compa- rable corpora. The algorithm is of heuris- tic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. One of the main ad- vantages of this approach is that it does not depend on bilingual or
ABSTRACT Cognates are words in different languages that have similar spelling and meaning. The id... more ABSTRACT Cognates are words in different languages that have similar spelling and meaning. The identification of cognates is very useful for many different Natural Language Processing tasks, and also in the process of learning a second language. This paper presents a new approach to classify pairs of words into cognates/false friends or not related classes. The proposed approach uses a fuzzy system to combine complementary string similarity measures in order to improve the cognate identification task. The underlying hypothesis is that the combination of different string measures by applying heuristic knowledge, can outperform those measures working separately. The results obtained by the proposed system confirm the previous hypothesis, and furthermore it also outperforms other systems that combine string measures by using a supervised approach. As an additional contribution, we have created a bilingual test data set which include pairs of cognates, false friends and unrelated words in Spanish and English, that is freely available for research purposes.
ABSTRACT Named entities (NEs) can facilitate access to multilingual knowledge sources--which have... more ABSTRACT Named entities (NEs) can facilitate access to multilingual knowledge sources--which have exploded in recent years--but the identification, classification, and retrieval of NEs remain challenging tasks.
Journal of the Association for Information Science and Technology, 2014
ABSTRACT In this article, we present a new algorithm for clustering a bilingual collection of com... more ABSTRACT In this article, we present a new algorithm for clustering a bilingual collection of comparable news items in groups of specific topics. Our hypothesis is that named entities (NEs) are more informative than other features in the news when clustering fine grained topics. The algorithm does not need as input any information related to the number of clusters, and carries out the clustering only based on information regarding the shared named entities of the news items. This proposal is evaluated using different data sets and outperforms other state-of-the-art algorithms, thereby proving the plausibility of the approach. In addition, because the applicability of our approach depends on the possibility of identifying equivalent named entities among the news, we propose a heuristic system to identify equivalent named entities in the same and different languages, thereby obtaining good performance.
Meeting of the Association for Computational Linguistics, 2006
This paper presents an approach for Mul- tilingual Document Clustering in compa- rable corpora. T... more This paper presents an approach for Mul- tilingual Document Clustering in compa- rable corpora. The algorithm is of heuris- tic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. One of the main ad- vantages of this approach is that it does not depend on bilingual or
ABSTRACT Cognates are words in different languages that have similar spelling and meaning. The id... more ABSTRACT Cognates are words in different languages that have similar spelling and meaning. The identification of cognates is very useful for many different Natural Language Processing tasks, and also in the process of learning a second language. This paper presents a new approach to classify pairs of words into cognates/false friends or not related classes. The proposed approach uses a fuzzy system to combine complementary string similarity measures in order to improve the cognate identification task. The underlying hypothesis is that the combination of different string measures by applying heuristic knowledge, can outperform those measures working separately. The results obtained by the proposed system confirm the previous hypothesis, and furthermore it also outperforms other systems that combine string measures by using a supervised approach. As an additional contribution, we have created a bilingual test data set which include pairs of cognates, false friends and unrelated words in Spanish and English, that is freely available for research purposes.
Uploads