Lemmatization
106 Followers
Recent papers in Lemmatization
In corpus studies when analysing a word, it is often convenient to see all or several grammatical forms of this word as one group. Such a group is referred to as a lemma. Traditionally the lemma is seen as a group of words that share the... more
In this paper, we describe an approach to lemmatisation for Russian nouns, which makes use of a large-scale inheritance lexicon implemented in the lexical representation language DATR (Evans and Gazdar 1996). The lexicon was compiled... more
This journal provides an overview of Natural Language Processing (NLP). This will provide an introduction to NLP and is also intended to focus on the discussion of the current challenges of Natural Language Processing, NLP libraries, and... more
This paper deals with the impact of complex morphological structures on essential aspects of lexicology. On the basis of data from the Kartvelian (South-Caucasian) language family consisting of Georgian and its sister-languages, it... more
This paper discusses the theoretical bases as well as the pragmatic implementation of the lemmatization of the Late Latin Charter Treebanks (LLCT). LLCT is a set of three dependency treebanks (LLCT1, LLCT2, LLCT3) of Early Medieval Latin... more
This short article presents the result of accuracy tests for currently available Ancient Greek lemmatizers and recently published lemmatized corpora. We ran a blinded experiment in which three highly proficient readers of Ancient Greek... more
The paper presents a spoken corpus of the endangered Torlak dialect from the Timok area of Southeast Serbia. This dialect expresses a great deal of variation in the use of non-standard features under the influence of standard Serbian... more
The algorithm and the software for conducting the procedure of Preprocessing of the reviews of films in Polish language was developed. This algorithm contains the following steps: Text Adaptation Procedure; Procedure of Tokenization;... more
Zulu uses a conjunctive writing system, that is, a system whereby relatively short linguistic words are joined together to form long orthographic words with complex morphological structures. This has led to the so-called 'stem tradition'... more
This research focuses on the implementation of Gramatika, a grammar checker designed for the Filipino language given its available resources and linguistic tools. The checker uses hybrid n-grams generated from n-grams of words,... more
In this paper we consider a set of natural language processing techniques that can be used to analyze large amounts of texts, focusing on the advanced tokenizer which accounts for a number of complex linguistic phenomena, as well as for... more
In this paper we consider a set of natural language processing techniques that can be used to analyze large amounts of texts, focusing on the advanced tokenizer which accounts for a number of complex linguistic phenomena, as well as for... more
The focus of this study is Hellenistic Greek, a variation of Greek that continues to be of particular interest within the humanities. The Hellenistic variant of Greek, we argue, requires tools that are specifically tuned to its... more
In this research article an in-depth investigation is presented of the lexicographic treatment of the demonstrative copulative (DC) in Sesotho sa Leboa. This one case study serves as an example to illustrate the so-called 'paradigmatic... more
En este artículo se propone la utilización de mecanismos de morfología derivativa productiva con el fin de agrupar en una misma familia morfológica a todas aquellas palabras que se derivan de una misma raíz gramatical. En particular, se... more
One of the most important prior tasks for robust part-ofspeech tagging is the correct tokenization or segmentation of the texts. This task can involve processes which are much more complex than the simple identification of the different... more
We describe an on-going project to develop a lexical database of American Sign Language (ASL) as a tool for annotating ASL corpora collected in the United States. Labs within our team complete locally chosen fields using their notation... more
Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source... more
In this article, we consider the problem of supervised morphological analysis using an approach that differs from industry spread analogs. The article describes a new method of lemmatization based on the algorithms of machine learning, in... more
We consider a set of natural language processing techniques based on finite-state technology that can be used to analyze huge amounts of texts.These techniques include an advanced tokenizer, a part-of-speech tagger that can manage... more