Lemmatization Research Papers

With increasingly higher numbers of non-English language web searchers the problems of efficient handling of non-English Web documents and user queries are becoming major issues for search engines. The main aim of this review paper is to... more

La production de corpus d'occitan médiéval et prémoderne: problèmes et perspectives de travail Ce dépôt contient les fichiers et modèles décrits dans: Jean-Baptiste Camps et Gilles Guilhem Couffignal, « La production de corpus... more

En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas a la normalización de términos en Recuperación de Información Textual. El objetivo de dichas técnicas es el tratamiento de los fenómenos de... more

Автоматические методы морфологического анализа и лемматизации, предназначенные для литературного русского языка, могут давать невысокие результаты, будучи применёнными к так называемым социальным медиа (микроблоги, социальные сети и... more

Bookmark
Download
- by Tatiana Shavrina and +1
  Alexey Sorokin
- •
- 6
  Social Media, Corrections, Spelling Errors, Levenshtein distance

This research focuses on the implementation of Gramatika, a grammar checker designed for the Filipino language given its available resources and linguistic tools. The checker uses hybrid n-grams generated from n-grams of words,... more

This project sets out to discover and develop techniques for the lemmatisation of a historical corpus of the Cornish language in order that a lemmatised dictionary macrostructure can be generated from the corpus. The system should be... more

Bookmark
Download
- by Jon Mills
- •
- 8
  Lexicology, Cornish Language, Computational Linguistics, Lexicography

Aplikasi pemeriksa ejaan (spelling checker) merupakan sebuah tool yang dapat mendeteksi kesalahan penulisan ejaan pada suatu kata atau teks. Aplikasi pemeriksa ejaan untuk bahasa Indonesia pada umumnya memeriksa dengan cara membandingkan... more

Bookmark
Download
- by Rossella Mosti
- •
- 10
  Vocabulary, Lexicography, Dictionary, Lemma

The paper presents the system used in the EvaLatin shared task to POS tag and lemmatize Latin. It consists of two components. A gradient boosting machine (LightGBM) is used for POS tagging, mainly fed with pre-computed word embeddings of... more

Bookmark
Download
- by Giuseppe G. A. Celano
- •
- 4
  Machine Learning, Latin Language, POS tagging, Lemmatization

This paper presents a newly funded international project for machine translation and automated analysis of ancient cuneiform languages where NLP specialists and Assyriologists collaborate to create an information retrieval system for... more

Bookmark
Download
- by Émilie Pagé-Perron and +1
  Maria Sukhareva
- •
- 26
  Information Retrieval, Languages and Linguistics, Natural Language Processing, Assyriology

Processing of Arabic language is very improtant and actual these days. The arabic is the sixth most used language in the word. The problem of stemming is very important in information retrival, knowledge mining language processing. The... more

Bookmark
Download
- by Hussein Soori and +1
  Jan Platoš
- •
- 7
  Arabic morphology, Information Retrival, Arabic Natural Language Processing, Lemmatization

Traditionally, Zulu adjectives have been lemmatized under their stems only. In this research article, an in-depth analysis is undertaken to make a case for the lemmatization of all frequent adjectival forms with their adjective concords... more

Bookmark
Download
- by Gilles-Maurice de Schryver
- •
- 4
  Lexicography, Zulu, Adjectives, Lemmatization

The authors of this article firmly believe in the advantages of utilising a corpus for lemma-sign list creation. However, one should not overreact and assume that alternative methods for the creation of a dictionary’s macrostructure have... more

SUMMARY This study examines L2 French learner corpora for adjective complexity through the data in The Newcastle Corpus (Myles & Mitchell, 2016). Data suggest that over the span of one year student usage patterns become more native-like... more

SUMMARY

This study examines L2 French learner corpora for adjective complexity through the data in The Newcastle Corpus (Myles & Mitchell, 2016). Data suggest that over the span of one year student usage patterns become more native-like by all metrics analyzed, providing pedagogical implications for corpus-informed pedagogy and data-driven learning.

Keywords: Corpus Linguistics, French, Adjectives, Syntax, Pedagogy, Genre, Learner Corpora, Oral Corpora

ABSTRACT
This study examines L2 French learner corpora for usage patterns in adjective complexity through the data in The Newcastle Corpus, French Language Learner Oral Corpora (FLLOC) (Myles & Mitchell, 2016).

French adjective variation in prenominal and postnominal placement has been notoriously difficult to explain through prescriptive rules (Sleeman et al., 2014) and variation is oft attributed to semantic shifts (Alexiadou et al., 2007; Laenzlinger, 2005), usage contexts (Delbecque, 1990) and user preference (Thuilier, 2013). Recently, large-scale corpus research of native speaker production suggests that variation is more linked to genre and speaker preference than has been previous understood (Thuilier, 2013). Empirical data indicate that French adjectives in the noun phrase (NP) are significantly impacted in nominal placement by adjective lemma frequency and syllable length (Delbecque, 1990; Thuilier, 2013; Wilmet, 1980).

These native speaker metrics provide a framework to consider the progression of L2 French learning through corpus data (Filipovic & Hawkins, 2013; Shirato & Stapleton, 2007; Wulff & Gries, 2015). The comparison of data to native speaker norms is further facilitated by data of the Newcastle Corpus, which provides native and learner speaker productions for elicited language tasks (Myles & Mitchell, 2016). Thus, learner progress is analyzed through the complexity of adjectives in the NP as explored through lemma frequency, lemma syllable length, and nominal placement variation.

Data suggest that over the span of a year student usage patterns become more native-like by all metrics. More advanced students use greater placement variation and lemmas of longer syllable length than their counterparts. Additionally, lemma frequency and nominal placement are also more closely aligned with native speaker usage patterns. These data display the lexical and syntactic understanding acquired by L2 French students and have pedagogical implications for corpus-informed pedagogy and data-driven learning.

(1) Introduction (2) Corpora and the compilation of the lemma-sign list (3) Corpora and the battle against inconsistencies (4) Corpora as an aid for conjunctively written languages (5) Corpora as the key to writing better dictionary... more

This paper deals with the impact of complex morphological structures on essential aspects of lexicology. On the basis of data from the Kartvelian (South-Caucasian) language family consisting of Georgian and its... more

Bookmark
Download
- by Jost Gippert
- •
- 10
  Georgian Language, Lexicography, Kartvelian Languages, Svan language

The aim of this article is (a) to reflect on the contributions made by P.S. Groenewald to the field of lexicography in South Africa, focusing on the importance of determining the relative frequency of individual words in Sesotho sa Leboa,... more

In Zulu, there are three kinds of quantitatives: inclusive, exclusive and numeral. For the lemmatization of these, even existing traditional dictionaries felt the need to move away from a pure 'stem' approach towards a 'word' approach. In... more

Bookmark
Download
- by Gilles-Maurice de Schryver
- •
- 4
  Lexicography, Zulu, Lemmatization, Quantitative Pronouns

This short article presents the result of accuracy tests for currently available Ancient Greek lemmatizers and recently published lemmatized corpora. We ran a blinded experiment in which three highly proficient readers of Ancient Greek... more

Bookmark

This paper deals with the impact of complex morphological structures on essential aspects of lexicology. On the basis of data from the Kartvelian (South-Caucasian) language family consisting of Georgian and its sister-languages, it... more

This paper explores the problem of developing NLP tools for morphologically rich and orthographi-cally inconsistent classical languages. It is a case study of building a lemmatizer for Old Irish using only a dictionary and an unlabeled... more

Cet aide-mémoire résume l'essentiel du travail effectué lors de la journée CeLiSo (EA7332) « Créer soi-même un corpus étiqueté d'un million de mots en anglais, français ou allemand » à l'Université Paris IV-Sorbonne le 25/03/2016. Il... more

[FR - English below] À l’heure où la quantité de données disponibles, plus ou moins librement, s’accroît de manière importante, grâce aux corpus, éditions ou bibliothèques numériques, le développement d’outils de fouille de données ou de... more

[FR - English below]
À l’heure où la quantité de données disponibles, plus ou moins librement, s’accroît de manière importante, grâce aux corpus, éditions ou bibliothèques numériques, le développement d’outils de fouille de données ou de méthodes d’apprentissage profond permet au chercheur de se constituer un corpus d’étude adapté à ses recherches, d’enrichir ses données et des les exploiter. Des outils ouverts de reconnaissance optique des caractères peuvent être adaptés à un imprimé ancien, un incunable, voire un manuscrit, avec des résultats exploitables, autorisant la constitution rapide de corpus textuels. L’alternance de phases d’entraînement et de correction permet de faire progresser la qualité des résultats, en accumulant rapidement des données textuelles brutes. Celles-ci peuvent ensuite être structurées, par exemple en xml/tei, et enrichies. L’enrichissement par des annotations graphiques ou linguistiques connaît également des automatisations. Ces procédés, connus des linguistes et fonctionnels pour les langues modernes, posent des difficultés pour des langues comme l’occitan médiéval, dues en partie à l’absence de corpus lemmatisés conséquents. Des pistes pour la création d’outils adaptés à la grande variabilité graphique des états anciens de langue, seront présentées, ainsi que des expérimentations pour la lemmatisation de l’occitan médiéval et prémoderne. Ces techniques ouvrent la porte à de nombreuses exploitations. L’augmentation, tant souhaitée, de la quantité de textes et données de qualité disponibles, permet le progrès des méthodes de philologie numérique, si tant est que chacun prenne la peine de rendre ses données librement disponibles en ligne et réutilisables. Par l’exposition de différentes solutions techniques et de quelques micro-analyses à titre d’exemple, cette communication entend montrer une partie de ce que la philologie numérique peut offrir au chercheur en domaine occitan, tout en rappelant les enjeux éthiques sur lesquels reposent de telles pratiques.

[EN]
At a time when the quantity of - more or less freely - available data is increasing significantly, thanks to digital corpora, editions or libraries, the development of data mining tools or deep learning methods allows researchers to build a corpus of study tailored for their research, to enrich their data and to exploit them. Open optical character recognition (OCR) tools can be adapted to old prints, incunabula or even manuscripts, with usable results, allowing the rapid creation of textual corpora. The alternation of training and correction phases makes it possible to improve the quality of the results by rapidly accumulating raw text data. These can then be structured, for example in XML/TEI, and enriched. The enrichment of the texts with graphic or linguistic annotations can also be automated. These processes, known to linguists and functional for modern languages, present difficulties for languages such as Medieval Occitan, due in part to the absence of big enough lemmatized corpora. Suggestions for the creation of tools adapted to the considerable spelling variation of ancient languages will be presented, as well as experiments for the lemmatization of Medieval and Premodern Occitan. These techniques open the way for many exploitations. The much desired increase in the amount of available quality texts and data makes it possible to improve digital philology methods, if everyone takes the trouble to make their data freely available online and reusable. By exposing different technical solutions and some micro-analyses as examples, this paper aims to show part of what digital philology can offer to researchers in the Occitan domain, while recalling the ethical issues on which such practices are based.

Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an... more

Sentiment classification of texts written in Serbian is still an under-researched topic. One of the open issues is how the different forms of morphological normalization affect the performances of different sentiment classifiers and which... more

— Stemming is the main step used for handling the morphologically rich languages such as Arabic. It is usually used in several types of applications such as natural language processing, information retrieval, and text mining. The goal of... more

(1) Introduction (2) Lemma-sign lists in dictionaries for the elementary level (3) Compiling the lemma-sign list of the Junior Dictionary (4) Comparison of the compiled lemma-sign list with the manually excerpted vocabulary (5) The... more

REVIEW. Fragments, (Studies in Iconology, 14), Leuven-Walpole, 2018 (Fragments is co-edited by Stephanie Heremans and is a limited edition and celebratory publication signed by the author). This 400 pages book, consists of 110 lemmata,... more

Bookmark
Download
- by barbara baert
- •
- 4
  Aby Warburg, Encyclopedism, Fragments and Aphorisms, Lemmatization

Bookmark
- by roberto bombacigno and +1
  scotti muth nicoletta
- •
- 5
  Digital Humanities, Aristotle, Ancient Greek Philosophy, Aristotélisme

The aim of this article is to analyze traditional approaches to the lemmatization of nouns on the macrostructural level in African languages against the background of the user- perspective, the physical limitations on volume, the... more

We present AcTo, a network of integrated projects for the development of language resources and tools for Medieval Occitan. This abstract illustrates the resources in the network, as well as the first steps towards their integration,... more

This paper introduces the main components of the downloadable package of the 3.0 version of the morphological analyser for Latin Lemlat. The processes of word form analysis and treatment of spelling variation performed by the tool are... more

We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger [1] for tagging and The Icelandic Frequency Dictionary [2] corpus for training. We combine the advantages of... more

We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed for annotating texts in the Russian National Corpus and is... more

Bookmark
Download
- by Hanne Eckhoff
- •
- 4
  Corpus Linguistics, Treebanks, POS tagging, Lemmatization

(1) Defining a dictionary’s macrostructure (2) On the need for rulers, part 1 (3) Part-of-Speech rulers (‘POS Rulers’) (4) On the need for rulers, part 2 (5) Multidimensional lexicographic rulers (6) Characterising POS Rulers and... more

In this article a four-step methodology is proposed for the creation of the lemma-sign list of a Nguni-language reference work. The theoretical principles are illustrated throughout with a full-scale case study revolving around... more

Bookmark
Download
- by Gilles-Maurice de Schryver
- •
- 3
  Lexicography, IsiNdebele, Lemmatization

An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the... more

Bookmark
Download
- by Dan Tufis
- •
- 6
  Question Answering System, Search Engine, Tokenization, Question Answering

We describe an on-going project to develop a lexical database of American Sign Language (ASL) as a tool for annotating ASL corpora collected in the United States. Labs within our team complete locally chosen fields using their notation... more

Bookmark
Download
- by Leah Geer and +1
  Jonathan Henner
- •
- 3
  Lemmatization, ID Gloss, ASL corpora

Resumen En este art culo se describe el sistema ERIAL, llevado a cabo en el marco del proyecto del mismo nombre, para Recuperaci on de Informaci on. Tras una primera descripci on externa del proyecto (Secci on 1), se presenta el entorno... more

In this article, we consider the problem of supervised morphological analysis using an approach that differs from industry spread analogs. The article describes a new method of lemmatization based on the algorithms of machine learning, in... more

Présentation du lemmatiseur/annotateur Pandora lors de l'atelier du groupe Lemmes du Consortium Sources Médiévales (COSME), org. par Eliana Magnani; Paris, Institut de recherche et d'histoire des textes, 6 novembre 2017.

Bookmark
Download
- by Jean-Baptiste Camps and +1
  Thibault Clérice
- •
- 8
  Natural Language Processing, Old French, Medieval Latin, Artificial Neural Networks

Bookmark
Download
- by Antonio Moreno Sandoval and +1
  Doroteo Toledano
- •
- 4
  Phonology, Syllable, Procesamiento del Lenguaje Natural, Lemmatization

Bookmark
Download
- by Sardar Jaf
- •
- 4
  Stemming, POS tagging, Natural Language Parsing, Lemmatization

The research group L.A.S.L.A. (Laboratoire d’Analyse Statistique des Langues Anciennes, University of Liege, Belgium) began in 1961 a project of lemmatization and morphosyntactic tagging of Latin texts. This project continues with new... more

Bookmark
Download
- by Margherita Fantoli
- •
- 3
  Computer Science, Latin Language, Lemmatization

c○Springer-Verlag Abstract. In this our first participation in CLEF, we have applied Natural Language Processing techniques for single word and multiword term conflation. We have tested several approaches at different levels of text... more

Bookmark
Download
- by Jesus Vilares
- •
- 10
  Information Retrieval, Spanish, Natural Language Processing, Morphology

Lemmatization

Log In