research-article

Open access

KannadaLex: A lexical database with psycholinguistic information

Authors:

Govardhan Hegde K.Authors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 7

Article No.: 104, Pages 1 - 21

https://doi.org/10.1145/3670688

Published: 12 July 2024 Publication History

PDF eReader

Abstract

Databases containing lexical properties are of primary importance to psycholinguistic research and speech-language therapy. Several lexical databases for different languages have been developed in the recent past, but Kannada, a language spoken by 50.8 million people, has no comprehensive lexical database yet. To address this, KannadaLex, a Kannada lexical database, is built as a language resource that contains orthographic, phonological, and syllabic information about words that are sourced from newspaper articles from the past decade. Along with these vital statistics such as the phonological neighborhood, syllable complexity summed syllable and bigram syllable frequencies, and lemma and inflectional family information are stored. The database is validated by correlating frequency, a well-established psycholinguistic feature, with other numerical features. The developed lexical database contains 170K words from varied disciplines, complete with psycholinguistic features. This KannadaLex is a comprehensive resource for psycholinguists, speech therapists, and linguistic researchers for analyzing Kannada and other similar languages. Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, and rehabilitation of speech after brain injuries.

1 Introduction

The KannadaLex is a lexical database built to cater to the needs of psycholinguists and speech therapists who make use of lexical databases to study languages. Lexical databases are resources containing lexical properties of word forms such as frequency, orthography, phonology, morphology, syntax, and semantic and pragmatic information. They are of great interest to psycholinguists and are useful to speech and language therapists [1], since lexical databases in any language give a close approximation of the lexicon of that language. Words or lexemes (various forms taken by a word) are sourced from large corpora, processed, their features analyzed, and recorded in variable formats. Some databases focus on lexemes, their meanings, how they are used in sentences, how they are interpreted, and their synonyms. Others focus on statistics, grammar, and other structural details.

Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, rehabilitation of speech after brain injuries, and so on [2]. Building a lexical database involves identifying and obtaining a large, diverse, and recent corpus of texts, pre-processing them to derive lexemes, and applying Natural Language Processing (NLP) and machine learning techniques to extract various psycholinguistic features [3, 4]. Once the database is built, the features obtained are validated using several tasks, such as finding correlations between the features, speeded naming tasks, and lexical decision tasks, and finally made available to researchers and therapists in formats that are easy to work with.

Several databases of this kind have been developed for different languages in the past few decades, and much psycholinguistic research has been done on western languages [5, 6]. Indian languages, despite being complex and having several dialects, have had some amount of work done on them with databases like IndoWordNet, inspired by the famous English WordNet, which is a lexical knowledge base of words (of 18 scheduled languages of India) mainly concentrating on semantic information (like synonymy) and few lexicon projects in local languages. But apart from the Kannada WordNet, there has not been an attempt to form a large lexical database for Kannada, a language spoken predominantly in the state of Karnataka.

Currently, due to the absence of a large lexical database in Kannada, therapists in Karnataka are compelled to form stimuli from their intuition, for example, by writing down words that they think are more frequent and easier to learn. This is highly inaccurate, inefficient, and causes a waste of time and effort [7]. Psycholinguists are being forced to either work on smaller, inaccurate datasets or build datasets containing the required features and information on their own. Kannada, with a total of 50.8 million speakers, is a Dravidian language and has a rich literature grouped as old Kannada (450–1200 CE), middle Kannada (1200–1700 CE), and modern Kannada (1700–present) written using the Kannada script. Since it is the native language of around 50 million people, psycholinguists and speech therapists in and around the state are compelled to analyze and experiment on it. A lexical database in Kannada with information related to the frequency of word forms, orthography, phonology, and morphology would thus be highly beneficial [8, 9]. Building one is, however, challenging due to the complexity of Kannada grammar, its large number of inflections, derivations, and cases, and because it is highly agglutinative. Due to this fact, neither has there been a comprehensive lexical database in Kannada nor has much work been done in determining the relations among the language's various features and its role in language comprehension/learning.

Existing databases in Kannada are either too small, contain only specific information such as semantic information [10–13] or polarity, or are based on data that is obtained from only specific sources such as prose or poetry. Different frequency counts such as Bigram frequency, syllable frequency, orthographic/phonological neighborhood characteristics, Part-of-Speech information, and so on, which have been proven to directly influence reading/speaking in other languages, have not been recorded or analyzed. This work is an attempt towards building a database to overcome these setbacks.

Lexical databases are an important tool for psycholinguists and Speech and Language therapists for deriving stimuli for psycholinguistic experiments, diagnosis and treatment of speech disorders, and speech and language rehabilitation. Lexical databases are generally studied and used for tasks of language acquisition, language comprehension, and language production. Being the native language of 50 million people in and around the state of Karnataka, studies on the Kannada language are quite important for the benefit of its speakers, especially for middle-aged patients who have received their primary education in this language, and several old-aged patients who are more familiar with Kannada than any other language. Treatment in Kannada is more effective than in English or other common languages. A lexical database for Kannada with these features will therefore aid research and therapy in speech and language rehabilitation using Kannada language [14].

Also, the several sub-tasks within the work, such as finding and analyzing phonological neighborhoods, implementing stemmers for identifying lemmas, and so on, have individual importance and can be used in other research and works. Thus, the result of the work can be used either by researchers, psycholinguists, or speech therapists for obtaining stimuli for their experiments and treatment or by other linguists and developers who use the underlying techniques and modules to develop other applications that will help in developing the language and all its speakers [15].

Our contributions:

—

The proposed language resource “KannadaLex” is a first of its kind in Kannada, tabulating frequencies and orthographic, phonological, syllabic, and morphological information that are of interest to psycholinguists and speech therapists.

—

Obtaining and validating features such as syllable count, complex syllable count, phonological neighborhood, and lemma's cumulative frequencies.

—

Finding phonological neighbors using minimum edit distance, designing a new and unique supervised stemmer with high accuracy values, and correlating frequency with each of these features contribute to the uniqueness of this work.

Several features that are tried and tested in foreign languages have been tweaked to accommodate the Kannada language, such as considering the importance of syllables and bigram syllables instead of letters and orthographic bigrams. These characteristics have not been implemented in Kannada before and hence are unique to state-of-the-art research.

2 Literature Review

The past four decades have seen a growth in the studies related to visual recognition and particularly psycholinguistics, with research directed mainly at the English language. This chapter gives a brief insight into the recent advancements in different languages and provides some background theory related to the field as well as to the Kannada language. With the evolution of the field of psycholinguistics, its linguistics-related tasks such as phonology, morphology, syntax, semantics, and pragmatics required lexical databases for their advancement. To enable them, several databases have been developed in many different languages, and much work has been done in this field in those languages, especially in the language most commonly spoken and studied, English.

Several empirical works have shown the direct effects of frequency on word recognition, since high-frequency words tend to be used and accessed more frequently and hence quickly by the brain. This is confirmed by the Logogen model, which claims that a lexeme is not accessed based on its position in the lexicon but when it is triggered when a certain threshold is crossed. Also, Zipf's Law shows the relationship between frequency and ease of usage. Due to these theories, a decision was made to perform a small-scale validation of the database by finding a relationship between the features and the frequency of use. The length of a word, phoneme length, syllable length, and count of complex syllables generally show the complexity of the word. The longer the word, the more it is to be processed to contemplate the word.

According to the cohort model proposed by William Marslen-Wilson, visual or auditory input words are processed by the neurons of the brain part-by-part as it is read, rather than all at once. This point motivates the need to document features such as summed syllable frequencies and summed bigram syllable frequencies and to look at the number of complex syllables in the word. These features help us look at individual parts of the word, as advised.

From the connectionist model, we infer that phonological neighbors (one phoneme difference) trigger quick access to words rather than an orthographic neighbor (one letter difference). Also, since Kannada is a syllabic language, an orthographic neighbor holds very little meaning. Thus, studying the phonological neighborhood is more valuable. To find a phonological neighbor, finding and comparing the minimum edit distance to 1 is essential. Minimum edit distance is the number of operations (addition, deletion, or substitution) one would need to perform to transform one string to another.

Root word information helps link the inflectional family and adds to morphological information that is valuable in the psycholinguistic study. But to extract the root word in an agglutinative language like Kannada is cumbersome. Using brute force or suffix stripping is not feasible as accommodating small changes that undergo during suffix attachment is by itself a huge task. A supervised stemmer, which learns on a large database of words, their roots, and suffixes, can better find the stem of a word and can learn the changes to make to the word after removing the suffix. Since an SVM classifier is used in the work, it is only wise to define the classifier. Support Vector Machines are supervised learning models that try to construct a hyperplane to separate the samples marked in a three-dimensional space for purposes like classification [16].

In the past four decades, there has been extensive work done in several languages to analyze if the findings obtained after thoroughly studying the English language are specific to it or if they have the same effects in the language of study. CELEX (Centre For Lexical Information) [17] was completed in the year 2000 with three large databases with lexical information for Dutch, English, and German: the English Lexicon Project [18], with fields including length, frequency, orthographic neighborhood, phonological and morphological characteristics, along with Part-of-Speech information for over 40K words and non-words; the French Lexicon Project [19], with lexicon data for 38.84K words and non-words concentrating mainly on frequency, length, and orthographic Levenshtein distance; a new and revised French lexicon Lexique 2 [20], which bears a resemblance to the proposed work, with separate databases for word-forms, lemmas, and surface frequencies; the Malay Lexicon Project [21], with words mainly from its reputed newspapers and features resembling the English Lexicon Project; GreekLex2 [22], which is an improved version of the original GreekLex with additional features such as part-of-speech, syllabic and stress information for words in Modern Greek; and E-Hitz [23], with word frequency and other orthographical, phonological, and syllabic statistics for an agglutinative language Basque at both word level and lemma levels, are some of the prominent works in building lexical databases.

The Hindi WordNet [24–27], the first WordNet for an Indian language, analyzes lexical and semantic relations between words, grouping words into synsets. Other lexical resources in Indian languages include building a lexical resource for other languages such as Assamese and Bodo using the Hindi WordNet [28, 43], several speech databases of Hindi language developed at C-DAC, Noida [29], IIT Kharagpur [30], IIIT Hyderabad [31], and so on. Even though there has not been a large enough lexical database in the Kannada language, there has been some research trying to correlate reading speeds with frequency and orthographic neighborhood [32]. Also, the Kannada WordNet [25, 33] provides data on semantic relations: synonymy, antonymy, hyper/hyponymy, and so on, and attempts have been made in understanding users’ sentiments from Kannada web documents [33]. To build a lexicon database containing lemma information, it is essential to build a lemmatizer/stemmer for the language. There have been several notable works in this field, with clustering-based statistical stemmers for the Kannada language [34], rule-based longest match stemming algorithms for Hindi and Gujarati, stemming using morphotactic rules in Malayalam, suffix stripping methods in Marathi, Punjabi, Hindi, and Gujarati using brute force or lookup methods, and so on [35].

After analyzing several works done in building lexical databases, it is clear that there exist many common features that have been shown to affect all languages, such as the length of word, frequency, phonological neighborhood, summed bigram frequencies, and so on. A preliminary validation done in works like Lexique 2 [5] also shows how the frequency of use correlates with the features tabulated. Additional information such as part-of-speech and semantic information has also played a major role in word recognition. In Kannada and other Indian languages, semantic information has contributed to diverse domains of research like sentiment analysis.

Stemmers that have been developed in western languages are not of much use to Indian languages due to the orthographic structure of English in contrast to the syllabic structure of Kannada words. Stemmers that exist are either crude as they use brute force or suffix stripping or are built using automata with morphological rules, the latter being difficult to construct in Kannada due to the agglutinative nature of the language.

From the discussions, we infer that the features chosen to be tabulated indeed affect languages that were studied and researched and are also in sync with well-established theories. It is clear that a large number of words from diverse domains need to be sourced, most preferably from newspapers, and features extracted. Since much work has not been done in this field in Kannada, new techniques will have to be developed and the database validated. We can conclude that KannadaLex is indeed one of a kind and can be very useful to psycholinguists and speech therapists [39–43].

3 Methodology

Building a lexical database primarily consists of three main tasks. The first step is to acquire the necessary data, the next is to process the data to obtain lexical information in the form of features and statistics, and the final step is to provide an interface for retrieval. Once the database is built, it is essential to test and validate the database to ensure that the features do indeed play a part in the psychology of language. Keeping this in mind, building KannadaLex has been divided into three major phases. The three phases in the work are depicted in Figure 1 and are described in the following subsections.

Fig. 1.

3.1 Data Collection

A large set of Kannada sentences was pooled from the resources under various domains such as culture, arts, science, social, sports, food, health, politics, education, business, economy, world/local news, entertainment, and so on, by scraping websites of these domains. Since language evolves and spoken language varies widely from standard formal language, it is more logical to consider newspapers, periodicals, fiction, and non-fiction textual contents. So, the archives of Kannada news articles from the past decade were used to source the data, because they contain a consistent collection of vocabulary from diverse fields and reflect the most current form of language. The two tasks, data scraping and pre-processing, were performed to collect the data. Figure 2 shows the steps involved in data collection. In scraping data from news article archives, for each date given as an input, a list of article links was returned as a response, and articles were read from each of these links. Dates were input such that data collected is not centered on a particular news piece. Once data has been sourced, a pre-processing or data cleaning is an essential step. The process of cleaning text involves eliminating unusable sections, while tokenization involves breaking down the text into individual units such as words or phrases. Identifying the phonological and syllabic structures for each word's written form is also part of the process. The articles read were stripped into sentences on encountering sentence delimiters. Each of these sentences is then processed by further tokenizing them into words and performing a preliminary check to ensure that the sentence is entirely in Kannada script. Words that are acceptable in this manner are written into flat files, indexed by the date field. The complete sentences are also stored separately, keeping in mind the need for sentence context during the part-of-speech (POS) tagging of words. The output of this phase is an orthographic representation of a large set of words and the sentences that they were obtained from. In total, around 35.4 MB of uncleaned data corresponding to six months of newspaper articles were obtained. Other word representations such as phonological and syllabic representations were also obtained. The words with less than four phonemes, which are usually caused due to abbreviations or stop words, are excluded from the database.

Fig. 2.

3.2 Database Development

KannadaLex database contains two main tables, Base and RootAndFamily, and three supporting tables that not only help obtain few features of the former two, but also provide independent lexical statistics useful for psycholinguistic research. A detailed description of individual fields in the Base table and the RootAndFamily is provided in Table 1 and Table 2.

Table 1.

	Feature	Description
1	Orthographic Representation	Unicode Rep. of word (Unique)
2	Phonological Representation	Unicode Rep. of phonemes with separators
3	Syllabic Representation and CV pattern	Unicode Rep. of syllables with separators and CV pattern
4	Word Length	Number of Bytes in Unicode Rep. of word
5	Frequency	Frequency of word in corpus
6	Number of Phonemes	Number of Phonemes in the word
7	Number of Syllables	Number of Syllables (V/CV/CCV/CCCV)
8	Number of Complex Syllables	Number of syllables of type CCV and CCCV
9	Phonological Neighborhood Density	Number of words that differ from the word by single phoneme
10	Phonological Neighborhood Mean Frequency	Mean of frequencies of phonological neighbors of the word
11	Lemma	Canonical/dictionary form of the word
12	Summed Syllable Frequency and Mean	Sum and Mean of frequencies of word's syllables
13	Summed Bigram-Syllable Frequency and Mean	Sum and Mean of frequencies of word's successive bigrams (sequence of two syllables)

Table 1. The Structure of the Base Table and the Features Stored in the Table

Table 2.

	Feature	Description
1	Lemma	Unicode representation of root words
2	Cumulative Frequency	Cumulative frequency of all wordforms of the lemma
3	Lemma Frequency	Frequency of root word alone

Table 2. RootAndFamily Table

The tables are obtained as follows:

•

Base table: Table containing all words or wordforms taken directly from the corpus

•

RootAndFamily table: Table derived from Base table, containing only lemmas and its additional features.

∘

Tables to store Syllable Frequencies (syllable, frequency), Bigram Syllable frequencies (bigram, frequency), and Phonological Neighbors (OrthRepWord1, ListOfPhonologicalNeighbors)

The lemma, phonological neighborhood, word length, and frequencies of the word are the most important linguistic features used in the psycholinguistic study. The word length and frequency count in a corpus are two fundamental psycholinguistic variables that have been found to affect word recognition tasks in many languages [4, 6, 7], particularly in English [3]. For the Kannada language, the word length is nothing but the number of bytes in the word's UNICODE representation. Other measures, such as the number of syllables and phonemes, the number of complex syllables, and so on, describe the complexity of Kannada words. They are calculated from their phonological and syllabic representations. The frequencies of syllables and bigrams are maintained in the database. The summed syllable frequency, the summed bigram syllable frequency, and their averages determine the frequency of word parts. They are calculated by extracting syllable/bigram syllable frequencies from the database. We extract 13 psycholinguistic features in total. These have been described in the following sections.

Orthographic representation: The orthographic representation of the word is the UNICODE representation of the word adhering to the Unicode block (128 code points ranging from U+0C80 to U+0CFF) that contains characters for Kannada, Tulu, and Kodava languages. Words sourced from the archive are initially in this representation. Figure 3 shows a word, corresponding orthographic representation, and its Unicode.

Fig. 3.

Phonological representation: Phonological neighbors are defined as a set of words that differ by a single sound (which is more relevant than orthographic neighborhood for the Kannada language). For every word, the Phonological Neighbors table in the database is first looked up. If it is not present in the table, then all words with one phoneme difference are found, non-words are filtered out, and phonological neighbors are recorded. The number of neighbors gives the phonological neighborhood density, and the mean sum of frequencies of all the neighbors gives the phonological neighborhood mean frequency. It is the description of the combination of sounds that comprise words. In Kannada, the vowels (svara), consonants (yvanjana), and yogavahas (part-vowel, part-consonant) is the phonemic letters. The phonemes in each word are segmented and represented with “-” as a separator. The consonants are described with a diacritic (Virama), vowels stripped and represented separately. For an orthographic representation of a word as an input, the phonological representation module outputs its phonological representation. The phonological representation of the word in Figure 3 is shown in Figure 4.

Fig. 4.

Syllabic representation and CV pattern: Syllables, units of pronunciation that form words, are considered to play a major role in psycholinguistic tasks. In Kannada, each word is composed of aksharas or kagunita, which directly correspond to syllables. Like in many Indian languages, Kannada also follows a CVCVCV pattern for words with syllables in the formats V (Vowel in its primary orthographic form), CV (Consonant-Vowel, secondary form of vowel added to primary orthographic form of consonant), CCV (Consonant-Consonant-Vowel: Ottaksharas, first consonant in primary form, second below the first in its secondary form, and vowel indicated by a first consonant), and CCCV (Consonant-Consonant-Consonant-Vowel, usually in loan words from Sanskrit); the last two being classified as complex syllables. Figure 5 shows the four basic types of syllables.

Fig. 5.

The phonological representation of the word is taken as an input and the output is a CV string, each syllable separated by a “-” and represented as one of the above CV patterns. The phonological representation is also separated at syllable endings by a “;”, thus providing the syllabic representation of the word. The syllabic representation of the word is in Figure 3, and its CV pattern is shown in Figure 6.

Fig. 6.

Word length: Word length is the number of segments in the Unicode representation of the word or the number of bytes used to represent the word in Unicode. Word length is obtained from the orthographic representation of the word. Word length for the word in Figure 3 is thus eight.

Number of phonemes: Obtained from the phonological representation, it is the number of phonemes that the word consists of. In Figure 4, the example word

or “kSetrada” (using WX notation), which means ‘of the field’, contains 8 phonemes. This count is more accurate in describing the length of a Kannada word than a mere Unicode segments’ count, as it accurately determines the number of sounds in the word.

Number of syllables: The number of units of pronunciation vaguely describes the complexity of a Kannada word. From the syllabic representation or the CV pattern, a count of the number of syllables constituting the word is obtained. Considering Figure 6, the number of syllables in the example word is three.

Number of complex syllables: In various languages, it has been proven that the complexity of syllables directly affects the difficulty of recognizing a word. Thus, a count of the number of complex vowels in Kannada is worth noting as a feature. In Kannada, vowels taking the form CCV or CCCV are considered complex syllables. Considering Figure 6, the number of complex syllables in the example word is two.

Word frequency: Frequency is an important factor in psycholinguistic studies. Word frequency is the number of times the word occurs in the data that was sourced as a function of the total number of words sourced. Words in each file (that correspond to each date) are processed to find the unique words and their frequencies within each file. These words are then compared to words in other files and frequencies are updated accordingly. These frequencies are then modified based on the total number of words that were sourced.

Summed syllable bigram frequency and mean: In Kannada, bi-syllable frequencies contain more value than orthographic bigram frequencies (bisyllable and bigram frequencies are used interchangeably in the text that follows). The frequency of every successive sequence of two syllables is extracted and summed up for obtaining Summed Bigram Frequency, and it is divided by a total number of bigrams in the word to obtain the mean. To do so, on encountering each word for obtaining it syllabic representation, all bigrams (bisyllables) in the word are inserted into a separate database: A new entry is made in case of the absence of bigram in the database or the frequency of the existing entry updated if bigram is found. After one round of processing all words in the database, a second pass calculates the summed bigram frequency as well as its mean.

Summed syllable frequency and mean: Syllables in Kannada have a major role to play in word recognition and other psycholinguistic tasks, since they are represented as one unit or akshara. A summed syllable frequency provides an estimate of how familiar the word's syllables are and thus corresponds to the ease of recognizing the word. The approach to finding this is similar to that used to find Summed Bigram Frequency and its mean.

Phonological neighborhood density: A phonological neighbor of a word is another word that might differ from the former by maximally one phoneme (through deletion, addition, or substitution). To find a word's neighbors, each other word in the corpus is checked for its similarity (minimum phonological edit distance of one unit) to the query word and then added to a list of neighbors. A total count of these neighbors gives the word's phonological neighborhood density.

Phonological neighborhoods mean frequency: Neighborhood means the frequency is calculated by adding up the frequencies of all the neighbors and dividing this sum by the density. In the process of finding the neighborhood density, on finding a valid neighbor, its frequency is extracted and added.

Lemma: Stemmers generally derive the word stem from the inflected form of a word, and the word stem may or may not be the same as the actual morphological root. Kannada has a complex morphology, characterized by its agglutinative nature. There are more than 10K root words and more than a million morphed variants, and stemming is particularly challenging due to the several morphophonemic changes that take place during suffix attachment. However, with a sufficiently large dataset covering almost all roots, morphemes, inflections, and suffixes, one can attempt to build a stemmer. A supervised stemmer employing an SVM classifier is built to extract the lemma. Additional lemma information is stored in the database. The overall design of the stemmer is depicted in Figure 7. The stemmer is internally built in two parts: The first predicts the split point, and the second predicts if any additional changes are to be made to the root. The algorithm for designing a stemmer is explained below:

—

Extract word and root data from training corpus: From the manually annotated training corpus, word, root, and morphologically analyzed suffix information is filtered out and stored separately.

—

Transform data to form training sets: Data is transformed in three steps.

(a)

First, the point at which the word is split into root and suffix is identified and the split is performed. To do so, one of the following three cases might occur:

—

Word is a root word.

—

Word contains the root word without structural changes.

—

Root words have undergone additional changes in the word.

An intermediate dataset is formed with Word, Root, Suffix, and Actual Root (as annotated by experts).

(b)

The training dataset is built with features of Current letter and Suffix, each row classified into one of the three classes: 0 – No split, 1 – Split with no changes, and 2 – Split with the additional change required to root.

To do so, for each letter, the part of a word that comes after it is stored as a suffix, and the part of the word up to and including the letter is compared with the root of the word. If a match is found, then the split label is marked as 1. If it matches partially, then the additional change is maintained separately, and the split label is marked as 2. The split class label is 0 for no match.

(c)

All rows that contain Additional changes to be performed on the root are extracted and stored separately to form the training set for the second part of stemmer.

—

Feature Engineering: The Unicode features are converted into features that machine learning models can work upon. This is done by converting the two features, Current Letter and Suffix, into long integer IDs. The additional changes to the data are also converted into integer IDs in the second part of the stemmer.

—

Train SVM classifier: Two SVM classifiers are trained separately with the two training sets obtained in the above task. The first part takes features of current Letter and Suffix information and classifies it into class labels 0, 1, or 2, as mentioned in task 2. The second part of the stemmer takes current Letter and Suffix information and classifies it into class labels that include all the additional changes that have been seen in the dataset. The binary objects obtained after training the classifier are stored for prediction purposes.

—

Predict root for an input word: For an input word, starting from the third letter (root words cannot be less than three letters long), for each letter a feature set is formed with long integers representing the current letter and the suffix (part of the word that comes after the suffix) and is fed to the classifier to predict the class label. The split is performed at the first occurrence of class labels 1 or 2. In case the class label is 2, the second part of the stemmer is called to predict the changes that must be performed on the root identified. The root word thus predicted is returned.

Fig. 7.

For each word in the lexical database, stemming is performed, and the lemma is stored as a feature in the base table. Further, all lemmas are identified, and their cumulative frequencies (summed frequency of the inflectional family) are calculated and stored in the databases.

3.3 Front-end and Validation

Different psycholinguists and speech therapists require to use the contents in the database differently to form stimuli for their experiments and treatment. A user interface for the front end has been developed to assist users in querying the database according to their preferences, using a combination of recorded features. The selection of features was informed by research conducted across various other languages. However, not all of them may significantly impact the frequency of use or the ease of comprehension and learning in Kannada. These measures must be tested before being used for treatment or other advanced experiments. Preliminary validation of these features is attempted by correlating them with the frequency of use. The user interface has two main features: one is to generate stimuli based on a combination of characteristics or features, and the other feature is to download the entire database to enable researchers to work on the data directly and other developers to add to the existing database. To validate the features, the relationship between the frequency count of a word and the different numerical features is found. KannadaLex was thus developed and analyzed in three phases, beginning with collecting data that made up a large and diverse corpus and moving on to extracting all the features and tabulating them and finally performing a preliminary validation of the features before making it available to the public with means to query the data.

4 Result Analysis

KannadaLex contains 13 psycholinguistic features for 170K words and is intended to be extended to up to 5 million words. We note that the syllable frequencies table contains only about 2K syllables even for 170K words, as compared to a large number of possible combinations of consonants and vowels. The number reaches a saturation point and will not be very much even with a significant rise in the size of the base table in the database.

The accuracies of the machine learning algorithms for both parts of the stemmer are specified, where the first part predicts the split point in the word and determines if any additional modifications are required for the root, and the second part predicts the modification to be made if signaled by the first part. The first part predicts which of the three classes (no split, split without changes, split, and modify the root) employs an SVM classifier with 92.51% accuracy, as determined using 5-fold cross-validation. The second part, which predicts the additional change required, employs another SVM classifier with 75.29% accuracy, as determined using 5-fold cross-validation. The results of the validation phase are explained below, with one plot for each of the features investigated.

Word length vs. Frequency: The plot is shown in Figure 8(a). It is clear by the shape the scatter plot takes that as word length increases, frequency of use steadily drops and hence takes longer time for the brain to access the word. But initially, for extremely small word lengths, frequencies are very low and steadily rise to give a maximum around word length 4. This illustrates the fact that there are very few words below and around word length 3. Long, agglutinated words have occurred very rarely, and some short words (possibly connecting words or other stop words) have been used extremely frequently. This provides a preliminary validation for the use of word length as a psycholinguistic feature.

Fig. 8.

Phoneme count vs. Frequency: The plot is shown in Figure 8(b). Since phoneme count is usually one or two more than the word length, the phoneme count vs. frequency graph follows the same trend as the previous plot but has a clearer shape, establishing the possibility that phoneme count is an important linguistic feature. However, since phoneme count is more accurate a length as compared to word length in the case of Kannada, the number of phonemes has more value than the word length. There are absolutely no points at counts less than 4 phonemes, showing that a Kannada word being formed with less than or equal to 3 phonemes is extremely unlikely, hence strengthening our assumption used in the stemmer (root word cannot be with lengths less than 3).

Syllable count vs. Frequency: The plot is shown in Figure 8(c). The plot follows the same trend as the ones plotted earlier but is sharper. With the same justification, we conclude that syllable count has much value to psycholinguists.

Complex syllable count vs. Frequency: The plot is shown in Figure 8(d). We see that the most popular and frequently used words are those with either no complex syllables or with one complex syllable. Words with many complex syllables are complex to read and understood by common people and hence are less used in newspapers and blogs. This means that not only do the words get complex, but they are also very rarely used in the day-to-day life of commoners and are thus most likely to not be recognized quickly.

Summed syllable means frequency vs. Frequency: In Figure 9, it can be observed that the plot depicts an upward trend at the beginning, indicating that popular words in Kannada are those with popular syllables. This highlights the significance of syllabic information in the Kannada language.

Fig. 9.

4.1 Significance of Results

The result of the work is KannadaLex, a validated, free lexical database, with features that not only help in the psycholinguistic study but also research in the field of natural language processing. With 170,000 words of diverse domains extracted (and a target of 5 million words) and its features tabulated, it is the first of its kind in Kannada and poses as a tool that signals the start of research in the language. With 92.5% accuracy for splitting a word into root and suffix (and 75% for identifying the changes required), the stemmer that was developed presents itself as an important aid for many linguistic researches and applications in the language and many other Indian languages, like Telugu, that bear resemblance with Kannada. Lemma information gives an overview of the inflectional family a word can take and can nurture research in the area of morphology.

The abundant syllable and bigram syllable information captured and stored can be used for studies related to syllabic structures. The minimum edit distance algorithm implemented to find neighbors of a word can be used in many applications that focus on the difference in the usages of words (such as discriminating between accents or versions of the language and studying the evolution of the language). The online tool provides a way to access this data easily, enabling even computer-illiterate researchers to access and query data. An option to download the data in different formats makes it more appealing to developers, too. Finally, the results of validation also show how each of the features that were logged is indeed an important psycholinguistic feature for the language.

There were a few deviations from the expected results, especially in the capacity of the stemming algorithm. Although the stemmer performed with high accuracies on the training and test data and during 5-fold cross-validation, it did not respond very well to unseen suffixes. Also, some words were partially stemmed. A Multilayer Perceptron classifier may give better results in learning and predicting unseen suffixes.

The results of the work can be summarized as having developed a lexical database for around 170K words (with plans to extend it to 5 million soon), with features that cover lengths, various frequencies (standard, summed syllabic, summed bigram syllabic, and phonological neighborhood), and morphology (lemma information), each of which has been preliminarily validated. As a side product, a stemmer has also been developed with 92.5% accuracy (using 5-fold cross-validation) for splitting a word into root and suffix (and 75% for identifying the changes required to be made to the root word). An online tool, i.e., a front end, has been built to freely access and query the database.

KannadaLex, a lexical database in the Kannada language, contains five major tables with features such as lengths, frequencies, and grammar information intended to help psycholinguists and speech therapists to prepare lists of stimuli with ease and accuracy. It will also serve as a database of great interest to other researchers vested in natural language processing and linguistics. Being based on text sourced from a local newspaper and several blogs, this database currently consists of about 170K words from diverse domains and is planned to house a total of about five million words after extending and sourcing more data. Features extracted include orthographic, phonological, and syllabic information along with information regarding phonological neighborhood and inflectional family and summed syllable and bigram syllable frequencies. The database also contains syllable and bigram syllable frequencies for further research in the field. KannadaLex's online tool provides a front end for researchers to query and access the validated data.

4.2 A Comparison of Proposed KannadaLex and Kannada WordNet

Kannada WordNet, a part of Indo-WordNet, is a lexical database for semantic relationship between words such as synonymy, antonymy, homonymy, hypernymy-hyponymy, entailment relations depending on the POS category of the word. Kannada nouns, verbs, adjectives, and adverbs are organized into synonym sets called the synsets, each representing one underlying concept. However, KannadaLex contains orthographic, phonological, syllabic, and morphological information that are of interest to psycholinguists and speech therapists. In addition, it contains features such as syllable count, complex syllable count, and phonological neighborhood. Kannada WordNet consists of 12,765 nouns, 3,119 verbs, 5,988 adjectives, and 170 adverbs. The synsets normally exist separately for each of the categories. Each word in Indo-WordNet has a Synset ID, Synonyms, gloss, example statement, and the gloss. So, the purpose of KannadaLex is different from Kannada WordNet.

4.3 Performance of Stemmer on KannadaWordNet

We assessed the performance of the proposed stemmer on KannadaWordNet. The latest version of KannadaWordnet [43] contains approximately 29K Nouns and 5.5K verbs. In our experiment, as a preprocessing step, we excluded the words that are hyphenated or the words having “_” symbol. The stemmer achieved an accuracy of 83.42% in identifying the root words using the SVM models mentioned in Section 3.1. Some instances of both successful and unsuccessful identification of root words are presented in Table 3. In the table, the word and its WX notation are given. We observe that the KannadaWordNet has relatively many words in the root form compared to our dataset. However, the verbs in KannadaWordNet have more morphological modifications. Our stemmer can identify the root word when the word itself is a root word or the word contains the root word without structural changes in most of the cases, e.g., ಹೊಂದಿಸುವಿಕೆ, ಮದುವೆಯಾಗು, and so on. But it is also problematic in certain cases, such as ಮಡಿವಾಳ, ಕೊಡದೆಯಿರು, and so on. The stemmer is able to handle root words that have undergone additional changes in the word, for example, ಅಲುಗಾಡಿಸಿಸು. Overall, the stemmer performs very well in most of the cases.

Table 3.

Successful cases	Identified Root	Unsuccessful cases	Identified Root
ಹೊಂದಿಸುವಿಕೆ	ಹೊಂದಿಸು	ಮಡಿವಾಳ	ಮಡಿ
(hoMxisuvike)	(hoMxisu)	(madivAla)	(madi)
ಕುಟುಂಬದವರು	ಕುಟುಂಬ	ಅಲೆದಾಡಿಸು	ಅಲೆ
(kutuMbaxavaru)	(kutuMba)	(alexAdisu)	(ale)
ಮದುವೆಯಾಗು	ಮದುವೆ	ಸೂಚಿಸುವುದು	ಸೂಚಿಸಿ
(maxuveyAgu)	(maxuve	(sUcisu)	(sUcisi)
ಅಲುಗಾಡಿಸಿಸು	ಅಲುಗಾಡು	ಕೊಡದೆಯಿರು	ಕೊಡ
(alugAdisu)	(alugAdu)	(kodaweyiru)	(koda)

Table 3. Some Examples of Stemming for the Words from KannadaWordNet Using the Proposed Model

4.4 Potential applications of KannadaLex database

We outline two possible use cases of KannadaLex in the following paragraphs. A web-based user interface (UI) is provided for accessing the KannadaLex. Using ASP.NET with C#, a web-interface is built to allow psycholinguists, speech therapists, and other users to access the lexical databases; or to query the databases by choosing a combination of features. The screenshots of the developed web-based UI are shown in the Appendix section. The web-interface has two main features: (i) to generate stimuli based on a combination of characteristics or features and (ii) to download the entire database to enable researchers to work on the data directly and other developers to add to the existing database The two features are explained here in detail.

(1)

To generate stimuli, one has the options to choose the number of samples required, choose the features desired, and also set limits of some quantitative features.

—

Navigating to the “Generate Lists” page by clicking on the corresponding tab, one has the options to query any one of the four main tables: Base table, RootAndFamily table, Syllable Frequencies table, or Bigram Syllable Frequencies table.

—

Check the “Number of Samples” and enter the number in case a limited number is required or uncheck the box if all samples that satisfy the conditions are required.

—

Select the features that are desired, fill in minimum and maximum values if any (leave blank otherwise), and click on “Generate List” to generate and download the list.

(2)

To download entire database, navigate to the “Download Data” page by clicking on the corresponding tab in the main menu, choose the file type (XML or CSV), and click on the “Download” button to download the lexical database.

This database, comprising 170K words from various domains, is a pioneering tool in Kannada. It serves not only for psycholinguistic studies but also aids research in natural language processing. This tool proves beneficial for linguistic research and applications in Kannada and related languages such as Telugu. Additionally, the lemma information provided by the database offers insights into the inflectional family of a word, fostering research in morphology.

Case study-1: In speech therapy, a lexical database like KannadaLex can be used to identify phonological neighbors, which are words that differ by only one phoneme (sound). Identifying phonological neighbors is beneficial in assessing and treating speech disorders, such as phonological or articulation disorders [38]. A speech-language pathologist (SLP) may use a lexical database to compile a list of target words relevant to a client's speech goals. Throughout the therapy process, the SLP can track the client's progress by comparing their production accuracy and consistency with the target word and its phonological neighbors. Phonological neighborhood information helps the SLP in designing therapy activities and interventions. The SLP may incorporate the phonological neighbors into articulation or phonological therapy sessions to target specific sound substitutions, deletions, or distortions. Using words from the phonological neighborhood can facilitate auditory discrimination, sound production practice, and generalization of speech skills. The SLP can assess if the client has successfully acquired the target sound by evaluating their ability to produce it accurately within the phonological context of the word and differentiate it from the neighboring sounds.

Case study-2: KannadaLex provides detailed information about the lexical properties of words, including phonological representations, syllable structure, stress patterns, and phonetic transcriptions. Researchers can use KannadaLex to study phonological processes, analyze word structures, and investigate linguistic phenomena related to pronunciation and phonetics. For example, the analysis for the Kannada word hinneleyalli (in WX notation) as shown in Figure 10, the syllable details and the syllable bi-gram details below can be obtained with the KannadaLex but not with the KannadaWordNet.

Fig. 10.

Thus, KannadaLex is a unique lexical database in the Kannada language annotated with lexical and phonological information. There are no databases of this kind available for the Kannada language. The popular KannadaWordNet does not provide the psycholinguistic information such as the syllable frequency, orthographic/phonological neighborhood characteristics, and morphological information.

Table 4.

Database	Semantic Relations	Synonyms	POS	Phonological Neighbors	Orthographic, Syllabic Information	Phonological info	Stem
KannadaWordNet	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	✗	✗	✗	\(\checkmark\)
Proposed KannadaLex	✗	✗	✗	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)

Table 4. A Comparison of Features Provided by Kannada WordNet and Proposed KannadaLex

Database and tool are available for download at: https://github.com/muralikrishnasn/KannadaLex

5 Conclusion and Future Scope

The developed language resource KannadaLex provides a rich lexical database and is the first of its kind in the Kannada Language, providing important psycholinguistic features such as lengths, frequencies, phonological neighborhood, and morphology-related information for approximately 170K words from vast and varied domains as sourced from new articles and other popular blogs. This database allows psycholinguists, speech therapists, and other researchers in the field of linguistics and natural language processing to study and analyze a new range of phenomena, perform experiments, and carry out treatment with more effectiveness and efficiency. Several side products were generated during the process of building the database, such as the minimum edit distance algorithm built for the phonological neighborhood, the stemmer with 92.5% accuracy (using 5-fold validation) built for extracting inflectional family information and tables with syllable frequencies, and bigram syllable frequencies, posing as tools for a wide range of applications and related research. The online tool provides a means of free and easy access to the data, built to be of use to the community.

The future scope of this work is wide-ranging, from improving the quality of existing features to adding new features to the database. Three major suggestions are outlined in this section. Including these suggestions will enhance the database immensely. The quality of the stemmer can be enhanced to achieve higher accuracy of stemming for unseen words. The SVM classifier that has been currently employed to identify the split and the additional change to the database—though it has good accuracy in determining the split points—does not perform at the same level when unseen suffixes are posed to it. A different classifier like Multilayer Perceptron Classifier may be better at learning. The second change that can be done to improve the stemmer's quality is to enhance the dataset used to predict additional changes to be made to the root word and increase the accuracy of that part of the stemmer.

The feature validation performed in this work is only preliminary. The features are truly said to affect word recognition only after validating it using one or more behavioral tasks, where a subject is asked to perform some action for the presented stimuli. Some behavioral experiments that can be performed are lexical decision task, in which a subject is asked to classify a string stimulus to a word or non-word when a stimulus is presented to them, and speeded naming task, in which subjects are asked to accurately and quickly read the word presented to them while their reacting times are noted. The reaction times of many subjects (preferably native speakers) are then correlated with frequencies, lengths, and other information to observe patterns and relations.

Appendix

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

References

[1]

R. Beckwith, C. Fellbaum, D. Gross, and G. A. Miller. 2021. WordNet: A lexical database organized on psycholinguistic principles. In Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, Psychology Press, 211–232.

Abstract

1 Introduction

2 Literature Review

3 Methodology

3.1 Data Collection

3.2 Database Development

3.3 Front-end and Validation

4 Result Analysis

4.1 Significance of Results

4.2 A Comparison of Proposed KannadaLex and Kannada WordNet

4.3 Performance of Stemmer on KannadaWordNet

4.4 Potential applications of KannadaLex database

5 Conclusion and Future Scope

Appendix

References

Index Terms

Recommendations

Unsupervised translated word sense disambiguation in constructing bilingual lexical database

Sentiment Thesaurus, Synset and Word2Vec Based Improvement in Bigram Model for Classifying Product Reviews

A lexical database for English to support information retrieval, parsing, and text generation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations