3.2 Database Development
KannadaLex database contains two main tables, Base and RootAndFamily, and three supporting tables that not only help obtain few features of the former two, but also provide independent lexical statistics useful for psycholinguistic research. A detailed description of individual fields in the Base table and the RootAndFamily is provided in Table
1 and Table
2.
The tables are obtained as follows:
•
Base table: Table containing all words or wordforms taken directly from the corpus
•
RootAndFamily table: Table derived from Base table, containing only lemmas and its additional features.
∘
Tables to store Syllable Frequencies (syllable, frequency), Bigram Syllable frequencies (bigram, frequency), and Phonological Neighbors (OrthRepWord1, ListOfPhonologicalNeighbors)
The lemma, phonological neighborhood, word length, and frequencies of the word are the most important linguistic features used in the psycholinguistic study. The word length and frequency count in a corpus are two fundamental psycholinguistic variables that have been found to affect word recognition tasks in many languages [
4,
6,
7], particularly in English [
3]. For the Kannada language, the word length is nothing but the number of bytes in the word's UNICODE representation. Other measures, such as the number of syllables and phonemes, the number of complex syllables, and so on, describe the complexity of Kannada words. They are calculated from their phonological and syllabic representations. The frequencies of syllables and bigrams are maintained in the database. The summed syllable frequency, the summed bigram syllable frequency, and their averages determine the frequency of word parts. They are calculated by extracting syllable/bigram syllable frequencies from the database. We extract 13 psycholinguistic features in total. These have been described in the following sections.
Orthographic representation: The orthographic representation of the word is the UNICODE representation of the word adhering to the Unicode block (128 code points ranging from U+0C80 to U+0CFF) that contains characters for Kannada, Tulu, and Kodava languages. Words sourced from the archive are initially in this representation. Figure
3 shows a word, corresponding orthographic representation, and its Unicode.
Phonological representation: Phonological neighbors
are defined as a set of words that differ by a single sound (which is more relevant than orthographic neighborhood for the Kannada language). For every word, the
Phonological Neighbors table in the database is first looked up. If it is not present in the table, then all words with one phoneme difference are found, non-words are filtered out, and phonological neighbors are recorded. The number of neighbors gives the phonological neighborhood density, and the mean sum of frequencies of all the neighbors gives the phonological neighborhood mean frequency. It is the description of the combination of sounds that comprise words. In Kannada, the vowels (svara), consonants (yvanjana), and yogavahas (part-vowel, part-consonant) is the phonemic letters. The phonemes in each word are segmented and represented with “-” as a separator. The consonants are described with a diacritic (Virama), vowels stripped and represented separately. For an orthographic representation of a word as an input, the phonological representation module outputs its phonological representation. The phonological representation of the word in Figure
3 is shown in Figure
4.
Syllabic representation and CV pattern: Syllables, units of pronunciation that form words, are considered to play a major role in psycholinguistic tasks. In Kannada, each word is composed of aksharas or kagunita, which directly correspond to syllables. Like in many Indian languages, Kannada also follows a CVCVCV pattern for words with syllables in the formats
V (
Vowel in its primary orthographic form),
CV (
Consonant-Vowel, secondary form of vowel added to primary orthographic form of consonant),
CCV (
Consonant-Consonant-Vowel: Ottaksharas, first consonant in primary form, second below the first in its secondary form, and vowel indicated by a first consonant), and
CCCV (
Consonant-Consonant-Consonant-Vowel, usually in loan words from Sanskrit); the last two being classified as complex syllables. Figure
5 shows the four basic types of syllables.
The phonological representation of the word is taken as an input and the output is a CV string, each syllable separated by a “-” and represented as one of the above CV patterns. The phonological representation is also separated at syllable endings by a “;”, thus providing the syllabic representation of the word. The syllabic representation of the word is in Figure
3, and its CV pattern is shown in Figure
6.
Word length: Word length is the number of segments in the Unicode representation of the word or the number of bytes used to represent the word in Unicode. Word length is obtained from the orthographic representation of the word. Word length for the word in Figure
3 is thus eight.
Number of phonemes: Obtained from the phonological representation, it is the number of phonemes that the word consists of. In Figure
4, the example word
or “kSetrada” (using WX notation), which means ‘of the field’, contains 8 phonemes. This count is more accurate in describing the length of a Kannada word than a mere Unicode segments’ count, as it accurately determines the number of sounds in the word.
Number of syllables: The number of units of pronunciation vaguely describes the complexity of a Kannada word. From the syllabic representation or the CV pattern, a count of the number of syllables constituting the word is obtained. Considering Figure
6, the number of syllables in the example word is three.
Number of complex syllables: In various languages, it has been proven that the complexity of syllables directly affects the difficulty of recognizing a word. Thus, a count of the number of complex vowels in Kannada is worth noting as a feature. In Kannada, vowels taking the form CCV or CCCV are considered complex syllables. Considering Figure
6, the number of complex syllables in the example word is two.
Word frequency: Frequency is an important factor in psycholinguistic studies. Word frequency is the number of times the word occurs in the data that was sourced as a function of the total number of words sourced. Words in each file (that correspond to each date) are processed to find the unique words and their frequencies within each file. These words are then compared to words in other files and frequencies are updated accordingly. These frequencies are then modified based on the total number of words that were sourced.
Summed syllable bigram frequency and mean: In Kannada, bi-syllable frequencies contain more value than orthographic bigram frequencies (bisyllable and bigram frequencies are used interchangeably in the text that follows). The frequency of every successive sequence of two syllables is extracted and summed up for obtaining Summed Bigram Frequency, and it is divided by a total number of bigrams in the word to obtain the mean. To do so, on encountering each word for obtaining it syllabic representation, all bigrams (bisyllables) in the word are inserted into a separate database: A new entry is made in case of the absence of bigram in the database or the frequency of the existing entry updated if bigram is found. After one round of processing all words in the database, a second pass calculates the summed bigram frequency as well as its mean.
Summed syllable frequency and mean: Syllables in Kannada have a major role to play in word recognition and other psycholinguistic tasks, since they are represented as one unit or akshara. A summed syllable frequency provides an estimate of how familiar the word's syllables are and thus corresponds to the ease of recognizing the word. The approach to finding this is similar to that used to find Summed Bigram Frequency and its mean.
Phonological neighborhood density: A phonological neighbor of a word is another word that might differ from the former by maximally one phoneme (through deletion, addition, or substitution). To find a word's neighbors, each other word in the corpus is checked for its similarity (minimum phonological edit distance of one unit) to the query word and then added to a list of neighbors. A total count of these neighbors gives the word's phonological neighborhood density.
Phonological neighborhoods mean frequency: Neighborhood means the frequency is calculated by adding up the frequencies of all the neighbors and dividing this sum by the density. In the process of finding the neighborhood density, on finding a valid neighbor, its frequency is extracted and added.
Lemma: Stemmers generally derive the word stem from the inflected form of a word, and the word stem may or may not be the same as the actual morphological root. Kannada has a complex morphology, characterized by its agglutinative nature. There are more than 10K root words and more than a million morphed variants, and stemming is particularly challenging due to the several morphophonemic changes that take place during suffix attachment. However, with a sufficiently large dataset covering almost all roots, morphemes, inflections, and suffixes, one can attempt to build a stemmer. A supervised stemmer employing an SVM classifier is built to extract the lemma. Additional lemma information is stored in the database. The overall design of the stemmer is depicted in Figure
7. The stemmer is internally built in two parts: The first predicts the split point, and the second predicts if any additional changes are to be made to the root. The algorithm for designing a stemmer is explained below:
—
Extract word and root data from training corpus: From the manually annotated training corpus, word, root, and morphologically analyzed suffix information is filtered out and stored separately.
—
Transform data to form training sets: Data is transformed in three steps.
(a)
First, the point at which the word is split into root and suffix is identified and the split is performed. To do so, one of the following three cases might occur:
—
Word contains the root word without structural changes.
—
Root words have undergone additional changes in the word.
An intermediate dataset is formed with Word, Root, Suffix, and Actual Root (as annotated by experts).
(b)
The training dataset is built with features of Current letter and Suffix, each row classified into one of the three classes: 0 – No split, 1 – Split with no changes, and 2 – Split with the additional change required to root.
To do so, for each letter, the part of a word that comes after it is stored as a suffix, and the part of the word up to and including the letter is compared with the root of the word. If a match is found, then the split label is marked as 1. If it matches partially, then the additional change is maintained separately, and the split label is marked as 2. The split class label is 0 for no match.
(c)
All rows that contain Additional changes to be performed on the root are extracted and stored separately to form the training set for the second part of stemmer.
—
Feature Engineering: The Unicode features are converted into features that machine learning models can work upon. This is done by converting the two features, Current Letter and Suffix, into long integer IDs. The additional changes to the data are also converted into integer IDs in the second part of the stemmer.
—
Train SVM classifier: Two SVM classifiers are trained separately with the two training sets obtained in the above task. The first part takes features of current Letter and Suffix information and classifies it into class labels 0, 1, or 2, as mentioned in task 2. The second part of the stemmer takes current Letter and Suffix information and classifies it into class labels that include all the additional changes that have been seen in the dataset. The binary objects obtained after training the classifier are stored for prediction purposes.
—
Predict root for an input word: For an input word, starting from the third letter (root words cannot be less than three letters long), for each letter a feature set is formed with long integers representing the current letter and the suffix (part of the word that comes after the suffix) and is fed to the classifier to predict the class label. The split is performed at the first occurrence of class labels 1 or 2. In case the class label is 2, the second part of the stemmer is called to predict the changes that must be performed on the root identified. The root word thus predicted is returned.
For each word in the lexical database, stemming is performed, and the lemma is stored as a feature in the base table. Further, all lemmas are identified, and their cumulative frequencies (summed frequency of the inflectional family) are calculated and stored in the databases.