Stemmer is a language processing tool that has been widely used in many artificial intelligence a... more Stemmer is a language processing tool that has been widely used in many artificial intelligence applications for removing affixes in a word such as prefixes, infixes, and suffixes to generate the root word. This study designs an algorithm and develops a Malay language stemmer. It is given that most of Malay language stemmers have problems in stemming, as they tended to have dependencies on online dictionaries, which return false results during stemming. It is given that the complexity of affixes in Malay words is higher than that of English words. Therefore, an offline dictionary of 9,512 words is introduced in this study to handle the ambiguity when stemming Malay words. Each step the algorithm first checks the word in the local dictionary as a root word, otherwise process the word. The five steps are stem-extra-suffix, stem-plural, stem-infix, stem-prefix, and stem-suffix. The affixes rules are extracted from Kamus Tatabahasa, and Kamus Dewan (4th Ed) is used to confirm the accuracy of stemmed words. The results show that the proposed stemmer can stem prefixes, suffixes and infixes with high accuracy. The study conclusively illustrated that the proposed stemmer can handle the complexity of Malay words. This stemmer can be further enhanced by a look-up table or dictionary of overlapping words to cover the prefix and suffix overlapping limitation.
Stemmer is a language processing tool that has been widely used in many artificial intelligence a... more Stemmer is a language processing tool that has been widely used in many artificial intelligence applications for removing affixes in a word such as prefixes, infixes, and suffixes to generate the root word. This study designs an algorithm and develops a Malay language stemmer. It is given that most of Malay language stemmers have problems in stemming, as they tended to have dependencies on online dictionaries, which return false results during stemming. It is given that the complexity of affixes in Malay words is higher than that of English words. Therefore, an offline dictionary of 9,512 words is introduced in this study to handle the ambiguity when stemming Malay words. Each step the algorithm first checks the word in the local dictionary as a root word, otherwise process the word. The five steps are stem-extra-suffix, stem-plural, stem-infix, stem-prefix, and stem-suffix. The affixes rules are extracted from Kamus Tatabahasa, and Kamus Dewan (4th Ed) is used to confirm the accuracy of stemmed words. The results show that the proposed stemmer can stem prefixes, suffixes and infixes with high accuracy. The study conclusively illustrated that the proposed stemmer can handle the complexity of Malay words. This stemmer can be further enhanced by a look-up table or dictionary of overlapping words to cover the prefix and suffix overlapping limitation.
Uploads
Papers by Abdulrazak Yahya