Malay Language Stemmer

Abdulrazak Yahya; Rehman Ullah Khan

Malay Language Stemmer

Stemmer is a language processing tool that has been widely used in many artificial intelligence applications for removing affixes in a word such as prefixes, infixes, and suffixes to generate the root word. This study designs an algorithm and develops a Malay language stemmer. It is given that most of Malay language stemmers have problems in stemming, as they tended to have dependencies on online dictionaries, which return false results during stemming. It is given that the complexity of affixes in Malay words is higher than that of English words. Therefore, an offline dictionary of 9,512 words is introduced in this study to handle the ambiguity when stemming Malay words. Each step the algorithm first checks the word in the local dictionary as a root word, otherwise process the word. The five steps are stem-extra-suffix, stem-plural, stem-infix, stem-prefix, and stem-suffix. The affixes rules are extracted from Kamus Tatabahasa, and Kamus Dewan (4th Ed) is used to confirm the accuracy of stemmed words. The results show that the proposed stemmer can stem prefixes, suffixes and infixes with high accuracy. The study conclusively illustrated that the proposed stemmer can handle the complexity of Malay words. This stemmer can be further enhanced by a look-up table or dictionary of overlapping words to cover the prefix and suffix overlapping limitation.

INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 E-ISSN: 2349-7610 Malay Language Stemmer Rehman Ullah Khan*1, Fitri Suraya Mohamad2, Muh. Inam UlHaq3, Shahren Ahmad Zadi Adruce4, Philip Nuli Anding5, Sajjad Nawaz Khan6, Abdulrazak Yahya Saleh Al-Hababi7 1,2,4,5,6 Faculty of Cognitive Sciences and Human Development, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia E-mail: krullah@unimas.my E-mail: E-mail: mfitri@unimas.my azshahren@unimas.my E-mail: aphilip@unimas.my E-mail: sajjadnawazkhan@gmail.com E-mail: ysahabdulrazak@unimas.my 3Department of Computer Sciences, Khushal Khan Khattak University, Karak Pakistan E-mail: inamix@gmail.com ABSTRACT Stemmer is a language processing tool that has been widely used in many artificial intelligence applications for removing affixes in a word such as prefixes, infixes, and suffixes to generate the root word. This study designs an algorithm and develops a Malay language stemmer. It is given that most of Malay language stemmers have problems in stemming, as they tended to have dependencies on online dictionaries, which return false results during stemming. It is given that the complexity of affixes in Malay words is higher than that of English words. Therefore, an offline dictionary of 9,512 words is introduced in this study to handle the ambiguity when stemming Malay words. Each step the algorithm first checks the word in the local dictionary as a root word, otherwise process the word. The five steps are stem-extra-suffix, stem-plural, stem-infix, stem-prefix, and stem-suffix. The affixes rules are extracted from Kamus Tatabahasa, and Kamus Dewan (4th Ed) is used to confirm the accuracy of stemmed words. The results show that the proposed stemmer can stem prefixes, suffixes and infixes with high accuracy. The study conclusively illustrated that the proposed stemmer can handle the complexity of Malay words. This stemmer can be further enhanced by a lookup table or dictionary of overlapping words to cover the prefix and suffix overlapping limitation. Index Term — Stemming, Stemmer, Natural language processing, Algorithm and Morphology. 1. INTRODUCTION those commonly used in the English language. Second, affixes The Malay language is well known part of Austronesian in Malay language analyses part-of-speech of the words to family which is spoken across South East Asia such as represent nouns, verbs and adjectives while affixes in English Malaysia, Indonesia, Singapore, and Brunei [1]. Bahasa language represent plural, tenses and possession. The lack of Malaysia in Malaysia, Bahasa Indonesia in Indonesia and morphological analyses published on the Malay language Bahasa Melayu in Singapore and Brunei, they originated from creates a need to address the gap in literature about the Malay language [1]. Malay language has two key features to linguistic characteristics of the language widely used in take note of. First, it does not have grammatical functions as Southeast Asia. VOLUME-4, ISSUE-12, DEC-2017 COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 1 INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 E-ISSN: 2349-7610 Stemming algorithm has been widely used today to serve word and the suffix ‗an‘ at the end of the word to complete the various purposes. It is known as the process of eliminating word ‗pemakanan‘. English and Malay languages differ in derivational and inflectional suffixes from words until the root terms of their root words, which are based on their respective word is obtained [2]. For example, the words ―assign‖, morphological structures [5]. For instance, the English words ―assigns‖, or ‗related‘, ‗relates‘, and ‗relation‘, are derived from the root ―assignment‖ are reduced to the root word, ―assign‖. word ‗relate‘, and stemmer can work as suffix removal for Experiments were conducted to examine how efficient English language. Yet, the Malay language has a different stemming is, and n-grams in identifying suffixes, multi-word stemming process compared to English, due the complexity of concepts and spelling errors. The experiment was divided into its morphological rules. For example, the Malay words bigram and trigram string matching using a document in ‗pengajaran‘, ‗pembelajaran‘, and ‗pelajar‘ are derived from Malay language. Thenceforth, Sembok and Zainab [3] [3] the root word ‗ajar‘, and it is insufficient to use suffix removal carried out separated experiments by using bigram and to decide for the perfect root word [6]. ―assigned‖, ―assigning‖, ―assignation‖ stemmed bigram as well as trigram and stemmed trigram. Their experiment revealed that stemming both keywords and documents has obvious advantage over stemming keywords only, and not stemming of any keywords. On the other hand, the experiment also revealed that bigram and trigram search worked better with no stemming on keywords. This is because when the keywords in the text are stemmed, the bigram and trigram search would be affected as well caused by the reduced keywords [3]. However, the experiment eventually showed that applying stemming to keywords and documents has improved the average precision value. Conclusively, Sembok and Zainab [3] have proven that retrieval effectiveness is improved by using combined search, n-gram A Malay language stemmer that is used in text categorization was developed by Yasukawa et. al. [7]. This stemmer would check an input word with the dictionary before removing the affix to overcome the over stemming problem. In its methodology, the affixes are arranged from the longest match list to the shortest match list. In the longest match list, stemmer will remove the affix in the shortest match if there is no more root word after the affix is removed in the longest match. Therefore, the algorithm of this stemmer would not return a root word. However, it leads to two limitations in the stemmer. The limitations of the stemmer are ambiguity problem, and the algorithm is found to be more suitable than the arrangement of the longest or shortest match list. Such matching and stemming. phenomenon occurs because when the stemmed word is found Studies in the stemming algorithm for Malay language are similar to root word, there is no further checking for the next relatively left behind in comparison to other languages such as possible affix. English and European language [4]. The availability of Malay information retrieval system is also very limited. The usage of affixes in English and other European language is less complex than Malay language as it has been found that the stemmers are only concerned with the removal of suffixes. However, in Malay morphology, a stemmed word is produced by removing affixes in the text document or query. Affix is the verbal element that attached to the word whether at the beginning of the word (prefix) and at the end of the word (suffix). Besides, more than one affix may also be attached to a word at the same time. The word also can contain both affixes and this is known as prefix-suffix pair, for example as seen in the word ‗pemakanan‘. The root word for this word is ‗makan‘ and the prefix is ‗pe‘ is added at the beginning of the Based on Kassim and colleagues [6], the affixation words are derived from the combination of affixes, clitics, and particle. He added that affixes can be classified into prefixes, suffixes, and confixes and infixes. The most universal prefixes of Malay are di+, ke+, se+, ber+, men+, pen+, ter+, and per+. The prefixes normally attached at the beginning of the root words. The part-of-speech of the root words does not change if mixed with inflectional prefixes like di+, ke+, and se+ [5]. For example, ―diambil‖ (taken) is a derived word from prefix di+ and the root word ―ambil‖ (synonymous to ―take‖ in English). Both words are considered verbs. In contradiction, the part-ofspeech of the root words do change if mixed with derivational prefixes [6]. For example, the word ―pelayan‖ is a noun which derived from the mixture of layan (serving), verb, and prefix VOLUME-4, ISSUE-12, DEC-2017 COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 2 INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 E-ISSN: 2349-7610 pe+. For suffixes, the universal are +an and +i. The suffixes speed. A dictionary with affixed root words is additionally usually attached at the ending of the root words. Apart from added as a reference before stemming to solve stemming error prefixes, suffixes also have inflectional and derivational. For mentioned by Leong et al., (2012). Leong et al., (2012) inflectional suffix, example such as kuasai (powered) is a verb proposed that this stemmer uses RFO as the basis algorithm. which derived from the root word kuasa (power), verb, and The only difference is the implementation of second suffix +i. For derivational: there are suffix, minuman (drink), dictionary which checks on affixed root words. The first word noun, from combination of a verb minum (drinking), and will be going through first basic dictionary of Kamus Dewan suffix +an. For confixes, there are two types; inflectional and (4th Ed) to get the root word; if exists, it proceeds to the next derivational; these do not and do change part-of-speech word; if it does not exist, the second dictionary will be respectively. For instance, ―hendak‖ (want) → ―dihendaki‖ accessed Leong et al., (2012). When the second dictionary is (wanted), do not change the part-of speech, whereas ―pakai‖ accessed, there are three rules, (a) recode for prefix spelling (use) → ―pemakaian‖ (usage) changes the part-of-speech. exceptions and check the second dictionary, (b) check the Malay language also have infixes like +el+, +em+, and +er+ stem for spelling variations and check the second dictionary which attached at the middle of the root words. For example, and (c) recode for suffix spelling exceptions and check the ―telunjuk‖ (fingers) from the root word ―tunjuk‖ (point). Thus, second dictionary. If step (c) failed to stem a root word, the there are still many available sequences of affixes to be process will be looped. This approach successfully decreases attached to the base words9. the stemming error of 0.21% to 0.09%. In 2012, a number of researchers proposed different methods In addition, a stemming method of exhaustive affix stripping of Malay language stemming. One of the proposals is termed and a Malay Word Register are used to solve over-stemming, as the UniSZA stemmer, and it proposed 7 simple rules which under-stemming errors, and to address ambiguity problem of leads to reduction in dictionary dependencies and lower determining correct root word [9]. By considering all possible processing cost [8]. Fadzli et al., (2012) defined the rules word morphologies, the over-stemming and under-stemming namely; check dictionary, check length, double words, prefix, error remover helps in looking for possible affix to be suffix, change spelling and suffix-i. They developed and removed in all order classes for example: [prefix + root], [root enhanced Malay prefixes library based on RAO (Rules + suffix] or [prefix +root +suffix]. Additionally, the ambiguity Application Order) stemmer proposed by Fatimah in 1995 in reducer addresses the ambiguity problem of the derivative the prefix step. The suffix step was using similar approach as words by referring to Malay Word Register to determine the the prefix. Constructed rules are arranged in five different correct root word. Darwis et al., (2012) mentioned in the ways of Arrangement A, B, C, D and E. Fadzli et al. (2012) results of their test on a proposed method, as the Malay Word conducted an experiment on all five arrangements and the Register may not contain all possible derivative words to solve results showed that Arrangement C: Double Words → Check all ambiguity cases, it is still practically useful and does Dictionary → Check Length → Prefix → Suffix → Change contribute to the 99.8% accuracy of their stemmer. Spelling → Suffix-i scored well in terms of accuracy. In comparisons to RAO and RFO (Rules Frequency Order), UniSZA perform better with compression rate of 67.26%. On the other hand, they found that UniSZA method also improves Among all those existing Malay stemmers, Lee et al., [10] developed a syllable-based Malay word stemmer. Unlike traditional stemming process, the proposed concept is to split the word into syllable set before stemming it. After the other languages‘ stemmer compression rate. syllabification is done, stemming rules are used to identify the An innovative approach of stemming called Malay stemmer morphological structure of the words. In this research, there with background knowledge is proposed by Leong et al., are three set of rules, namely Prefix rules, Suffix rules and (2012). It was designed to avoid excessively broad-spectrum Morphographemic rules. The prototype works by removing dictionary scanning of traditional algorithm, where words are identified prefixes and suffixes, then consider spelling scanned regardless if they have affixes, to boost the processing variations and exceptions. However, limitations still exist in VOLUME-4, ISSUE-12, DEC-2017 COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 3 INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 the research as different words are under-stemmed or overstemmed when accorded to the three rules. For instance, E-ISSN: 2349-7610 2. MATERIALS AND METHODS 2.1. System Design stemming result of peralatan was ralat while the root word is alat whereas stemming result of kediaman was dia while root word is diam [10]. These examples indicated the syllabification process might affect the accuracy of result of the syllable-based Malay word stemmer. Regardless of these weaknesses, the system recorded an achievement of a 97.4% Stemming Malay language is not as easy as just by removing the suffix because Malay affixes consist of four diverse types which are: Prefix – attached at the beginning of a word of accuracy in stemming Malay words. Suffix – attached at the end of a word Singh and Gupta [11] described a comprehensive literature Infix – located at the middle of a word relevant to text stemming by classifying it according to certain key parameters; then it describes the deep analysis of some well-known stemming algorithms on standard data sets. Prefix-suffix pair (confix) – attached at the beginning and the end of the word. Kassim et al., [6] presented a detailed review of Malay word stemmers. They explained the research trends of the existing Malay word stemmers based on morphological structures of Malay language, general word stemming methods and adopted word stemming. A cross-lingual sentiment lexicon acquisition method for the Malay and English languages is reported by Nasharuddin, et al., [12]. They further tested their algorithm on a set of news test collections. Knowles Gerry [11], proposed new standards of data collection, organisation and analysis associated with the methodology of corpus linguistics. A rule based algorithm by which a stem for the Arabian Gulf dialect can be defined [11]. Special rules are created to remove the suffixes and prefixes of the dialect It has already been established that the Malay language is more complex than the English language. For the purpose of this study, an offline dictionary of 9,512 words is created to handle ambiguity during the stemming process. The affixes rules included within the algorithm are extracted from Kamus Tatabahasa and the dictionary used to check the validity of a stemmed word, is Kamus Dewan (4th Ed). The system used to produce the stemmer is HP core i7 and Windows 10 operating system. The tools used are Python language, Anaconda, NLTK and Pycharm (community version) to develop the stemmer. Fig.1 shows the dataflow of the proposed algorithm. words. Also, the algorithm applies rules related to the word 2.2. Algorithm The stemmer is based on the following algorithm. This size and the relation between adjacent letters. algorithm has five steps. Each step checks the input word or stemmed word in local dictionary. If the word is root word, In conclusion, there have been many approaches proposed and implemented by researchers in the past few years on Malay stemmers. Each reviewed approach has its own advantages then the word is printed as a root word otherwise the word is processed to stem according to the defined rules. The pseudo code of the algorithm in given in Fig. 2 below. and limitations, but the main contention lies in ensuring the Malay prefixes and suffixes are clearly represented into the system as they might affect the outcome. It is clear that, the stemming algorithm is still opened to various methods in modifying and improving the results of stemming. The contention of this paper is to provide another methodological perspective in designing a stemming algorithm and developing a stemmer for stemming Malay words, by using an offline dictionary to handle ambiguity during the stemming process. VOLUME-4, ISSUE-12, DEC-2017 COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 4 INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 E-ISSN: 2349-7610 STOP Endif else Go to next Step endif Step 1 Stem_extraSuffix If the word contains extra suffix which is –nya, then remove it Check stemmed word in dictionary. endif Step 2 Stem_Plural If the word is in plural form, then remove plural. Check stemmed word in dictionary. endif Step 3: (Stem_infix If the word contains infix, then remove infix. Check stemmed word in dictionary. endif Step 4: (Stem_prefix) If the word contains prefix, then remove prefix. Check stemmed word in dictionary. endif Step 5: (Stem_suffix) If the word contains sufix, then remove sufix. Check stemmed word in dictionary. endif end While Fig- 2: The pseudo code of the algorithm. Fig-1: The dataflow of our stemming process While not stop equal to yes do: Get input word or paragraph Check in Dictionary Check input/stemmed word in offline dictionary as a root word if the word is root word Print the word as root word. Do you want to stop, Yes/No? If stop equal to yes VOLUME-4, ISSUE-12, DEC-2017 First, the input word or stemmed word in each step is checked in the local dictionary. If the word is found as root word, then the word will be displayed as root word. Otherwise, the process will proceed to next step. Step1: (Stem_extraSuffix), It would stem the extra suffix which is ―–nya‖. Without using the stem ―–nya‖ at the first step, the root word is a meaningless word. For example, for the word ‗mendekatinya‘. Execute the step without step1: (stem_extraSuffix) After (check root_word) and (stem_infix), then Stem prefix: ―dekatinya‖ COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 5 INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 E-ISSN: 2349-7610 Stem suffix: ―dekati‖ randomly chosen from an online source [13]. A sample result Execute step with step1: (stem_extraSuffix) is shown in Fig. 3 below. Stem extraSuffix: ―mendekati‖ (―-nya‖ would be stem first) Then proceed to (check root_word) and (stem_infix), after that Stem prefix: ―dekati‖ Stem suffix: ―dekat‖ Accordingly, it is necessary to include stem_extraSuffix at the beginning in order to prevent any meaningless root word. Step2: (Stem_Plural), Malay language has different mechanism for making plurals. The particular word is doubled to make plural, for example buku-buku (books). In this step the word is examined. If the word is plural, then it is stemmed to root word. Step3: (Stem_infix), If the word contain infix, then infix is removed this step and proceed to next step. Step4: (Stem_prefix), The prefixes are removed in this step for example ―diper‖, ―ber‖, ―per‖, ―ter‖, and so forth. However, there is a special Fig- 3: Sample test of stemming grammar present in the Malay language where a word would The following Table 1 shows a sample list of prefixes, suffixes be replaced with a different letter for different prefixes. that can be stemmed using this stemmer. Therefore, prefix ―mem‖ would be replaced with either the letter ―f‖ or ―p‖ after checking with the dictionary. For Table-1: Sample list of removable prefixes and suffixes example, ―memakai‖ would become ―pakai‖ and ―memikir‖ would become ―fikir‖ after stemming. The step would check the word if there are less than four alphabets exist before Prefixes "diper", "ber", "bel", "mem", "penye", "per", "peny", "ter", "menye", "meny", "menge", "penge", "meng", stemming, to prevent the loss of meaning for the word. It "peng", "men", "pen", "me", "pem", would not remove the prefix if the stem is too short. "pe", "be", "ke", "se", "ter", "te", "di" "kannya", "nya", "kan", "an", "i", "kah", Step5: (stem_suffix), Suffixes The suffixes are removed in this step for example ―kannya‖, "lah", "pun", "ita", "man", "wan", "wati", "ku", "mu" ―nya‖, ―kan‖, ―an‖, etc. Like step4, it would not remove the Other than prefixes and suffixes, the proposed stemmer can suffix if the stem is too short. It also checks the word whether stem infixes also. In the Malay language, there is a unique rule less than five alphabets before stemming to prevent loss of called dual words or ―kata ganda‖. There are several instances meaning for the word. of ―kata ganda‖ that exist in the Malay language, the proposed 3. RESULTS AND DISCUSSION stemmer can successfully stem ―kata ganda‖. The following The stemmer can remove prefixes, suffixes and infixes from Table 2 shows several examples of dual word stemming. words in order to obtain the root word. To test the Table-2: Sample list of dual words stemming performance of the stemmer a Malay language essay is VOLUME-4, ISSUE-12, DEC-2017 COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 6 INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 Words without prefix jalan-jalan Stemmed Word jalan Non-identical words without prefix Identical words with prefix Non-identical words with prefix Words with suffix saudara-mara saudara tergesa-gesa gesa membeli-belah beli barangbarangan sebaik-baiknya barang Type of Dual Words Words with prefix and suffix  Words E-ISSN: 2349-7610 The stemmer is also able to stem passages instead of just words. Words make up sentences and sentences make up a passage. Therefore, any text passage could be stemmed using the stemmer and output as text passages but with stemmed words. Fig. 4 and Fig. 5 below show the output for stemmed text passages with and without the use of local dictionary. baik Text passage without dictionary Fig- 4: Output for a stemmed text passage without the use of word dictionary  Text passage with dictionary Fig- 5: Output for a stemmed text passage with use of word dictionary. VOLUME-4, ISSUE-12, DEC-2017 COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 7 INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 E-ISSN: 2349-7610 The accuracy of stemming increased tremendously with the language. IJCSNS International Journal of Computer use of word dictionary. In Figure 4, it is shown that without Science and Network Security, 9[14], 433-438. the help of a words dictionary, a handful of words were [2] Anelyza. [13]. Efforts to enhance the patriotic spirit incorrectly stemmed as they contained letters that resembled among the communities in our country. prefixes or suffixes. However, they are actually a part of the from, Ranaivo-Malancon, B. Computational analysis of root word, therefore the stemmer incorrectly stems the words affixed words in malay language. in Proceedings of the and turns them into meaningless words. But Figure 5 shows 8th that the same words can be accurately stemmed with the help Linguistics (ISMIL). 2004. Penang, Malaysia. International Symposium on Retrieved Malay/Indonesian of local dictionary. It can be concluded that the use of a word [3] Lovins, J.B., Development of a stemming algorithm. dictionary is essential in improving the accuracy of the Mech. Translat. & Comp. Linguistics, 1968. 11(1-2): p. stemmer provided the dictionary contains many different 22-31. words for the stemmer‘s reference. [4] Sembok, T.M.T. and Z.A. Bakar, Effectiveness of stemming and n-grams string similarity matching on The stemmer has its limitations. It does not achieve a hundred percent accuracy. Similar to the Porter stemming algorithm, there are instances where several words were not properly stemmed as the root words contain letters which were also found in the prefixes, therefore causing an overlap. The stemmer always goes for the longer prefix or suffix such as ―pem‖ instead of ―pe‖ which is the longer prefix among the two. As an example, the word ―pemain‖ contains the prefix ―pe‖ while its rootword consists of ―main‖. However, the stemmer recognizes the prefix ―pem‖ instead of ―pe‖ and this results in the word being improperly stemmed with an output of ―ain‖ as a result. Malay documents. International Journal of Applied Mathematics and Informatics, 2011. 5(3): p. 208-215. [5] Abdullah, M.T., et al., Rules frequency order stemmer for Malay language. IJCSNS International Journal of Computer Science and Network Security, 2009. 9(2): p. 433-438. [6] Kassim, M.N., et al. Word stemming challenges in Malay texts: A literature review. in 4th International Conference on Information and Communication Technology (ICoICT). 2016. Bandung, Indonesia: IEEE. [7] Kassim, M.N., et al. Malay Word Stemmer to Stem Standard and Slang Word Patterns on Social Media. in The same problem arises when an ending letter of a root word overlaps with letters found on suffixes as well. An example of this is the word ―pendidikan‖ where the root word which is Tan Y., Shi Y. (eds) Data Mining and Big Data. DMBD 2016. Lecture Notes in Computer Science. 2016. Cham Springer. ―didik‖ has a letter ―k‖ as an ending letter and it overlaps with [8] Yasukawa, M., H.T. Lim, and H. Yokoo, Stemming malay the suffix ―kan‖ where it is supposed stem the suffix ―an‖ text and its application in automatic text categorization. only. Hence, the output of the word becomes ―didi‖ as the IEICE transactions on information and systems, 2009. stemmer always stem the longer suffix from words. 92(12): p. 2351-2359. [9] Fadzli, S.A., et al. Simple rules malay stemmer. in The 4. CONCLUSION International Conference on Informatics and Applications It is clear in the study that the proposed stemmer is able to (ICIA2012). 2012. Malaysia: The Society of Digital accurately stem extra suffix, Malay plural, prefix, infix and Information and Wireless Communication. suffix. It is recommended that this stemmer can be further [10] Darwis, S.A., R. Abdullah, and N. Idris, Exhaustive affix enhanced by look up table or dictionary of overlapping words stripping and a Malay word register to solve stemming to cover the prefix and suffix overlapping limitation. errors and ambiguity problem in Malay stemmers. Malaysian Journal of Computer Science, 2012. 25(4): p. REFERENCES 196-209. [1] Abdullah, M. T., Ahmad, F., Mahmod, R., & Sembok, T. [11] Lee , C.J., M.O. Rosita, and N.Z. Mohamad. Syllable- M. T. [3]. Rules frequency order stemmer for Malay based Malay Word Stemmer. in 2013 IEEE Symposium on VOLUME-4, ISSUE-12, DEC-2017 COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 8 INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017 E-ISSN: 2349-7610 Computers & Informatics (ISCI) 2014. Langkawi, Malaysia: IEEE. [12] Knowles Gerry, Languages and linguistics in 2003: The potential contribution of corpus linguistics. Journal of Modern Languages, 2017. V. 15(1): p. 37-50. [13] Nasharuddin, N.A., et al. English and Malay Crosslingual Sentiment Lexicon Acquisition and Analysis. in Kim K., Joukov N. (eds) Information Science and Applications 2017. ICISA 2017. Lecture Notes in Electrical Engineering, vol 424. 2017. Singapore: Springer. [14] Anelyza, Efforts to enhance the patriotic spirit among the communities in our country, in Teacher Anelyza SPM 2009. [15] Rehman Ullah Khan1*, M.I., YahyaKhan3,Oon Yin Bee4, Shahren Ahmad ZadiAdruce5, Mai S. Ishak6, Tan KockWah7, A NOVEL ALGORITHM FOR TEXT STEGANOGRAPHY. International Journal of Soft Computing: p. 13. VOLUME-4, ISSUE-12, DEC-2017 COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED 9

Log In

Malay Language Stemmer

Related papers

Related papers

Related topics