INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
E-ISSN: 2349-7610
Malay Language Stemmer
Rehman Ullah Khan*1, Fitri Suraya Mohamad2, Muh. Inam UlHaq3, Shahren
Ahmad Zadi Adruce4, Philip Nuli Anding5, Sajjad Nawaz Khan6, Abdulrazak
Yahya Saleh Al-Hababi7
1,2,4,5,6 Faculty of Cognitive Sciences and Human Development, Universiti Malaysia Sarawak, 94300 Kota
Samarahan, Sarawak, Malaysia
E-mail: krullah@unimas.my
E-mail:
E-mail:
mfitri@unimas.my
azshahren@unimas.my
E-mail:
aphilip@unimas.my
E-mail:
sajjadnawazkhan@gmail.com
E-mail:
ysahabdulrazak@unimas.my
3Department of Computer Sciences, Khushal Khan Khattak University, Karak Pakistan
E-mail:
inamix@gmail.com
ABSTRACT
Stemmer is a language processing tool that has been widely used in many artificial intelligence applications for removing affixes
in a word such as prefixes, infixes, and suffixes to generate the root word. This study designs an algorithm and develops a Malay
language stemmer. It is given that most of Malay language stemmers have problems in stemming, as they tended to have
dependencies on online dictionaries, which return false results during stemming. It is given that the complexity of affixes in Malay
words is higher than that of English words. Therefore, an offline dictionary of 9,512 words is introduced in this study to handle
the ambiguity when stemming Malay words. Each step the algorithm first checks the word in the local dictionary as a root word,
otherwise process the word. The five steps are stem-extra-suffix, stem-plural, stem-infix, stem-prefix, and stem-suffix. The affixes
rules are extracted from Kamus Tatabahasa, and Kamus Dewan (4th Ed) is used to confirm the accuracy of stemmed words. The
results show that the proposed stemmer can stem prefixes, suffixes and infixes with high accuracy. The study conclusively
illustrated that the proposed stemmer can handle the complexity of Malay words. This stemmer can be further enhanced by a lookup table or dictionary of overlapping words to cover the prefix and suffix overlapping limitation.
Index Term — Stemming, Stemmer, Natural language processing, Algorithm and Morphology.
1. INTRODUCTION
those commonly used in the English language. Second, affixes
The Malay language is well known part of Austronesian
in Malay language analyses part-of-speech of the words to
family which is spoken across South East Asia such as
represent nouns, verbs and adjectives while affixes in English
Malaysia, Indonesia, Singapore, and Brunei [1]. Bahasa
language represent plural, tenses and possession. The lack of
Malaysia in Malaysia, Bahasa Indonesia in Indonesia and
morphological analyses published on the Malay language
Bahasa Melayu in Singapore and Brunei, they originated from
creates a need to address the gap in literature about the
Malay language [1]. Malay language has two key features to
linguistic characteristics of the language widely used in
take note of. First, it does not have grammatical functions as
Southeast Asia.
VOLUME-4, ISSUE-12, DEC-2017
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
1
INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
E-ISSN: 2349-7610
Stemming algorithm has been widely used today to serve
word and the suffix ‗an‘ at the end of the word to complete the
various purposes. It is known as the process of eliminating
word ‗pemakanan‘. English and Malay languages differ in
derivational and inflectional suffixes from words until the root
terms of their root words, which are based on their respective
word is obtained [2]. For example, the words ―assign‖,
morphological structures [5]. For instance, the English words
―assigns‖,
or
‗related‘, ‗relates‘, and ‗relation‘, are derived from the root
―assignment‖ are reduced to the root word, ―assign‖.
word ‗relate‘, and stemmer can work as suffix removal for
Experiments were conducted to examine how efficient
English language. Yet, the Malay language has a different
stemming is, and n-grams in identifying suffixes, multi-word
stemming process compared to English, due the complexity of
concepts and spelling errors. The experiment was divided into
its morphological rules. For example, the Malay words
bigram and trigram string matching using a document in
‗pengajaran‘, ‗pembelajaran‘, and ‗pelajar‘ are derived from
Malay language. Thenceforth, Sembok and Zainab [3] [3]
the root word ‗ajar‘, and it is insufficient to use suffix removal
carried out separated experiments by using bigram and
to decide for the perfect root word [6].
―assigned‖,
―assigning‖,
―assignation‖
stemmed bigram as well as trigram and stemmed trigram.
Their experiment revealed that stemming both keywords and
documents has obvious advantage over stemming keywords
only, and not stemming of any keywords. On the other hand,
the experiment also revealed that bigram and trigram search
worked better with no stemming on keywords. This is because
when the keywords in the text are stemmed, the bigram and
trigram search would be affected as well caused by the
reduced keywords [3]. However, the experiment eventually
showed that applying stemming to keywords and documents
has improved the average precision value. Conclusively,
Sembok
and
Zainab
[3]
have
proven
that
retrieval
effectiveness is improved by using combined search, n-gram
A Malay language stemmer that is used in text categorization
was developed by Yasukawa et. al. [7]. This stemmer would
check an input word with the dictionary before removing the
affix to overcome the over stemming problem. In its
methodology, the affixes are arranged from the longest match
list to the shortest match list. In the longest match list,
stemmer will remove the affix in the shortest match if there is
no more root word after the affix is removed in the longest
match. Therefore, the algorithm of this stemmer would not
return a root word. However, it leads to two limitations in the
stemmer. The limitations of the stemmer are ambiguity
problem, and the algorithm is found to be more suitable than
the arrangement of the longest or shortest match list. Such
matching and stemming.
phenomenon occurs because when the stemmed word is found
Studies in the stemming algorithm for Malay language are
similar to root word, there is no further checking for the next
relatively left behind in comparison to other languages such as
possible affix.
English and European language [4]. The availability of Malay
information retrieval system is also very limited. The usage of
affixes in English and other European language is less
complex than Malay language as it has been found that the
stemmers are only concerned with the removal of suffixes.
However, in Malay morphology, a stemmed word is produced
by removing affixes in the text document or query. Affix is the
verbal element that attached to the word whether at the
beginning of the word (prefix) and at the end of the word
(suffix). Besides, more than one affix may also be attached to
a word at the same time. The word also can contain both
affixes and this is known as prefix-suffix pair, for example as
seen in the word ‗pemakanan‘. The root word for this word is
‗makan‘ and the prefix is ‗pe‘ is added at the beginning of the
Based on Kassim and colleagues [6], the affixation words are
derived from the combination of affixes, clitics, and particle.
He added that affixes can be classified into prefixes, suffixes,
and confixes and infixes. The most universal prefixes of
Malay are di+, ke+, se+, ber+, men+, pen+, ter+, and per+.
The prefixes normally attached at the beginning of the root
words. The part-of-speech of the root words does not change if
mixed with inflectional prefixes like di+, ke+, and se+ [5]. For
example, ―diambil‖ (taken) is a derived word from prefix di+
and the root word ―ambil‖ (synonymous to ―take‖ in English).
Both words are considered verbs. In contradiction, the part-ofspeech of the root words do change if mixed with derivational
prefixes [6]. For example, the word ―pelayan‖ is a noun which
derived from the mixture of layan (serving), verb, and prefix
VOLUME-4, ISSUE-12, DEC-2017
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
2
INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
E-ISSN: 2349-7610
pe+. For suffixes, the universal are +an and +i. The suffixes
speed. A dictionary with affixed root words is additionally
usually attached at the ending of the root words. Apart from
added as a reference before stemming to solve stemming error
prefixes, suffixes also have inflectional and derivational. For
mentioned by Leong et al., (2012). Leong et al., (2012)
inflectional suffix, example such as kuasai (powered) is a verb
proposed that this stemmer uses RFO as the basis algorithm.
which derived from the root word kuasa (power), verb, and
The only difference is the implementation of second
suffix +i. For derivational: there are suffix, minuman (drink),
dictionary which checks on affixed root words. The first word
noun, from combination of a verb minum (drinking), and
will be going through first basic dictionary of Kamus Dewan
suffix +an. For confixes, there are two types; inflectional and
(4th Ed) to get the root word; if exists, it proceeds to the next
derivational; these do not and do change part-of-speech
word; if it does not exist, the second dictionary will be
respectively. For instance, ―hendak‖ (want) → ―dihendaki‖
accessed Leong et al., (2012). When the second dictionary is
(wanted), do not change the part-of speech, whereas ―pakai‖
accessed, there are three rules, (a) recode for prefix spelling
(use) → ―pemakaian‖ (usage) changes the part-of-speech.
exceptions and check the second dictionary, (b) check the
Malay language also have infixes like +el+, +em+, and +er+
stem for spelling variations and check the second dictionary
which attached at the middle of the root words. For example,
and (c) recode for suffix spelling exceptions and check the
―telunjuk‖ (fingers) from the root word ―tunjuk‖ (point). Thus,
second dictionary. If step (c) failed to stem a root word, the
there are still many available sequences of affixes to be
process will be looped. This approach successfully decreases
attached to the base words9.
the stemming error of 0.21% to 0.09%.
In 2012, a number of researchers proposed different methods
In addition, a stemming method of exhaustive affix stripping
of Malay language stemming. One of the proposals is termed
and a Malay Word Register are used to solve over-stemming,
as the UniSZA stemmer, and it proposed 7 simple rules which
under-stemming errors, and to address ambiguity problem of
leads to reduction in dictionary dependencies and lower
determining correct root word [9]. By considering all possible
processing cost [8]. Fadzli et al., (2012) defined the rules
word morphologies, the over-stemming and under-stemming
namely; check dictionary, check length, double words, prefix,
error remover helps in looking for possible affix to be
suffix, change spelling and suffix-i. They developed and
removed in all order classes for example: [prefix + root], [root
enhanced Malay prefixes library based on RAO (Rules
+ suffix] or [prefix +root +suffix]. Additionally, the ambiguity
Application Order) stemmer proposed by Fatimah in 1995 in
reducer addresses the ambiguity problem of the derivative
the prefix step. The suffix step was using similar approach as
words by referring to Malay Word Register to determine the
the prefix. Constructed rules are arranged in five different
correct root word. Darwis et al., (2012) mentioned in the
ways of Arrangement A, B, C, D and E. Fadzli et al. (2012)
results of their test on a proposed method, as the Malay Word
conducted an experiment on all five arrangements and the
Register may not contain all possible derivative words to solve
results showed that Arrangement C: Double Words → Check
all ambiguity cases, it is still practically useful and does
Dictionary → Check Length → Prefix → Suffix → Change
contribute to the 99.8% accuracy of their stemmer.
Spelling → Suffix-i scored well in terms of accuracy. In
comparisons to RAO and RFO (Rules Frequency Order),
UniSZA perform better with compression rate of 67.26%. On
the other hand, they found that UniSZA method also improves
Among all those existing Malay stemmers, Lee et al., [10]
developed a syllable-based Malay word stemmer. Unlike
traditional stemming process, the proposed concept is to split
the word into syllable set before stemming it. After the
other languages‘ stemmer compression rate.
syllabification is done, stemming rules are used to identify the
An innovative approach of stemming called Malay stemmer
morphological structure of the words. In this research, there
with background knowledge is proposed by Leong et al.,
are three set of rules, namely Prefix rules, Suffix rules and
(2012). It was designed to avoid excessively broad-spectrum
Morphographemic rules. The prototype works by removing
dictionary scanning of traditional algorithm, where words are
identified prefixes and suffixes, then consider spelling
scanned regardless if they have affixes, to boost the processing
variations and exceptions. However, limitations still exist in
VOLUME-4, ISSUE-12, DEC-2017
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
3
INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
the research as different words are under-stemmed or overstemmed when accorded to the three rules. For instance,
E-ISSN: 2349-7610
2. MATERIALS AND METHODS
2.1. System Design
stemming result of peralatan was ralat while the root word is
alat whereas stemming result of kediaman was dia while root
word
is
diam
[10].
These
examples
indicated
the
syllabification process might affect the accuracy of result of
the syllable-based Malay word stemmer. Regardless of these
weaknesses, the system recorded an achievement of a 97.4%
Stemming Malay language is not as easy as just by
removing the suffix because Malay affixes consist of four
diverse types which are:
Prefix – attached at the beginning of a word
of accuracy in stemming Malay words.
Suffix – attached at the end of a word
Singh and Gupta [11] described a comprehensive literature
Infix – located at the middle of a word
relevant to text stemming by classifying it according to certain
key parameters; then it describes the deep analysis of some
well-known stemming algorithms on standard data sets.
Prefix-suffix pair (confix) – attached at the beginning
and the end of the word.
Kassim et al., [6] presented a detailed review of Malay word
stemmers. They explained the research trends of the existing
Malay word stemmers based on morphological structures of
Malay language, general word stemming methods and adopted
word stemming. A cross-lingual sentiment lexicon acquisition
method for the Malay and English languages is reported by
Nasharuddin, et al., [12]. They further tested their algorithm
on a set of news test collections. Knowles Gerry [11],
proposed new standards of data collection, organisation and
analysis
associated
with
the
methodology
of
corpus
linguistics. A rule based algorithm by which a stem for the
Arabian Gulf dialect can be defined [11]. Special rules are
created to remove the suffixes and prefixes of the dialect
It has already been established that the Malay language is
more complex than the English language. For the purpose of
this study, an offline dictionary of 9,512 words is created to
handle ambiguity during the stemming process. The affixes
rules included within the algorithm are extracted from Kamus
Tatabahasa and the dictionary used to check the validity of a
stemmed word, is Kamus Dewan (4th Ed). The system used to
produce the stemmer is HP core i7 and Windows 10 operating
system. The tools used are Python language, Anaconda,
NLTK and Pycharm (community version) to develop the
stemmer. Fig.1 shows the dataflow of the proposed algorithm.
words. Also, the algorithm applies rules related to the word
2.2. Algorithm
The stemmer is based on the following algorithm. This
size and the relation between adjacent letters.
algorithm has five steps. Each step checks the input word or
stemmed word in local dictionary. If the word is root word,
In conclusion, there have been many approaches proposed and
implemented by researchers in the past few years on Malay
stemmers. Each reviewed approach has its own advantages
then the word is printed as a root word otherwise the word is
processed to stem according to the defined rules. The pseudo
code of the algorithm in given in Fig. 2 below.
and limitations, but the main contention lies in ensuring the
Malay prefixes and suffixes are clearly represented into the
system as they might affect the outcome. It is clear that, the
stemming algorithm is still opened to various methods in
modifying and improving the results of stemming. The
contention of this paper is to provide another methodological
perspective in designing a stemming algorithm and developing
a stemmer for stemming Malay words, by using an offline
dictionary to handle ambiguity during the stemming process.
VOLUME-4, ISSUE-12, DEC-2017
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
4
INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
E-ISSN: 2349-7610
STOP
Endif
else Go to next Step
endif
Step 1 Stem_extraSuffix
If the word contains extra suffix which is –nya,
then remove it
Check stemmed word in dictionary.
endif
Step 2 Stem_Plural
If the word is in plural form,
then remove plural.
Check stemmed word in dictionary.
endif
Step 3: (Stem_infix
If the word contains infix,
then remove infix.
Check stemmed word in dictionary.
endif
Step 4: (Stem_prefix)
If the word contains prefix,
then remove prefix.
Check stemmed word in dictionary.
endif
Step 5: (Stem_suffix)
If the word contains sufix,
then remove sufix.
Check stemmed word in dictionary.
endif
end While
Fig- 2: The pseudo code of the algorithm.
Fig-1: The dataflow of our stemming process
While not stop equal to yes do:
Get input word or paragraph
Check in Dictionary
Check input/stemmed word in offline
dictionary as a root word
if the word is root word
Print the word as root word.
Do you want to stop, Yes/No?
If stop equal to yes
VOLUME-4, ISSUE-12, DEC-2017
First, the input word or stemmed word in each step is checked
in the local dictionary. If the word is found as root word, then
the word will be displayed as root word. Otherwise, the
process will proceed to next step.
Step1: (Stem_extraSuffix),
It would stem the extra suffix which is ―–nya‖. Without using
the stem ―–nya‖ at the first step, the root word is a
meaningless word. For example, for the word ‗mendekatinya‘.
Execute the step without step1: (stem_extraSuffix)
After (check root_word) and (stem_infix), then
Stem prefix: ―dekatinya‖
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
5
INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
E-ISSN: 2349-7610
Stem suffix: ―dekati‖
randomly chosen from an online source [13]. A sample result
Execute step with step1: (stem_extraSuffix)
is shown in Fig. 3 below.
Stem extraSuffix: ―mendekati‖ (―-nya‖ would be stem first)
Then proceed to (check root_word) and (stem_infix), after that
Stem prefix: ―dekati‖
Stem suffix: ―dekat‖
Accordingly, it is necessary to include stem_extraSuffix at the
beginning in order to prevent any meaningless root word.
Step2: (Stem_Plural),
Malay language has different mechanism for making plurals.
The particular word is doubled to make plural, for example
buku-buku (books). In this step the word is examined. If the
word is plural, then it is stemmed to root word.
Step3: (Stem_infix),
If the word contain infix, then infix is removed this step and
proceed to next step.
Step4: (Stem_prefix),
The prefixes are removed in this step for example ―diper‖,
―ber‖, ―per‖, ―ter‖, and so forth. However, there is a special
Fig- 3: Sample test of stemming
grammar present in the Malay language where a word would
The following Table 1 shows a sample list of prefixes, suffixes
be replaced with a different letter for different prefixes.
that can be stemmed using this stemmer.
Therefore, prefix ―mem‖ would be replaced with either the
letter ―f‖ or ―p‖ after checking with the dictionary. For
Table-1: Sample list of removable prefixes and suffixes
example, ―memakai‖ would become ―pakai‖ and ―memikir‖
would become ―fikir‖ after stemming. The step would check
the word if there are less than four alphabets exist before
Prefixes
"diper",
"ber",
"bel",
"mem",
"penye",
"per",
"peny",
"ter",
"menye",
"meny", "menge", "penge", "meng",
stemming, to prevent the loss of meaning for the word. It
"peng", "men", "pen", "me", "pem",
would not remove the prefix if the stem is too short.
"pe", "be", "ke", "se", "ter", "te", "di"
"kannya", "nya", "kan", "an", "i", "kah",
Step5: (stem_suffix),
Suffixes
The suffixes are removed in this step for example ―kannya‖,
"lah", "pun", "ita", "man", "wan",
"wati", "ku", "mu"
―nya‖, ―kan‖, ―an‖, etc. Like step4, it would not remove the
Other than prefixes and suffixes, the proposed stemmer can
suffix if the stem is too short. It also checks the word whether
stem infixes also. In the Malay language, there is a unique rule
less than five alphabets before stemming to prevent loss of
called dual words or ―kata ganda‖. There are several instances
meaning for the word.
of ―kata ganda‖ that exist in the Malay language, the proposed
3. RESULTS AND DISCUSSION
stemmer can successfully stem ―kata ganda‖. The following
The stemmer can remove prefixes, suffixes and infixes from
Table 2 shows several examples of dual word stemming.
words in order to obtain the root word. To test the
Table-2: Sample list of dual words stemming
performance of the stemmer a Malay language essay is
VOLUME-4, ISSUE-12, DEC-2017
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
6
INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
Words without prefix
jalan-jalan
Stemmed
Word
jalan
Non-identical
words
without prefix
Identical
words
with
prefix
Non-identical words with
prefix
Words with suffix
saudara-mara
saudara
tergesa-gesa
gesa
membeli-belah
beli
barangbarangan
sebaik-baiknya
barang
Type of Dual Words
Words with prefix and
suffix
Words
E-ISSN: 2349-7610
The stemmer is also able to stem passages instead of just
words. Words make up sentences and sentences make up a
passage. Therefore, any text passage could be stemmed using
the stemmer and output as text passages but with stemmed
words. Fig. 4 and Fig. 5 below show the output for stemmed
text passages with and without the use of local dictionary.
baik
Text passage without dictionary
Fig- 4: Output for a stemmed text passage without the use of word dictionary
Text passage with dictionary
Fig- 5: Output for a stemmed text passage with use of word dictionary.
VOLUME-4, ISSUE-12, DEC-2017
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
7
INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
E-ISSN: 2349-7610
The accuracy of stemming increased tremendously with the
language. IJCSNS International Journal of Computer
use of word dictionary. In Figure 4, it is shown that without
Science and Network Security, 9[14], 433-438.
the help of a words dictionary, a handful of words were
[2] Anelyza. [13]. Efforts to enhance the patriotic spirit
incorrectly stemmed as they contained letters that resembled
among the communities in our country.
prefixes or suffixes. However, they are actually a part of the
from, Ranaivo-Malancon, B. Computational analysis of
root word, therefore the stemmer incorrectly stems the words
affixed words in malay language. in Proceedings of the
and turns them into meaningless words. But Figure 5 shows
8th
that the same words can be accurately stemmed with the help
Linguistics (ISMIL). 2004. Penang, Malaysia.
International Symposium
on
Retrieved
Malay/Indonesian
of local dictionary. It can be concluded that the use of a word
[3] Lovins, J.B., Development of a stemming algorithm.
dictionary is essential in improving the accuracy of the
Mech. Translat. & Comp. Linguistics, 1968. 11(1-2): p.
stemmer provided the dictionary contains many different
22-31.
words for the stemmer‘s reference.
[4] Sembok, T.M.T. and Z.A. Bakar, Effectiveness of
stemming and n-grams string similarity matching on
The stemmer has its limitations. It does not achieve a hundred
percent accuracy. Similar to the Porter stemming algorithm,
there are instances where several words were not properly
stemmed as the root words contain letters which were also
found in the prefixes, therefore causing an overlap. The
stemmer always goes for the longer prefix or suffix such as
―pem‖ instead of ―pe‖ which is the longer prefix among the
two. As an example, the word ―pemain‖ contains the prefix
―pe‖ while its rootword consists of ―main‖. However, the
stemmer recognizes the prefix ―pem‖ instead of ―pe‖ and this
results in the word being improperly stemmed with an output
of ―ain‖ as a result.
Malay documents. International Journal of Applied
Mathematics and Informatics, 2011. 5(3): p. 208-215.
[5] Abdullah, M.T., et al., Rules frequency order stemmer for
Malay language. IJCSNS International Journal of
Computer Science and Network Security, 2009. 9(2): p.
433-438.
[6] Kassim, M.N., et al. Word stemming challenges in Malay
texts: A literature review. in 4th International Conference
on
Information
and
Communication
Technology
(ICoICT). 2016. Bandung, Indonesia: IEEE.
[7] Kassim, M.N., et al. Malay Word Stemmer to Stem
Standard and Slang Word Patterns on Social Media. in
The same problem arises when an ending letter of a root word
overlaps with letters found on suffixes as well. An example of
this is the word ―pendidikan‖ where the root word which is
Tan Y., Shi Y. (eds) Data Mining and Big Data. DMBD
2016. Lecture Notes in Computer Science. 2016. Cham
Springer.
―didik‖ has a letter ―k‖ as an ending letter and it overlaps with
[8] Yasukawa, M., H.T. Lim, and H. Yokoo, Stemming malay
the suffix ―kan‖ where it is supposed stem the suffix ―an‖
text and its application in automatic text categorization.
only. Hence, the output of the word becomes ―didi‖ as the
IEICE transactions on information and systems, 2009.
stemmer always stem the longer suffix from words.
92(12): p. 2351-2359.
[9] Fadzli, S.A., et al. Simple rules malay stemmer. in The
4. CONCLUSION
International Conference on Informatics and Applications
It is clear in the study that the proposed stemmer is able to
(ICIA2012). 2012. Malaysia: The Society of Digital
accurately stem extra suffix, Malay plural, prefix, infix and
Information and Wireless Communication.
suffix. It is recommended that this stemmer can be further
[10] Darwis, S.A., R. Abdullah, and N. Idris, Exhaustive affix
enhanced by look up table or dictionary of overlapping words
stripping and a Malay word register to solve stemming
to cover the prefix and suffix overlapping limitation.
errors and ambiguity problem in Malay stemmers.
Malaysian Journal of Computer Science, 2012. 25(4): p.
REFERENCES
196-209.
[1] Abdullah, M. T., Ahmad, F., Mahmod, R., & Sembok, T.
[11] Lee , C.J., M.O. Rosita, and N.Z. Mohamad. Syllable-
M. T. [3]. Rules frequency order stemmer for Malay
based Malay Word Stemmer. in 2013 IEEE Symposium on
VOLUME-4, ISSUE-12, DEC-2017
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
8
INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY, VOLUME-4, ISSUE-12, DEC-2017
E-ISSN: 2349-7610
Computers & Informatics (ISCI) 2014. Langkawi,
Malaysia: IEEE.
[12] Knowles Gerry, Languages and linguistics in 2003: The
potential contribution of corpus linguistics. Journal of
Modern Languages, 2017. V. 15(1): p. 37-50.
[13] Nasharuddin, N.A., et al. English and Malay Crosslingual Sentiment Lexicon Acquisition and Analysis. in
Kim K., Joukov N. (eds) Information Science and
Applications 2017. ICISA 2017. Lecture Notes in
Electrical Engineering, vol 424. 2017. Singapore:
Springer.
[14] Anelyza, Efforts to enhance the patriotic spirit among the
communities in our country, in Teacher Anelyza SPM
2009.
[15] Rehman Ullah Khan1*, M.I., YahyaKhan3,Oon Yin
Bee4, Shahren Ahmad ZadiAdruce5, Mai S. Ishak6, Tan
KockWah7, A NOVEL ALGORITHM FOR TEXT
STEGANOGRAPHY.
International
Journal
of
Soft
Computing: p. 13.
VOLUME-4, ISSUE-12, DEC-2017
COPYRIGHT © 2017 IJREST, ALL RIGHT RESERVED
9