Potter Stemmer
Potter Stemmer
Potter Stemmer
M.F.Porter
1980
1. Introduction
Removing suffixes by automatic means is an operation which is especially useful in the
field of information retrieval. In a typical IR environment, one has a collection of
documents, each described by the words in the document title and possibly by words in
the document abstract. Ignoring the issue of precisely where the words originate, we
can say that a document is represented by a vetor of words, or terms. Terms with a
common stem will usually have similar meanings, for example:
CONNECT
CONNECTED
CONNECTING
CONNECTION
CONNECTIONS
Frequently, the performance of an IR system will be improved if term groups such as
this are conflated into a single term. This may be done by removal of the various
suffixes -ED, -ING, -ION, IONS to leave the single term CONNECT. In addition, the
suffix stripping process will reduce the total number of terms in the IR system, and
hence reduce the size and complexity of the data in the system, which is always
advantageous.
Many strategies for suffix stripping have been reported in the literature. (e.g. 1-6) The
nature of the task will vary considerably depending on whether a stem dictionary is
being used, whether a suffix list is being used, and of course on the purpose for which
the suffix stripping is being done. Assuming that one is not making use of a stem
dictionary, and that the purpose of the task is to improve IR performance, the suffix
stripping program will usually be given an explicit list of siffixes, and, with each suffix,
the criterion under which it may be removed from a word to leave a valid stem. This is
the approach adopted here. The main merits of the present program are that it is small
(less than 400 lines of BCPL), fast (it will process a vocabulary of 10,000 different
words in about 8.1 seconds on the IBM 370/165 at Cambridge University), and
reasonably simple. At any rate, it is simple enough to be described in full as an algorithm
in this paper. (The present version in BCPL is freely available from the author. BCPL
is itself available on a wide range of different computers, but anyone wishing to use the
program should have little difficulty in coding it up in other programming languages.)
Given the speed of the program, it would be quite realistic to apply it to every word in
a large file of continuous text, although for historical reasons we have found it
convenient to apply it only to relatively small vocabulary lists derived from continuous
text files.
In any suffix stripping program for IR work, two points must be borne in mind. Firstly,
the suffixes are being removed simply to improve IR performance, and not as a
linguistic exercise. This means that it would not be at all obvious under what
circumstances a suffix should be removed, even if we could exactly determine the
suffixes of a word by automatic means.
Perhaps the best criterion for removing suffixes from two words W1 and W2 to produce
a single stem S, is to say that we do so if there appears to be no difference between the
two statements `a document is about W1' and `a document is about W2'. So if
W1=`CONNECTION' and W2=`CONNECTIONS' it seems very reasonable to conflate
them to a single stem. But if W1=`RELATE' and W2=`RELATIVITY' it seems perhaps
unreasonable, especially if the document collection is concerned with theoretical
physics. (It should perhaps be added that RELATE and RELATIVITY are conflated
together in the algorithm described here.) Between these two extremes there is a
continuum of different cases, and given two terms W1 and W2, there will be some
variation in opinion as to whether they should be conflated, just as there is with deciding
the relevance of some document to a query. The evaluation of the worth of a suffix
stripping system is correspondingly difficult.
The second point is that with the approach adopted here, i.e. the use of a suffix list with
various rules, the success rate for the suffix stripping will be significantly less than
100% irrespective of how the process is evaluated. For example, if SAND and
SANDER get conflated, so most probably will WAND and WANDER. The error here
is that the -ER of WANDER has been treated as a suffix when in fact it is part of the
stem. Equally, a suffix may completely alter the meaning of a word, in which case its
removal is unhelpful. PROBE and PROBATE for example, have quite distinct
meanings in modern English. (In fact these would not be conflated in our present
algorithm.) There comes a stage in the development of a suffix stripping program where
the addition of more rules to increase the performance in one area of the vocabulary
causes an equal degradation of performance elsewhere. Unless this phenomenon is
noticed in time, it is very easy for the program to become much more complex than is
really necessary. It is also easy to give undue emphasis to cases which appear to be
important, but which turn ut to be rather rare. For example, cases in which the root of a
word changes with the addition of a suffix, as in DECEIVE/DECEPTION,
RESUME/RESUMPTION, INDEX/INDICES occur much more rarely in real
vocabularies than one might at first suppose. In view of the error rate that must in any
case be expected, it did not seem worthwhile to try and cope with these cases.
It is not obvious that the simplicity of the present program is any demerit. In a test on
the well-known Cranfield 200 collection7 it gave an improvement in retrieval
performance when compared with a very much more elaborate program which has been
in use in IR research in Cambridge since 1971 (2,6). The test was done as follows: the
words of the titles and abstracts in the documents were passed through the earlier suffix
stripping system, and the resultis stems were used to index the documents. The words
of the queries were reduced to stems in the same way, and the documents were ranked
for each query using term coordination matching of query against document. From these
rankings, recall and precision values were obtained using the standard recall cutoff
method. The entire process was then repeated using the suffix stripping system
described in this paper, and the results were as follows:
earlier system present system
-------------- --------------
precision recall precision recall
0 57.24 0 58.60
10 56.85 10 58.13
20 52.85 20 53.92
30 42.61 30 43.51
40 42.20 40 39.39
50 39.06 50 38.85
60 32.86 60 33.18
70 31.64 70 31.19
80 27.15 80 27.52
90 24.59 90 25.85
100 24.59 100 25.85
Cleary, the performance is not very different. The important point is that the earlier,
more elaborate system certainly performs no better than the present, simple system.
2. The Algorithm
To present the suffix stripping algorithm in its entirety we will need a few difinitions.
*S - the stem ends with S (and similarly for the other letters).
*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).
And the condition part may also contain expressions with and, or and not, so that
(m>1 and (*S or *T))
tests for a stem with m>1 ending in S or T, while
(*d and not (*L or *S or *Z))
tests for a stem ending witha double consonant other than L, S or Z. Elaborate
conditions like this are required only rarely.
In a set of rules written beneath each other, only one is obeyed, and this will be the one
with the longest matching S1 for the given word. For example, with
SSES -> SS
IES -> I
SS -> SS
S ->
(here the conditions are all null) CARESSES maps to CARESS since SSES is the
longest match for S1. Equally CARESS maps to CARESS (S1=`SS') and CARES to
CARE (S1=`S').
In the rules below, examples of their application, successful or otherwise, are given on
the right in lower case. The algorithm now follows:
Step 1a
SSES -> SS caresses -> caress
IES -> I ponies -> poni
ties -> ti
SS -> SS caress -> caress
S -> cats -> cat
Step 1b
(m>0) EED -> EE feed -> feed
agreed -> agree
(*v*) ED -> plastered -> plaster
bled -> bled
(*v*) ING -> motoring -> motor
sing -> sing
If the second or third of the rules in Step 1b is successful, the following is done:
AT -> ATE conflat(ed) -> conflate
BL -> BLE troubl(ed) -> trouble
IZ -> IZE siz(ed) -> size
(*d and not (*L or *S or *Z))
-> single letter
hopp(ing) -> hop
tann(ed) -> tan
fall(ing) -> fall
hiss(ing) -> hiss
fizz(ed) -> fizz
(m=1 and *o) -> E fail(ing) -> fail
fil(ing) -> file
The rule to map to a single letter causes the removal of one of the double letter pair.
The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can
be recognised later. This E may be removed in step 4.
Step 1c
(*v*) Y -> I happy -> happi
sky -> sky
Step 1 deals with plurals and past participles. The subsequent steps are much more
straightforward.
Step 2
(m>0) ATIONAL -> ATE relational -> relate
(m>0) TIONAL -> TION conditional -> condition
rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli - > vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate
(m>0) ATOR -> ATE operator -> operate
(m>0) ALISM -> AL feudalism -> feudal
(m>0) IVENESS -> IVE decisiveness -> decisive
(m>0) FULNESS -> FUL hopefulness -> hopeful
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ALITI -> AL formaliti -> formal
(m>0) IVITI -> IVE sensitiviti -> sensitive
(m>0) BILITI -> BLE sensibiliti -> sensible
The test for the string S1 can be made fast by doing a program switch on the penultimate
letter of the word being tested. This gives a fairly even breakdown of the possible values
of the string S1. It will be seen in fact that the S1-strings in step 2 are presented here in
the alphabetical order of their penultimate letter. Similar techniques may be applied in
the other steps.
Step 3
(m>0) ICATE -> IC triplicate -> triplic
(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
(m>0) ICITI -> IC electriciti -> electric
(m>0) ICAL -> IC electrical -> electric
(m>0) FUL -> hopeful -> hope
(m>0) NESS -> goodness -> good
Step 4
(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ
(m>1) ITI -> angulariti -> angular
(m>1) OUS -> homologous -> homolog
(m>1) IVE -> effective -> effect
(m>1) IZE -> bowdlerize -> bowdler
The suffixes are now removed. All that remains is a little tidying up.
Step 5a
(m>1) E -> probate -> probat
rate -> rate
(m=1 and not *o) E -> cease -> ceas
Step 5b
(m > 1 and *d and *L) -> single letter
controll -> control
roll -> roll
The algorithm is careful not to remove a suffix when the stem is too short, the length of
the stem being given by its measure, m. There is no linguistic basis for this approach. It
was merely observed that m could be used quite effectively to help decide whether or
not it was wise to take off a suffix. For example, in the following two lists:
list A list B
------ ------
RELATE DERIVATE
PROBATE ACTIVATE
CONFLATE DEMONSTRATE
PIRATE NECESSITATE
PRELATE RENOVATE
-ATE is removed from the list B words, but not from the list A words. This means that
the pairs DERIVATE/DERIVE, ACTIVATE/ACTIVE, DEMONSTRATE/DEMONS-
TRABLE, NECESSITATE/NECESSITOUS, will conflate together. The fact that no
attempt is made to identify prefixes can make the results look rather inconsistent. Thus
PRELATE does not lose the -ATE, but ARCHPRELATE becomes ARCHPREL. In
practice this does not matter too much, because the presence of the prefix decreases the
probability of an erroneous conflation.
Complex suffixes are removed bit by bit in the different steps. Thus
GENERALIZATIONS is stripped to GENERALIZATION (Step 1), then to
GENERALIZE (Step 2), then to GENERAL (Step 3), and then to GENER (Step 4).
OSCILLATORS is stripped to OSCILLATOR (Step 1), then to OSCILLATE (Step 2),
then to OSCILL (Step 4), and then to OSCIL (Step 5). In a vocabulary of 10,000 words,
the reduction in size of the stem was distributed among the steps as follows:
Suffix stripping of a vocabulary of 10,000 words
------------------------------------------------
Number of words reduced in step 1: 3597
" 2: 766
" 3: 327
" 4: 2424
" 5: 1373
Number of words not reduced: 3650
The resulting vocabulary of stems contained 6370 distinct entries. Thus the suffix
stripping process reduced the size of the vocabulary by about one third.
References
1. LOVINS, J.B. Development of a Stemming Algorithm. Mechanical Translation and
computation Linguistics. 11 (1) March 1968 pp 23-31.
3. PETRARCA, A.E. and LAY W.M. Use of an automatically generated authority list
to eliminate scattering caused by some singular and plural main index
terms. Proceedings of the American Society for Information Science, 6 1969 pp 277-
282.
5. COLOMBO, D.S. and NIEHOFF R.T. Final report on improved access to scientific
and technical information through automated vocabulary switching. NSF Grant No.
SIS75-12924 to the National Science Foundation.
6. DAWSON, J.L. Suffix Removal and Word Conflation. ALLC Bulletin, Michaelmas
1974 p.33-46.