Exploiting linguistic knowledge to infer properties of
neologisms
by
C. Paul Cook
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Graduate Department of Computer Science
University of Toronto
c Copyright by C. Paul Cook 2010
Library and Archives
Canada
Bibliothèque et
Archives Canada
Published Heritage
Branch
Direction du
Patrimoine de l’édition
395 Wellington Street
Ottawa ON K1A 0N4
Canada
395, rue Wellington
Ottawa ON K1A 0N4
Canada
Your file Votre référence
ISBN: 978-0-494-72118-6
Our file Notre référence
ISBN: 978-0-494-72118-6
NOTICE:
AVIS:
The author has granted a nonexclusive license allowing Library and
Archives Canada to reproduce,
publish, archive, preserve, conserve,
communicate to the public by
telecommunication or on the Internet,
loan, distribute and sell theses
worldwide, for commercial or noncommercial purposes, in microform,
paper, electronic and/or any other
formats.
.
The author retains copyright
ownership and moral rights in this
thesis. Neither the thesis nor
substantial extracts from it may be
printed or otherwise reproduced
without the author’s permission.
L’auteur a accordé une licence non exclusive
permettant à la Bibliothèque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public
par télécommunication ou par l’Internet, prêter,
distribuer et vendre des thèses partout dans le
monde, à des fins commerciales ou autres, sur
support microforme, papier, électronique et/ou
autres formats.
L’auteur conserve la propriété du droit d’auteur
et des droits moraux qui protège cette thèse. Ni
la thèse ni des extraits substantiels de celle-ci
ne doivent être imprimés ou autrement
reproduits sans son autorisation.
In compliance with the Canadian
Privacy Act some supporting forms
may have been removed from this
thesis.
Conformément à la loi canadienne sur la
protection de la vie privée, quelques
formulaires secondaires ont été enlevés de
cette thèse.
While these forms may be included
in the document page count, their
removal does not represent any loss
of content from the thesis.
Bien que ces formulaires aient inclus dans
la pagination, il n’y aura aucun contenu
manquant.
Abstract
Exploiting linguistic knowledge to infer properties of neologisms
C. Paul Cook
Doctor of Philosophy
Graduate Department of Computer Science
University of Toronto
2010
Neologisms, or newly-coined words, pose problems for natural language processing
(NLP) systems. Due to the recency of their coinage, neologisms are typically not listed
in computational lexicons—dictionary-like resources that many NLP applications depend
on. Therefore when a neologism is encountered in a text being processed, the performance
of an NLP system will likely suffer due to the missing word-level information. Identifying
and documenting the usage of neologisms is also a challenge in lexicography, the making
of dictionaries. The traditional approach to these tasks has been to manually read a lot of
text. However, due to the vast quantities of text being produced nowadays, particularly
in electronic media such as blogs, it is no longer possible to manually analyze it all
in search of neologisms. Methods for automatically identifying and inferring syntactic
and semantic properties of neologisms would therefore address problems encountered in
both natural language processing and lexicography. Because neologisms are typically
infrequent due to their recent addition to the language, approaches to automatically
learning word-level information relying on statistical distributional information are in
many cases inappropriate. Moreover, neologisms occur in many domains and genres, and
therefore approaches relying on domain-specific resources are also inappropriate. The
hypothesis of this thesis is that knowledge about etymology—including word formation
processes and types of semantic change—can be exploited for the acquisition of aspects
of the syntax and semantics of neologisms. Evidence supporting this hypothesis is found
ii
in three case studies: lexical blends (e.g., webisode a blend of web and episode), text
messaging forms (e.g., any1 for anyone), and ameliorations and pejorations (e.g., the
use of sick to mean ‘excellent’, an amelioration). Moreover, this thesis presents the first
computational work on lexical blends and ameliorations and pejorations, and the first
unsupervised approach to text message normalization.
iii
Dedication
To my beautiful wife Hannah for loving and supporting me through the ups and downs
of life.
iv
Acknowledgements
I’d like to begin by thanking my supervisor, Suzanne Stevenson, for her support throughout my graduate studies. Suzanne, thanks for believing in me when I told you that I
wanted to work on, of all things, lexical blends. Thank you for having the confidence
in me to let me work on understudied phenomena and do research that is truly novel.
Without your support this thesis wouldn’t even have been started.
I’d also like to thank the other members of my thesis committee, Graeme Hirst and
Gerald Penn. Their feedback on my research and their careful reading of my thesis
greatly improved the quality of my work.
I’m also grateful to my collaborator, Afsaneh Fazly. Although the research that we
did together did not form a part of this thesis, working with Afsaneh, and learning from
her experience and insight, was an important part of my development as a researcher.
I’d further like to thank the other members, past and present, of the computational
linguistics group at the University of Toronto. Their feedback and support has been a
tremendous help throughout my Ph.D.
Finally I’d like to thank the Natural Sciences and Engineering Research Council of
Canada, the Ontario Graduate Scholarship Program, the University of Toronto, and the
Dictionary Society of North America for financially supporting this research.
v
Contents
1 Neologisms
1.1
1
Problems posed by neologisms . . . . . . . . . . . . . . . . . . . . . . . .
4
1.1.1
Challenges for natural language processing . . . . . . . . . . . . .
4
1.1.2
Problems in lexicography . . . . . . . . . . . . . . . . . . . . . . .
6
1.2
New word typology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
Overview of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2 Related work
2.1
12
Computational work on specific word formations . . . . . . . . . . . . . .
13
2.1.1
Composites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.1.1
POS tagging . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.1.2
Compounds . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.1.2
Shifts
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.1.3
Shortenings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.1.3.1
Acronyms and initialisms . . . . . . . . . . . . . . . . .
18
2.1.3.2
Clippings . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.1.4
Loanwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.1.5
Blends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2
Computational work exploiting context . . . . . . . . . . . . . . . . . . .
24
2.3
Lexicographical work on identifying new words . . . . . . . . . . . . . . .
28
vi
2.3.1
Corpora in lexicography . . . . . . . . . . . . . . . . . . . . . . .
28
2.3.1.1
Drudgery . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.3.1.2
Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.3.1.3
Statistics . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.3.2
Successful new words . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.3.3
Finding new words . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3 Lexical blends
3.1
41
A statistical model of lexical blends . . . . . . . . . . . . . . . . . . . . .
43
3.1.1
Candidate sets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.1.2
Statistical features . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.1.2.1
Frequency . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.1.2.2
Length, contribution, and phonology . . . . . . . . . . .
47
3.1.2.3
Semantics . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.1.2.4
Syllable structure . . . . . . . . . . . . . . . . . . . . . .
50
3.2
Creating a dataset of recent blends . . . . . . . . . . . . . . . . . . . . .
50
3.3
Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.3.1
Experimental expressions . . . . . . . . . . . . . . . . . . . . . . .
53
3.3.2
Experimental resources . . . . . . . . . . . . . . . . . . . . . . . .
54
3.3.3
Experimental methods . . . . . . . . . . . . . . . . . . . . . . . .
55
3.3.4
Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.4.1
Candidate sets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.4.1.1
CELEX . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.4.1.2
Web 1T 5-gram Corpus . . . . . . . . . . . . . . . . . .
59
Source word identification . . . . . . . . . . . . . . . . . . . . . .
59
3.4.2.1
Feature ranking . . . . . . . . . . . . . . . . . . . . . . .
60
3.4.2.2
Error analysis . . . . . . . . . . . . . . . . . . . . . . . .
62
3.4
3.4.2
vii
3.4.2.3
Modified perceptron . . . . . . . . . . . . . . . . . . . .
64
3.4.2.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.5
Blend identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
3.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
3.7
Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4 Text message forms
71
4.1
Analysis of texting forms . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.2
An unsupervised noisy channel model for text message normalization . .
77
4.2.1
Word models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.2.1.1
Stylistic variations . . . . . . . . . . . . . . . . . . . . .
78
4.2.1.2
Subsequence abbreviations . . . . . . . . . . . . . . . . .
80
4.2.1.3
Suffix clippings . . . . . . . . . . . . . . . . . . . . . . .
81
4.2.2
Word formation prior . . . . . . . . . . . . . . . . . . . . . . . . .
81
4.2.3
Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.3.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.3.2
Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.3.3
Model parameter estimation . . . . . . . . . . . . . . . . . . . . .
83
4.3.4
Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
4.4.1
Results by formation type . . . . . . . . . . . . . . . . . . . . . .
86
4.4.2
Results by Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.4.3
All unseen data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
4.5
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
4.6
Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.3
4.4
viii
5 Ameliorations and pejorations
92
5.1
Determining semantic orientation . . . . . . . . . . . . . . . . . . . . . .
94
5.2
Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.1
Identifying historical ameliorations and pejorations . . . . . . . . 100
5.3.2
Artificial ameliorations and pejorations . . . . . . . . . . . . . . . 102
5.3.3
Hunting for ameliorations and pejorations . . . . . . . . . . . . . 104
5.4
Amelioration or pejoration of the seeds . . . . . . . . . . . . . . . . . . . 109
5.5
More on determining semantic orientation . . . . . . . . . . . . . . . . . 111
5.6
5.5.1
Combining information from the seed words . . . . . . . . . . . . 111
5.5.2
Number of seed words . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.3
Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . 115
Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 Conclusions
118
6.1
Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2
Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2.1
Lexical blends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2.2
Text messaging forms . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2.3
Ameliorations and pejorations . . . . . . . . . . . . . . . . . . . . 123
6.2.4
Corpus-based studies of semantic change . . . . . . . . . . . . . . 125
Bibliography
126
ix
List of Tables
1.1
Word formation types and their proportion in the data analyzed by Algeo
(1980). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
8
Metcalf’s (2002) FUDGE factors for determining whether a word will remain in usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.1
A candidate set for architourist, a blend of architecture and tourist. . . .
44
3.2
The Wordspy definition, and first citation given, for the blend staycation.
51
3.3
Types of blends and their frequency in Wordspy data. . . . . . . . . . . .
52
3.4
% of expressions (% exps) with their source words in each lexical resource
and candidate set (CS), and after applying the syllable heuristic filter on
the CELEX CS, as well as median CS size, for both the Wordsplend
and Mac-Conf datasets. . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
58
% accuracy on blends in Wordsplend and Mac-Conf using the feature
ranking approach. The size of each dataset is given in parentheses. The
lexicon employed (CELEX or WEB 1T) is indicated. The best accuracy
obtained using this approach for each dataset and lexicon is shown in
boldface. Results that are significantly better than the informed baseline
are indicated with ∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
60
3.6
% accuracy on blends in Wordsplend and Mac-Conf using the modified perceptron algorithm. The size of each dataset is given in parentheses.
The lexicon employed (CELEX or WEB 1T) is indicated. Results that
are significantly better than the informed baseline are indicated with ∗. .
64
4.1
Frequency of texting forms in the development set by formation type. . .
75
4.2
Grapheme–phoneme alignment for without. . . . . . . . . . . . . . . . . .
79
4.3
% in-top-1, in-top-10, and in-top-20 accuracy on test data using both
estimates for P (wf ). The results reported by Choudhury et al. (2007) are
also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
85
Frequency and % in-top-1 accuracy using the formation-specific model
where applicable (Specific) and all models (All) with a uniform estimate
for P (wf ), presented by formation type. . . . . . . . . . . . . . . . . . .
4.5
87
% in-top-1 accuracy on the 303 test expressions using each model individually. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.1
Time period and approximate size of each corpus. . . . . . . . . . . . . .
97
5.2
% accuracy for inferring the polarity of expressions in GI using each corpus.
The accuracy for classifying the items with absolute calculated polarity in
the top 25% and 50% (top panel) and 75% and 100% (bottom panel) which
co-occur at least five times with seed words in the corresponding corpus
is shown. In each case, the baseline of always choosing negative polarity
and the number of items classified (N) are also shown.
5.3
. . . . . . . . . .
98
% accuracy and baseline using Lampeter and approximately one-millionword samples from CLMETEV and the BNC. The results using CLMETEV and the BNC are averaged over five random one-million-word samples. 99
xi
5.4
The polarity in each corpus and change in polarity for each historical
example of amelioration and pejoration. Note that succeed does not exhibit
the expected change in polarity. . . . . . . . . . . . . . . . . . . . . . . . 101
5.5
Average polarity of positive and negative words from GI in each corpus
with frequency greater than five and which co-occur at least once with
both positive and negative seed words in the indicated corpus. . . . . . . 103
5.6
Expressions with top 10 increase in polarity from CLMETEV to the BNC
(candidate ameliorations). For each expression, the proportion of human
judgements for each category is shown: CLMETEV usage is more positive/less negative (CLMETEV), BNC usage is more positive/less negative
(BNC), neither usage is more positive or negative (Neither). Majority
judgements are shown in boldface, as are correct candidate ameliorations
according to the majority responses of the judges. . . . . . . . . . . . . . 105
5.7
Expressions with top 10 decrease in polarity from CLMETEV to the BNC
(candidate pejorations). For each expression, the proportion of human
judgements for each category is shown: CLMETEV usage is more positive/less negative (CLMETEV), the BNC usage is more positive/less negative (BNC), neither usage is more positive or negative (Neither). Majority
judgements are shown in boldface, as are correct candidate pejorations
according to the majority responses of the judges. . . . . . . . . . . . . . 106
5.8
A sample of the Amazon Mechanical Turk polarity judgement task. . . . 107
5.9
Confusion matrix representing the results of our classification task used to
define our adapted versions of precision and recall (given in equations 5.6
and 5.7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xii
5.10 Results for classifying items in GI in terms of our adapted versions of
precision (P), recall (R), and F-measure (F) using the TL seeds and GI
seeds. The number of seed words for each of TL and GI is given, along
with the number of items that are classified using these seed words. . . . 113
5.11 Precision (P), recall (R), F-measure (F), and number of items classified for
the top 25% most-polar items in GI. Polarity is calculated using the items
from Brooke et al.’s (2009) polarity lexicon with polarity greater than or
equal to the indicated level as seed words; the total number of seed words
is also given. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xiii
List of Figures
3.1
ROC curves for blend identification. . . . . . . . . . . . . . . . . . . . . .
5.1
Average % accuracy for inferring the polarity of the items in GI for each
corpus as the percentage of noisy seed words is varied.
xiv
67
. . . . . . . . . . 110
Chapter 1
Neologisms
Neologisms—newly-coined words or new senses of an existing word—are constantly being
introduced into a language (Algeo, 1980; Lehrer, 2003), often for the purpose of naming
a new concept. Domains that are culturally prominent or that are rapidly advancing—
current examples being electronic communication and the Internet—often contain many
neologisms, although novel words do arise throughout a language (Ayto, 1990, 2006;
Knowles and Elliott, 1997).
Fischer (1998) gives the following definition of neologism:
A neologism is a word which has lost its status of a nonce-formation but is
still one which is considered new by the majority of members of a speech
community.
A nonce-formation is a word which is created and used by a speaker who believes it to be
new (Bauer, 1983); once a speaker is aware of having used or heard a word before, it ceases
to be a nonce-formation. Other definitions of neologism take a more practical stance. For
example, Algeo (1991) considers a neologism to be a word which meets the requirements
for inclusion in general dictionaries, but has not yet been recorded in such dictionaries.
The neologisms considered in this thesis will—for the most part—satisfy both of these
1
Chapter 1. Neologisms
2
definitions: they are sufficiently established to no longer be nonce-formations, generally
considered to be new, and typically not recorded in general-purpose dictionaries.
We further distinguish between two types of neologism: new words that are unique
strings of characters, for example, webisode (a blend of web and episode), and neologisms
that correspond to new meanings for an existing word form, for example, Wikipedia used
as a verb—meaning to conduct a search on the website Wikipedia—instead of as a proper
noun.
Before going any further, we must clarify the meaning of word. It is difficult to give
a definition of word which is satisfactory for all languages and all items which seem to
be words (Cruse, 2001). Therefore, in the same spirit as Cruse, we will characterize the
notion word in terms of properties of prototypical words, accepting that our definition
is inadequate for some cases. Words are typically morphological objects, that is to say
that words are formed by combining morphemes according to the rules of morphology.
Turning to syntax, specifically X-bar theory, words typically occupy the X 0 position in
a parse tree; that is, words are usually syntactic atoms (Di Sciullo and Williams, 1987).
Phonological factors may also play a role in determining what is a word. For example,
a speaker typically cannot naturally pause during pronunciation of a word (Anderson,
1992). We do not appeal to the notion of listedness in the lexicon in our characterization
of word. Since the rules of morphology are recursive, there are potentially an infinite
number of words. Therefore, if the lexicon is viewed as a simple list of words, not all words
can be stored in the lexicon. Furthermore, many non-compositional phrases, such as
idioms, must be stored in the lexicon, as their meaning cannot be derived compositionally
from the meaning of their parts. Neither do we rely on whitespaces in writing to determine
what is a word. Many words are written as two whitespace delimited strings (e.g.,
many English compounds); some languages do not use whitespace to delimit words; and
moreover, some languages do not have writing systems.
Chapter 1. Neologisms
3
It is difficult to know the frequency of new word formation. Barnhart (1978, section
2.3.4) notes that approximately 500 new words are recorded each year in various English
dictionaries. This figure can be taken as a lower bound of the yearly number of new
English words, but the true number of such words is likely much higher. Dictionaries
only record words that meet their criteria for inclusion, which may be based on frequency,
range of use, timespan of use, and judgements about a word’s cruciality, that is, the need
for it to be in the language (Sheidlower, 1995). These criteria will not necessarily capture
all new words, even those that have become established in a language. Furthermore, at
the time of Barnhart’s (1978) estimate, lexicography was largely a manual undertaking.
Lexicographers identified neologisms by reading vast quantities of material and recording
what they found.1 It is entirely possible that dictionaries fail to document some of the
new words from a given time period which satisfy their criteria for inclusion.
Barnhart (1985) observes that in a large sample of magazines spanning one month,
1,000 new words were found; from this he extrapolates that the annual rate of new
word formation may be roughly 12,000 words per year. However, it is likely that many
of these terms would not be recorded in dictionaries, due to their policies for inclusion.
This figure may also be an overestimate of the yearly number of new words; sampling any
particular month will also find words which were new in a previous month, and sampling
subsequent months may reveal fewer neologisms. On the other hand, this estimate may
be quite conservative as it only considers magazines; sampling more materials may reveal
many more new words.
Metcalf (2002) claims that at least 10,000 new words are coined each day in English;
however, he also notes that most of these words never become established forms. The rate
at which new words are coined can also be estimated from corpus data. The number of
hapax legomena (or hapaxes—words which only occur once) and total number of tokens
1
Johnson (1755) describes the work of dictionary making in his oft-quoted definition of lexicographer :
“A writer of dictionaries; a harmless drudge, that busies himself in tracing the original, and detailing
the signification of words.”
Chapter 1. Neologisms
4
in a corpus can be used to estimate the rate of vocabulary growth (Baayen and Renouf,
1996). As corpus size increases, the proportion of new words amongst the hapaxes
increases, and so the rate of vocabulary growth gives an estimate of the rate of new
word coinage. However, new words that are also hapaxes may be nonce-formations.
Nevertheless, despite the difficulty of estimating the frequency of new word coinage, and
the differing estimates thereof, it is clear that many new words enter the English language
each year.
1.1
Problems posed by neologisms
In the following two subsections we consider challenges related to neologisms in the fields
of natural language processing and lexicography.
1.1.1
Challenges for natural language processing
Systems for natural language processing (NLP) tasks often depend on lexicons for a variety of information, such as a word’s parts-of-speech or meaning representation. Therefore,
when an unknown word—a word that is not in a system’s lexicon—is encountered in a
text being processed, the performance of the entire system will likely suffer due to missing
lexical information.
Unknown words may be of various types. For example, a word may be unknown because it is an infrequent or domain-specific word which happens to not have been included
in a system’s lexicon. For example, syntagmatic is not listed in the CELEX database
(Baayen et al., 1995), but is included in the Macquarie Dictionary (Delbridge, 1981).
Non-word spelling errors—errors that result in a form that is typically not considered to
be a word, such as teh for the—and proper nouns, are two other types of unknown word
that have received a fair amount of attention in computational linguistics, under the
headings of non-word spelling error detection and correction, and named-entity recogni-
Chapter 1. Neologisms
5
tion, respectively. Since new words are constantly being coined, neologisms are a further
source of unknown words; however, neologisms have not been studied as extensively in
computational linguistics.
Ideally, an NLP system could identify neologisms as such, and then infer various aspects of their syntactic or semantic properties necessary for the computational task at
hand. For example, a parser for a combinatory categorial grammar may benefit from
knowing a neologism’s syntactic category, while semantic information such as a neologism’s hypernyms may be important for tasks such as question answering. Context of
usage is clearly a key piece of information for inferring a word’s syntactic and semantic
properties, and indeed many studies into lexical acquisition have used the context in
which words occur to learn a variety of such properties (e.g., Hindle, 1990; Lapata and
Brew, 2004; Joanis et al., 2008). However, these methods are generally not applicable
for learning about neologisms. Such techniques depend on distributional information
about a word which is obtained by observing a large number of usages of that word; in
general, the more frequent a target word, the more accurate the automatically-inferred
information will be. Since neologisms are expected to be rather infrequent due to the
recency of their coinage, such methods cannot be expected to work well on these words.
On the other hand, some studies into lexical acquisition have inferred lexical information based on just a single usage, or a small number of usages, of a word (e.g.,
Granger, 1977; Cardie, 1993; Hastings and Lytinen, 1994). These methods exploit rich
representations of the lexical and syntactic context in which a given target word occurs,
as well as domain-specific knowledge resources, to infer lexical information. However,
these methods are of limited use for inferring properties of neologisms, as the domainspecific knowledge resources they require are only available for a very small number of
narrowly-defined domains.
Chapter 1. Neologisms
1.1.2
6
Problems in lexicography
Dictionaries covering current language must be updated to reflect new words, and new
senses of existing word forms, that have come into usage. Vast quantities of text are produced each day in a variety of media including traditional publications such as newspapers
and magazines, as well as newer types of communication such as blogs and micro-blogs
(e.g., Twitter). New-word lexicographers must search this text for neologisms; however,
given the amount of text that must be analyzed, it is simply not feasible to manually
process it all (Barnhart, 1985). Therefore, automatic (or semi-automatic) methods for
the identification of new words are required.
Identifying unique string neologisms is facilitated by their distinguishing orthographic
form. One proposed method of searching for unique string neologisms that should be
included in a dictionary is to identify words that are substantially more frequent in
a corpus of recently-produced texts than in a corpus of older texts, and that are not
listed in the dictionary under consideration; the identified words can then be manually
examined, and if found to be appropriate, included in that dictionary (O’Donovan and
O’Neil, 2008). This semi-automatic method for finding new words is limited in that it
can only find unique string neologisms and not new senses of word forms. Indeed, this
remains an important open problem in computational lexicography. The precision of
such a method is also limited as it will identify new-word candidates that have unique
orthographic forms, such as jargon terms and proper nouns, that—depending on the
dictionary’s inclusion policies—should not be included in the dictionary.
Even greater challenges are posed by neologisms that correspond to new senses of
existing word forms, that is, neologisms that are homographous with words already
recorded in a given dictionary. Such neologisms result in so-called covert lexical gaps
(Zernik, 1991), which are difficult to automatically identify as they cannot be searched
for in any straightforward way. Lexicographers have also stressed the importance of
not solely focusing on new words when updating a dictionary, but also considering how
Chapter 1. Neologisms
7
established words have changed (Simpson, 2007).
1.2
New word typology
As discussed above in Section 1.1.1, current methods for lexical acquisition are generally
not applicable for learning properties of neologisms; this is largely due to the reliance of
these methods on statistical distributional information, and the tendency for neologisms
to be low-frequency items. However, knowledge about the processes through which neologisms are formed can be exploited in systems for lexical acquisition of neologisms; to
date this knowledge source has not been widely considered in computational work.
Language users create new words through a variety of word formation processes, e.g.,
derivational morphology, compounding, and borrowing from another language (Bauer,
1983; Plag, 2003). To estimate the relative frequency of the various word formation processes, Algeo (1980) determines the etymology of 1,000 words selected from Barnhart
et al. (1973); a summary of his findings is presented in Table 1.1. I hypothesize that by
identifying the word formation process of a given new word, and then exploiting properties of this formation process, computational systems can infer some lexical information
from a single usage of that word without relying on expensive domain-specific knowledge
resources.
Of the items Algeo classifies as shifts, roughly half—7.7% of the total words analyzed—
do not correspond to a change in part-of-speech. These items are the neologisms that
are most difficult to identify. Although these words represent a rather small percentage
of the total number of neologisms, the rate of emergence of new senses of word forms is
not necessarily low; Barnhart et al. (1973) were manually searching for neologisms, and
they may have missed many new senses of word forms. Nevertheless, new senses of word
forms also emerge through regular processes which can be exploited in computational
systems.
8
Chapter 1. Neologisms
Formation Type
%
Examples
Composites
64
Affixed forms, such as dehire, and compound forms,
such as think tank
Shifts
14
The noun edit from the verb edit, hardhat meaning
construction worker
Shortenings
10
Clippings, such as Jag from Jaguar, and acronyms,
such as Sam from Surface to Air Missile
Loanwords
7
Al dente from Italian, macho from Spanish
Blends
5
Chunnel from channel tunnel
Unknown
1
Cowabunga
Table 1.1: Word formation types and their proportion in the data analyzed by Algeo
(1980).
New senses of word forms arise through semantic change, two broad types of which
are widening and narrowing (Campbell, 2004, Chapter 9). Widening is the extension
of a word’s meaning to more contexts than it previously applied to. For example, in
addition to indicating that a literal sense is intended literally has also come to be used as
an intensifier, even in figurative contexts, as in The world is literally her oyster.2 On the
other hand, narrowing restricts a word’s meaning to fewer contexts. For example, the
meaning of meat in Old English was food in general, but this has since narrowed to the
flesh of animals. (The food sense of meat may still be in use nowadays, but it appears
to be much less frequent than the animal flesh sense.) Many other types of semantic
change, such as metaphorical sense extensions, can be viewed as types of widening. For
example, using the metaphor arguments are buildings, the domain of arguments
can be discussed using terms for the domain of buildings (Lakoff and Johnson, 1980), as
2
http://www.nytimes.com/2009/07/05/opinion/05dowd.html
Chapter 1. Neologisms
9
in My argument was demolished. Two further types of widening are amelioration and
pejoration; in these processes a word takes on a more positive or negative evaluation,
respectively, in the mind of the speaker. A recent amelioration is the extension of banging
from its meaning of music “having a loud, prominent, danceable beat” to “excellent” (not
specifically referring to music).3 An example pejoration is retarded acquiring the sense
of being inferior or of poor quality. I hypothesize that knowledge about specific types of
semantic change, such as amelioration and pejoration, can be exploited by computational
systems for the automatic acquisition of new senses of word forms.
1.3
Overview of thesis
The hypothesis of this thesis is that knowledge about etymology—including word formation processes and types of semantic change—can be exploited for the acquisition of
aspects of the syntax and semantics of neologisms; to date, this knowledge source has not
been widely considered in computational linguistics. Moreover, in some cases, exploiting
etymological information may allow the development of lexical acquisition methods which
rely on neither statistical distributional information nor domain-specific lexical resources,
both of which are desirable to avoid in the case of neologisms.
Chapter 2 discusses related computational and lexicographical work on neologisms.
In particular, we examine computational work that has exploited knowledge of word
formation processes for lexical acquisition, as well as studies that infer aspects of the
syntax and semantics of a given lexical item from just a small number of its usages.
This chapter also examines lexicographical approaches to identifying neologisms and
determining which are likely to remain in usage, and thus deserve entry in a dictionary.
The next three chapters present novel research on three topics related to the research
discussed in Chapter 2 that, to date, have not received the attention they deserve in
3
“banging, ppl. a.” OED Online, March 2007, Oxford University Press, 13 August 2009 http:
//dictionary.oed.com/cgi/entry/50017259.
Chapter 1. Neologisms
10
computational linguistics. Lexical blends—or blends, also sometimes referred to as portmanteaux by lay people—are words such as staycation which are formed by combining
parts of existing words, in this case stay-at-home and vacation. Although accounting for
roughly 5% of new words (see Table 1.1), blends have largely been ignored in computational work. Chapter 3 presents a method for inferring the source words of a given
blend—for example, stay-at-home and vacation for staycation—based on linguistic observations about blends and their source words and cognitive factors that likely play a
role in their interpretation. On a dataset of 324 blends, the proposed method achieves an
accuracy of 40% on the task of identifying both source words of each expression, which
has an informed baseline of 27%. Chapter 3 also presents preliminary results for the task
of distinguishing blends from other types of neologisms. This research is the first computational study of lexical blends, and was previously published by Cook and Stevenson
(2007) and Cook and Stevenson (2010b).
Cell phone text messaging—also known as SMS—contains many abbreviations and
non-standard forms. Before NLP tasks such as machine translation can be applied to
text messages, the text must first be normalized by converting non-standard forms to
their standard forms. This is particularly important for text messaging given the abundance of non-standard forms in this medium. Although text message normalization has
been considered in several computational studies, the issue of out-of-vocabulary texting
forms—items that are encountered in text on which the system is operating, but not
found in the system’s training data—has received little attention. Chapter 4 presents an
unsupervised type-level model for normalization of non-standard texting forms. The proposed method draws on observations about the typical word formation processes in text
messaging, and—like the work on lexical blends described in Chapter 3—incorporates
cognitive factors in human interpretation of text messaging forms. The performance of
the proposed unsupervised method is on par with that of the best reported results of a
supervised system on the same dataset. This work was previously published by Cook
Chapter 1. Neologisms
11
and Stevenson (2009).
The research in Chapters 3 and 4 focuses on unique string neologisms. Chapter 5,
on the other hand, presents work on identifying new word senses. Amelioration and
pejoration are common types of semantic change through which a word’s meaning takes
on a more positive or negative evaluation in the mind of the speaker. Given the recent interest in natural language processing tasks such as sentiment analysis, and that
many current approaches to such tasks rely on lexicons of word-level polarity, automatic
methods for keeping polarity lexicons up-to-date are needed. Furthermore, knowledge
of word-level polarity is important for speakers—particularly non-native speakers of a
language—to use words appropriately. Tools to track changes in polarity could therefore
also be useful to lexicographers in keeping dictionaries current. Chapter 5 presents an
unsupervised statistical method for identifying ameliorations and pejorations drawing on
recent corpus-based methods for inferring semantic orientation lexicons. We show that
our proposed method is able to successfully identify historical ameliorations and pejorations, as well as artificial examples of amelioration and pejoration. We also apply our
method to find words which have undergone amelioration and pejoration in recent text,
and show that this method may be used as a semi-automatic tool for finding new word
senses. This research was previously published by Cook and Stevenson (2010a) and is
the first published computational work focusing on amelioration and pejoration.
Finally, Chapter 6 gives a summary of the contributions of this thesis and identifies
potential directions for future work.
Chapter 2
Related work
As discussed in Chapter 1, neologisms pose problems for NLP applications, such as
question answering, due to the absence of lexical information for these items. Moreover,
since neologisms are expected to be rather infrequent due to the recency of their coinage,
methods for lexical acquisition that rely solely on statistical distributional information are
not well-suited for learning syntactic or semantic properties of neologisms, particularly
those which have very low frequency.
Linguistic observations regarding neologisms—namely aspects of their etymology such
as the word formation process through which whey were created—can be exploited in
systems for inferring syntactic or semantic properties of infrequent new words. In Section 2.1 we examine computational work related to each of the word formation processes
that Algeo (1980) identifies. (See Table 1.1, page 8, for Algeo’s word formation process
classification scheme.)
The context in which a neologism is used also provides information about its syntax
and semantics. This is the intuition behind corpus-based statistical methods for lexical
acquisition which we have already discussed as not being applicable for neologisms; however, a number of methods have been proposed for inferring the syntax or semantics of
an unknown word—potentially a neologism—using domain-specific lexical resources and
12
Chapter 2. Related work
13
the context in which it occurs, based on just a single usage, or a small number of usages.
Section 2.2 examines some of this work.
Identifying and documenting new words is also a challenge for lexicography. From
the massive amounts of text produced each day, neologisms must be found; subsequently,
those neologisms that are expected to remain in the language need to be added to dictionaries of current usage. Section 2.3 discusses lexicographical approaches—both manual
and semi-automatic—to these tasks.
2.1
Computational work on specific word formations
In this section we examine a number of computational methods that have exploited
knowledge about the way in which new words are typically formed in order to learn
aspects of their syntax or semantics. We consider each type of word formation that
Algeo (1980) identifies, in decreasing order of frequency in the data he analyzes (see
Table 1.1, page 8).
2.1.1
Composites
In Algeo’s (1980) taxonomy of new words, the category of composites consists of words
created through derivational morphology by combining affixes with existing words, and
compounds formed by combining two words. In Section 2.1.1.1 we discuss a number of
approaches that have exploited knowledge of prefixes and suffixes for the task of part-ofspeech (POS) tagging. In Section 2.1.1.2 we look at some computational work that has
addressed compounds.
2.1.1.1
POS tagging
POS tagging of unknown words, including neologisms, can benefit greatly from exploiting
word structure. A simple method for tagging English unknown words would be to tag
Chapter 2. Related work
14
a word as a common noun if it begins with a lowercase letter, and as a proper noun
otherwise. However, there are a number of heuristics based on word endings which can
easily be incorporated to improve performance. For example, tagging of English words
can benefit from the knowledge that regular English verbs often end in -ed when used
in the past tense. Indeed, commonly used POS taggers have made use of this kind of
information.
Brill’s (1994) transformation-based tagger handles unknown words by learning weights
for lexicalized transformations specific to these words. These transformations incorporate information about suffixes and the presence of particular characters. Although the
transformations capture properties specific to English, such as to change the tag for a
word from common noun to past participle verb if the word ends in the suffix -ed, the
specific lexicalizations and corresponding weights for these transformations are learned
automatically. Ratnaparkhi (1996) assumes that unknown words in test data behave
similarly to infrequent words in training data. He also introduces some features specifically to cope with unknown words, which are based on prefixes and suffixes of a word as
well as the presence of uppercase characters.
Toutanova et al. (2003) improve on the results of Ratnaparkhi by using features
which are lexicalized on the specific words in the context of the target word (as opposed
to just their part-of-speech tags). Like Ratnaparkhi, Toutanova et al. also introduce a
small number of features specifically aimed at improving tagging of unknown words.
For example they use a crude named entity recognizer which identifies capitalized words
followed by a typical suffix for a company name (e.g., Co. and Inc.).
Mikheev (1997) examines the issue of determining the set of POSs which a given unknown word can occur as. Since most POS taggers require access to a lexicon containing
this information, this is an essential sub-task of POS tagging. Mikheev describes guessing
rules which are based on the parts of speech of morphologically related words, and the
aforementioned observation that certain suffixes of words, such as -ed, often correspond
Chapter 2. Related work
15
to particular POSs (a past tense verb in the case of -ed ). An example of a guessing rule
is that if an unknown word ends in the suffix -ied, and if the result of replacing this suffix
with y is a word whose POS class is {VB , VBP },1 then the POS class of the unknown
word should be {JJ , VBD, VBN }.2
Guessing rules are not hand-coded, rather they are automatically learned from a
corpus. To do this, Mikheev (automatically) examines all pairs of words in a training
dataset of approximately 60K words. If some guessing rule can be used to derive one
word from the other, the frequency count for that rule is increased. After processing
all word pairs in the training data, infrequent rules are eliminated, and the remaining
rules are scored according to how well they predict the correct POS class of words in
the training data. The performance of the guessing rules on both the training data and
a test set formed from approximately 18K hapax legomena (words that only occur once
in a corpus) from the Brown corpus (Francis and Kucera, 1979) is used to determine a
threshold for selecting only the best guessing rules. To evaluate the performance of the
guessing rules, Mikheev uses his guesser in conjunction with Brill’s tagger, and achieves
an error rate of 11% for tagging unknown words in the Brown corpus. Mikheev compares
his system against the standard Brill tagger which gives an error rate of 15% on the same
task, indicating that morphological information about unknown words can be effectively
exploited in POS tagging.
2.1.1.2
Compounds
Compounds include expressions such as think tank, low-rise, and database in which two
existing words are combined to form a new word. The combined items may be separated by a space or hyphen, or written as a single word. Moreover, a single item, such
as database may be expressed in all three of these forms (Manning and Schütze, 1999).
1
2
VB: verb, base form; VBP: verb, present tense, not 3rd person singular.
JJ: adjective or numeral, ordinal; VBD: verb, past tense; VBN: verb, past participle.
Chapter 2. Related work
16
Although these items pose challenges for the task of tokenization (Manning and Schütze,
1999), little work appears to have addressed single-word English compounds. In particular, recognizing that a single word is a compound, and knowing its etyma, could be useful
in tasks such as translation. However, the similar problem of word segmentation in languages that do not delimit words with whitespace, such as Chinese, has been considered
(e.g., Jurafsky and Martin, 2000, Section 5.9).
One aspect of compounds that has received a great deal of attention recently is automatically determining the semantic relation between the component words in a compound, particularly in the case of noun–noun compounds. Lauer (1995) automatically
classifies noun–noun compounds according to which of eight prepositions best paraphrases
them. For example, he argues that a baby chair is a chair for a baby while Sunday television is television on Sunday. Lauer draws on corpus statistics of the component head
and modifier noun in a given noun–noun compound co-occurring with his eight selected
prepositions to determine the most likely interpretation. Girju et al. (2005) propose supervised methods for determining the semantics of noun–noun compounds based on the
WordNet (Fellbaum, 1998) synsets of their head and modifier nouns. In this study they
evaluate their methods using Lauer’s eight prepositional paraphrases, as well as a set of
35 semantic relations they develop themselves which includes relations such as possession, temporal, and cause. Interestingly their method achieves higher accuracy on
the 35 more fine-grained semantic relations, which they attribute to the fact that Lauer’s
prepositional paraphrases are rather abstract and therefore more ambiguous.
2.1.2
Shifts
Shifts are a change in the meaning of a word, with a possible change in syntactic category.
In one of the few diachronic computational studies of shifts, Sagi et al. (2009) propose
a method for automatically identifying the semantic change processes of widening and
narrowing. They form a word co-occurrence vector for each usage of a target expression
Chapter 2. Related work
17
in two corpora using latent semantic analysis (LSA, Deerwester et al., 1990); the two
corpora consist of texts from Middle English and Early Modern English, respectively.
For each corpus, they then compute the average pairwise cosine similarity of the cooccurrence vectors for all usages of the target word in that corpus. They then compare
the two similarity scores for the target word. Their hypothesis is that if the target word
has undergone widening, the usages in the newer corpus will be less similar to each other
because the target now occurs in a greater variety of contexts; similarly, in the case of
narrowing, the usages will be more similar. They test this hypothesis on three target
expressions and find it to hold in each case. A more thorough evaluation will be required
in the future to properly determine the performance of this method. Moreover, the cooccurrence vectors formed through LSA may not be the most appropriate representation
for a target word. A more linguistically informed representation that takes into account
the syntactic relationship between the target and co-occurring words may be more informative. Furthermore, by focusing on more specific types of semantic change, such as
amelioration and pejoration, and exploiting properties specific to these processes, it may
be possible to develop methods which more accurately identify these types of semantic
change.
Other computational work on shifts has considered identifying expressions or usages
that are metaphorical. Lakoff and Johnson (1980) present the idea of “metaphors we
live by” which views metaphor as pervasive throughout not just language but also our
conceptual system. However, if a metaphorical usage of a word is sufficiently frequent, it
will (or should) be included in a lexicon. Novel metaphors, on the other hand, would not
be recorded in lexicons. Krishnakumaran and Zhu (2007) present a method for extracting
novel metaphors from a corpus based on violations of selectional preferences that are
determined using WordNet (Fellbaum, 1998) and corpus statistics. Beigman Klebanov
et al. (2009) consider the identification of metaphorical usages from the perspective that
they will be off-topic with respect to the topics of the document in which they occur.
Chapter 2. Related work
18
Using latent Dirichlet allocation (Blei et al., 2003) to determine topics, they show that this
hypothesis often holds. In Section 2.2 we return briefly to metaphor when we consider
computational approaches to neologisms that exploit rich semantic representations of
context.
2.1.3
Shortenings
In Algeo’s new-word classification scheme, shortenings consist of acronyms and initialisms, clippings, and backformations. Backformations—for example, the verb choreograph formed from the noun choreography—are rather infrequent in Algeo’s data, and
therefore will not be further discussed here. Computational work relating to acronyms
and initialisms, and clippings, is discussed in the following two subsections, respectively.
2.1.3.1
Acronyms and initialisms
Acronyms are typically formed by combining the first letter of two or more words, and are
pronounced as a word, for example, NAFTA (North American Free Trade Agreement) and
laser (light amplification by stimulated emission of radiation). Initialisms, on the other
hand, are similarly formed, but are pronounced letter-by-letter, as in CBC (Canadian
Broadcasting Corporation) and P.E.I. (Prince Edward Island). For the remainder of
this section we will refer to both acronyms and initialisms simply as acronyms. Some
acronyms also include letters that are not the first letter of one of their source words,
as in XML (Extensible Markup Language) and COBOL (Common Business-Oriented
Language).
Automatically inferring the longform of an acronym (i.e., Canadian Broadcasting Corporation for CBC ) has received a fair bit of attention in computational linguistics, particularly in the bio-medical domain, where such expressions are very frequent. Schwartz
and Hearst (2003) take a two-step approach to this problem. First, they extract from a
corpus pairs consisting of an acronym and a candidate longform. They take the candidate
Chapter 2. Related work
19
longform for a given acronym to be a contiguous sequence of words in the same sentence
as that acronym of length less than or equal to min(|A| + 5, |A| ∗ 2) words, where A is the
acronym. (This heuristic was observed to capture the relationship between the length
of most acronyms and their corresponding longforms). They then select the appropriate
longform, which is a subset of words from the candidate longform, for each pair. The
acronym–candidate longform pairs are identified using some simple heuristics based on
the typical ways in which acronyms are defined, for example, patterns such as a longform
followed by its acronym in parentheses, as in Greater Toronto Area (GTA). For each
pair, the correct longform is selected using a simple algorithm which matches characters
in the acronym and candidate longform. They evaluate their algorithm on a corpus which
contains 168 acronym–candidate longform pairs, and report precision and recall of 96%
and 82%, respectively.
Okazaki and Ananiadou (2006) use heuristics similar to those of Schwartz and Hearst
to identify acronyms. However, in this study, frequency information is used to choose
the best longform for a given acronym. Okazaki and Ananiadou order the longforms
according to a score which is based on the frequency of a longform l discounted by
the frequency of longer candidate longforms of which l is a subsequence. They then
eliminate all longforms which do not score above a certain threshold, and use a number
of heuristics—such as that a longform must contain all the letters in an acronym—to
select the most likely longform. Okazaki and Ananiadou evaluate their method on 50
acronyms, and report a precision and recall of 82% and 14%, respectively. They compare
their method against Schwartz and Hearst’s algorithm which achieves precision and recall
of 56% and 93%, respectively, on the same data. Interestingly, augmenting Okazaki and
Ananiadou’s method to treat longforms proposed by Schwartz and Hearst’s system as
scoring above the threshold gives precision and recall of 78% and 84%, respectively.
Nadeau and Turney (2005) propose a supervised approach to learning the longforms
of acronyms. Like Schwartz and Hearst, and Okazaki and Ananiadou, Nadeau and Tur-
Chapter 2. Related work
20
ney rely on heuristics to identify acronyms and potential longforms. However, Nadeau
and Turney train a support vector machine to classify the candidates proposed by the
heuristics as correct acronym–longform pairs or incorrect pairs. Examples of the seventeen features used by their classifier are the number of letters in the acronym that
match the first letter of a longform word, and the number of words in the longform that
do not participate in the acronym. They train their classifier on 126 acronym–potential
longform pairs, and evaluate on 168 unseen pairs. They achieve precision and recall of
93% and 84%, respectively, while Schwartz and Hearst’s method gives 89% and 88% in
terms of the same metrics on this data.
All three of the above-mentioned methods have been evaluated within the biomedical
domain. Further evaluation is required to verify the appropriateness of such methods
in other domains, or in non–domain-specific settings. One issue that may arise in such
an evaluation is the ambiguity of acronyms in context. For example, ACL may refer
to either the Association of Christian Librarians or the Association for Computational
Linguistics. Sumita and Sugaya (2006) address the problem of determining the correct
longform of an acronym given an instance of its usage and a set of its possible longforms.
For each of an acronym’s longforms Sumita and Sugaya form word co-occurrence vectors
for the acronym corresponding to that longform based on the results of web queries for
the acronym co-occurring with that longform in the same document. They then use these
vectors to train a decision tree classifier for each acronym. Sumita and Sugaya evaluate
their method on instances of 20 acronyms that have at least 5 meanings, but restrict
their evaluation to either the 2 or 5 most frequent meanings. On the 5-way and 2-way
tasks they achieve accuracies of 86% and 92%, respectively. The baselines on these tasks
are 77% and 82%, respectively.
Chapter 2. Related work
2.1.3.2
21
Clippings
Clippings are typically formed by removing either a prefix or suffix from an existing word.
(Note that here we use prefix and suffix in the sense of strings, not affixes in morphology.)
Example clippings are lab from laboratory and phone from telephone. Clippings corresponding to an infix of a word, for example, flu from influenza, are much less common.
In some cases the clipped form may contain additional graphemes or phonemes that are
not part of the original word, as in ammo, a shortened form of ammunition. Kreidler
(1979) identifies a number of orthographic and phonological properties of clippings, such
as that they tend to be mono-syllabic and end in a consonant. He further notes that
in cases where clippings do not fit these patterns, they tend to fall into a small number
of other regular forms. Such insights could be used in a computational method for automatically inferring the full form of a word that is known to be a clipping, a key step
towards inferring the meaning of a clipping.
Some preliminary work has been done in this direction by Means (1988), who attempts
to automatically recognize, and then correct or expand, misspellings and abbreviations,
which include clippings. The data for her study is text produced by automotive technicians to describe work done fixing vehicles, and is known to contain many words of these
types. Means first creates a set of candidate words which could be the corrected version,
or full form, of a given unknown word. This candidate set includes words which are
orthographically similar to the unknown word in terms of edit-distance, words which the
unknown word is a prefix or suffix of, and words the unknown word is orthographically a
subsequence of. Means then orders the words in the candidate set according to a variety
of heuristics, some of which make use of observations similar to those of Kreidler, such
as that an abbreviation is unlikely to end in a vowel. Means claims that the performance of her system is “fairly good”, but does not report any quantitative results. A
quantitative evaluation of such methods is clearly required to verify the extent to which
they are able to automatically infer the full form of clippings. Furthermore, Means’s
Chapter 2. Related work
22
approach makes only limited use of linguistic observations about clippings; her methods
could potentially be extended by incorporating observations about the role of phonology
and syllable structure in clipping formation.
A similar class of words not included in Algeo’s (1980) classification scheme are the
non-standard forms found in computer-mediated communication such as cell phone text
messaging and Internet instant messaging. These items are typically shortened forms of
a standard word. Examples include clippings, as well as other abbreviated forms, such
as betta for better and dng for doing. Letters, digits, and other characters may also be
used, as in ne1 for anyone and d@ for that. A small amount of computational work has
addressed normalization of these forms—that is, converting non-standard forms to their
standard form (e.g., Aw et al., 2006; Choudhury et al., 2007; Kobus et al., 2008). Text
messaging forms will be further discussed in Chapter 4.
2.1.4
Loanwords
Automatically identifying the language in which a text is written is a well-studied problem. However, most approaches to this problem have focused on categorizing documents,
and have not considered classification at finer levels of granularity, such as at the word
level (Hughes et al., 2006).
A small number of approaches to identifying loanwords, particularly English words in
Korean and German, have been proposed. Kang and Choi (2002) build a hidden Markov
model over syllables in Korean eojeols—a Korean orthographic unit that consists of one
or more lexemes—to identify foreign syllables. They then apply a series of heuristics to
the extracted eojeols to identify function words, and segment the remaining portions into
nouns. Any noun for which more than half of its syllables have been identified as foreign
is then classified as a foreign word. They evaluate their method on a corpus containing
approximately 102K nouns of which 15.5K are foreign. Their method achieves a precision
and recall of 84% and 92%, respectively. While the performance of this method is quite
Chapter 2. Related work
23
good, it relies on the availability of a large collection of known loanwords which may
not be readily available. Baker and Brew (2008) develop a method for distinguishing
English loanwords in Korean from Korean words (not of foreign origin) which does not
rely on the availability of such data. They develop a number of phonological re-write
rules that describe how English words are expressed in Korean. They then apply these
rules to English words, allowing them to generate a potentially unlimited number of
noisy examples of English loanwords. They then represent each word as a vector of the
trigram character sequences which occur in it, and train a logistic regression classifier on
such representations of both frequent words in Korean text (likely Korean words) and
automatically-generated English loanwords. They evaluate their classifier on a collection
of 10K English loanwords and 10K Korean words, and report an accuracy of 92.4% on
this task which has a chance baseline of 50%. They also conduct a similar experiment
using known Korean words and known English loanwords as training data to see how
well their method performs if it is trained on known items from each class (as opposed to
noisy, automatically-generated examples). Baker and Brew report an accuracy of 96.2%
for this method; however, it does require knowledge of the etymology of Korean words.
It is worth noting that this approach to loanword identification in Korean, as well as that
of Kang and Choi, may not be applicable to identifying loanwords in other languages. In
particular, Korean orthography makes syllable structure explicit, and the correspondence
between orthography and phonemes in Korean is, for the most part, transparent. If such
syllabic and phonemic assumptions could not be made—or approximations to syllable
structure and phonemic transcription were used—the performance of these methods may
suffer.
Alex (2008) builds a classifier to identify English inclusions in German text. Words
that occur in an English or German lexicon are classified as English or German accordingly. Terms found in both the English and German lexicon are classified according to
a number of rule-based heuristics, such as to classify currencies as non-English (Alex,
Chapter 2. Related work
24
2006). More interestingly, unknown words are classified by comparing their estimated
frequency in English and German webpages through web search queries. Alex applies
her method to a corpus of German text from the Internet and telecom domain which
consists of roughly 6% English tokens. Her method achieves a precision and recall of
approximately 92% and 76%, respectively, on the task of English word detection. Alex
also replaces the estimates of term frequency from web queries with frequency information from corpora of varying sizes, and notes that this results in a substantial decrease
in performance.
2.1.5
Blends
Lexical blends, words formed by combining parts of existing words, such as gayby (gay
and baby) and eatertainment (eat and entertainment), have received little computational
treatment. Cook and Stevenson (2007) and Cook and Stevenson (2010b) describe a
method for automatically inferring the source words of lexical blends, for example gay
and baby for gayby (a child whose parents are a gay couple). They also present preliminary
results for the task of determining whether a given neologism is a blend. This is the only
computational work to date on lexical blends, and it is described in detail in Chapter 3.
2.2
Computational work exploiting context
One source of information from which to infer knowledge about any unknown word—
including any neologism—is the context in which it is used. As discussed in Section 1.1.1,
methods that rely on statistical distributional evidence of an unknown word are not
appropriate for acquiring information about neologisms, since these methods require
that their target expressions be somewhat frequent, whereas neologisms are expected
to be infrequent. In this section, we discuss a number of somewhat older studies that
infer information about an unknown word from a rich representation of its lexical and
Chapter 2. Related work
25
syntactic context, and domain-specific knowledge resources.
Granger (1977) develops the FOUL-UP system which infers the meaning of unknown
words based on their expected meaning in the context of a script—an ordered description
of an event and its participants. Granger gives the following example: Friday a car
swerved off Route 69. The car struck an UNKNOWN. From the first sentence FOULUP is able to determine that this text describes a vehicular accident, and therefore a
corresponding script is invoked. Then, using the knowledge from this script and the
information available in the second sentence, Granger’s system constructs an incomplete
representation of the sentence in which the actor, the car, is propelled into the referent
of the unknown word. However, the knowledge in the script states that in a vehicular
accident a car is propelled into a physical object. Therefore, FOUL-UP infers that the
unknown word is a physical object and that it plays the role of an obstruction in the
vehicular accident script.
The knowledge that FOUL-UP infers is very specific to the context in which the
unknown word is used. If the unknown word in the above example were elm, FOULUP would not be able to determine that an elm is a tree or living organism, only that
it is an obstruction in a vehicular accident. Granger conducts no empirical evaluation,
but argues that FOUL-UP is best suited to nouns, and that verbs are somewhat more
challenging. He claims this is because much of the information in the representation of
sentences, including expectations about the participants in some event, is provided by a
verb. Moreover, Granger remarks that FOUL-UP cannot infer the meaning of unknown
adjectives, since they are not included in the representation of sentences used.
Hastings and Lytinen (1994) consider learning the meaning of unknown verbs and
nouns using knowledge of known words which occur in certain syntactic slots relative to
the target unknown word. This approach is similar to that taken by Granger (1977),
except that it is limited to the terrorism domain and exploits knowledge specific to this
domain, whereas Granger’s system relies on scripts to provide detailed domain-specific
Chapter 2. Related work
26
lexical knowledge.
In Hastings and Lytinen’s system, Camille, objects and actions are each represented
in a separate domain-specific ontology. Inferring the meaning of an unknown word (noun
or verb) then boils down to choosing the most appropriate node in the corresponding
ontology. Given the difficulties of inferring the meaning of an unknown verb previously
noted by Granger (1977), a key insight of Hastings and Lytinen is that since it is verbs
that impose restrictions on the nouns which may occur in their slots, the meaning of
unknown verbs and nouns should be inferred differently. Therefore, for an unknown
noun, Camille chooses the most general concept of the specific concepts indicated by the
constraints placed on the slots in which the unknown noun occurs (e.g., if an unknown
noun occurs in slots corresponding to ‘car’ and ‘vehicle’, Camille would choose ‘vehicle’
as its interpretation). On the other hand, for an unknown verb, Camille selects the most
specific interpretation possible given the observed slot-filling nouns.
Hastings and Lytinen evaluate their system on 9 ambiguous nouns and achieve precision and recall of 67% and 44%, respectively, in terms of the concepts selected. Camille
did not perform as well on verbs, achieving precision and recall of 19% and 41%, respectively, on a test set of 17 verbs, which is in line with Granger’s (1977) observations that
it is more difficult to infer the meaning of an unknown verb than noun.
Cardie (1993) develops a system that learns aspects of the syntax and semantics of
words that goes beyond the work of Granger (1977) and Hastings and Lytinen (1994)
in that it is able to infer knowledge of any open class word, not just nouns and verbs.
Cardie’s study is limited to a corpus of news articles describing corporate joint ventures.
In Cardie’s system, each word is represented as a vector of 39 features describing aspects
of the word itself, such as its POS and sense in an ontology, and the context in which
it occurs. Given a usage of an unknown word, Cardie constructs an incomplete feature
vector representing it, which is missing four features: its POS, its general and specific
senses (in a small domain-specific ontology), and its concept in the taxonomy of joint
Chapter 2. Related work
27
venture types. Inferring the correct values for these features corresponds to learning that
word. Assuming the availability of full feature vectors for a number of words (known
words that are in a system’s lexicon) Cardie infers the missing features for the unknown
words using a nearest-neighbour approach. However, prior to doing so, she uses a decision
tree algorithm to select the most informative features to use when determining the nearest
neighbours.
To evaluate her method, Cardie performs an experiment which simulates inferring
the unknown features for a number of unknown words. Cardie builds feature vectors for
120 sentences from her corpus, and then conducts a 10-fold cross-validation experiment
in which the appropriate features are deleted from the vectors of the representation of
the test sentences in each sub-experiment, and the feature inference method is then run
on these vectors. In this experiment Cardie achieves results that are significantly better
than both a uniform random baseline and a most frequent sense baseline.
The studies examined in this section have a number of commonalities. First, they
present methods for learning information about an unknown word given just one usage,
or a small number of usages, of that word, making them well-suited to learning about
neologisms. However, all of these methods are limited to a particular domain, or as in
the case of Granger (1977), rely on selecting the correct script to access domain specific
knowledge. Each method also requires lexical resources, such as ontologies; although
such resources are available for the limited domains considered, the reliance on them
also prevents these methods from being easily applied to other domains, or used in nondomain-specific settings. This limits their widespread applicability to learning properties
of neologisms.
Wilks and Catizone (2002) describe the problem of lexical tuning, updating a lexicon
to reflect word senses encountered in a corpus that are not listed in that lexicon. Early
work related to lexical tuning, such as Wilks (1978), manipulates rich semantic representations of words and sentences to understand usages of extended senses of words. Like
Chapter 2. Related work
28
the other methods discussed in this section, Wilks’s approach is limited as it relies on
lexical knowledge sources which are not generally available. Fass (1991) similarly relies
on rich semantic representations, particularly with respect to selectional restrictions, to
identify and distinguish metaphorical, metonymous, and literal usages.
2.3
Lexicographical work on identifying new words
In this section we consider work in lexicography on identifying new words for inclusion
in a dictionary. Before looking specifically at new words, in Section 2.3.1 we consider
the role of corpora in lexicography. Then in Section 2.3.2 we examine some properties of
new words that indicate whether they are likely to remain in usage, and therefore should
be included in a dictionary. Finally, in Section 2.3.3 we examine some approaches to
identifying new words.
2.3.1
Corpora in lexicography
Kilgarriff et al. (2004) note that the use of corpora in lexicography has gone through three
stages. In the following subsections we briefly discuss these stages, paying particular
attention to the problems posed by neologisms.
2.3.1.1
Drudgery
Before roughly 1980, the process of lexicography was largely manual. In order to collect
the immense number of citations necessary to form the basis for writing the entries of a
dictionary, lexicographers had to read a very large amount of text. While reading, when
a lexicographer encountered a usage of a word that struck them as being particularly
important—perhaps being very illustrative of that word’s meaning—they would create a
citation for it—roughly a slip of paper indicating the headword, the sentence or context
in which it was used, and the source of the usage; after collecting sufficient citations,
Chapter 2. Related work
29
the dictionary could be written. With respect to new words, this process left much to
be desired. In particular, it made it very difficult to search for citations for new words.
In order to find usages of a previously undocumented word suspected of being new, one
would have to wait until it was encountered during reading (Barnhart, 1985).
2.3.1.2
Corpora
The Collins COBUILD English Language Dictionary broke new ground in lexicography
by being the first dictionary to be based entirely on corpus evidence (Sinclair, 1987).
A corpus of approximately 40 million words was compiled and used in place of the
traditional collection of citations. (Storing and accessing what was, at the time, a very
large corpus, presented a substantial technical challenge.) This allowed the automation
of many of the manual tasks of lexicography, including collecting, storing, and searching
for appropriate citations. The ability to quickly search a corpus for usages of a given
word made the task of documenting new words much easier; now a lexicographer could
look for instances of a word that they thought might be worthy of documenting, without
having to wait for it to be encountered in reading (Barnhart, 1985). Moreover, the use of
corpora in this way fundamentally changed lexicography, as the evidence was no longer
biased by the examples that were chosen by a reader to be recorded as citations; readers
tend to focus on the exceptional, rather than the ordinary, in some cases resulting in a
paucity of evidence for very common words (Atkins and Rundell, 2008).
Newly-coined words are expected to be rather infrequent. Therefore, for a corpus to
be used for finding new words, it must be very large. Fortunately, the size of corpora
has grown immensely since the COBUILD project; nowadays, corpora of one billion
words are not uncommon (e.g., Graff et al., 2005). Furthermore, a corpus for new-word
lexicography must contain recently-produced text. In this vein the World Wide Web is
very attractive, and indeed there has been considerable interest in using the Web as a
corpus (Kilgarriff and Grefenstette, 2003), and as a source of citations for new words
30
Chapter 2. Related work
(Hargraves, 2007).
2.3.1.3
Statistics
The rise in popularity of statistical methods in computational linguistics influenced computational work on lexicography, which looked toward statistics for analyzing corpus
data. Church and Hanks (1990) propose the association ratio as a measure for determining the association between two words in a corpus. The association ratio is based
on mutual information I, a statistic that measures how often two words, w1 and w2 ,
co-occur, taking into account their expected (chance) co-occurrence.
I(w1 , w2 ) = log
p(w1 , w2 )
p(w1 )p(w2 )
Mutual information is symmetric, i.e., I(w1 , w2 ) = I(w2 , w1 ). The association ratio is
asymmetric, and takes into account the order of w1 and w2 in estimating p(w1 , w2 ) to
reflect the importance of word order in language.
Church and Hanks note applications of the association ratio to lexicography. First,
this measure consistently identifies phrasal verbs and relationships between verbs and
prepositions. This is useful to lexicographers for adding information about collocations
to dictionaries—in particular which prepositions are typically used with a given verb—
as is often done in learner’s dictionaries.3 Church and Hanks also discuss the use of
the association ratio to identify word senses. In particular, the words that are strongly
associated with a target word indicate the possible word senses of the target, while the
3
Although more sophisticated statistical methods for extracting phrasal verbs from corpora (e.g.,
Baldwin and Villavicencio, 2002) and rating the compositionality of phrasal verbs (e.g., McCarthy et al.,
2003) have been proposed, they have received less attention in lexicography. However, methods for
assessing the compositionality of a multiword expression (MWE) seem particularly useful for this field,
as non-compositional MWEs should be listed in dictionaries as their meaning cannot be inferred from
the meaning of their component words.
Chapter 2. Related work
31
usage of a strongly-associated word with the target indicates the sense of that usage of
the target.4
One drawback of the association ratio noted by Church and Hanks is that it gives
“unstable” scores for low-frequency items. This is problematic for the study of neologisms as they are expected to be relatively infrequent due to the recency of their coinage.
However, as we will discuss in Section 2.3.2, frequency is an important factor in determining which words to include in a dictionary; therefore, it may be the case that the
neologisms we are interested in listing in a dictionary are in fact frequent enough to be
used with measures such as mutual information. Nevertheless, a computational system
is still expected to encounter infrequent neologisms, and methods for lexical acquisition
that do not rely solely on distributional information are more suitable in this case.
A further problem with mutual information, the association ratio, and indeed many
other co-occurrence statistics, is that they require a window size to determine co-occurrence.
In Church and Hanks’s study two words are said to co-occur if they are both present
within a five-word window. This somewhat arbitrary decision must be made when using
the association ratio and other similar co-occurrence statistics. Recent work on windowless association measures, that are based on the distance between occurrences of two
words, such as Washtell (2009), could lead to better association measures since they do
not require this arbitrary choice of window size.
Church and Hanks discuss preprocessing the corpus with a part-of-speech tagger or
parser to extract association measures according to part-of-speech, or to incorporate
syntactic relations into the notion of window. Kilgarriff and Tugwell (2002) build on this
idea in their word sketches. A word sketch is a representation of the collocates of a word
4
The notion that a word may have several distinct senses has been challenged. Kilgarriff (1997)
proposes an account of word meaning based on clusters—or groupings—of similar corpus usages of a
word. Kilgarriff’s view is not incompatible with our definition of a new word (see Chapter 1, page 1),
if we consider a new “sense” of an existing word form as being one or several recently-produced usages
of that word that are similar to each other and different from the usages of that word form that have
previously been observed.
Chapter 2. Related work
32
that is designed to be of practical use to lexicographers. Collocations are found for specific
syntactic relations including modifiers, prepositions and their complements, and, in the
case of verbs, subject and direct object. Mutual information is used to identify strength
of collocation; however, like Church and Hanks (1990), Kilgarriff and Tugwell note that
such scores tend to give too much weight to low-frequency items. Therefore, mutual
information is weighted by log frequency, giving a score that Kilgarriff and Tugwell refer
to as salience. Word sketches are presented to lexicographers in a format that allows them
to easily view corpus usages of a target word occurring with selected collocates. This
system was used by a team of lexicographers writing a dictionary, and in a subjective
evaluation was found to be very useful. We will return to mutual information and word
sketches in Section 2.3.3 when we consider automatic approaches to finding new words.
2.3.2
Successful new words
In this section we discuss some properties of new words that lexicographers have used in
determining whether they should be included in a dictionary. Dictionaries typically strive
to only include words that are expected to remain in usage. However, as Algeo (1993)
points out, the majority of new words in fact fail to become established in language, and
even those words that do make it into dictionaries often fall out of usage. Of course, the
focus of a dictionary—e.g., on a particular variety of English or regional dialect—is also
an important consideration in determining which words to include (Atkins and Rundell,
2008).
Frequency has been widely noted as an important factor for determining whether a
word should be listed in a dictionary (Sheidlower, 1995; Barnhart, 2007; Hargraves, 2007;
Metcalf, 2007). If a word is sufficiently frequent a reader can reasonably be expected to
encounter it, wonder as to its meaning and usage, and then look for it in a dictionary.
Simple word frequency alone however, may be misleading as to a measure of a word’s
importance; the frequencies of variant spellings, and forms derived from the word under
Chapter 2. Related work
33
consideration, may also need to be taken into account.
Frequency may also be misleading as it ignores how widely a word is used (Sheidlower,
1995). For example, a word may be rather frequent, but its use may be restricted to a
particular language community, which may affect a lexicographer’s decision as to include
that word. Therefore, in determining a word’s importance for inclusion, lexicographers
also consider factors such as the number of sources (e.g., titles of newspapers or magazines) and genres (e.g., news, science fiction) in which that word occurs, as well as the
diversity of its uses (e.g., formal versus informal language, media type — Sheidlower,
1995; Barnhart, 2007; Hargraves, 2007; Metcalf, 2007).
The time span over which a word has been used—the date between the first citation
and most recently-recorded usage—is another indication of its importance (Sheidlower,
1995; Barnhart, 2007); a word that has been used over a relatively long period of time
may be expected to remain in use. Two additional related factors that may affect whether
a word is included in a dictionary are whether the concept to which that word refers is
likely to remain relevant (Metcalf, 2007), and the cruciality of the word—whether there
is a need for it in the language (Sheidlower, 1995).
A final property that may affect whether a word is likely to experience widespread
adoption is its unobtrusiveness (Metcalf, 2007). Many new words are clever, witty
coinages. Such words are said to be obtrusive, and tend to be noticed by speakers,
but not widely used.
The multitude of factors that affect the importance of a new word for inclusion in a
dictionary led Barnhart (2007) to propose the following formula which combines these
various pieces of information:
V ×F ×R×G×T
where for a given word w, V is the number of forms of w, F is the frequency of w, R is the
number of sources in which w occurs, G is the number of genres in which w occurs, and
T is the time span over which w has been observed. Metcalf (2002) similarly proposes
Chapter 2. Related work
F
Frequency
U
Unobtrusiveness
D
Diversity of users and situations
34
G Generation of other forms and meanings
E
Endurance of the concept to which the word refers
Table 2.1: Metcalf’s (2002) FUDGE factors for determining whether a word will remain
in usage.
his FUDGE factors, shown in Table 2.1 for determining whether a word will remain in
usage. (Metcalf (2007) also offers a summary of these factors.) Metcalf proposes scoring
a word from 0-2 in terms of each of these factors, and then summing the scores for a
word to determine whether it is likely to remain in usage.
There appear to be a number of ways that metrics such as Barnhart’s and Metcalf’s
could be improved. First, some of these properties may play a larger role in determining
a word’s importance for inclusion in a dictionary than others. A supervised machine
learning algorithm could be used to learn an optimal weighting for the various properties.
Furthermore, it may also be the case that applying a non-linear transformation to the
values for the properties—such as the natural logarithm—could make the values more
informative; taking the natural logarithm has the effect of emphasizing the differences
between smaller values, which may be particularly important in the case of frequency,
since neologisms are expected to have relatively low frequency.
The above discussion of factors that play a role in determining whether a word is
included in a dictionary reflects lexicographers’ knowledge of their task. Boulanger (1997)
takes a more formal approach to determining the factors relating to the success of a new
word. She collects a number of words from an English new-words dictionary published
in 1990, and then checks for the presence of these words in five more recently-published
general-use English dictionaries. The items from the new-words dictionary occurring in
Chapter 2. Related work
35
the general-use dictionaries are deemed successful new words that have been adopted into
general use; those that do not occur in the general-use dictionaries are assumed to be no
longer commonly used. Boulanger then compares a variety of properties of these words
across the two groups—successful and unsuccessful—to determine the factors affecting
the success of a new word.
A number of Boulanger’s findings are unsurprising given the above remarks by lexicographers. Boulanger finds that frequent words are more likely to be successful, as are
words that are used in a non-specialized register. Furthermore, words for referents that
remained popular until the time of Boulanger’s study were also found to be more likely
to succeed than words whose referents were no longer popular at that time.
Interestingly, Boulanger finds that words associated with particular notional fields
(e.g., disease, economics) are more likely to succeed than others. This may seem somewhat contradictory to the observation by lexicographers that occurring across a variety of
genres and domains is an indication that a word is a good candidate for inclusion in a dictionary. However, it has also been observed that the new words from a particular period
of time tend to reflect what was culturally prominent then (e.g., Ayto, 2006). Therefore,
association with a culturally prominent notional field appears to have a positive effect
on a word’s success.
Two of Boulanger’s findings have not been mentioned by any of the studies examined
so far in this thesis. She finds that new words which are in competition with an already
established word (i.e., the new word and established word are roughly synonymous) are
more likely to succeed than new words which are not in competition with an established
form. Boulanger hypothesizes that in the case of competition, only the new word itself
(i.e., the word form) must be accepted by speakers. In the no-competition case, both
the new word and new referent must be accepted. Boulanger also finds that taboo
association is related to the success of a word. She suggests that since taboo association
encourages lexical loss, this then encourages the formation of a new word to take its place.
Chapter 2. Related work
36
This is also discussed by Allan and Burridge (1991) in their treatment of euphemism
and dysphemism. A euphemistic term for a taboo subject may become contaminated
by its association with that subject, eventually losing its euphemistic status, thereby
encouraging the emergence of a new euphemistic term for that taboo subject.
One drawback to Boulanger’s study is that it does not directly consider whether a
new word has become successful. Instead, it examines whether a lexicographer considers
a new word to be worthy of inclusion in a dictionary. To the extent that lexicographers do
their jobs perfectly, and there is enough space in a dictionary to document all successful
neologisms, success and inclusion in a dictionary are the same. However, if lexicographers are making systematic errors, then the conclusions reached in this study relate to
properties of words that determine whether they will be listed in a dictionary, and not
whether they will become established in language. Nevertheless, it is not clear how such
a study could be conducted without making such an assumption.
A computational system for automatically identifying new words could exploit some of
the properties discussed in this section to rate whether a new word is worthy of inclusion in
a dictionary. Frequency information can be easily extracted from corpora, including range
(i.e., the number of documents in which a word occurs). Corpora which provide additional
information about the documents they include enable the extraction of properties such as,
for example, the number of genres in which a word occurs. Using automatic approaches
to stemming and lemmatization, it may also be possible to estimate the number of forms
of a new word. Corpora, such as the English Gigaword Corpus, consisting of newswire
stories over several years, could be used to estimate the time span over which a word
has been used. Unfortunately, it is currently unclear how properties of a word such as
the relevance of its referent, its cruciality in a language, and its obtrusiveness could be
automatically estimated.
Chapter 2. Related work
2.3.3
37
Finding new words
As discussed in Section 2.3.1.1, one way to find new words is by reading and looking
for them. As unsophisticated as this method may seem, the Oxford English Dictionary
still has a reading programme;5 as part of this process, volunteers read text and write
citations for interesting usages of new words that they find.
We also considered the use of electronic corpora for searching for citations for new
words in Section 2.3.1.2. If a lexicographer has a small amount of evidence for a word, or
a hunch that a word might be worth documenting, large corpora—in particular the World
Wide Web—are a potential source of more usages of that word. However, lexicographers
have noted a number of challenges to this, largely related to text normalization issues,
such as variant forms and spellings (e.g., Barnhart, 1985; Brookes, 2007). Nevertheless,
such problems are fairly straightforward to resolve through approaches to text normalization (e.g., Sproat et al., 2001) which can be used, for example, to convert all instances
of a word in its various forms to a single canonical form. Barnhart (1985) also mentions
problems related to not knowing a word’s part-of-speech. For example, searching for
instances of the verb Google would be difficult, as this word is likely predominantly used
as a proper noun. However, this problem can be partially overcome by searching for
inflected forms of words (e.g., Googled ). Moreover, approaches to part-of-speech tagging
(e.g., Brill, 1994; Mikheev, 1997) can automatically determine the syntactic category of
words.
Syntactic information may also be useful when searching for new words, and can be
automatically inferred—although noisily—using chunkers and parsers (e.g., Abney, 1991;
Collins, 2003). However, normalization problems are more difficult to resolve when using
the Web as a corpus due to the technical challenges of storing and processing very large
amounts of text. The Linguist’s Search Engine (Resnik et al., 2005) addressed many
5
http://www.oed.com/readers/research.html
Chapter 2. Related work
38
of these issues and enabled linguistically-informed web queries including, for example,
searches for particular syntactic structures.
Lexicographers have noted that speakers and writers often use new words in a way
that indicates that they are new. Consider the following citation taken from The DoubleTongued Dictionary:6
As material from the disk falls onto the surface of the pulsar, it imparts
enough angular momentum to spin back up into what scientists call a “recycled pulsar.”
In this example, the newness of the word recycled pulsar is indicated by scare quotes
and the sequence of words what scientists call. McKean (2007) discusses conducting web
searches using particular phrases that are expected to be used with new words—such as
what scientists call —to identify new words. This is somewhat similar to computational
work that has made use of patterns to automatically extract hypernyms from corpora
(for example, using patterns such as NP1 , NP2 and other NP3 to infer that NP3 is likely
a hypernym of NP1 , Hearst, 1992). More recent computational work has considered
automatically learning the patterns that often express hypernymy (Snow et al., 2005).
So far, a similar study—learning the patterns that are likely to express new words—does
not appear to have been undertaken. A statistical analysis of the lexico-syntactic context
in which new words are used may reveal patterns in which new words are introduced;
these patterns could then be searched for on the Web, or in large corpora of recentlyproduced text, to find new words.
This approach to finding new words is attractive in that it would be able to identify
both new word forms and new senses of existing words, as writers often use both in ways
that indicate that they are new. However, many of the words in The Double-Tongued
Dictionary appear to be technical terms from a specialized domain, such as recycled
6
http://www.doubletongued.org
Chapter 2. Related work
39
pulsar above, that may in fact have been in use, although only by a small subset of
speakers, for quite some time. Nevertheless, considering our definition of neologism (see
Chapter 1, page 1), a word which has been used extensively but only in a particular
domain, and then becomes established in general usage, would indeed be a new word.
Moreover, after (automatically) finding a word that matches a pattern for new word
usage, we must also consider whether its other distributional properties as discussed in
Section 2.3.2—many of which can be automatically extracted—warrant its inclusion in
a dictionary.
O’Donovan and O’Neil (2008) describe the efforts of the lexicographers working on The
Chambers Dictionary to automatically identify neologisms. They maintain a reference
corpus of recent English which represents normal English usage. They then periodically
gather recent documents (from various sources on the Web and electronic editions of publications) which are then compared against the reference corpus. They identify all words
that do not already occur in their dictionary, and that have a frequency substantially
higher in the recent texts than in the reference corpus. Lexicographers then examine
these terms and consider them for inclusion as new dictionary entries.
This approach is somewhat limited in that it cannot identify new words that correspond to multiword expressions (MWEs). This is especially problematic since many new
words are compounds (Algeo, 1991), which are often written as MWEs. Furthermore,
this method is generally not able to recognize neologisms that correspond to existing word
forms, however, O’Donovan and O’Neil are able to identify some shifts that correspond
to a change in syntactic category by finding usages of inflected forms.
The method of O’Donovan and O’Neil could potentially be improved to better identify
new meanings for existing word forms using more statistical distributional information
about the words under consideration. Lexico-syntactic information for each lemma in
the register corpus could be extracted; this could take the form of a word sketch, for
example. The same information could also be extracted for the lemmas in the new
Chapter 2. Related work
40
texts. Rather than compare the lemmas in the register corpus to those in the new
texts using simple frequency (as in O’Donovan and O’Neil, 2008), their word sketches
could instead be compared. The context in which a word is used—often as little as fifty
characters to the left and right—is usually sufficient to manually determine that word’s
sense (Moon, 1987). Indeed the assumption that context disambiguates has been widely
used in computational work on word sense disambiguation. Therefore, if the association
between a target word, and some other word in a particular syntactic relation, is found to
substantially differ between the register corpus and the new texts, this may be evidence of
a new sense of the target word. Novel MWEs that are sufficiently frequent in a collection
of new texts could be identified using statistics of association such as mutual information
(see Section 2.3.1.3). However, the frequency of many established MWEs is very low,
even in large corpora; therefore, even if we focus on documenting the more frequent new
MWEs, many of them can still be expected to have a low enough frequency that such
statistics will be unreliable. Nevertheless, approaches to identifying neologisms based
on their use in particular patterns that indicate the newness of a word—as discussed
above—appear to be applicable to MWEs and other low-frequency neologisms.
Chapter 3
Lexical blends
Lexical blends, also known as blends, are a common type of new word typically formed by
combining a prefix of one source word with a suffix of another source word, as in brunch
(breakfast and lunch). There may be overlap in the contribution of the source words,
as in fantabulous (fantastic and fabulous). It is also possible that one or both source
words are included in their entirety, for example, gaydar (gay radar ) and jetiquette (jet
etiquette). We refer to blends such as these as simple two-word sequential blends, and
focus on this common type of blend in this chapter. Blends in which (part of) a word
is inserted within another (e.g., entertoyment, a blend of entertainment and toy) and
blends formed from more than two source words (e.g., nofriendo from no, friends, and
Nintendo) are rare. In Algeo’s (1991) study of new words, approximately 5% were blends
(see Table 1.1, page 8). However, in our analysis of 1,186 words taken from a popular
neologisms website, approximately 43% were blends. Clearly, computational techniques
are needed that can augment lexicons with knowledge of novel blends.
The precise nature and intended use of a computational lexicon will determine the
degree of processing required of a novel blend. In some cases it may suffice for the lexical
entry for a blend to simply consist of its source words. For example, a system that
employs a measure of distributional similarity may benefit from replacing occurrences
41
Chapter 3. Lexical blends
42
of a blend—likely a recently-coined and hence low frequency item—by its source words,
for which distributional information is likely available. In other cases, further semantic
reasoning about the blend and its source words may be required (e.g., determining the
semantic relationship between the source words as an approximation to the meaning of
the blend). However, any approach to handling blends will need to recognize that a
novel word is a blend and identify its source words. These two tasks are the focus of this
chapter. Specifically, we draw on linguistic knowledge of how blends are formed as the
basis for automatically determining the source words of a blend.
Language users create blends that tend to be interpretable by others. Tapping into
properties of blends believed to contribute to the recognizability of their source words—
and hence the interpretability of the resulting blend—we develop statistical measures
that indicate whether a candidate word pair is likely the source words for a given blend.
Moreover, the fact that a novel word is determined to have a “good” source word pair
may be evidence that it is in fact a blend, since we are unlikely to find two words that
are a “good” source word pair for a non-blend. Thus, the statistical measures we develop
for source word identification may also be useful in recognizing a novel word as a blend.
This chapter presents the first computational study of lexical blends. It was previously published by Cook and Stevenson (2010b), which is itself an extended and improved
version of the preliminary work done in this direction by Cook and Stevenson (2007).
Section 3.1 presents our statistical model for identifying a blend’s source words. We
describe our dataset of blends in Section 3.2, and the experimental setup in Section 3.3.
Results for the task of identifying a blend’s source words are given in Section 3.4. Section 3.5 then gives preliminary results for distinguishing blends from other word types.
We discuss related work in Section 3.6, and summarize the contributions of this chapter
in Section 3.7.
Chapter 3. Lexical blends
3.1
43
A statistical model of lexical blends
We present statistical features that are used to automatically infer the source words of
a word known to be a lexical blend, and show that the same features can be used to
distinguish blends from other types of neologisms. First, given a blend, we generate
all word pairs that could have formed the blend. This set is termed the candidate set,
and the word pairs it contains are referred to as candidate pairs (Section 3.1.1). Next,
we extract a number of linguistically-motivated statistical features for each candidate
pair, as well as filter from the candidate sets those pairs that are unlikely to be source
words due to their linguistic properties (Section 3.1.2). Later, we explain how we use the
features to rank the candidate pairs according to how likely they are the source words
for that blend. Interestingly, the “goodness” of a candidate pair is also related to how
likely the word is actually a blend.
3.1.1
Candidate sets
To create the candidate set for a blend, we first consider each partitioning of the graphemes
of the blend into a prefix and suffix, referred to as a prefix–suffix pair. (In this work,
prefix and suffix refer to the beginning or ending of a string, regardless of whether those
portions are morphological affixes.) We restrict the prefixes and suffixes to be of length
two or more. This heuristic reduces the size of the candidate sets, yet generally does not
exclude a blend’s source words from its candidate set since it is uncommon for a source
word to contribute fewer than two letters.1 For example, for brunch (breakfast+lunch) we
consider the following prefix–suffix pairs: br, unch; bru, nch; brun, ch. For each prefix–
suffix pair, we then find in a lexicon all words beginning with the prefix and all words
ending in the suffix, ignoring hyphens and whitespace, and take the Cartesian product of
the prefix words and suffix words to form a list of candidate word pairs. The candidate
1
Some examples of blends where a source word contributes just one letter are zorse (zebra and horse)
and vortal (vertical portal ).
44
Chapter 3. Lexical blends
archimandrite
tourist
archipelago
tourist
architect
behaviourist
architect
tourist
architectural
behaviourist
architectural
tourist
architecturally
behaviourist
architecturally
tourist
architecture
behaviourist
architecture
tourist
archives
tourist
archivist
tourist
Table 3.1: A candidate set for architourist, a blend of architecture and tourist.
set for the blend is the union of the candidate word pairs for all its prefix–suffix pairs.
Note that in this example, the candidate pair brute crunch would be included twice:
once for the prefix–suffix pair br, unch; and once again for bru, nch. We remove all such
duplicate pairs from the final candidate set. A candidate set for architourist, a blend of
architecture and tourist, is given in Table 3.1.
3.1.2
Statistical features
Our statistical features are motivated by properties of blends observed in corpus-based
studies, and by cognitive factors in human interpretation of blends, particularly relating
to how easily humans can recognize a blend’s source words. All the features are formulated to give higher values for more likely candidate pairs. We organize the features into
four groups—frequency; length, contribution, and phonology; semantics; and syllable
structure—and describe each feature group in the following subsections.
45
Chapter 3. Lexical blends
3.1.2.1
Frequency
Various frequency properties of the source words influence how easily a language user
recognizes the words that form a blend. Because blends are most usefully coined when
the source words can be readily deduced, we hypothesize that frequency-based features
will be useful in identifying blends and their source words. We propose ten features that
draw on the frequency of candidate source words.
Lehrer (2003) presents a study in which humans are asked to give the source words
for blends. She found that frequent source words are more easily recognizable. Our first
two features—the frequency of each candidate word, freq(w1 ) and freq(w2 )—reflect this
finding. Lehrer also finds that the recognizability of a source word is further affected by
both the number of words in its neighbourhood—the set of words which begin/end with
the prefix/suffix which that source word contributes—and the frequencies of those words.
(Gries (2006) reports a similar finding.) Our next two features, given below, capture this
insight:
freq(w1 )
freq(prefix )
and
freq(w2 )
freq(suffix )
(3.1)
where freq(prefix ) is the sum of the frequency of all words beginning with prefix, and
similarly for freq(suffix ).
Because we observe that blends are often formed from two words that co-occur in
language use, we propose six features that capture this tendency. A blend’s source
words often correspond to a common sequence of words, for example, camouflanguage is
camouflaged language. We therefore include two features based on Dice’s co-efficient to
capture the frequency with which the source words occur consecutively:
2 ∗ freq(w1 w2 )
freq(w1 ) + freq(w2 )
and
2 ∗ freq(w2 w1 )
freq(w1 ) + freq(w2 )
(3.2)
Chapter 3. Lexical blends
46
Since many blends can be paraphrased by a conjunctive phrase—for example, broccoflower is broccoli and cauliflower —we also use a feature that reflects how often the
candidate words are used in this way:
2 ∗ (freq(w1 and w2 ) + freq(w2 and w1 ))
freq(w1 and) + freq(and w1 ) + freq(w2 and) + freq(and w2 )
(3.3)
Furthermore, some blends can be paraphrased by a noun modified by a prepositional
phrase, for example, a nicotini is a martini with nicotine. Lauer (1995) suggests eight
prepositional paraphrases for identifying the semantic relationship between the modifier
and head in a noun compound. Using the same paraphrases, the following feature measures how often two candidate source words occur with any of the following prepositions
P between them: about, at, for, from, in, of, on, with:
2 ∗ (freq(w1 P w2 ) + freq(w2 P w1 ))
freq(w1 P ) + freq(P w1 ) + freq(w2 P ) + freq(P w2 )
(3.4)
where freq(w P v) is the sum of the frequency of w and v occurring with each of the eight
prepositions between w and v, and freq(w P ) is the sum of the frequency of w occurring
with each of the eight prepositions immediately following w.
Since the previous three features target the source words occurring in very specific
patterns, we also count the candidate source words occurring in any of the above patterns
in an effort to avoid data sparseness problems.
2 ∗ freq(w1 w2 ) + freq(w2 w1 ) + freq(w1 and w2 ) +
freq(w2 and w1 ) + freq(w1 P w2 ) + freq(w2 P w1 )
freq(w1 ) + freq(w2 )
(3.5)
Chapter 3. Lexical blends
47
Finally, since the above patterns are very specific, and do not capture general cooccurrence information which may also be useful in identifying a blend’s source words,
we include the following feature which counts the candidate source words co-occurring
within a five-word window.
2 ∗ freq(w1 , w2 in a 5-word window)
freq(w1 ) + freq(w2 )
3.1.2.2
(3.6)
Length, contribution, and phonology
Ten features tap into properties of the orthographic or phonetic composition of the source
words and blend. Note that although we use information about the phonological and/or
syllabic structure of the source words, we do not assume such knowledge for the blend
itself, since it is a neologism for which such lexical information is typically unavailable.
The first word in a conjunct tends to be shorter than the second, and this also seems
to be the case for the source words in blends (Kelly, 1998; Gries, 2004). The first three
features therefore capture this tendency based on the graphemic, phonemic, and syllabic
length of w2 relative to w1 , respectively:
len graphemes (w2 )
len graphemes (w1 ) + len graphemes (w2 )
(3.7)
len phonemes (w2 )
len phonemes (w1 ) + len phonemes (w2 )
(3.8)
len syllables (w2 )
len syllables (w1 ) + len syllables (w2 )
(3.9)
A blend and its second source word also tend to be similar in length, possibly because,
similar to compounds, the second source word of a blend is often the head; therefore it is
this word that determines the overall phonological structure of the resulting blend (Kubozono, 1990). The following feature captures this property using graphemic length as an
48
Chapter 3. Lexical blends
approximation to phonemic length, since as stated above, we assume no phonological
information about the blend b.
1−
|len graphemes (b) − len graphemes (w2 )|
max (len graphemes (b), len graphemes (w2 ))
(3.10)
We hypothesize that a candidate source word is more likely if it contributes more
graphemes to a blend. We use two ways to measure contribution in terms of graphemes:
cont seq (w, b) is the length of the longest prefix/suffix of word w which blend b begins/ends
with, and cont lcs (w, b) is the longest common subsequence (LCS) of w and b. This yields
four features, two using cont seq and two using cont lcs :
cont seq (w1 , b)
len graphemes (w1 )
and
cont seq (w2 , b)
len graphemes (w2 )
(3.11)
cont lcs (w1 , b)
len graphemes (w1 )
and
cont lcs (w2 , b)
len graphemes (w2 )
(3.12)
Note that for some blends, such as spamdex (spam index ), cont seq and cont lcs will be
equal; however, this is not the case in general, as in the blend tomacco (tomato and
tobacco) in which tomato overlaps with the blend not only in its prefix toma, but also in
the final o.
In order to be recognizable in a blend, the shorter source word will tend to contribute
more graphemes or phonemes, relative to its length, than the longer source word (Gries,
2004). We formulate the following feature which is positive only when this is the case:
cont seq (w2 , b)
cont seq (w1 , b)
−
len graphemes (w1 ) len graphemes (w2 )
!
∗
len graphemes (w2 ) − len graphemes (w1 )
len graphemes (w1 ) + len graphemes (w2 )
!
(3.13)
Chapter 3. Lexical blends
49
For this feature we don’t have strong motivation to choose one measure of contribution
over the other, and therefore use cont seq , the simpler version of contribution.
Finally, the source words in a blend are often phonologically similar, as in sheeple
(sheep people); the following feature captures this (Gries, 2006):
LCS phonemes (w1 , w2 )
max (len phonemes (w1 ), lenphonemes w2 )
3.1.2.3
(3.14)
Semantics
We include two semantic features that are based on Lehrer’s (2003) observation that
people can more easily identify the source words of a blend when there is a semantic
relation between them.
As noted, blends are often composed of two semantically similar words, reflecting
a conjunction of their concepts. For example, a pug and a beagle are both a kind of
dog, and can be combined to form the blend puggle. Similarly an exergame is a blend
of exercise and game, both of which are types of activity. Our first semantic feature
captures similarity using an ontological similarity measure, which is calculated over an
ontology populated with word frequencies from a corpus.
The source words of some blends are not semantically similar (in the sense of their
relative positions within an ontology), but are semantically related. For example, the
source words of slanguist—slang and linguist—are related in that slang is a type of
language and a linguist studies language. Our second semantic feature is a measure of
semantic relatedness using distributional similarity between word co-occurrence vectors.
The semantic features are described in more detail in Section 3.3.2.
Chapter 3. Lexical blends
3.1.2.4
50
Syllable structure
Kubozono (1990) notes that the split of a source word—into the prefix/suffix it contributes to the blend and the remainder of the word—occurs at a syllable boundary or
immediately after the onset of the syllable. Because this syllable structure property holds
sufficiently often, we use it as a filter over candidate pairs—rather than as an additional
statistical feature—in an effort to reduce the size of the candidate sets. Candidate sets
can be very large, and we expect that our features will be more successful at selecting the
correct source word pair from a smaller candidate set. In our results below, we analyze
the reduction in candidate set size using this syllable structure heuristic, and its impact
on performance.
3.2
Creating a dataset of recent blends
One potential source of a dataset of blends is the entries from a dictionary whose etymological entry indicates they were formed from a blend of two words. Using a dictionary
in this way provides an objective method for selecting experimental expressions and indicating their gold standard source words. However, it results in a dataset of blends
that are sufficiently established in the language to appear in a dictionary. Truly novel
blends—neologisms which have been recently added to the language—may have differing properties from fully established forms in a dictionary. In particular, many of our
features are based on properties of the source words, both individually and in relation
to each other, that may not hold for expressions that entered the language some time
ago. For example, although meld is a blend of melt and weld, the current frequency of
the phrase melt and weld may not be as common as the source word co-occurrences for
newly-coined expressions. Thus, an important step to support further research on blends
is to develop a dataset of recent neologisms that are judged to be lexical blends.
To develop a dataset of recently-coined blends we drew on www.wordspy.com, a pop-
Chapter 3. Lexical blends
51
staycation n. A stay-at-home vacation. Also: stay-cation.
—staycationer n.
Example Citation:
Amy and Adam Geurden of Hollandtown, Wis., had planned a long summer
of short, fun getaways with their kids, Eric, 6, Holly, 3, and Jake, 2. In
the works were water-park visits, roller-coaster rides, hiking adventures and
a whirlwind weekend in Chicago. Then Amy did the math: their Chevy
Suburban gets 17 miles to the gallon and, with gas prices topping $4, the
family would have spent about $320 on fill-ups alone. They’ve since scrapped
their plans in favor of a “staycation” around the backyard swimming pool.
—Linda Stern, “Try Freeloading Off Friends!,” Newsweek, May 26, 2008
Table 3.2: The Wordspy definition, and first citation given, for the blend staycation.
ular website documenting English neologisms (and a small number of rare or specialized
terms) that have been recently used in a recordable medium such as a newspaper or
book, and that (typically) are not found in currently available dictionaries. A (partial)
sample entry from Wordspy is given in Table 3.2. The words on this website satisfy our
goal of being new; however, they include many kinds of neologisms, not just blends. We
thus annotated the dataset to identify the blends and their source words. (In cases where
multiple source words were found to be equally acceptable, all source words judged to
be valid were included in the annotation.) Most expressions in Wordspy include both a
definition and an example usage, making the task fairly straightforward.
As of 17 July 2008, Wordspy contained 1,186 single-word entries. The first author of
this study (also the author of this thesis) annotated each of these words as a blend or not a
52
Chapter 3. Lexical blends
Blend type
Simple 2-word sequential blends
Frequency
351
Example
digifeiter (digital counterfeiter )
Proper nouns
50
Japanimation (Japanese animation)
Affixes
61
prevenge (pre- revenge)
Common 1-letter prefix
10
e-business (electronic business)
Non-source word material
w2 contributes a prefix
Foreign word
7
10
4
aireoke (air guitar karaoke)
theocon (theological conservative)
sousveillance (French sous, meaning
under, and English surveillance)
Non-sequential blends
6
entertoyment (entertainment blended
with toy)
w1 contributes a suffix
5
caponomics (salary cap economics)
Multiple source words
6
MoSoSo (mobile social software)
Other
5
CUV (car blended with initialism SUV )
Table 3.3: Types of blends and their frequency in Wordspy data.
Chapter 3. Lexical blends
53
blend, and indicated the source words for each blend. To ensure validity of the annotation
task, the second author similarly annotated 100 words randomly sampled from the 1,186.
On this subset of 100 words, observed agreement on both the blend/non-blend annotation
and the component source word identification was 92%, with an unweighted kappa score
of .84. On four blends, the annotators gave different variants of the same source word;
for example, fuzzy buzzword and fuzz buzzword for the blend fuzzword. These items were
counted as agreements, and all variants were considered correct source words.
Given the high level of agreement between the annotators, only one person annotated
all 1,186 items. 515 words were judged to be blends, with 351 being simple 2-word
sequential blends whose source words are not proper nouns (this latter type of blend
being the focus of this study). Table 3.3 shows the variety of blends encountered in the
Wordspy data, organized according to a categorization scheme we devised. Of the simple
2-word sequential blends, we restrict our experimental dataset to the 324 items whose
entries included a citation of their usage, as we have evidence that they have in fact been
used; moreover, such blends may be less likely to be nonce-formations—expressions which
are used once but do not become part of the language. The usage data in the citations
can also be used in the future for semantic features based on contextual information. We
refer to this new dataset of 324 items as Wordsplend (a blend of Wordspy and blend ).
3.3
Materials and methods
3.3.1
Experimental expressions
The dataset used in the preliminary version of this study (Cook and Stevenson, 2007)
consisted of expressions from the Macquarie Dictionary (Delbridge, 1981) with an etymology entry indicating that they are blends. All of our statistical features were devised
using the development portion of this dataset, enabling us to use the full Wordsplend
dataset for testing. To compare our current results to those in our preliminary study,
Chapter 3. Lexical blends
54
we also perform experiments on a subset of the Macquarie dataset. We are uncertain as
to whether a number of expressions indicated to be blends in the Macquarie Dictionary
are in fact blends. For example, it does not match our intuition that clash is a blend of
clap and dash. We created a second dataset of confirmed blends, Mac-Conf, consisting
of only those blends from Macquarie that are found in at least one of two additional
dictionaries with an etymology entry indicating that they are blends. We report results
on the 30 expressions in the unseen test portion of Mac-Conf.
3.3.2
Experimental resources
We generate candidate sets using two different lexicons: the CELEX lexicon (Baayen
et al., 1995),2 and a wordlist created from the Web 1T 5-gram Corpus (Brants and
Franz, 2006). These are discussed further just below. The frequency information needed
to calculate the frequency features is extracted from the Web 1T 5-gram Corpus. The
length, contribution, and phonology features, as well as the syllable structure filter, are
calculated on the basis of the source words themselves, or are derived from information in
CELEX (when CELEX is the lexicon in use).3 We compute semantic similarity between
the source words using Jiang and Conrath’s (1997) measure in the WordNet-Similarity
package (Pedersen et al., 2004), and we compute semantic relatedness of the pair using
the cosine between word co-occurrence vectors using software provided by Mohammad
and Hirst (2006).
We conduct separate experiments with the two different lexicons for candidate set
creation. We began by using CELEX, because it contains rich phonological information
that some of our features draw on. However, in our analysis of the results, we noted
2
From CELEX, we use lemmas as potential source words, as it is uncommon for a source word to be
an inflected form—there are no such examples in our development data.
3
Note that it would be possible to automatically infer the phonological and syllabic information
required for our features using automatic approaches for text-to-phoneme conversion and syllabification
(e.g., Bartlett et al., 2008). Although such techniques currently provide noisy information, phonological
and syllabic information for the blend itself could also be inferred, allowing the development of features
that exploit this information. We leave exploring such possibilities for future work.
Chapter 3. Lexical blends
55
that for many expressions the correct candidate pair is not in the candidate set. Many
of the blends in Wordsplend are formed from words which are themselves new words,
often coined for concepts related to the Internet, such as download, for example; such
words are not listed in CELEX. This motivated us to create a lexicon from a recent
dataset (the Web 1T 5-gram Corpus) that would be expected to contain many of these
new coinages. To form a lexicon from this corpus, we extracted the 100K most frequent
words, restricted to lowercase and all-alphabetic forms. Using this lexicon we expect the
correct source word pair to be in the candidate set for more expressions. However, this
comes at the expense of potentially larger candidate sets, due to the larger lexicon size.
Furthermore, since this lexicon does not contain phonological or syllabic representations
of each word, we cannot extract three features: the feature for the syllable heuristic, and
the two features that capture the tendency for the second source word to be longer than
the first in terms of phonemes and syllables. (We do calculate the phonological similarity
between the two candidate source words, in terms of graphemes.)
3.3.3
Experimental methods
Since each of our features is designed to have a high value for a correct source word pair
and a low value otherwise, we can simply sum the features for each candidate pair to
get a score for each pair indicating its degree of goodness as a source word pair for the
blend under consideration. However, since our various features have values falling on
differing ranges, we first normalize the feature values by subtracting the mean of that
feature within that candidate set and dividing by the corresponding standard deviation.
We also take the arctan of each resulting feature value to reduce the influence of outliers.
We then sum the feature values for each candidate pair, and order the pairs within each
candidate set according to this sum. This ranks the pairs in terms of decreasing degree
of goodness as a source word pair. We refer to this method as the feature ranking
approach.
56
Chapter 3. Lexical blends
We also use a machine learning approach applied to the features in a training regimen.
Our task can be viewed as a classification problem in which each candidate pair is either
a positive instance (the correct source word pair) or a negative instance (an incorrect
source word pair). However, a standard machine learning algorithm does not directly
apply because of the structure of the problem space. In classification, we typically look for
a hyperplane that separates the positive and negative training examples. In the context
of our problem, this corresponds to separating all the correct candidate pairs (for all
blends in our dataset) from all the incorrect candidate pairs. However, such an approach
is undesirable as it ignores the structure of the candidate sets; it is only necessary to
separate the correct source word pair for a given blend from the corresponding incorrect
candidate pairs (i.e., for the same blend). This is also in line with the formulation of
our features, which are designed to give relatively higher values to correct candidate
pairs than incorrect candidate pairs within the candidate set for a given blend; it is not
necessarily the case that the feature values for the correct candidate pair for a given
blend will be higher than those for an incorrect candidate pair for another blend. In
other words, the features are designed to give values that are relative to the candidates
for a particular blend.
To address this issue, we use a version of the perceptron algorithm similar to that
proposed by Shen and Joshi (2005). In this approach, the classifier is trained by only adjusting the perceptron weight vector when the correct candidate pair is scored lower than
the incorrect pairs for the target blend (not across all the candidate pairs for all blends).
Furthermore, to accommodate the large variation in candidate set size, we use an uneven
margin—in this case the distance between the weighted sum of the feature vector for a
correct and incorrect candidate pair—of
1
.
|correct cand. pairs|·|incorrect cand. pairs|
We therefore
learn a single weight vector such that, within each candidate set, the correct candidate
pairs are scored higher than the incorrect candidate pairs by a factor of this margin.
When updating the weight vector, we multiply the update that we add to the weight
Chapter 3. Lexical blends
57
vector by a factor of this margin to prevent the classifier from being overly influenced
by large candidate sets. During testing, each candidate pair is ranked according to the
weighted sum of its feature vector. To evaluate this approach, on each of Wordsplend
and Mac-Conf we perform 10-fold cross-validation with 10 random restarts. In these
experiments, we use our syllable heuristic as a feature, rather than as a filter, to allow
the learner to weight it appropriately.
3.3.4
Evaluation metrics
We evaluate our methods according to two measures: accuracy and mean reciprocal rank
(MRR). Under the accuracy measure, the system is scored as correct if it ranks one of the
correct source word pairs for a given blend first, and as incorrect otherwise. The MRR
gives the mean of the rank of the highest ranked correct source word pair for each blend.
Although accuracy is more stringent than MRR, we are interested in MRR to see where
the system ranks the correct source word pair in the case that it is not ranked first.
We compare the accuracy of our system against two baselines. The chance (random)
baseline is the accuracy obtained by randomly selecting a candidate pair from the candidate set. We also consider an informed baseline in which the feature ranking approach
is applied using just two of our features, the frequency of each candidate source word.
3.4
3.4.1
Experimental results
Candidate sets
Recall that we construct candidate sets using two different resources, CELEX and the
Web 1T 5-gram Corpus. In Section 3.4.1.1 we examine some properties of the candidate
sets created using CELEX (also referred to as the CELEX candidate sets), and then in
Section 3.4.1.2 we consider the candidate sets built from the Web 1T 5-gram Corpus.
58
Chapter 3. Lexical blends
Lexical resource or CS
Wordsplend
% exps Med. CS size
Mac-Conf
% exps
Med. CS size
CELEX
78
-
83
-
CELEX CS
76
117
83
121
CELEX CS after syllable filter
71
71
77
92
Web 1T lexicon
92
-
-
-
Web 1T CS
89
442
-
-
Table 3.4: % of expressions (% exps) with their source words in each lexical resource and
candidate set (CS), and after applying the syllable heuristic filter on the CELEX CS, as
well as median CS size, for both the Wordsplend and Mac-Conf datasets.
3.4.1.1
CELEX
Rows 2–4 of Table 3.4 present statistics for the CELEX candidate sets. First, in the
second row of this table (CELEX), we observe that only 78–83% of expressions have
both source words in CELEX. For the other 17–22% of expressions, our system is always
incorrect, since the CELEX candidate set cannot contain the correct source words. The
percentages reported in this row thus serve as an upper bound on the task for each
dataset.
The third row of Table 3.4 (CELEX CS) shows the percentage of expressions for which
the CELEX candidate set contains the correct source words. Note that in most cases, if
the source words are in CELEX, they are also in the CELEX candidate set. The only
expressions in Wordsplend for which that is not the case are those in which a source
word contributes a single letter to the blend. We could remove our restriction that each
source word contribute at least two letters; however, this would cause the candidate sets
to be much larger and likely reduce accuracy.
We now look at the effect of filtering the CELEX candidate sets to include only those
Chapter 3. Lexical blends
59
candidate pairs that are valid according to our syllable heuristic. This process results
in a 24–39% reduction in median candidate set size, but only excludes the source words
from the candidate set for a relatively small number of expressions (5–6%), as shown in
the fourth row of Table 3.4 (CELEX CS after syllable filter). We will further examine the
effectiveness of this heuristic when we consider the results for source word identification
in Section 3.4.2.
3.4.1.2
Web 1T 5-gram Corpus
Now we examine the candidate sets created using the lexicon derived from the Web 1T
5-gram Corpus.4 In the final two rows of Table 3.4 (Web 1T lexicon and Web 1T CS)
we see that, as expected, many more expressions have their source words in the Web 1T
lexicon than in CELEX, and furthermore, more expressions have their source words in
the candidate sets created using the Web 1T lexicon than in the candidate sets formed
from CELEX. This means that the upper bound for our task is much higher when using
the Web 1T lexicon than when using CELEX. However, this comes at the cost of creating
much larger candidate sets; we examine this trade-off more thoroughly below.
3.4.2
Source word identification
In the following subsections we present results using the feature ranking approach (Section 3.4.2.1), and analyze some of the errors the system makes in these experiments
(Section 3.4.2.2). We then consider results using the modified perceptron algorithm
(Section 3.4.2.3), and finally we compare our results against those from our preliminary
study (Cook and Stevenson, 2007) and human performance (Section 3.4.2.4).
4
Syllable structure information is not available for all words in the Web 1T lexicon, therefore we do
not apply the syllable heuristic filter to the pairs in these candidate sets (see Section 3.3.2). We do
not create candidate sets for Mac-Conf using the Web 1T lexicon since this lexicon was constructed
specifically in response to the kinds of new words found in Wordsplend.
60
Chapter 3. Lexical blends
Features
Wordsplend
Mac-Conf
(324)
(30)
CELEX
WEB 1T
CELEX
Random Baseline
6
3
1
Informed Baseline
27
27
7
Frequency
32∗
32∗
Length/Contribution/Phonoology
20
20
7
Semantic
15
13
20
All
38∗
42*
37*
All+Syllable
40*
-
37*
30∗
Table 3.5: % accuracy on blends in Wordsplend and Mac-Conf using the feature
ranking approach. The size of each dataset is given in parentheses. The lexicon employed
(CELEX or WEB 1T) is indicated. The best accuracy obtained using this approach
for each dataset and lexicon is shown in boldface. Results that are significantly better
than the informed baseline are indicated with ∗.
3.4.2.1
Feature ranking
Table 3.5 gives the accuracy using the feature ranking approach for both the random and
informed baselines (described in Section 3.3.4), each feature group, and the combination
of all features, on each dataset, using both the CELEX and Web 1T lexicons in the case of
Wordsplend. Feature groups and combinations marked with ∗ are significantly better
than the informed baseline at the .05 confidence level using McNemar’s Test. (McNemar’s
Test is a non-parametric test that can be applied to correlated, nominal data.)
We first note that the informed baseline is an improvement over the random baseline
in all cases, which points to the importance of word frequency in blend formation. We
also see that the informed baseline is quite a bit higher on Wordsplend than Mac-
Chapter 3. Lexical blends
61
Conf. Inspection of candidate sets—created from the CELEX lexicon—that include the
correct source words reveals that the average source word frequency for Wordsplend
is much higher than for Mac-Conf (118 million vs. 34 million). On the other hand, the
average for non-source words in the candidate sets is similar across these datasets (11M
vs. 9M). Thus, although source words are more frequent than non-source words for both
datasets, frequency is a much more reliable indicator of being a source word for truly
novel blends than for established blends. This finding emphasizes the need for a dataset
such as Wordsplend to evaluate methods for processing neologisms.
All of the individual feature groups outperform the random baseline. We also see
that our frequency features are better than the informed baseline. Although source word
frequency (the informed baseline) clearly plays an important role in forming interpretable
blends, this finding confirms that additional aspects of source word frequency beyond
their unigram counts also play an important role in blend formation. Also note that
the semantic features are substantially better than the informed baseline—although not
significantly so—on Mac-Conf, but not on Wordsplend. This result demonstrates
the importance of testing on true neologisms to have an accurate assessment of a method.
It also supports our future plan to explore alternative semantic features, such as those
that draw on the context of usage of a blend (as provided in Wordsplend).
We expect using all the features to give an improvement in performance over any
individual feature group, since they tap into very different types of information about
blends. Indeed the combination of all features (All) does perform better than the frequency features, supporting our hypothesis that the information provided by the different
feature groups is complementary.5
Looking at the results on Wordsplend using the Web 1T lexicon, we see that as
expected, due to the larger candidate sets, the random baseline is lower than when using
5
This difference is significant (p < 0.01) according to McNemar’s test for the Wordsplend dataset
using both the CELEX and Web 1T lexicons. The difference is not significant for Mac-Conf.
Chapter 3. Lexical blends
62
the CELEX lexicon. However, the informed baseline, and each feature group used on
its own, give very similar results, with only a small difference observed for the semantic
features. The combination of all features gives slightly higher performance using the Web
1T lexicon than the CELEX lexicon, although again this difference is rather small.
Recall that we wanted to see if the use of our syllable heuristic filter to reduce candidate set size would have a negative impact on performance. Table 3.5 shows that the
accuracy on all features when we apply our syllable heuristic filter (All+Syllable) is at
least as good as when we do not apply the filter (All). This is the case even though
the syllable heuristic filter removes the correct source word pairs for 5–6% of the blends
(see Table 3.4). It seems that the words this heuristic excludes from consideration are
not those that the features rank highly, indicating that it is a reasonable method for
pruning candidate sets. Moreover, reducing candidate set size will enable future work to
explore features that are more expensive to extract than those currently used. Given the
promising results using the Web1T lexicon, we also intend to examine ways to automatically estimate the syllable filtering heuristic for words for which we do not have syllable
structure information.
3.4.2.2
Error analysis
We now examine some cases where the system ranks an incorrect candidate pair first, to
try to determine why the system makes the errors it does. We focus on the expressions in
Wordsplend using the CELEX lexicon, as we are able to extract all of our features for
this experimental setup. First, we observe that when considering feature groups individually, the frequency features perform best; however, in many cases, they also contribute
to errors. This seems to be primarily due to (incorrect) candidate pairs that occur very
frequently together. For example, in the case of mathlete (math athlete), the candidate
pair male athlete co-occurs much more frequently than the correct source word pair (math
athlete), causing the system to incorrectly rank the source word pair male athlete first.
Chapter 3. Lexical blends
63
We observe a similar situation for cutensil (cute utensil ), where the candidate pair cup
and utensil often co-occur. In both these cases, phonological information for the blend
itself could help as, for example, cute ([kjut]) contributes more phonemes to cutensil
([kjutEnsl]) than cup ([k2p]).
"
Turning to the length, contribution, and phonology features, we see that although
many blends exhibit the properties on which these features are based, there are also
many blends which do not. For example, our first feature in this group captures the
property that the second source word tends to be longer than the first; however, this
is not the case for some blends, such as testilie (testify and lie). Furthermore, even
for blends for which the second source word is longer than the first, there may exist
a candidate pair that has a higher value for this feature than the correct source word
pair. In the case of banalysis — banal analysis—banal electrolysis is a better source
word pair according to this feature. These observations, and similar issues with other
length, contribution, and phonology features, likely contribute to the poor performance
of this feature group. Moreover, such findings motivate approaches such as our modified
perceptron algorithm—discussed in the following subsection—that learn a weighting for
the features.
Finally, for the semantic features, we find cases where a blend’s source words are
similar and related, but there is another (incorrect) candidate pair which is more similar
and related according to these features. For example, puggle, a blend of pug and beagle,
has the candidate source words push and struggle which are more semantically similar
and related than the correct source word pair. In this case, the part-of-speech of the
candidate source words, along with contextual knowledge indicating the part-of-speech
of the blend, may be useful; blending pug and beagle would result in a noun, while a
blend of push and struggle would likely be a verb. Another example is camikini, a blend
of camisole and bikini. Both of these source words are women’s garments, so we would
expect them to have a moderately high similarity. However, the semantic similarity
64
Chapter 3. Lexical blends
Features
Informed Baseline
All+Syllable
Wordsplend
Mac-Conf
(324)
(30)
CELEX
WEB 1T
CELEX
23
24
7
∗40
∗37
∗35
Table 3.6: % accuracy on blends in Wordsplend and Mac-Conf using the modified
perceptron algorithm. The size of each dataset is given in parentheses. The lexicon
employed (CELEX or WEB 1T) is indicated. Results that are significantly better than
the informed baseline are indicated with ∗.
feature assigns this candidate pair the lowest possible score, since these words do not
occur in the corpus from which this feature is estimated.
3.4.2.3
Modified perceptron
Table 3.6 gives the average accuracy of the modified perceptron algorithm for the informed baseline and the combination of all features plus the feature corresponding to
the syllable heuristic, on each dataset, using both the CELEX and Web 1T lexicons in
the case of Wordsplend. We don’t compare this method directly against the results
using the feature ranking approach since our perceptron experiments are conducted using cross-validation, rather than a held-out test set methodology. Examining the results
using the combination of All+Syllable, we see that for each dataset and lexicon the mean
accuracy over the 10-fold cross-validation is significantly higher than that obtained using
the informed baseline, according to an unpaired t-test (p < 0.0001 in each case).
Interestingly, on Wordsplend using the combination of all features, we see higher
performance using the CELEX lexicon than the Web 1T lexicon. We hypothesize that
this is due to the training data in the latter case containing many more negative examples
(incorrect candidate pairs—due to the larger candidate sets). It is worth noting that,
Chapter 3. Lexical blends
65
despite the differing experimental methodologies, the results are in fact not very different
from those obtained in the feature ranking approach. One limitation of this perceptron
algorithm is that it assumes that the training data is linearly separable. In future work,
we will try other machine learning techniques, such as that described by Joachims (2002),
that do not make this assumption.
3.4.2.4
Discussion
We now compare the feature ranking results on Mac-Conf here of 37% accuracy, to the
best results in our preliminary study on this dataset of 27% accuracy, also using feature
ranking (Cook and Stevenson, 2007) To make this comparison, we should consider the
differing baselines and upper bounds across the experiments. The informed baseline in
our preliminary study on Mac-Conf is 13%, substantially higher than the 7% in the
current study. Recall that the first row of Table 3.4 shows the upper bound using the
CELEX lexicon on this dataset to be 83%. By contrast, in our preliminary work we only
use blends whose source words appear in the lexicon we used there (Macquarie), so the
upper bound for that study is 100%. Taking these factors into account, the best results in
our preliminary study correspond to a reduction in error rate (RER) over the informed
baseline of 0.16, while the feature ranking method here using the combination of all
features and the syllable heuristic filter achieves a much higher RER of 0.39. (Reduction
in error rate =
accuracy−baseline
.)
upper bound−baseline
Lehrer (2003) finds human performance for determining the source words of blends
to be 34% to 79%—depending on the blends considered—which indicates the difficulty
of this task. (Note that the high level of interannotator agreement achieved in our
annotation task (Section 3.2) may seem surprising in the context of Lehrer’s results.
However, our task is much easier, since our annotators were given a definition of the
blend, while Lehrer’s subjects were not.) Our best accuracy on each dataset of 37%–42%
is quite respectable in comparison. These accuracies correspond to mean reciprocal ranks
Chapter 3. Lexical blends
66
of 0.47–0.51, while the random baseline on Wordsplend and Mac-Conf in terms of
this measure is 0.03–0.07. This indicates that even when our system is incorrect, the
correct source word pair is still ranked fairly high. Such information about the best
interpretations of a blend could be useful in semi-automated methods, such as computeraided translation, where a human may not be familiar with a novel blend in the source
text. Moreover, a list of possible interpretations for a blend—ranked by their likelihood—
could be more useful for NLP systems for tasks such as machine translation than a single
most likely interpretation.
3.5
Blend identification
The statistical features we have developed may also be informative about whether or not
a word is in fact a blend—that is, we expect that if a novel word has “good” candidate
source words, then the word is more likely to be a blend than the result of another word
formation process. Since our features are designed to be high for a blend’s source words
and low for other word pairs, we hypothesize that the highest scoring candidate pairs for
blends will be higher than those of non-blends.
To test this hypothesis, we first create a dataset of non-blends from our earlier annotation, which found 671 non-blends out of the 1,186 Wordspy expressions (see Section 3.2).
From these words, we eliminate all those beginning with a capital letter (to exclude words
formed from proper nouns) or containing a non-letter character (to exclude acronyms and
initialisms). This results in 663 non-blends.
We create candidate sets for the non-blends using the CELEX lexicon. Using the
CELEX lexicon allows us to extract—and consider the contribution of—all of our length,
contribution, and phonology features, some of which are not available when using the Web
1T lexicon. The candidate sets resulting from using the CELEX lexicon were also much
smaller than when using the Web 1T lexicon. We calculate the features for the non-blends
Chapter 3. Lexical blends
Figure 3.1: ROC curves for blend identification.
67
Chapter 3. Lexical blends
68
as we did for the blends, and then order all expressions (both blends and non-blends)
according to the sum of the features for their highest-scoring candidate source word pair.
We use the same feature groups and combinations presented in Table 3.5. Rather than
set an arbitrary cut-off to distinguish blends from non-blends, we instead give receiver
operating characteristic (ROC) curves for some of these experiments. ROC curves plot
true positive rate versus false positive rate as the cut-off is varied; see Figure 3.1. The
top-left corner represents perfect classification, with points further towards the top-left
from the diagonal (a random classifier) being “better.” We see that the informed baseline
is a substantial improvement over a random classifier, while the combination All+Syllable
is a further improvement over the informed baseline. The individual feature groups (not
shown in Figure 3.1) do not perform as well as All+Syllable. In future work, we plan to
re-examine this task and develop methods specifically for identifying blends and other
types of neologism.
3.6
Related Work
As discussed in Section 2.2, techniques generally used in the automatic acquisition of
syntactic and semantic properties of words are not applicable here, since they use corpus
statistics that cannot be accurately estimated for low-frequency items, such as the novel
lexical blends considered in this study (e.g., Hindle, 1990; Lapata and Brew, 2004; Joanis
et al., 2008). Other work that has used the context in which an unknown word occurs,
along with domain specific knowledge, to infer aspects of its meaning and syntax from
just one usage, or a small number of usages (e.g., Granger, 1977; Cardie, 1993; Hastings
and Lytinen, 1994), is also inapplicable; the domain-specific lexical resources that these
approaches rely on limit their applicability to general text.
Techniques for inferring lexical properties of neologisms can make use of information
that is typically not available in other lexical acquisition tasks—specifically, knowledge
Chapter 3. Lexical blends
69
of the processes through which neologisms are formed. Like the computational work
on neologisms discussed in Section 2.1 that concentrates on particular types of words,
this study also focuses on a specific word formation type, namely lexical blends. This
common word type has been unaddressed in computational linguistics except for our
previous work (Cook and Stevenson, 2007, 2010b).
In addition to knowledge about a word’s formation process, for many types of neologism, information about its phonological and orthographic content can be used to infer
aspects of its syntactic and semantic properties. This is the case for neologisms that are
composed of existing words or affixes (e.g., compounds and derivations) or partial orthographic or phonological material from existing words or affixes (e.g., acronyms, clippings,
and blends). For example, in the case of part-of-speech tagging, information about the
suffix of an unknown word can be used to determine its part-of-speech (e.g., Brill, 1994;
Ratnaparkhi, 1996; Mikheev, 1997, discussed in Section 2.1.1.1). For the task of inferring
the longform of an acronym, the letters which compose a given acronym can be used to
determine the most likely longform (e.g., Schwartz and Hearst, 2003; Nadeau and Turney,
2005; Okazaki and Ananiadou, 2006, discussed in Section 2.1.3.1).
The latter approach to acronyms is somewhat similar to the way in which we use
knowledge of the letters that make up a blend to form candidate sets and determine the
most likely source words. However, in the case of acronyms, each word in a longform
typically contributes only one letter to the acronym, while for blends, a source word
usually contributes more than one letter. At first glance, it may appear that this makes
the task of source word identification easier for blends, since there is more source word
material available to work with. However, acronyms have two properties that help in
their identification. First, there is less uncertainty in the “split” of an acronym, since
each letter is usually contributed by a separate word. By contrast, due to the large
variation in the amount of material contributed by the source words in blends, one
of the challenges in blend identification is to determine which material in the blend
Chapter 3. Lexical blends
70
belongs to each source word. Second, and more importantly, acronyms are typically
introduced in regular patterns (e.g., the longform followed by the acronym capitalized and
in parentheses) which can be exploited in acronym identification and longform inference;
in the case of blends there is no counterpart for this information.
3.7
Summary of contributions
This is the first computational study to consider lexical blends, a very frequent class
of new words. We propose a statistical model for inferring the source words of lexical
blends based largely on properties related to the recognizability of their source words.
We also introduce a method based on syllable structure for reducing the number of words
that are considered as possible source words. We evaluate our methods on two datasets,
one consisting of novel blends, the other containing established blends; in both cases
our features significantly outperform an informed baseline. We further show that our
methods for source word identification can also be used to distinguish blends from other
word types. We find evidence that blends tend to have candidate source word pairs
that are “good” according to our features while non-blends tend not to. In addition, we
annotate a dataset of newly-coined expressions which will support future research not
only on lexical blends, but on neologisms in general.
Chapter 4
Text message forms
Cell phone text messages—or SMS—contain many shortened and non-standard forms due
to a variety of factors, particularly the desire for rapid text entry (Grinter and Eldridge,
2001; Thurlow, 2003). Abbreviated forms may also be used because the number of
characters in a text message is sometimes limited to 160 characters, although this is
not always the case. Furthermore, text messages are written in an informal register;
non-standard forms are used to reflect this, and even for personal style (Thurlow, 2003).
These factors result in tremendous linguistic creativity, and hence many novel lexical
items, in the language of text messaging, or texting language.
One interesting consideration is whether text messaging and conventional writing constitute two separate writing systems or variants of the same writing system. Sampson
(1985) claims that making distinctions between writing systems is as difficult as determining whether the speech of two communities corresponds to two different languages
or two dialects of the same language. Nevertheless, although we may not be able to
decisively answer this question, here we consider some of the differences between conventional writing and text messaging from the perspective of writing systems. Logographs
are symbols which represent morphemes or words. English logographs include @ (at) and
& (and ). Phonographs are symbols which represent sounds. The English writing system
71
Chapter 4. Text message forms
72
is largely phonographic. Logographs in conventional writing can be used as phonographs
in text messaging. For example, numerals are widely used phonographically in text messaging forms, as in any1 (anyone) and b4 (before). Furthermore, in text messaging some
symbols can be used as word-level phonographs—e.g., r (are) and u (you)—but are not
typically used this way in conventional writing. Moreover, the use of logographs appears
to be more common in text messaging than in conventional writing. These differences
between conventional writing and text messaging pose difficulties for natural language
processing (NLP).
Normalization of non-standard forms is a challenge that must be tackled before other
types of NLP can take place (Sproat et al., 2001). In the case of text messages, textto-speech synthesis may be particularly useful for the visually impaired. For texting
language, given the abundance of creative forms, and the wide-ranging possibilities for
creating new forms, normalization is a particularly important problem, and has indeed
received some attention in computational linguistics (e.g., Aw et al., 2006; Choudhury
et al., 2007; Kobus et al., 2008; Yvon, 2010). Indeed, Aw et al. show that normalizing text messages prior to translation (to another language) can improve the quality
of the resulting translation, emphasizing the importance of methods for text message
normalization.
In this chapter we propose an unsupervised noisy channel method for texting language normalization, that gives performance on par with that of a supervised system.
We use the term unsupervised here to mean specifically that our proposed method does
not rely directly on gold-standard training data, i.e., pairs of text messages and their
corresponding standard forms. We pursue unsupervised approaches to this problem, as
a large collection of gold-standard training data is not readily available. One notable exception is Fairon and Paumier (2006), although this resource is in French. (The resource
used in our study, provided by Choudhury et al. (2007), is quite small in comparison.)
Furthermore, other forms of computer-mediated communication, such as Internet mes-
Chapter 4. Text message forms
73
saging and microblogging, for example, Twitter,1 exhibit creative phenomena similar to
text messaging, although at a lower frequency (at least in the case of Internet messaging,
Ling and Baron, 2007). Moreover, technological changes, such as new input devices, are
likely to have an impact on the language of such media (Thurlow, 2003). On the other
hand, the rise of technology such as word prediction could reduce the use of abbreviations
in computer-mediated communication; however, it’s not clear such technology is widely
used (Grinter and Eldridge, 2001). An unsupervised approach, drawing on linguistic
properties of creative word formations, has the potential to be adapted for normalization
of text in other similar media—such as microblogging—without the cost of developing a
large training corpus. Moreover, normalization may be particularly important for such
media, given the need for applications such as translation and question answering.
We observe that many creative texting forms are the result of a small number of specific word formation processes. Rather than using a generic error model to capture all of
them, we propose a mixture model in which each word formation process is modeled explicitly according to linguistic observations specific to that formation. We do not consider
our method’s reliance on our observations about common word formation processes in
texting language to constitute a supervised approach—our method requires only general
observations about word formation processes and not specific gold-standard pairs of text
messages and their normalized forms. The remainder of this chapter is organized as follows: we present an analysis of a collection of texting forms in Section 4.1, which forms
the basis for the unsupervised model of text message normalization described in Section 4.2. We discuss the experimental setup and system implementation in Section 4.3,
and present results in Section 4.4. Finally, we discuss related work in Section 4.5 and
summarize the contributions of this study in Section 4.6.
1
http://twitter.com/
Chapter 4. Text message forms
4.1
74
Analysis of texting forms
To better understand the creative processes present in texting language, we categorize the
word formation process of each texting form in our development data, which consists of
400 texting forms paired with their standard forms.2 Several iterations of categorization
were done in order to determine sensible categories, and ensure categories were used
consistently. Since this data is only to be used to guide the construction of our system,
and not for formal evaluation, only one judge categorized the expressions (the author of
this thesis, a native English speaker). The findings are presented in Table 4.1.
Stylistic variations, by far the most frequent category, exhibit non-standard spelling,
such as representing sounds phonetically. Subsequence abbreviations, also very frequent,
are composed of a subsequence of the graphemes in a standard form, often omitting
vowels. These two formation types account for approximately 66% of our development
data; the remaining formation types are much less frequent. Suffix clippings and prefix
clippings consist of a prefix or suffix, respectively, of a standard form, and in some cases
a diminutive ending; we also consider clippings which omit just a final g (e.g., talkin) or
initial h (e.g., ello) from a standard form as they are rather frequent. (Thurlow (2003) also
observes an abundance of g-clippings.) A single letter or digit can be used to represent
a syllable; we refer to these as syllabic letter/digit. Phonetic abbreviations are variants
of clippings and subsequence abbreviations where some sounds in the standard form are
represented phonetically. Several texting forms appear to be spelling errors; we took the
layout of letters on cell phone keypads into account when making this judgement. The
items that did not fit within the above texting form categories were marked as unclear.
Finally, for some expressions the given standard form did not appear to be appropriate.
For example, gal is a colloquial English word meaning roughly the same as girl, but was
2
Most texting forms have a unique standard form; however, some have multiple standard forms, e.g.,
will and well can both be shortened to wl. In such cases we choose the word formation process through
which the texting form would be created from the most frequent standard form; in the case of frequency
ties we choose arbitrarily among the categories corresponding to the most frequent standard forms.
75
Chapter 4. Text message forms
Formation type
Frequency
Example
Texting form
Standard form
Stylistic variation
152
betta
better
Subsequence abbreviation
111
dng
doing
Suffix clipping
24
hol
holiday
Syllabic letter/digit
19
neway
anyway
G-clipping
14
talkin
talking
Phonetic abbreviation
12
cuz
because
H-clipping
10
ello
hello
Spelling error
5
darliog
darling
Prefix clipping
4
morrow
tomorrow
Punctuation
3
b/day
birthday
Unclear
34
mobs
mobile
Error
12
gal
girl
Total
400
Table 4.1: Frequency of texting forms in the development set by formation type.
Chapter 4. Text message forms
76
annotated as a texting form of the standard form girl. Such cases were marked as errors.
No texting forms in our development data correspond to multiple standard form
words, for example, wanna for want to. (A small number of similar forms, however, appear with a single standard form word, and are marked as errors, e.g., the texting form
wanna annotated as the standard form want.) Acronyms and initialisms, such as LOL
and OMG, respectively, can also occur in text messaging, but are again very infrequent in
our development data. Previous approaches to text message normalization, such as Aw
et al. (2006) and Kobus et al. (2008), have considered issues related to texting forms corresponding to multiple standard forms. However, these approaches have limited means
for normalizing out-of-vocabulary texting forms. On the other hand, we focus specifically on creative formations in texting language. According to our development data,
such forms tend to have a one-to-one correspondence with a standard form. Moreover,
many texting forms which do correspond to multiple standard forms are well-established,
and could perhaps best be normalized through lexicon-based approaches. We therefore
assume that—for the purposes of this study—a texting form always corresponds to a
single standard form word.
It is important to note that some texting forms have properties of multiple categories,
for example, bak (back ) could be considered a stylistic variation or a subsequence abbreviation. At this stage we are only trying to get a sense as to the common word formation
processes in texting language, and therefore in such cases we simply attempt to assign
the most appropriate category.
The design of our model for text message normalization, presented in the following
section, uses properties of the observed formation processes.
77
Chapter 4. Text message forms
4.2
An unsupervised noisy channel model for text
message normalization
Let S be a sentence consisting of standard forms s1 s2 ...sn ; in this study the standard
forms si are regular English words. Let T be a sequence of texting forms t1 t2 ...tn , which
are the texting language realization of the standard form words, and may differ from the
standard forms. Given a sequence of texting forms T , the challenge is then to determine
the corresponding standard forms S.
Following Choudhury et al. (2007)—and various approaches to spelling error correction, such as, for example, Mays et al. (1991)—we model text message normalization
using a noisy channel. We want to find argmaxS P (S|T ). We apply Bayes rule and ignore the constant term P (T ), giving argmaxS P (T |S)P (S). Making the independence
assumption that each texting form ti depends only on the standard form word si , and
not on the context in which it occurs, as in Choudhury et al., we express P (T |S) as a
Q
product of probabilities: argmaxS ( i P (ti |si )) P (S).
We note in Section 4.1 that many texting forms are created through a small number
of specific word formation processes. Rather than model each of these processes at once
using a generic model for P (ti |si ), as in Choudhury et al., we instead create several such
models, each corresponding to one of the observed common word formation processes.
P
We therefore rewrite P (ti |si ) as wf P (ti |si , wf )P (wf ) where wf is a word formation
process, e.g., subsequence abbreviation. Since, like Choudhury et al., we focus on the
word model, we simplify our model as below, to consider a single word si as opposed to
sequence of words S.
argmaxsi
X
P (ti |si , wf )P (wf )P (si )
wf
We next explain the components of the model, P (ti |si , wf ), P (wf ), and P (si ), referred
Chapter 4. Text message forms
78
to as the word model, word formation prior, and language model, respectively.
4.2.1
Word models
We now consider which of the word formation processes discussed in Section 4.1 to capture with a word model P (ti |si , wf ). Our choices here are based on the frequency of a
word formation process in the development data and how specific that process is. We
model stylistic variations and subsequence abbreviations simply due to their frequency.
We also choose to model suffix clippings since this word formation process is common
outside of text messaging (Kreidler, 1979; Algeo, 1991) and fairly frequent in our data.
Although g-clippings and h-clippings are moderately frequent, we do not model them,
as these very specific word formations are also (non-prototypical) subsequence abbreviations. The other less frequent formations—phonetic abbreviations, spelling errors, and
prefix clippings—are not modeled. On the one hand these word formation processes are
infrequent, and therefore modeling them explicitly is not expected to greatly improve
our system. On the other hand, these processes are somewhat similar to those we do
model, particularly stylistic variations. We therefore hypothesize that due to this similarity the system will perform reasonably well on these word formation processes that are
not modeled. Syllabic letters and digits, or punctuation cannot be captured by any of
the word formation processes that we do model, and are therefore incorporated into our
model despite their low frequency in the development data. We capture these formations
heuristically by substituting digits with a graphemic representation (e.g., 4 is replaced
by for ), and removing punctuation, before applying the model.
4.2.1.1
Stylistic variations
We propose a probabilistic version of edit-distance—referred to here as edit-probability—
inspired by Brill and Moore (2000) to model P (ti |si , stylistic variation). To compute
edit-probability, we consider the probability of each edit operation—substitution, inser-
79
Chapter 4. Text message forms
graphemes w i
phonemes
w
I
th
ou
t
T
au
t
Table 4.2: Grapheme–phoneme alignment for without.
tion, and deletion—instead of its cost, as in edit-distance. We then simply multiply
the probabilities of edits as opposed to summing their costs. (Edit-probability could
equivalently be thought of as a version of edit-distance in which the cost of each edit
operation is its log probability and the costs are then summed as in the standard version
of edit-distance.)
In this version of edit-probability, we allow two-character edits. Ideally, we would
compute the edit-probability of two strings as the sum of the edit-probability of each
partitioning of those strings into one or two character segments. However, following
Brill and Moore, we approximate this by the probability of the partition with maximum
probability. This allows us to compute edit-probability using a simple adaptation of
edit-distance, in which we consider edit operations spanning two characters at each cell
in the chart maintained by the algorithm.
We compute edit-probability between the graphemes of si and ti . When filling each
cell in the chart, we consider edit operations between segments of si and ti of length 0–2,
referred to as a and b, respectively. We also incorporate phonemic information when
computing edit-probability. In our lexicon, the graphemes and phonemes of each word
are aligned according to the method of Jiampojamarn et al. (2007). For example, the
alignment for without is given in Table 4.2. When computing the probability of each
cell, if a aligns with phonemes in si , we also consider those phonemes, p. For example,
considering the alignment in Table 4.2, if a were th we would consider the phoneme [T];
however, if a were h, a would not align with any phonemes, and we would not consider
phonemic information. The probability of each edit operation is then determined by
three properties—the length of a, whether a aligns with any phonemes in si , and if so,
80
Chapter 4. Text message forms
those phonemes p—as shown below:
|a| = 0 or 1, not aligned with si phonemes: Pg (b|a, position)
|a| = 2, not aligned with si phonemes: 0
|a| = 1 or 2, aligned with si phonemes: Pp,g (b|p, a, position)
where Pg (b|a, position) is the probability of texting form grapheme b given standard form
grapheme a at word position position, where position is the beginning, middle, or end of
the word; Pp,g (b|p, a, position) is the probability of texting form graphemes b given the
standard form phonemes p and graphemes a at word position position. a, b, and p can
be a single grapheme or phoneme, or a bigram.
4.2.1.2
Subsequence abbreviations
We model subsequence abbreviations according to the equation below:
P (ti |si , subsequence abbreviation) =
where c is a constant.
c
if ti is a subsequence of si
0 otherwise
Note that this is similar to the error model for spelling correction presented by Mays
et al. (1991), in which all words (in our terms, all si ) within a specified edit-distance of the
out-of-vocabulary word (ti in our model) are given equal probability. The key difference
is that in our formulation, we only consider standard forms for which the texting form is
potentially a subsequence abbreviation.
In combination with the language model, P (ti |si , subsequence abbreviation) assigns
a non-zero probability to each standard form si for which ti is a subsequence, according
to the likelihood of si (under the language model). The interaction of the models in this
81
Chapter 4. Text message forms
way corresponds to our intuition that a standard form will be recognizable—and therefore
frequent—relative to the other words for which ti could be a subsequence abbreviation.
4.2.1.3
Suffix clippings
We model suffix clippings similarly to subsequence abbreviations.
P (ti |si , suffix clipping) =
c
if ti is a possible suffix clipping of si
0 otherwise
Kreidler (1979) observes that clippings tend to be mono-syllabic and end in a consonant.
Furthermore, when they do end in a vowel, it is often of a regular form, such as telly for
television and breaky for breakfast. We therefore only consider P (ti |si , suffix clipping) if ti
is a suffix clipping according to the following heuristics: ti is mono-syllabic after stripping
any word-final vowels, and subsequently removing duplicated word-final consonants (e.g,
telly becomes tel, which is a candidate suffix clipping). If ti is not a suffix clipping
according to these criteria, P (ti |si ) simply sums over all models except suffix clipping.
4.2.2
Word formation prior
Our goal is an unsupervised system, and therefore we do not have access to gold-standard
texting form–standard form pairs. It is not clear how to estimate P (wf ) without such
data, so we simply assume a uniform distribution for P (wf ). We also consider estimating
P (wf ) using maximum likelihood estimates (MLEs) from our observations in Section 4.1.
This gives a model that is not fully unsupervised, since it relies on labelled training data.
However, we consider this a lightly-supervised method, since it only requires an estimate
of the frequency of the relevant word formation types.
Chapter 4. Text message forms
4.2.3
82
Language model
Choudhury et al. (2007) find that using a bigram language model estimated over a balanced corpus of English had a negative effect on their results compared with a unigram
language model, which they attribute to the unique characteristics of text messaging
that were not reflected in the corpus. We therefore use a unigram language model for
P (si ), which also enables comparison with their results. Nevertheless, alternative language models, such as higher order n-gram models, could easily be used in place of our
unigram language model.
4.3
4.3.1
Materials and methods
Datasets
We use the data provided by Choudhury et al. (2007) which consists of texting forms—
extracted from a collection of 900 text messages—and their manually determined standard forms. Our development data—used for model development and discussed in Section 4.1—consists of the 400 texting form types that are not in Choudhury et al.’s held-out
test set, and that are not the same as one of their standard forms. The test data consists
of 1,213 texting forms and their corresponding standard forms. A subset of 303 of these
texting forms differ from their standard form.3 This subset is the focus of this study, but
we also report results on the full dataset.
3
Choudhury et al. report that this dataset contains 1,228 texting forms. We found it to contain
1,213 texting forms corresponding to 1,228 standard forms (recall that a texting form may have multiple
standard forms). There were similar inconsistencies with the subset of texting forms that differ from
their standard forms. Nevertheless, we do not expect these small differences to have an appreciable effect
on the results.
Chapter 4. Text message forms
4.3.2
83
Lexicon
We construct a lexicon of potential standard forms such that it contains most words
that we expect to encounter in text messages, yet is not so large as to make it difficult
to identify the correct standard form. Our subjective analysis of the standard forms in
the development data is that they are frequent, non-specialized, words. To reflect this
observation, we create a lexicon consisting of all single-word entries containing only alphabetic characters found in both the CELEX Lexical Database (Baayen et al., 1995)
and the CMU Pronouncing Dictionary.4 We remove all words of length one (except a and
I ) to avoid choosing, for example, the letter r as the standard form for the texting form
r. We further limit the lexicon to words in the 20K most frequent alphabetic unigrams,
ignoring case, in the Web 1T 5-gram Corpus (Brants and Franz, 2006). The resulting lexicon contains approximately 14K words, and excludes only three of the standard
forms—cannot, email, and online—for the 400 development texting forms.
4.3.3
Model parameter estimation
MLEs for Pg (b|a, position)—needed to estimate P (ti |si , stylistic variation)—could be estimated from texting form–standard form pairs. However, since our system is unsupervised, such data cannot be used. We therefore assume that many texting forms, and
other similar creative shortenings, occur on the web. We develop a number of character
substitution rules, for example, s ⇒ z, and use them to create hypothetical texting forms
from standard words. We then compute MLEs for Pg (b|a, position) using the frequencies
of these derived forms on the web.
We create the substitution rules by examining examples in the development data,
considering fast speech variants and dialectal differences (e.g., voicing), and drawing on
our intuition. The derived forms are produced by applying the substitution rules to
4
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Chapter 4. Text message forms
84
the words in our lexicon. To avoid considering forms that are themselves words, we
eliminate any form found in a list of approximately 480K words taken from SOWPODS
(the wordlist used in many Scrabble tournaments),5 and the Moby Word Lists.6 Finally,
we obtain the frequency of the derived forms from the Web 1T 5-gram Corpus.
To estimate Pp,g (b|p, a, position), we begin by first estimating two simpler distributions: Ph (b|a, position) and Pp (b|p, position). Ph (b|a, position) is estimated in the
same manner as Pg (b|a, position), except that two-character substitutions are allowed.
Pp (b|p, position) is estimated from the frequency of p, and its alignment with b, in a
version of CELEX in which the graphemic and phonemic representation of each word is
many–many aligned using the method of Jiampojamarn et al. (2007).7 Pp,g (b|p, a, position)
is then an evenly-weighted linear combination of the estimates of this distribution using only graphemic and phonemic information, Ph (b|a, position) and Pp (b|p, position),
respectively. Finally, we smooth each of Pg (b|a, position) and Pp,g (b|p, a, position) using
add-alpha smoothing.
We set the constant c in our word models for subsequence abbreviations and suffix
P
clippings such that si P (ti |si , wf )P (si ) = 1. We similarly normalize the product of the
stylistic variation word model and the language model, P (ti |si , stylistic variation)P (si ).8
We use the frequency of unigrams (ignoring case) in the Web 1T 5-gram Corpus to
estimate our language model. We expect the language of text messaging to be more
similar to that found on the web than that in a balanced corpus of English, such as the
British National Corpus (Burnard, 2007).
5
http://en.wikipedia.org/wiki/SOWPODS
http://icon.shef.ac.uk/Moby/
7
We are very grateful to Sittichai Jiampojamarn for providing this alignment.
8
In our implementation of this model we in fact estimate P (si |ti , wf ) directly for subsequence abbreviations and suffix clippings rather than applying Bayes rule and calculating P (ti |si , wf )P (si ). The
normalization is necessary to account for the constant factor of P (ti ) which is dropped from the denominator when Bayes rule is applied, but not when direct estimation is used. We present the model as in
Section 4.2 to follow the presentation of previous models, such as Choudhury et al. (2007).
6
85
Chapter 4. Text message forms
Model
% accuracy
In-top-1 In-top-10 In-top-20
Uniform
59.4
83.8
87.8
MLE
55.4
84.2
86.5
Choudhury et al.
59.9
84.3
88.7
Table 4.3: % in-top-1, in-top-10, and in-top-20 accuracy on test data using both estimates
for P (wf ). The results reported by Choudhury et al. (2007) are also shown.
4.3.4
Evaluation metrics
To evaluate our system, we consider three accuracy metrics: in-top-1, in-top-10, and intop-20. (These are the same metrics used by Choudhury et al. (2007), although we refer
to them by different names.) In-top-n considers the system correct if a correct standard
form is in the n most probable standard forms. The in-top-1 accuracy shows how well
the system determines the correct standard form; the in-top-10 and in-top-20 accuracies
may be indicative of the usefulness of the output of our system in other tasks which could
exploit a ranked list of standard forms, such as machine translation.
4.4
Results and discussion
In Table 4.3 we report the results of our system using both the uniform estimate and the
MLE of P (wf ). Note that there is no meaningful random baseline to compare against
here; randomly ordering the 14K words in our lexicon gives very low accuracy. The
results using the uniform estimate of P (wf )—a fully unsupervised system—are very
similar to the supervised results of Choudhury et al. (2007). Surprisingly, when we
estimate P (wf ) using MLEs from the development data—resulting in a lightly-supervised
system—the results are slightly worse than when using the uniform estimate of this
probability. One explanation for this is that the distribution of word formations is very
Chapter 4. Text message forms
86
different for the testing data and development data. However, we observe the same
difference in performance between the two approaches on the development data where
we expect to have an accurate MLE for P (wf ) (results not shown). We hypothesize that
the ambiguity of the categories of texting forms (see Section 4.1) results in poor MLEs
for P (wf ), thus making a uniform distribution, and hence fully-unsupervised approach,
more appropriate.
4.4.1
Results by formation type
We now consider in-top-1 accuracy for each word formation type, in Table 4.4. We
show results for the same word formation processes as in Table 4.1 (page 75), except
for h-clippings and punctuation, as no words of these categories are present in the test
data. We present results using the same experimental setup as before with a uniform
estimate of P (wf ) (All), and using just the model corresponding to the word formation
process (Specific), where applicable. (In this case our model then becomes, for each word
formation process wf , argmaxsi P (ti |si , wf )P (si ).)
We first examine the top panel of Table 4.4 where we compare the performance on each
word formation type for both experimental conditions (Specific and All). We first note
that the performance using the formation-specific model on subsequence abbreviations
and suffix clippings is better than that of the overall model. This is unsurprising since we
expect that when we know a texting form’s formation process, and invoke a corresponding
specific model, our system should outperform a model designed to handle a range of
formation types. However, this is not the case for stylistic variations; here the overall
model performs better than the specific model. We observed in Section 4.1 that some
texting forms do not fit neatly into our categorization scheme; indeed, many stylistic
variations are also analyzable as subsequence abbreviations. Therefore, the subsequence
abbreviation model may improve normalization of stylistic variations. This model, used
in isolation on stylistic variations, gives an in-top-1 accuracy of 33.1%, indicating that
87
Chapter 4. Text message forms
Formation type
Frequency
% in-top-1 accuracy
n = 303
Specific
All
121
62.8
67.8
Subsequence abbreviation
65
56.9
46.2
Suffix clipping
25
44.0
20.0
G-clipping
56
-
91.1
Syllabic letter/digit
16
-
50.0
Unclear
12
-
0.0
Spelling error
5
-
80.0
Prefix clipping
1
-
0.0
Phonetic abbreviation
1
-
0.0
Error
1
-
0.0
Stylistic variation
Table 4.4: Frequency and % in-top-1 accuracy using the formation-specific model where
applicable (Specific) and all models (All) with a uniform estimate for P (wf ), presented
by formation type.
Chapter 4. Text message forms
88
this may be the case.
Comparing the performance of the individual word models on only word types that
they were designed for (column Specific in Table 4.4), we see that the suffix clipping
model is by far the lowest, indicating that in the future we should consider ways of
improving this word model. One possibility is to incorporate phonemic knowledge. For
example, both friday and friend have the same probability under P (ti |si , suffix clipping)
for the texting form fri, which has the standard form friday in our data. (The language
model, however, does distinguish between these forms.) However, if we consider the
phonemic representations of these words, friday might emerge as more likely. Syllable
structure information may also be useful, as we hypothesize that clippings will tend to be
formed by truncating a word at a syllable boundary. We may similarly be able to improve
our estimate of P (ti |si , subsequence abbreviation). For example, for the texting form txt
both text and taxation have the same probability under this distribution, but intuitively
text, the correct standard form in our data, seems more likely. We could incorporate
knowledge about the likelihood of omitting specific characters, as in Choudhury et al.
(2007), to improve this estimate.
We now examine the lower panel of Table 4.4, in which we consider the performance
of the overall model on the word formation types that are not explicitly modeled. The
very high accuracy on g-clippings indicates that since these forms are also a type of
subsequence abbreviation, we do not need to construct a separate model for them. We
in fact also conducted experiments in which g-clippings and h-clippings were modeled
explicitly, but found these extra models to have little effect on the results.
Recall from Section 4.2.1 our hypothesis that prefix clippings, spelling errors, and
phonetic abbreviations have common properties with formation types that we do model,
and therefore the system will perform reasonably well on them. Here we find preliminary evidence to support this hypothesis as the accuracy on these three word formation
types (combined) is 57.1%. However, we must interpret this result cautiously as it only
89
Chapter 4. Text message forms
Model
% in-top-1 accuracy
Stylistic variation
51.8
Subsequence abbreviation
44.2
Suffix clipping
10.6
Table 4.5: % in-top-1 accuracy on the 303 test expressions using each model individually.
considers seven expressions. On the syllabic letter and digit texting forms the accuracy is 50.0%, indicating that our heuristic to replace digits in texting forms with an
orthographic representation is reasonable.
The performance on types of expressions that we did not consider when designing the
system—unclear and error—is very poor. However, this has little impact on the overall
performance as these expressions are rather infrequent.
4.4.2
Results by Model
We now consider in-top-1 accuracy using each model on the 303 test expressions; results
are shown in Table 4.5. No model on its own gives results comparable to those of the
overall model (59.4%, see Table 4.3). This indicates that the overall model successfully
combines information from the specific word formation models.
Each model used on its own gives an accuracy greater than the proportion of expressions of the word formation type for which the model was designed (compare accuracies
in Table 4.5 to the number of expressions of the corresponding word formation type in
the test data in Table 4.4). As we note in Section 4.1, the distinctions between the word
formation types are not sharp; these results show that the shared properties of word
formation types enable a model for a specific formation type to infer the standard form
of texting forms of other formation types.
Chapter 4. Text message forms
4.4.3
90
All unseen data
Until now we have discussed results on our test data of 303 texting forms which differ
from their standard forms. We now consider the performance of our system on all 1,213
unseen texting forms, 910 of which are identical to their standard form. Since our model
was not designed with such expressions in mind, we slightly adapt it for this new task; if ti
is in our lexicon, we return that form as si , otherwise we apply our model as usual, using
the uniform estimate of P (wf ). This gives an in-top-1 accuracy of 88.2%, which is very
similar to the results of Choudhury et al. (2007) on this data of 89.1%. Note, however,
that Choudhury et al. only report results on this dataset using a uniform language model;9
since we use a unigram language model, it is difficult to draw firm conclusions about the
performance of our system relative to theirs.
4.5
Related Work
Aw et al. (2006) model text message normalization as translation from the texting language into the standard language. Kobus et al. (2008) incorporate ideas from both
machine translation and automatic speech recognition for text message normalization.
However, both of the approaches of Aw et al. and Kobus et al. are supervised, and have
only limited means for normalizing texting forms that do not occur in the training data.
Yvon (2010) combines ideas from machine translation, automatic speech recognition,
and spelling error correction in his system for text message normalization, but again this
system is supervised.
The approach proposed in this chapter, like that of Choudhury et al. (2007), can be
viewed as a noisy-channel model for spelling error correction (e.g., Mays et al., 1991;
Brill and Moore, 2000), in which texting forms are seen as a kind of spelling error.
9
Choudhury et al. do use a unigram language model for their experiments on the 303 texting forms
which differ from their standard forms (see Section 4.2.3).
Chapter 4. Text message forms
91
Furthermore, like our approach to text message normalization, approaches to spelling
correction have incorporated phonemic information (Toutanova and Moore, 2002).
The word model of the supervised approach of Choudhury et al. consists of hidden Markov models, which capture properties of texting language similar to those of
our stylistic variation model. We propose multiple word models—corresponding to frequent texting language formation processes—and an unsupervised method for parameter
estimation.
4.6
Summary of contributions
We analyze a sample of text messaging forms to determine frequent word formation
processes in texting language. This analysis is revealing as to the range of creative
phenomena occurring in text messages.
Based on the above observations, we construct an unsupervised noisy-channel model
for text message normalization. On an unseen test set of 303 texting forms that differ
from their standard form, our model achieves 59% accuracy, which is on par with that
obtained by the supervised approach of Choudhury et al. (2007) on the same data.
Our approach is well-suited to normalization of novel creative texting forms—unlike
previously-proposed supervised approaches to text message normalization—and has the
potential to be applied to other domains, such as microblogging.
Chapter 5
Ameliorations and pejorations
Amelioration and pejoration are common linguistic processes through which the meaning
of a word changes to have a more positive or negative evaluation, respectively, in the mind
of the speaker. Historical examples of amelioration and pejoration include nice, which in
Middle English meant ‘foolish’, and vulgar, originally meaning ‘common’. More recent
examples are sick (now having a sense meaning ‘excellent’, an amelioration), and retarded
(now having a sense meaning ‘of inferior quality’, a pejoration, and often considered
offensive).
Amelioration and pejoration seem to come about in a number of ways, with these
processes taking place at the level of both concepts, and word forms or senses. If a
community’s evaluation of some concept changes, this may then result in amelioration or
pejoration. This seems to be the case with words such as authentic, local, and organic,
particularly when used to describe food, all of which are properties of foods which have
recently become highly valued by certain communities. In these cases the amelioration of
a particular concept (namely, authentic, local, and organic food) results in amelioration
of certain words.
A further possibility for amelioration or pejoration is that a word form acquires a new
sense. Gay (in its adjectival form) is one such example; in this case the primary sense
92
Chapter 5. Ameliorations and pejorations
93
of this word changed from ‘merry’ to ‘homosexual’, with the ‘merry’ sense being rather
uncommon nowadays. (More recently, gay has also come to be used in a sense—generally
considered offensive—meaning ‘of poor quality’.)
Pejoration in particular may also be caused by contamination through association
with a taboo concept. For example, toilet, as used in the phrase go to the toilet, was
originally a euphemistic borrowing from French (Allan and Burridge, 1991); however,
due to the taboo status of human waste it has lost its euphemistic status. Nowadays,
more euphemistic terms such as bathroom and loo are commonly used in American and
British English, respectively, in this context, and toilet has acquired a somewhat negative
status. In fact, euphemistic terms often become taboo terms—i.e., become pejorated—
due to their association with a taboo subject (Allan and Burridge, 1991). This example
with toilet also illustrates a related issue, namely, the potential lack of correspondence
between the evaluation of a word and the concept to which it refers. Although toilet
and bathroom both refer to the same physical place, toilet is somewhat more negative (in
contemporary Canadian usage). A similar situation is observed amongst near synonyms.
For example, lie and fib both refer to saying something that is not true, although lie is
much more negative than fib.
In the present study we consider amelioration and pejoration at the level of word
forms. Although it is certainly the case that concepts and particular word senses can
undergo amelioration and pejoration, we assume that these changes will be reflected in
the corresponding word forms.
Amelioration and pejoration are processes that change the semantic orientation of
a word, an aspect of lexical semantics that is of great interest nowadays. Much recent
computational work has looked at determining the sentiment or opinion expressed in
some text (see Pang and Lee, 2008, for an overview). A key aspect of many sentiment
analysis systems is a lexicon in which words or senses are annotated with semantic orientation. Such lexicons are often manually-crafted (e.g., the General Inquirer, Stone et al.,
Chapter 5. Ameliorations and pejorations
94
1966). However, it is clearly important to have automatic methods to detect semantic changes that affect a word’s orientation in order to keep such lexicons up-to-date,
whether automatically- or manually-created. Indeed, there have been recent efforts to
automatically infer polarity lexicons from corpora (e.g., Hatzivassiloglou and McKeown,
1997; Turney and Littman, 2003) and from other lexicons (e.g., Esuli and Sebastiani,
2006; Mohammad et al., 2009), and to adapt existing polarity lexicons to specific domains (e.g., Choi and Cardie, 2009). Similarly, since appropriate usage of words depends
on knowledge of their semantic orientation, tools for detecting such changes would be
helpful for lexicographers in updating dictionaries.
Our hypothesis is that methods for automatically inferring polarity lexicons from
corpora can be used for detecting changes in semantic orientation, i.e., ameliorations
and pejorations. If the corpus-based polarity of a word is found to vary significantly
across two corpora which differ with respect to timespan, then that word is likely to
have undergone amelioration or pejoration. Moreover, this approach could be used to
find new word senses by applying it to corpora of recent text. Specifically, we adapt
an existing web-based method for calculating polarity (Turney and Littman, 2003) to
work on smaller corpora (since our corpora will be restricted by timespan), and apply
the method to words in the two corpora of interest.
5.1
Determining semantic orientation
Turney and Littman (2003) present web and corpus-based methods for determining the
semantic orientation of a target word. Their methods use either pointwise mutual information (PMI) or latent semantic analysis (Deerwester et al., 1990) to compare a target
word to known words of positive and negative polarity. Here we focus on a variant of
their PMI-based method. In preliminary experiments we find a PMI-based method to
outperform a method using latent semantic analysis, and therefore choose to focus on
95
Chapter 5. Ameliorations and pejorations
PMI-based methods. (We discuss the use of latent semantic analysis further in Section 5.5.3.)
Turney and Littman manually build small sets of known positive and negative seed
words, and then determine the semantic orientation (SO) of a target word t by comparing
its association with the positive and negative seed sets, POS and NEG, respectively.
SO-PMI (t) = PMI (t, POS ) − P M I(t, NEG)
(5.1)
The association between the target and a seed set is then determined as below, where t
is the target, S = s1 , s2 ...sn is a seed set of n words, N is the number of words in the
corpus under consideration, and hits is the number of hits returned by a search engine
for the given query.
PMI (t, S) = log
P (t, S)
P (t)P (S)
≈ log
N · hits(t NEAR (s1 OR s2 OR ... OR sn ))
hits(t)hits(s1 OR s2 OR ... OR sn )
(5.2)
(5.3)
In this study we do not use web data, and therefore do not need to estimate frequencies using the number of hits returned by a search engine. We therefore estimate
PMI (t, S) using frequencies obtained directly from a corpus, as below, where freq(t, s) is
the frequency of t and s co-occurring within a five-word window, and freq(t) and freq(s)
are the frequency of t and s, respectively.
PMI (t, S) ≈ log
P
N s∈S freq(t, s)
P
freq(t) s∈S freq(s)
(5.4)
Chapter 5. Ameliorations and pejorations
96
We do not smooth these estimates. In this study, we only calculate the polarity of a word
t if it co-occurs at least five times with seed words—positive or negative—in the corpus
being used. Therefore the frequency of each word t is at least five so the denominator is
never zero. If t doesn’t co-occur with any seed word s ∈ S the numerator is zero, in which
case we simply set PMI (t, S) to a very low number (−∞). In this case t co-occurs at least
five times with the opposite seed set, and the resulting polarity is then the maximum
positive or negative polarity (∞ or −∞, respectively).
Turney and Littman focus on experiments using web data, the size of which allows
them to use very small, but reliable, seed sets of just seven words each. (Positive seeds:
good, nice, excellent, positive, fortunate, correct, superior ; negative seeds: bad, nasty,
poor, negative, unfortunate, wrong, inferior.) However, their small seed sets can cause
data sparseness problems when using the corpora of interest to us, which can be rather
small since they are restricted in time period. Therefore, we use the positive and negative
words from the General Inquirer (GI, Stone et al., 1966) as our seeds. Some words in
GI are listed with multiple senses, and the polarity of these senses may differ. To avoid
using seed words with ambiguous polarity, we select as seeds only those words which have
either positive or negative senses, but not both. This gives positive and negative seed sets
of 1621 and 1989 words, respectively, although at the cost of these seed words potentially
being less reliable indicators of polarity than those used by Turney and Littman. Note
that we obtain our seed words from a large manually-created lexicon, whereas Turney
and Littman use a much smaller amount of manual knowledge. This is a reflection of the
differing goals of our studies: Turney and Littman aim to automatically infer a polarity
lexicon similar to GI, whereas our goal is to use such a lexicon in order to identify
ameliorations and pejorations.
97
Chapter 5. Ameliorations and pejorations
Corpus
Time period
Approximate size in
millions of words
Lampeter
1640–1740
1
CLMETEV 1710–1920
15
BNC
Late 20th century
100
Table 5.1: Time period and approximate size of each corpus.
5.2
Corpora
In investigating this method for amelioration and pejoration detection, we make use
of three British English corpora from differing time periods: the Lampeter Corpus of
Early Modern English Tracts (Lampeter, Siemund and Claridge, 1997), approximately
one million words of text from 1640–1740 taken from a variety of domains including
religion, politics, and law; the Corpus of Late Modern English Texts Extended Version
(CLMETEV, De Smet, 2005) consisting of fifteen million words of text from 1710–1920
concentrating on formal prose; and the British National Corpus (BNC, Burnard, 2007),
one hundred million words from a variety of primarily written sources from the late 20th
century. The size and time period of these three corpora is summarized in Table 5.1.
We first verify that our adapted version of Turney and Littman’s (2003) SO-PMI can
reliably predict human polarity judgements on these corpora. We calculate the polarity
of each item in GI that co-occurs at least five times with seed words in the corpus under
consideration. We calculate polarity using a leave-one-out methodology in which all items
in GI—except the target expression—are used as seed words. The results are shown in
Table 5.2. For each corpus, all accuracies are substantially higher than the baseline of
always choosing the most frequent class, negative polarity. Moreover, when we focus on
only those items with strong polarity—the top 25% most-polar items—the accuracies are
quite high, close to or over 90% in each case. Note that even with these restrictions on
98
Chapter 5. Ameliorations and pejorations
Percentage most-polar items classified
Top 25%
Corpus
Top 50%
% accuracy
Baseline
N
% accuracy
Baseline
N
Lampeter
88
54
344
84
53
688
CLMETEV
92
61
792
90
59
1584
BNC
94
72
883
93
64
1767
Percentage most-polar items classified
Top 75%
Corpus
100%
% accuracy
Baseline
N
% accuracy
Baseline
N
Lampeter
79
52
1032
74
50
1377
CLMETEV
85
56
2376
80
55
3169
BNC
89
59
2650
82
55
3534
Table 5.2: % accuracy for inferring the polarity of expressions in GI using each corpus.
The accuracy for classifying the items with absolute calculated polarity in the top 25%
and 50% (top panel) and 75% and 100% (bottom panel) which co-occur at least five
times with seed words in the corresponding corpus is shown. In each case, the baseline of
always choosing negative polarity and the number of items classified (N) are also shown.
99
Chapter 5. Ameliorations and pejorations
Corpus
% accuracy
Baseline
Lampeter
74
50
CLMETEV sample
73
50
BNC sample
70
47
Table 5.3: % accuracy and baseline using Lampeter and approximately one-million-word
samples from CLMETEV and the BNC. The results using CLMETEV and the BNC are
averaged over five random one-million-word samples.
frequency and polarity, many items are still being classified—344 in the case of Lampeter,
the smallest corpus. We conclude that using a very large set of potentially noisy seed
words is useful for polarity measurement on even relatively small corpora.
The accuracy using Lampeter is substantially lower than that using CLMETEV,
which is in turn lower than that using the BNC. These differences could arise due to
the differences in size between these corpora; it could also be the case that because the
seed words are taken from a polarity lexicon created in the mid-twentieth century, they
are less accurate indicators of polarity in older corpora (Lampeter and CLMETEV) than
in corpora from the same time period (the BNC). To explore this, we randomly extract
approximately one-million-word samples from both CLMETEV and the BNC, to create
corpora from these time periods of approximately the same size as Lampeter. We then
estimate the polarity of the items in GI in the same manner as for the experiments
presented in Table 5.2 for the CLMETEV and BNC samples. We do this for five random
samples for each of CLMETEV and the BNC and average the results over these five
samples. These results are presented in Table 5.3 along with the results using Lampeter
for comparison. Interestingly, the results are quite similar in all three cases. We therefore
conclude that the differences observed between the three (full size) corpora in Table 5.2
are primarily due to the differences in size between these corpora. Furthermore, these
results show that the words from the GI lexicon—created in the mid-twentieth century—
Chapter 5. Ameliorations and pejorations
100
can be effectively used to estimate polarity from corpora from other time periods.
One further issue to consider is the lack of standard orthography in Early Modern
English. During this time period many words were spelled inconsistently. Ideally we
would normalize historical spellings to their modern forms to make them consistent with
the spellings in our polarity lexicon. Although we do not do this, the performance on the
Lampeter corpus—particularly when compared against the performance on similar-size
samples of Modern English, as in Table 5.3—shows that diachronic spelling differences
do not pose a serious problem for this task.
5.3
5.3.1
Results
Identifying historical ameliorations and pejorations
We have compiled a small dataset of words known to have undergone amelioration and
pejoration which we use here to evaluate our methods. Some examples are taken from
etymological dictionaries (Room, 1986) and from textbooks discussing semantic change
(Traugott and Dasher, 2002) and the history of the English language (Brinton and
Arnovick, 2005). We only consider those that are indicated as having undergone amelioration or pejoration in the eighteenth century or later (Room, 1986).1 We also search for
additional test expressions in editions of Shakespearean plays that contain annotation
as to words and phrases that are used differently in the play than they commonly are
now (Shakespeare, 2008a,b). Here we—perhaps naively—assume that the sense used by
Shakespeare was the predominant sense at the time the play was written, and consider
these as expressions whose predominant sense has undergone semantic change. The ex1
Note that historical dictionaries, such as the Oxford English Dictionary (OED Online. Oxford
University Press. http://dictionary.oed.com), do not appear to be appropriate for establishing the
approximate date at which a word sense has become common because they give the date of the earliest
known usage of a word sense, which could be much earlier than the widespread use of that sense. The
etymological dictionary we use (Room, 1986) attempts to give the date at which the use of a particular
sense became common.
101
Chapter 5. Ameliorations and pejorations
Change identified
Expression
from resources
Polarity in corpora
Lampeter
CLMETEV
Change in polarity
ambition
amelioration
−0.76
−0.24
0.52
eager
amelioration
−1.09
−0.12
0.97
fond
amelioration
0.14
0.21
0.07
luxury
amelioration
−0.93
0.55
1.49
nice
amelioration
−2.48
0.36
2.84
amelioration
0.81
0.06
−0.75
artful
pejoration
1.33
−0.38
−1.71
plainness
pejoration
1.65
1.04
−0.61
*succeed
Table 5.4: The polarity in each corpus and change in polarity for each historical example
of amelioration and pejoration. Note that succeed does not exhibit the expected change
in polarity.
pressions taken from Shakespeare are restricted to words whose senses as used in the
plays are recorded in the Oxford English Dictionary (OED).2 These expressions are further limited to those that two native English-speaking judges—the author of this thesis
and the second author of this study—agree are ameliorations or pejorations.
For all the identified test expressions, we assume that their original meaning will be
the predominant sense in Lampeter, while their ameliorated or pejorated sense will be
dominant in CLMETEV. After removing expressions with frequency five or less in either
Lampeter or CLMETEV, eight test items remain—six items judged as ameliorations and
two as pejorations. The results of applying our method for amelioration and pejoration
identification are shown in Table 5.4. Note that for seven out of eight expressions, the
2
OED Online. Oxford University Press. http://dictionary.oed.com
Chapter 5. Ameliorations and pejorations
102
calculated change in polarity is as expected from the lexical resources; the one exception is
succeed. The calculated polarity is significantly higher for the corpus which is expected to
have higher polarity (CLMETEV in the case of ameliorations, Lampeter for pejorations)
than for the other corpus using a one-tailed paired t-test (p = 0.024).
In this evaluation we used the corpora that we judged to best correspond to the time
periods immediately before and after the predominant sense of the test expressions had
undergone change. Nevertheless, for some of the test expressions, it could be that the
ameliorated or pejorated sense was more common during the time period of the BNC
than that of CLMETEV. However, conducting the same evaluation using the BNC as
opposed to CLMETEV in fact gives very similar results.
5.3.2
Artificial ameliorations and pejorations
We would like to determine whether our method is able to identify known ameliorations
and pejorations; however, as discussed in the previous section, the number of expressions
in our historical dataset thus far is small. We can nevertheless evaluate our method
on artificially created examples of amelioration and pejoration. One possibility for constructing such examples is to assume that the usages of words of opposite polarity in two
different corpora are in fact usages of the same word. For example, we could assume that
excellent in Lampeter and poor in CLMETEV are in fact the same word. This would
then be (an artificial example of) a word which has undergone pejoration. If our method
assigns lower polarity to poor in CLMETEV than to excellent in Lampeter, then it has
successfully identified this “pejoration”. This type-based approach to creating artificial
data is inspired by word sense disambiguation evaluations in which the token instances
of two distinct words are used to represent two senses of the same word (e.g., Schütze,
1992).
Selecting appropriate pairs of words to compare in such an evaluation poses numerous
difficulties. For example, it seems that strongly polar words with opposite polarity (e.g.,
103
Chapter 5. Ameliorations and pejorations
Lampeter
CLMETEV
BNC
Average polarity of positive seeds
0.58
0.50
0.40
Average polarity of negative seeds
−0.74
−0.67
−0.76
Table 5.5: Average polarity of positive and negative words from GI in each corpus with
frequency greater than five and which co-occur at least once with both positive and
negative seed words in the indicated corpus.
excellent and poor ) would not be a realistic approximation to amelioration or pejoration.
(The degree of change in polarity in real examples of amelioration and pejoration varies,
and can be less drastic than that between excellent and poor.) Nevertheless, it is unclear
how to choose words to construct more plausible artificial examples. Therefore, given the
number of available items, we average the polarity of all the positive/negative expressions
in a given corpus with frequency greater than five and which co-occur at least once with
both positive and negative seed words. (We introduce the additional restriction—cooccurrence at least once with both positive and negative seed words—because expressions
not meeting this condition have a polarity of either ∞ or −∞.) These results are shown
in Table 5.5. For each corpus, the positive GI words have higher average polarity than the
negative GI words in all other corpora. (All differences are strongly significant in unpaired
t-tests: p ≪ 10−5 .) Therefore, if we construct an artificial example of amelioration or
pejoration, and estimate the polarity of this artificial example using any two of our
three corpora, the expected polarity of the positive senses of that artificial example
is higher than the expected polarity of the negative senses. This suggests that our
method can detect strong differences in polarity across corpora. However, as previously
mentioned, such strong changes in polarity—as represented by the average polarity of the
positive and negative GI expressions—may not be representative of typical ameliorations
or pejorations, which may exhibit more subtle changes in meaning and polarity. A further
limitation of this experiment is that the average polarity values calculated from the two
Chapter 5. Ameliorations and pejorations
104
corpora could be influenced by outliers. In particular, a small number of strongly positive
or negative words could have a large influence on the average polarity. It could then be
the case that the polarities of arbitrarily chosen positive and negative words may not in
fact be expected to be different. Despite these limitations, these results do suggest that
our method is able to identify ameliorations and pejorations under idealized conditions,
and is worthy of further consideration.
5.3.3
Hunting for ameliorations and pejorations
Since we suggest our method as a way to discover potential new word senses that are
ameliorations and pejorations, we test this directly by comparing the calculated polarity
of words in a recent corpus, the BNC, to those in an immediately preceding time period,
CLMETEV. We consider the words with the largest increase and decrease in polarity
between the two corpora as candidate ameliorations and pejorations, respectively, and
then have human judges consider usages of these words to determine whether they are
in fact ameliorations and pejorations.
The expressions with the ten largest increases and decreases in polarity from CLMETEV to the BNC (restricted to expressions with frequency greater than five in each
corpus) are presented in Tables 5.6 and 5.7, respectively. Expressions with an increase
in polarity from CLMETEV to the BNC (Table 5.6) are candidate ameliorations, while
expressions with a decrease from CLMETEV to the BNC (Table 5.7) are candidate pejorations. We extract ten random usages of each expression—or all usages if the word has
frequency lower than ten—from each corpus, and then pair each usage from CLMETEV
with a usage from the BNC. This gives ten pairs of usages (or as many as are available)
for each expression, resulting in 190 total pairs.
We use Amazon Mechanical Turk (AMT, https://www.mturk.com/) to obtain judgements for each pair of usages. For each pair, a human judge is asked to decide whether
the usage from CLMETEV or the BNC is more positive/less negative, or whether the
105
Chapter 5. Ameliorations and pejorations
Proportion of judgements for corpus of more positive usage
Expression
CLMETEV
BNC
Neither
bequeath
0.25
0.28
0.47
coerce
0.38
0.20
0.42
costliness
0.41
0.24
0.35
disputable
0.30
0.43
0.27
empower
0.30
0.29
0.40
foreboding
0.19
0.39
0.42
hysteria
0.26
0.39
0.35
slothful
0.24
0.44
0.31
thoughtfulness
0.21
0.50
0.29
verification
0.27
0.27
0.46
Average
0.28
0.34
0.37
Table 5.6: Expressions with top 10 increase in polarity from CLMETEV to the BNC
(candidate ameliorations). For each expression, the proportion of human judgements for
each category is shown: CLMETEV usage is more positive/less negative (CLMETEV),
BNC usage is more positive/less negative (BNC), neither usage is more positive or negative (Neither). Majority judgements are shown in boldface, as are correct candidate
ameliorations according to the majority responses of the judges.
106
Chapter 5. Ameliorations and pejorations
Proportion of judgements for corpus of more positive usage
Expression
CLMETEV
BNC
Neither
adornment
0.43
0.27
0.33
disavow
0.37
0.22
0.41
dynamic
0.43
0.27
0.30
elaboration
0.26
0.38
0.36
fluent
0.25
0.34
0.41
gladden
0.39
0.12
0.49
outrun
0.30
0.38
0.31
skillful
0.43
0.27
0.29
synthesis
0.41
0.19
0.40
wane
0.33
0.34
0.33
Average
0.36
0.27
0.36
Table 5.7: Expressions with top 10 decrease in polarity from CLMETEV to the BNC
(candidate pejorations). For each expression, the proportion of human judgements for
each category is shown: CLMETEV usage is more positive/less negative (CLMETEV),
the BNC usage is more positive/less negative (BNC), neither usage is more positive or
negative (Neither). Majority judgements are shown in boldface, as are correct candidate
pejorations according to the majority responses of the judges.
Chapter 5. Ameliorations and pejorations
107
Instructions:
• Read the two usages of the word disavow below.
• Based on your interpretation of those usages, select the best answer.
A: in a still more obscure passage he now desires to DISAVOW the circular or
aristocratic tendencies with which some critics have naturally credited him .
B: the article went on to DISAVOW the use of violent methods :
• disavow is used in a more positive, or less negative, sense in A than B.
• disavow is used in a more negative, or less positive, sense in A than B.
• disavow is used in an equally positive or negative sense in A and B.
Enter any feedback you have about this HIT. We greatly appreciate you taking the
time to do so.
Table 5.8: A sample of the Amazon Mechanical Turk polarity judgement task.
Chapter 5. Ameliorations and pejorations
108
two usages are equally positive/negative. A sample of this AMT polarity judgement task
is presented in Table 5.8. We solicit responses from ten judges for each pair of usages,
and pay $0.05 per judgement.
The judgements obtained from AMT are shown in Tables 5.6 and 5.7. For each
candidate amelioration or pejoration the proportion of responses that the usage from
CLMETEV, the BNC, or neither is more positive/less negative is shown. For each expression, the majority response is indicated in boldface. In the case of both candidate
ameliorations and pejorations, four out of ten items are correct according to the AMT
judgements; these expressions are also shown in boldface. Taking the AMT judgements
as a gold-standard, this corresponds to a precision of 40%. (We cannot calculate recall
because this would require manually identifying all of the ameliorations and pejorations
between the two corpora.) We also consider the average proportion of responses for
each category (CLMETEV usage is more positive/less negative, BNC usage is more
positive/less negative, neither usage is more positive or negative) for the candidate ameliorations and pejorations (shown in the last row of Tables 5.6 and 5.7, respectively).
Here we note that for candidate ameliorations the average proportion of responses that
the BNC usage is more positive is higher than the average proportion of responses that
the CLMETEV usage is more positive, and vice versa for candidate pejorations. This
is an encouraging result, but in one-tailed paired t-tests it is not found to be significant
for candidate ameliorations (p = 0.12), although it is marginally significant for candidate
pejorations (p = 0.05).
We also consider an evaluation methodology in which we ignore the judgements for
usage pairs for which the judgements are roughly uniformly distributed across the three
categories. For each usage pair, if the proportion of judgements of the most frequent
judgement is greater than 0.5 then this pair is assigned the category of the most frequent
judgement, otherwise we ignore the judgements for this pair. We then count these resulting judgements for each candidate amelioration and pejoration. In this alternative
Chapter 5. Ameliorations and pejorations
109
evaluation, the overall results are quite similar to those presented in Tables 5.6 and 5.7.
These results are not very strong from the perspective of a fully-automated system
for identifying ameliorations and pejorations; in the case of both candidate ameliorations
and pejorations only four of the ten items are judged as correct by humans. Nevertheless,
these results do indicate that this approach could be useful as a semi-automated tool to
help in the identification of new senses, particularly since the methods are inexpensive
to apply.
5.4
Amelioration or pejoration of the seeds
Our method for identifying ameliorations and pejorations relies on knowing the polarity of
a large number of seed words. However, a seed word itself may also undergo amelioration
or pejoration, and therefore its polarity may in fact differ from what we assume it to
be in the seed sets to produce a noisy set of seed words. Here we explore the extent
to which noisy seed words—i.e., seed words labelled with incorrect polarity—affect the
performance of our method. We begin by randomly selecting n% of the positive seed
words, and n% of the negative seed words, and swapping these items in the seed sets. We
then conduct a leave-one-out experiment, using the same methodology as in Section 5.2,
in which we use the noisy seed words to calculate the polarity of all items in the GI lexicon
which co-occur at least five times with seed words in the corpus under consideration.
We consider each n in {5, 10, 15, 20}, and repeat each experiment five times, randomly
selecting the seed words whose polarity is changed in each trial. The average accuracy
over the five trials is shown in Figure 5.1.
We observe a similar trend for all three corpora: the average accuracy decreases as the
percentage of noisy seed words increases. However, with a small amount of noise in the
seed sets, 5%, the reduction in absolute average accuracy is small, only 1–2 percentage
points, for each corpus. Furthermore, when the percentage of noisy seed words is in-
110
Chapter 5. Ameliorations and pejorations
100
BNC
CLMETEV
Lampeter
% accuracy
80
60
40
20
00
5
10
% noisy seed words
15
20
Figure 5.1: Average % accuracy for inferring the polarity of the items in GI for each
corpus as the percentage of noisy seed words is varied.
Chapter 5. Ameliorations and pejorations
111
creased to 20%, the absolute average accuracy is lowered by only 5–7 percentage points.
We conclude that by aggregating information from many seed words, our method for
determining semantic orientation is robust against a small amount of noise in the seed
sets.
5.5
More on determining semantic orientation
In this section we consider some alternative methods for determining semantic orientation
from a corpus to show that our chosen method for this task performs comparably to, or
better than, other proposed methods. Throughout this section we consider results using
the Lampeter corpus, our smallest corpus, because we are primarily interested in methods
that work well on small corpora.
5.5.1
Combining information from the seed words
In equation 5.2 (page 95) we present Turney and Littman’s (2003) “disjunction” method
for SO-PMI, so-called because it is estimated through web queries involving disjunction
(see equation 5.3, page 95). Turney and Littman also present a variant of SO-PMI
referred to as “product” in which the association between a target t and seed set S is
calculated as follows:
SO-PMI (t, S) =
X
PMI(t, s)
(5.5)
s∈S
In this variant the association is calculated between t and each seed word s. This is a
summation of logarithms which can be expressed as a logarithm of products, giving rise
to the name “product”.
We consider the product variant in preliminary experiments. We note that the raw
frequency of co-occurrence between many individual seed words and the target is zero.
Chapter 5. Ameliorations and pejorations
112
Following Turney and Littman we smooth frequencies using Laplace smoothing. We
consider a range of smoothing factors, but find this method to perform poorly in all
cases. The smoothed zero frequencies appear to have a very large effect on the calculated
polarities. We believe this to be the reason for the poor performance of this method.
The disjunction variant that we adopt for our method counts co-occurrence between the
target and any positive/negative seed word. It is unlikely that the target would have
a frequency of co-occurrence of zero with all the positive or negative seed words, and
therefore the method of smoothing used has less impact on the calculated polarities.
(Indeed, for our experiments it was not necessary to smooth the frequencies.) We believe
this is why the disjunction variant performs better in our experiments.
It is also worth noting that Turney and Littman find the product and disjunction
variants to perform similarly on their smallest corpus (which at approximately ten million words is much larger than our smallest corpus, Lampeter). The disjunction variant
is more efficient to calculate than the product variant in Turney and Littman’s experimental setup because it requires issuing less search engine queries. Turney and Littman
therefore argue that for small corpora disjunction is more appropriate. However, in our
experimental setup the two approaches require equal computational cost, so this reason
for choosing the disjunction variant does not apply.
5.5.2
Number of seed words
In our method for inferring semantic orientation we assume that it is necessary to use
a large number of seed words to compensate for the relatively small size of our corpora.
Here we support this assumption by considering the results of using a smaller number of
seed words.
We compare the accuracy for inferring the polarity of items in GI using the fourteen
seed words from Turney and Littman (2003) with the accuracy using the words from GI
as seed words (3610 seeds). Because of our frequency restriction that items must occur
113
Chapter 5. Ameliorations and pejorations
Real polarity
Assigned polarity
Positive
Negative
Positive
a
b
Negative
c
d
Unassigned
e
f
Table 5.9: Confusion matrix representing the results of our classification task used to
define our adapted versions of precision and recall (given in equations 5.6 and 5.7).
Seed words
Num. seeds
P
R
F
Num. items classified
TL
14
0.64
0.02
0.04
87
GI
3610
0.74
0.28
0.41
1377
Table 5.10: Results for classifying items in GI in terms of our adapted versions of precision
(P), recall (R), and F-measure (F) using the TL seeds and GI seeds. The number of seed
words for each of TL and GI is given, along with the number of items that are classified
using these seed words.
five times with seed words, fewer items will be classified when using the Turney and
Littman (TL) seeds than when using the GI seeds. In order to compare the trade-off
between accuracy and number of items classified, we adapt the ideas of precision and
recall. Based on the confusion matrix in Table 5.9 representing the output of our task,
we define adapted versions of precision and recall as follows:
P recision =
Recall =
a+d
a+b+c+d
(5.6)
a+d
a+b+c+d+e+f
(5.7)
Chapter 5. Ameliorations and pejorations
114
Note that our definition of precision is in fact the same as what we have been referring
to as accuracy up to now in this chapter, because accuracy is calculated over only those
items that are classified.
Table 5.10 shows results for classifying the items in GI using the TL seeds and GI
seeds. When using the GI seeds many more items meet the frequency cutoff and are
therefore classified. The precision using the GI seeds is somewhat higher than that
obtained using the TL seeds; however, the recall is much higher. In terms of F-measure,
the GI seeds give performance an order of magnitude better than the TL seeds.
Table 5.10 presents two very different possibilities for the number of seed words used;
we now consider further varying the number of seed words. In contrast to GI which assigns
a binary polarity to each word, the polarity lexicon provided by Brooke et al. (2009)
gives a strength of polarity ranging between −5 and +5 for each of 5469 lexical items,
with an absolute value of 5 being the strongest polarity. Some words occur in Brooke
et al.’s lexicon with multiple parts-of-speech and differing polarities corresponding to each
part-of-speech. We ignore any word which has senses with both positive and negative
polarity. For a word with multiple positive or negative senses, we then assign that word
the polarity of its sense with lowest absolute value of polarity. This results in a polarity
lexicon which assigns a unique polarity to each word. For each value i from 1 to 5 we
infer the polarity of items in GI using the words with polarity greater than or equal to i
in the modified version of Brooke et al.’s polarity lexicon as seed words. The results are
shown in Table 5.11. For these experiments, because the number of seed words used is
greater than that of the previous experiments which used the TL seeds, we consider the
accuracy of only the 25% most-polar items, to focus on high-confidence items. We again
consider results in terms of our adapted versions of precision, recall, and F-measure. The
precision/accuracy using seed words with a polarity of 2 or greater and 3 or greater are
slightly higher than the results on Lampeter using the GI lexicon (88%, Table 5.2, page
98). This indicates that it may be possible to improve the precision of our methods by
115
Chapter 5. Ameliorations and pejorations
Polarity of seeds Num. seeds
P
R
F
Num. items classified
≥1
5126
0.87
0.08
0.14
324
≥2
3412
0.89
0.06
0.11
248
≥3
1694
0.90
0.04
0.08
160
≥4
616
0.82
0.02
0.03
71
=5
201
0.93
0.00
0.00
14
Table 5.11: Precision (P), recall (R), F-measure (F), and number of items classified for
the top 25% most-polar items in GI. Polarity is calculated using the items from Brooke
et al.’s (2009) polarity lexicon with polarity greater than or equal to the indicated level
as seed words; the total number of seed words is also given.
carefully selecting an appropriate number of seed words with strong polarity, although
based on these findings we do not expect the improvement to be very large. Note that
as the strength of polarity of the seed words is increased, and number of seed words
decreased, the recall and F-measure also decrease, as fewer items meet the frequency
cutoff and are classified. Although there appear to be gains in precision for the case of
seed words with polarity equal to 5, very few items are classified resulting in a very low
recall and F-measure. Furthermore, given that only 14 items are classified in this case,
we must interpret the observed increase in precision cautiously.
5.5.3
Latent Semantic Analysis
In addition to their PMI-based method for determining semantic orientation, Turney and
Littman (2003) present a method for semantic orientation drawing on latent semantic
analysis (LSA, Deerwester et al., 1990):
SO-LSA(t) =
X
p∈POS
LSA(t, p) −
X
n∈NEG
LSA(t, n)
(5.8)
116
Chapter 5. Ameliorations and pejorations
According to SO-LSA the semantic orientation of a word is the difference between its
similarity (using LSA) with known positive and negative seed words. The key difference
between SO-LSA and SO-PMI is that SO-LSA assumes that words with similar polarity
will tend to occur in similar contexts while SO-PMI assumes that words with similar
polarity will tend to co-occur.
Turney and Littman’s findings suggest that for small corpora, SO-LSA may be more
appropriate than SO-PMI. This approach therefore seems promising given our interest in
small corpora. Turney and Littman use an equal number of positive and negative seeds,
whereas in our study the size of the seed sets differ. We account for this by dividing the
association with the positive and negative seeds by the size of the positive and negative
seed sets, respectively, as below:
SO-LSA(t) =
P
LSA(t, p)
−
|POS |
p∈POS
P
LSA(t, n)
|NEG|
n∈NEG
(5.9)
The Lampeter corpus consists of 122 documents averaging approximately 9000 words
each. It is unlikely that such a large context will be useful for capturing information
related to polarity. Therefore, whereas Turney and Littman construct a term–document
matrix, we construct a term–sentence matrix. (The documents used by Turney and
Littman are rather short, with an average length of approximately 260 words.) In these
experiments we use the words from GI as seeds, and restrict our evaluation to words with
frequency greater than 5 in Lampeter. We consider a range of values of k (the dimensionality of the resulting term vectors); however, in no case are the results substantially
above the baseline. It may be the case that in our experimental setup the term–sentence
matrix is too sparse to effectively capture polarity. In the future we intend to consider
a term–paragraph matrix to address this. (We have not yet done this experiment as we
found some inconsistencies in the paragraph mark-up of Lampeter which we must resolve
Chapter 5. Ameliorations and pejorations
117
before doing so.)
5.6
Summary of contributions
This is the first computational study of amelioration and pejoration. We adapt an established corpus-based method for identifying polarity for use on small corpora, and show
that it performs reasonably well. We then apply this method for determining polarity to
corpora of texts from differing time periods to identify ameliorations and pejorations. In
evaluations on a small dataset of historical ameliorations and pejorations, and artificial
examples of amelioration and pejoration, we show that our proposed method successfully
captures diachronic differences in polarity. We also apply this method to find words
which have undergone amelioration and pejoration in corpora. The results of this experiment indicate that although our proposed method does not perform well on the task of
fully-automatic identification of new senses, it may be useful as a semi-automated tool
for finding new word senses.
Chapter 6
Conclusions
This chapter summarizes the contributions this thesis has made and then describes a
number of directions for future work.
6.1
Summary of contributions
The hypothesis of this thesis is that knowledge about word formation processes and
types of semantic change can improve the automatic acquisition of aspects of the syntax
and semantics of neologisms. Evidence supporting this hypothesis has been found in
the studies of lexical blends, text messaging forms, and ameliorations and pejorations
presented in this thesis. In the study of lexical blends in Chapter 3, knowledge of the
ways in which blends are formed and how people interpret them is exploited in a method
for automatically inferring the source words of novel blends, an important step in the
inference of the syntax and semantics of a blend. Moreover, the same information is used
to distinguish blends from other types of neologisms. In Chapter 4, the common ways
in which text messaging forms are created from their standard forms is exploited for
the task of text message normalization. By considering word formation types in texting
language, we are able to develop an unsupervised model for text message normalization
that performs as well as a supervised approach. In Chapter 5 we consider the use of
118
Chapter 6. Conclusions
119
knowledge of types of semantic change to identify new word senses. Specifically, by
drawing on knowledge of amelioration and pejoration—two types of semantic change—
we are able to identify words that have undergone these processes.
Computational work to date on lexical acquisition has concentrated on the context
in which an unknown word occurs to infer aspects of its lexical entry; typically such
approaches exploit statistical distributional information or expensive manually-crafted
lexical resources (see discussion in Chapter 2). However, in the case of neologisms, neither of these sources of information is necessarily available. New words are expected
to be low frequency due to the recency of their coinage, and therefore distributional
information is not reliable in this case. Approaches to lexical acquisition that rely heavily on lexical resources are typically limited to a particular domain; new words occur
throughout a language, and therefore such approaches are not generally applicable to
neologisms. This thesis finds evidence to support the hypothesis that knowledge of word
formation processes and types of semantic change can be exploited for lexical acquisition
of neologisms, where these other knowledge sources cannot be relied upon. This thesis
sets the stage for further research into lexical acquisition that considers word formation
processes and types of semantic change and what can be inferred from this information.
Moreover, since these methods are particularly well-suited to new words, this thesis will
encourage further research on neologisms, which have not been extensively considered in
computational linguistics.
Chapter 3 presents the first computational study of lexical blends. This frequent
new word type had been previously ignored in computational linguistics. We present
a statistical method for inferring the source words of a blend—an important first step
in the semantic interpretation of a blend—that draws on linguistic observations about
blends and cognitive factors that may play a role in their interpretation. The proposed
method achieves an accuracy of 40% on a test set of 324 novel unseen blends. We also
present preliminary methods for identifying an unknown word as a blend.
Chapter 6. Conclusions
120
In our study of blends we find strikingly different results when our methods are
applied to newly-coined blends versus established blends found in a dictionary. This
finding emphasizes the importance of testing methods for processing neologisms on truly
new expressions. We annotate a set of 1,186 recently-coined expressions (including the
324 blends used in evaluation discussed above) for their word formation type, which will
support future research on neologisms.
We describe the first unsupervised approach to text messaging normalization in Chapter 4; normalization is an important step that must be taken before other NLP tasks, such
as machine translation, can be done. By considering common word formation processes
in text messaging, we are able to develop an unsupervised method which gives performance on par with that of a supervised system on the same dataset. Moreover, since our
approach is unsupervised, it can be adapted to other media without the cost of developing a manually-annotated training resource. This is particularly important given that
non-standard forms similar to those found in text messaging are common in other popular
forms of computer-mediated communication, such as Twitter (http://twitter.com).
Chapter 5 describes the first computational work focusing on the processes of amelioration and pejoration. In this work we adapt an established corpus-based method for
inferring polarity to the task of identifying ameliorations and pejorations. We propose
an unsupervised method for this task and show that our proposed method is able to
successfully identify historical ameliorations and pejorations, as well as artificial examples of amelioration and pejoration. We also apply this method to find words which
have undergone amelioration and pejoration in recent corpora. In addition to being the
first computational work on amelioration and pejoration, this study is one of only a
small number of computational studies of diachronic semantic change, an exciting new
interdisciplinary research direction.
Chapter 6. Conclusions
6.2
121
Future directions
In this section we discuss a number of future directions related to each of the three studies
presented in Chapters 3–5.
6.2.1
Lexical blends
There are a number of ways in which the model for source word identification presented
in Section 3.1 could potentially be improved. The present results using the modified
perceptron algorithm are not an improvement over the rather unsophisticated feature
ranking approach. As discussed in Section 3.4.2.3, machine learning methods that do not
assume that the training data is linearly separable, such as that described by Joachims
(2002), may give an improvement over our current methods.
The features used in the proposed model do not take into account the context in which
a given blend occurs. However, blends often occur with their source words nearby in a
text, although unlike acronyms (discussed in Section 2.1.3.1) there do not appear to be as
clear textual indicators of the relationship between the words (e.g., a longform followed
by its acronym in parentheses). Nevertheless, the words that occur in a window of words
around a blend may be very informative as to the blend’s source words. A simple feature
capturing whether a candidate source word is used within a window of words around a
usage of a blend may be a very powerful feature. Note that such contextual information
could be exploited even if just one instance of a given blend is observed.
Contextual information could also be used to improve our approach to blend identification (see Section 3.5). If an unknown word is used with two of its candidate source
words occurring nearby in the text, and those candidate source words are likely source
words according to the model presented in Section 3.1, the unknown word may be likely
to be a blend.
In Chapter 3 we focus on source word identification because it is an important first
Chapter 6. Conclusions
122
step in the semantic interpretation of blends. However, once a blend’s source words have
been identified, their semantic relationship must be determined in order to interpret the
expression. Classifying blends as one of a pre-determined number of semantic relationship
types is one way to approach this task.
Algeo (1977) gives a categorization of blends which consists of two broad categories:
syntagmatic blends, such as webinar (web seminar ), can be viewed as a contraction of
two words that occur consecutively in text; associative blends, on the other hand, involve
source words that are related in some way. For example, brunch combines breakfast and
lunch which are both types of meal, whereas a chocoholic is addicted to chocolate similarly
to the way an alcoholic is addicted to alcohol.
A classification of blend semantics based on the ontological relationship between a
blend and its source words may be particularly useful for computational blend interpretation. Many syntagmatic blends are hyponyms of their second source word (e.g.,
webinar is a type of seminar ). Some associative blends are a hyponym of the lowest
common subsumer of their source words (e.g., breakfast and lunch are both hyponyms
of meal, as is brunch). However, the ontological relationship between chocolate, alcoholic, and chocoholic is less clear. Moreover, blends such as momic—a mom who is a
(stand up) comic—appear to be hyponyms of both their source words. These expressions
demonstrate that there are clearly many challenges involved in developing a classification
scheme for blends appropriate for their semantic interpretation; this remains a topic for
future work.
6.2.2
Text messaging forms
The model for text message normalization presented in Chapter 4 uses a unigram language model. One potential way to improve the accuracy of this system is through the
use of a higher-order n-gram language model. However, Choudhury et al.’s (2007) study
of text message normalization finds that a bigram language model does not outperform
Chapter 6. Conclusions
123
a unigram language model. On the other hand Choudhury et al. estimate their language
model from a balanced corpus of English (the British National Corpus, Burnard, 2007);
estimating this language model from a medium that is more similar to that of text messaging, such as the World Wide Web, may result in higher-order n-gram language models
that do outperform a unigram language model for this task.
We consider unsupervised approaches to text message normalization because we observe that similar abbreviated forms are common in other types of computer-mediated
communication, and want a model which can be easily adapted to such media without
the cost of developing a large manually-annotated resource. Twitter is a microblogging
service which has become very popular recently. Tweets—the messages a user posts
on Twitter—are limited to 140 characters. Perhaps due to this limited space, and also
possibly due to the desire to communicate in an informal register, tweets exhibit many
shortened and non-standard forms similar to text messages. Because Twitter has become so popular, there are many interesting opportunities to apply NLP technology to
this medium. For example, Twitter can be used to determine public opinion on some
product, or find trends in popular topics. However, in order to effectively apply methods
for such NLP tasks, the text must first be normalized. Adapting our proposed method
for text message normalization to Twitter is therefore a direction for future work that
could directly benefit the many applications processing tweets.
6.2.3
Ameliorations and pejorations
There are a number of ways to potentially improve the method for estimating polarity
used in Chapter 5. We intend to consider incorporating syntactic information, such as
the target expression’s part-of-speech, as well as linguistic knowledge about common
patterns that indicate polarity; for example, adjectives co-ordinated by but often have
opposite semantic orientation. Furthermore, although our experiments so far using LSA
to estimate polarity have not found this method to perform better than PMI-based
Chapter 6. Conclusions
124
methods, we intend to further consider LSA. In particular, we intend to re-examine the
context used when constructing the co-occurrence matrix (i.e., sentence, paragraph, or
document).
The corpora used in this study, although all consisting of British English, are not comparable, i.e., they were not constructed using the same or similar sampling strategies. It
is possible that any differences in polarity found between these corpora can be attributed
to differences in the composition of the corpora. In future work, we intend to evaluate
our methods on more comparable corpora; for example, the Brown Corpus (Kucera and
Francis, 1967) and Frown Corpus (Hundt et al., 1999)—comparable corpora of American
English from the 1960s and 1990s, respectively—could be used to study changes in polarity between these time periods in American English. We are also excited about applying
our methods to very recent corpora to identify new word senses.
In the present study we have considered amelioration and pejoration only across time.
However, words may have senses of differing polarity which are specific to a particular
speech community. In the future, we intend to apply our methods to comparable corpora
of the same language, but different geographical regions, such as the International Corpus
of English (http://ice-corpora.net/ice/) to identify words with differing semantic
orientation in these varieties of English.
We also intend to consider ways to improve the evaluation of our methods for identifying ameliorations and pejorations. We are working to enlarge our dataset of expressions
known to have undergone amelioration or pejoration in order to conduct a larger-scale
evaluation on historical examples of these processes. We further intend to conduct a
more wide-scale human evaluation of our experiments on hunting for ameliorations and
pejorations. In particular, in our current evaluation, each usage participates in only one
pairing; the exact pairings chosen therefore heavily influence the outcome. In the future
we will include each usage in multiple pairings in order to reduce this effect.
Chapter 6. Conclusions
6.2.4
125
Corpus-based studies of semantic change
The meaning of a word can vary with respect to a variety of sociolinguistic variables,
such as time period, geographical location, sex, age, and socio-economic status. As
noted in Section 2.3, identifying new word senses is a major challenge for lexicography;
identifying unique regional word senses poses similar challenges. Automatic methods for
identifying words that vary in meaning along one of the aforementioned variables would
be very beneficial for lexicography focusing on specific speech communities (defined by
these variables), and also the study of language variation.
The study on automatically identifying ameliorations and pejorations presented in
Chapter 5 is a specific instance of this research problem. One of my longterm research
goals is accurate methods for detecting semantic change—including ameliorations and
pejorations, but also more general processes such as widening and narrowing—across
any sociolinguistic variable.
Bibliography
Steven Abney. 1991. Parsing by chunks. In Robert Berwick, Steven Abney, and Carol
Tenny, editors, Principle-Based Parsing: Computation and Psycholinguistics, pages
257–278. Kluwer Academic Publishers, Dordrecht, The Netherlands.
Beatrice Alex. 2006. Integrating language knowledge resources to extend the English
inclusion classifier to a new language. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), pages 2431–2436. Genoa,
Italy.
Beatrice Alex. 2008. Comparing corpus-based to web-based lookup techniques for automatic English inclusion detection. In Proceedings of the Sixth International Conference
on Language Resources and Evaluation (LREC 2008), pages 2693–2697. Marrakech,
Morocco.
John Algeo. 1977. Blends, a structural and systemic view. American Speech, 52(1/2):47–
64.
John Algeo. 1980. Where do all the new words come from. American Speech, 55(4):264–
277.
John Algeo, editor. 1991. Fifty Years Among the New Words. Cambridge University
Press, Cambridge.
126
BIBLIOGRAPHY
127
John Algeo. 1993. Desuetude among new words. International Journal of Lexicography,
6(4):281–293.
Keith Allan and Kate Burridge. 1991. Euphemism & Dysphemism: Language Used as
Shield and Weapon. Oxford University Press, New York.
Stephen R. Anderson. 1992. A-Morphous Morphology. Cambridge University Press,
Cambridge.
B. T. Sue Atkins and Michael Rundell. 2008. The Oxford Guide to Practical Lexicography.
Oxford University Press, Oxford.
AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for
SMS text normalization. In Proceedings of the COLING/ACL 2006 Main Conference
Poster Sessions, pages 33–40. Sydney, Australia.
John Ayto, editor. 1990. The Longman Register of New Words, volume 2. Longman,
London.
John Ayto. 2006. Movers and Shakers: A Chronology of Words that Shaped our Age.
Oxford University Press, Oxford.
R. Harald Baayen, Richard Piepenbrock, and Leon Gulikers. 1995. The CELEX Lexical
Database (Release 2). Linguistic Data Consortium, Philadelphia.
R. Harald Baayen and Antoinette Renouf. 1996. Chronicling the Times: Productive
lexical innovations in an English newspaper. Language, 72(1):69–96.
Kirk Baker and Chris Brew. 2008. Statistical identification of English loanwords in
Korean using automatically generated training data. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pages
1159–1163. Marrakech, Morocco.
128
BIBLIOGRAPHY
Timothy Baldwin and Aline Villavicencio. 2002. Extracting the unextractable: A case
study on verb-particles. In Proceedings of the Sixth Conference on Computational
Natural Language Learning (CoNLL-2002), pages 98–104. Taipei, Taiwan.
Clarence L. Barnhart. 1978.
American lexicography, 1945–1973.
American Speech,
53(2):83–140.
Clarence L. Barnhart, Sol Steinmetz, and Robert K. Barnhart. 1973. The Barnhart
Dictionary of New English Since 1963. Barnhart/Harper & Row, Bronxville, NY.
David K. Barnhart. 1985. Prizes and pitfalls of computerized searching for new words
for dictionaries. Dictionaries, 7:253–260.
David K. Barnhart. 2007. A calculus for new words. Dictionaries, 28:132–138.
Susan Bartlett, Grzegorz Kondrak, and Colin Cherry. 2008. Automatic syllabification
with structured SVMs for letter-to-phoneme conversion. In Proceedings of the 46th
Annual Meeting of the Association for Computational Linguistics (ACL-08): Human
Language Technologies, pages 568–576. Columbus, Ohio.
Laurie Bauer. 1983. English Word-formation. Cambridge University Press, Cambridge.
Beata Beigman Klebanov, Eyal Beigman, and Daniel Diermeier. 2009. Discourse topics
and metaphors. In Proceedings of the NAACL HLT 2009 Workshop on Computational
Approaches to Linguistic Creativity (CALC-2009), pages 1–8. Boulder, CO.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022.
Valerie M. Boulanger. 1997. What Makes a Coinage Successful?: The Factors Influencing
the Adoption of English New Words. Ph.D. thesis, University of Georgia.
Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Corpus version 1.1.
BIBLIOGRAPHY
129
Eric Brill. 1994. Some advances in transformation-based part of speech tagging. In
Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 722–
727. Seattle, Washington.
Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel
spelling correction. In Proceedings of the 38th Annual Meeting of the Association for
Computational Linguistics, pages 286–293. Hong Kong.
Laurel Brinton and Leslie Arnovick, editors. 2005. The English Language: A Linguistic
History. Oxford University Press.
Julian Brooke, Milan Tofilosky, and Maite Taboada. 2009. Cross-linguistic sentiment
analysis: From English to Spanish. In Proceedings of Recent Advances in Natural
Language Processing (RANLP-2009). Borovets, Bulgaria.
Ian Brookes. 2007. New words and corpus frequency. Dictionaries, 28:142–145.
Lou Burnard. 2007. Reference guide for the British National Corpus (XML Edition).
Oxford University Computing Services.
Lyle Campbell. 2004. Historical Linguistics: An Introduction. MIT Press, Cambridge,
MA.
Claire Cardie. 1993. A case-based approach to knowledge acquisition for domain-specific
sentence analysis. In Proceedings of the Eleventh National Conference on Artificial
Intelligence, pages 798–803. Washington, DC.
Yejin Choi and Claire Cardie. 2009. Adapting a polarity lexicon using integer linear
programming for domain-specific sentiment classification. In Proceedings of the 2009
Conference on Empirical Methods in Natural Language Processing (EMNLP-2009),
pages 590–598. Singapore.
BIBLIOGRAPHY
130
Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and
Anupam Basu. 2007. Investigation and modeling of the structure of texting language.
International Journal of Document Analysis and Recognition, 10(3/4):157–174.
Kenneth W. Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29.
Michael Collins. 2003. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4):589–637.
Paul Cook and Suzanne Stevenson. 2007. Automagically inferring the source words of
lexical blends. In Proceedings of the Tenth Conference of the Pacific Association for
Computational Linguistics (PACLING-2007), pages 289–297. Melbourne, Australia.
Paul Cook and Suzanne Stevenson. 2009. An unsupervised model for text message normalization. In Proceedings of the NAACL HLT 2009 Workshop on Computational
Approaches to Linguistic Creativity (CALC-2009), pages 71–78. Boulder, CO.
Paul Cook and Suzanne Stevenson. 2010a. Automatically identifying changes in the
semantic orientation of words. In Proceedings of the Seventh International Conference
on Language Resources and Evaluation (LREC 2010), pages 28–34. Valletta, Malta.
Paul Cook and Suzanne Stevenson. 2010b. Automatically identifying the source words
of lexical blends in English. Computational Linguistics, 36(1):129–149.
D. Allan Cruse. 2001. The lexicon. In Mark Aronoff and Janie Rees-Miller, editors, The
Handbook of Linguistics, pages 238–264. Blackwell Publishers Inc., Malden, MA.
Henrik De Smet. 2005. A corpus of Late Modern English texts. International Computer
Archive of Modern and Medieval English, 29:69–82.
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and
BIBLIOGRAPHY
131
Richard A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the
American Society of Information Science, 41(6):391–407.
Arthur Delbridge, editor. 1981. The Macquarie Dictionary. Macquarie Library, Sydney.
Anne-Marie Di Sciullo and Edwin Williams. 1987. On the Definition of Word. MIT
Press, Cambridge, MA.
Andrea Esuli and Fabrizio Sebastiani. 2006. SentiWordNet: A publicly available
lexical resource for opinion mining. In Proceedings of the Fifth International Conference
on Language Resources and Evaluation (LREC 2006), pages 417–422. Genoa, Italy.
Cédrick Fairon and Sébastien Paumier. 2006. A translated corpus of 30,000 French
SMS. In Proceedings of the Fifth International Conference on Language Resources and
Evaluation (LREC 2006), pages 351–354. Genoa, Italy.
Dan Fass. 1991. met*: A method for discriminating metonymy and metaphor by computer. Computational Linguistics, 17(1):49–90.
Christiane Fellbaum, editor. 1998. Wordnet: An Electronic Lexical Database. MIT press,
Cambridge, MA.
Roswitha Fischer. 1998. Lexical Change in Present Day English: A Corpus-Based Study
of the Motivation, Institutionalization, and Productivity of Creative Neologisms. Gunter
Narr Verlag, Tübingen, Germany.
W. Nelson Francis and Henry Kucera. 1979. Manual of Information to accompany A standard Corpus of Present-Day Edited American English, for use with Digital Computers.
Brown University.
Roxana Girju, Dan Moldovan, Marta Tatu, and Daniel Antohe. 2005. On the semantics
of noun compounds. Computer Speech and Language, 19(4):479–496.
BIBLIOGRAPHY
132
David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2005. English Gigaword Second
Edition. Linguistic Data Consortium, Philadelphia.
Richard H. Granger. 1977. FOUL-UP: A program that figures out the meanings of words
from context. In Proceedings of the Fifth International Joint Conference on Artificial
Intelligence, pages 172–178. Cambridge, MA.
Stefan Th. Gries. 2004. Shouldn’t it be breakfunch? A quantitative analysis of the
structure of blends. Linguistics, 42(3):639–667.
Stefan Th. Gries. 2006. Cognitive determinants of subtractive word-formation processes:
A corpus-based perspective. Cognitive Linguistics, 17(4):535–558.
Rebecca E. Grinter and Margery A. Eldridge. 2001. y do tngrs luv 2 txt msg. In
Proceedings of the Seventh European Conference on Computer-Supported Cooperative
Work (ECSCW ’01), pages 219–238. Bonn, Germany.
Orin Hargraves. 2007. Taming the wild beast. Dictionaries, 28:139–141.
Peter M. Hastings and Steven L. Lytinen. 1994. The ups and downs of lexical acquisition.
In Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 754–
759. Seattle, Washington.
Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1997. Predicting the semantic
orientation of adjectives. In Proceedings of the Eighth Conference of the European
Chapter of the Association for Computational Linguistics, pages 174–181. Madrid,
Spain.
Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In
Proceedings of the Fourteenth International Conference on Computational Linguistics
(COLING 1992), pages 539–545. Nantes, France.
BIBLIOGRAPHY
133
Donald Hindle. 1990. Noun classification from predicate-argument structures. In Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics,
pages 268–275. Pittsburgh, Pennsylvania.
Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and Andrew MacKinlay. 2006. Reconsidering language identification for written language resources. In
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), pages 485–488. Genoa, Italy.
Marianne Hundt, Andrea Sand, and Paul Skandera. 1999. Manual of information to
accompany the Freiburg - Brown Corpus of American English (‘Frown’). http://
khnt.aksis.uib.no/icame/manuals/frown/INDEX.HTM.
Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. 2007. Applying manyto-many alignments and hidden markov models to letter-to-phoneme conversion. In
Proceedings of Human Language Technologies: The Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL-HLT
2007), pages 372–379. Rochester, NY.
Jay J. Jiang and David W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference Research on
Computational Linguistics (ROCLING X), pages 19–33. Taipei, Taiwan.
Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In KDD
’02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 133–142. Edmonton, Canada.
Eric Joanis, Suzanne Stevenson, and David James. 2008. A general feature space for
automatic verb classification. Natural Language Engineering, 14(3):337–367. Also
published by Cambridge Journals Online on December 19, 2006.
Samuel Johnson. 1755. A Dictionary of the English Language. Richard Bentley.
BIBLIOGRAPHY
134
Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, Upper Saddle River, NJ.
Byung-Ju Kang and Key-Sun Choi. 2002. Effective foreign word extraction for Korean
information retrieval. Information Processing and Management, 38(1):91–109.
Michael H. Kelly. 1998. To “brunch” or to “brench”: Some aspects of blend structure.
Linguistics, 36(3):579–590.
Adam Kilgarriff. 1997. “I Dont Believe in Word Senses”. Computers and the Humanities,
31(2):91–113.
Adam Kilgarriff and Gregory Grefenstette. 2003. Introduction to the special issue on the
Web as a corpus. Computational Linguistics, 29(3):333–347.
Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and David Tugwell. 2004. The Sketch Engine.
In Proceedings of Euralex, pages 105–116. Lorient, France.
Adam Kilgarriff and David Tugwell. 2002. Sketching words. In Marie-Hélène Corréard,
editor, Lexicography and Natural Language Processing: A Festschrift in Honour of B.
T. S. Atkins, pages 125–137. Euralex.
Elizabeth Knowles and Julia Elliott, editors. 1997. The Oxford Dictionary of New Words.
Oxford University Press, Oxford.
Catherine Kobus, François Yvon, and Géraldine Damnati. 2008. Normalizing SMS: are
two metaphors better than one? In Proceedings of the 22nd International Conference
on Computational Linguistics (COLING 2008), pages 441–448. Manchester.
Charles W. Kreidler. 1979. Creating new words by shortening. English Linguistics,
13:24–36.
BIBLIOGRAPHY
135
Saisuresh Krishnakumaran and Xiaojin Zhu. 2007. Hunting elusive metaphors using
lexical resources. In Proceedings of the HLT/NAACL 2007 Workshop on Computational
Approaches to Figurative Language, pages 13–20. Rochester, NY.
Haruo Kubozono. 1990. Phonological constraints on blending in English as a case for
phonology-morphology interface. Yearbook of Morphology, 3:1–20.
Henry Kucera and W. Nelson Francis. 1967. Computational Analysis of Present Day
American English. Brown University Press, Providence, Rhode Island.
George Lakoff and Mark Johnson. 1980. Metaphors We Live By. University of Chicago
Comment Press, Chicago.
Mirella Lapata and Chris Brew. 2004. Verb class disambiguation using informative priors.
Computational Linguistics, 30(1):45–73.
Mark Lauer. 1995. Designing Statistical Language Learners: Experiments on Noun Compounds. Ph.D. thesis, Macquarie University.
Adrienne Lehrer. 2003. Understanding trendy neologisms. Italian Journal of Linguistics,
15(2):369–382.
Rich Ling and Naomi S. Baron. 2007. Text messaging and IM: Linguistic comparison of
American college data. Journal of Language and Social Psychology, 26:291–298.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural
Language Processing. MIT Press, Cambridge, MA.
Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction. Information Processing and Management, 27(5):517–522.
Diana McCarthy, Bill Keller, and John Carroll. 2003. Detecting a continuum of compositionality in phrasal verbs. In Proceedings of the ACL-SIGLEX Workshop on Multiword
Expressions: Analysis, Acquisition and Treatment, pages 73–80. Sapporo, Japan.
BIBLIOGRAPHY
136
Erin McKean. 2007. Verbatim. A talk given at Google.
Linda G. Means. 1988. Cn yur cmputr raed ths? In Proceedings of the Second Conference
on Applied Natural Language Processing, pages 93–100. Austin, Texas.
Allan Metcalf. 2002. Predicting New Words. Houghton Mifflin Company, Boston, MA.
Allan Metcalf. 2007. The enigma of 9/11. Dictionaries, 28:160–162.
Andrei Mikheev. 1997. Automatic rule induction for unknown-word guessing. Computational Linguistics, 23(3):405–423.
Saif Mohammad, Cody Dunne, and Bonnie Dorr. 2009. Generating high-coverage semantic orientation lexicons from overtly marked words and a thesaurus. In Proceedings of
the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP2009), pages 599–608. Singapore.
Saif Mohammad and Graeme Hirst. 2006. Distributional measures of concept-distance: A
task-oriented evaluation. In Proceedings of the 2006 Conference on Empirical Methods
in Natural Language Processing (EMNLP-2006), pages 35–43. Sydney, Australia.
Rosamund Moon. 1987. The analysis of meaning. In John M. Sinclair, editor, Looking
Up: An Account of the COBUILD Project in Lexical Computing and the Development
of the Collins COBUILD English Language Dictionary, pages 86–103. Collins ELT,
London.
David Nadeau and Peter D. Turney. 2005. A supervised learning approach to acronym
identification. In Proceedings of the Eighteenth Canadian Conference on Artificial
Intelligence (AI’2005), pages 319–329. Victoria, Canada.
Ruth O’Donovan and Mary O’Neil. 2008. A systematic approach to the selection of
neologisms for inclusion in a large monolingual dictionary. In Proceedings of the 13th
Euralex International Congress, pages 571–579. Barcelona.
BIBLIOGRAPHY
137
Naoaki Okazaki and Sophia Ananiadou. 2006. A term recognition approach to acronym
recognition. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics
(COLING-ACL 2006), pages 643–650. Sydney, Australia.
Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and
Trends in Information Retrieval, 2(1–2):1–135.
Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. Wordnet::Similarity
- Measuring the relatedness of concepts. In Demonstration Papers at the Human
Language Technology Conference of the North American Chapter of the Association
for Computational Linguistics (HLT-NAACL 2004), pages 38–41. Boston, MA.
Ingo Plag. 2003. Word-formation in English. Cambridge University Press, Cambridge.
Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing,
pages 133–142. Philadelphia, PA.
Philip Resnik, Aaron Elkiss, Ellen Lau, and Heather Taylor. 2005. The Web in theoretical linguistics research: Two case studies using the Linguist’s Search Engine. In
Proceedings of the 31st Meeting of the Berkeley Linguistics Society, pages 265–276.
Berkeley, CA.
Adrian Room. 1986. Dictionary of Changes in Meaning. Routledge and Kegan Paul,
London, New York.
Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009. Semantic density analysis: Comparing word meaning across time and space. In Proceedings of the EACL 2009 Workshop on GEMS: GEometrical Models of Natural Language Semantics, pages 104–111.
Athens, Greece.
BIBLIOGRAPHY
138
Geoffrey Sampson, editor. 1985. Writing Systems: A linguistic introduction. Stanford
University Press, Stanford, California.
Hinrich Schütze. 1992. Automatic word sense discrimination. Computational Linguistics,
24(1):97–123.
Ariel S. Schwartz and Marti A. Hearst. 2003. A simple algorithm for identifying abbreviation definitions in biomedical texts. In Proceedings of the Pacific Symposium on
Biocomputing (PSB 2003), pages 451–462. Lihue, HI.
William Shakespeare. 2008a. Hamlet. Edited by Joseph Pearce. Ignatius Press, San
Francisco.
William Shakespeare. 2008b. King Lear. Edited by Joseph Pearce. Ignatius Press, San
Francisco.
Jesse T. Sheidlower. 1995. Principles for the inclusion of new words in college dictionaries.
Dictionaries, 16:33–44.
Libin Shen and Aravind K. Joshi. 2005. Ranking and reranking with perceptron. Machine
Learning, 60(1):73–96.
Rainer Siemund and Claudia Claridge. 1997. The Lampeter Corpus of Early Modern
English Tracts. International Computer Archive of Modern and Medieval English,
21:61–70.
John Simpson. 2007. Neologism: The long view. Dictionaries, 28:146–148.
John M. Sinclair, editor. 1987. Looking Up: An Account of the COBUILD Project in
Lexical Computing and the Development of the Collins COBUILD English Language
Dictionary. Collins ELT, London.
BIBLIOGRAPHY
139
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning syntactic patterns for
automatic hypernym discovery. In Proceedings of the Nineteenth Annual Conference
on Neural Information Processing Systems (NIPS 2005), pages 1297–1304. Whistler,
Canada.
Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and
Christopher Richards. 2001. Normalization of non-standard words. Computer Speech
and Language, 15:287–333.
Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, and Daniel M. Ogilvie, editors.
1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press,
Cambridge, MA.
Eiichiro Sumita and Fumiaki Sugaya. 2006. Using the Web to disambiguate acronyms.
In Proceedings of the Human Language Technology Conference of the North American
Chapter of the Association for Computational Linguistics, pages 161–164. New York.
Crispin Thurlow. 2003. Generation txt? The sociolinguistics of young people’s textmessaging. Discourse Analysis Online, 1(1).
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003.
Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings
of the 2003 Human Language Technology Conference of the North American Chapter
of the Association for Computational Linguistics (HLT-NAACL 2003), pages 173–180.
Edmonton, Canada.
Kristina Toutanova and Robert C. Moore. 2002. Pronunciation modeling for improved
spelling correction. In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics, pages 144–151. Philadelphia, PA.
Elizabeth C. Traugott and Richard B. Dasher. 2002. Regularity in Semantic Change.
Cambridge University Press, Cambridge.
BIBLIOGRAPHY
140
Peter D. Turney and Michael L. Littman. 2003. Measuring praise and criticism: Inference
of semantic orientation from association. ACM Transactions on Information Systems
(TOIS), 21(4):315–346.
Justin Washtell. 2009. Co-Dispersion: A windowless approach to lexical association. In
Proceedings of the 12th Conference of the European Chapter of the Association for
Computational Linguistics (EACL 2009), pages 861–869. Athens, Greece.
Yorick Wilks. 1978. Making preferences more active. Artificial Intelligence, 11(3):197–
223.
Yorick Wilks and Roberta Catizone. 2002. Lexical tuning. In Proceedings of Computational Linguistics and Intelligent Text Processing, Third International Conference
(CICLING 2002), pages 106–125. Mexico City, Mexico.
François Yvon. 2010. Rewriting the orthography of SMS messages. Natural Language
Engineering, 16(2):133–159.
Uri Zernik, editor. 1991. Lexical Acquisition: Exploiting On-Line Resources to Build a
Lexicon. Lawrence Erlbaum Associates, Hillsdale, NJ.