Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Classifying The Semantic Relations in Noun Compounds Via A Domain-Specific Lexical Hierarchy

Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 9

Classifying the Semantic Relations in Noun Compounds via a

Domain-Specific Lexical Hierarchy


Barbara Rosario and Marti Hearst
School of Information Management & Systems
University of California, Berkeley
Berkeley, CA 94720-4600
{rosario,hearst}@sims.berkeley.edu

Abstract sify constituents according to which semantic rela-


We are developing corpus-based techniques for iden- tionships hold between them. For example, we want
tifying semantic relations at an intermediate level of to characterize the treatment-for-disease relation-
description (more specific than those used in case ship between the words of migraine treatment ver-
frames, but more general than those used in tra- sus the method-of-treatment relationship between
ditional knowledge representation systems). In this the words of aerosol treatment. These relations are
paper we describe a classification algorithm for iden- intended to be combined to produce larger proposi-
tifying relationships between two-word noun com- tions that can then be used in a variety of interpreta-
pounds. We find that a very simple approach using tion paradigms, such as abductive reasoning (Hobbs
a machine learning algorithm and a domain-specific et al., 1993) or inductive logic programming (Ng and
lexical hierarchy successfully generalizes from train- Zelle, 1997).
ing instances, performing better on previously un- Note that because we are concerned with the se-
seen words than a baseline consisting of training on mantic relations that hold between the concepts, as
the words themselves. opposed to the more standard, syntax-driven com-
putational goal of determining left versus right as-
1 Introduction sociation, this has the fortuitous effect of changing
We are exploring empirical methods of determin- the problem into one of classification, amenable to
ing semantic relationships between constituents in standard machine learning classification techniques.
natural language. Our current project focuses on We have found that we can use such algorithms to
biomedical text, both because it poses interesting classify relationships between two-word noun com-
challenges, and because it should be possible to make pounds with a surprising degree of accuracy. A
inferences about propositions that hold between sci- one-out-of-eighteen classification using a neural net
entific concepts within biomedical texts (Swanson achieves accuracies as high as 62%. By taking ad-
and Smalheiser, 1994). vantage of lexical ontologies, we achieve strong re-
One of the important challenges of biomedical sults on noun compounds for which neither word is
text, along with most other technical text, is the present in the training set. Thus, we think this is a
proliferation of noun compounds. A typical article promising approach for a variety of semantic label-
title is shown below; it consists a cascade of four ing tasks.
noun phrases linked by prepositions: The reminder of this paper is organized as follows:
Section 2 describes related work, Section 3 describes
Open-labeled long-term study of the effi- the semantic relations and how they were chosen,
cacy, safety, and tolerability of subcuta- and Section 4 describes the data collection and on-
neous sumatriptan in acute migraine treat- tologies. In Section 5 we describe the method for
ment. automatically assigning semantic relations to noun
compounds, and report the results of experiments
The real concern in analyzing such a title is in de- using this method. Section 6 concludes the paper
termining the relationships that hold between differ- and discusses future work.
ent concepts, rather than on finding the appropriate
attachments (which is especially difficult given the
lack of a verb). And before we tackle the prepo-
2 Related Work
sitional phrase attachment problem, we must find Several approaches have been proposed for empiri-
a way to analyze the meanings of the noun com- cal noun compound interpretation. Lauer and Dras
pounds. (1994) point out that there are three components to
Our goal is to extract propositional information the problem: identification of the compound from
from text, and as a step towards this goal, we clas- within the text, syntactic analysis of the compound
(left versus right association), and the interpreta- tionship between the terms must still be determined.
tion of the underlying semantics. Several researchers In our framework we would cast this problem as
have tackled the syntactic analysis (Lauer, 1995; finding the relationship R(p, n2) that best character-
Pustejovsky et al., 1993; Liberman and Sproat, izes the preposition and the NP that follows it, and
1992), usually using a variation of the idea of find- then seeing if the categorization algorithm deter-
ing the subconstituents elsewhere in the corpus and mines their exists any relationship R′ (n1, R(p, n2))
using those to predict how the larger compounds are or R′ (v, R(p, n2)).
structured. The algorithms used in the related work reflect the
We are interested in the third task, interpretation fact that they condition probabilities on a particular
of the underlying semantics. Most related work re- verb and noun. Resnik (1993; 1995) use classes in
lies on hand-written rules of one kind or another. Wordnet (Fellbaum, 1998) and a measure of concep-
Finin (1980) examines the problem of noun com- tual association to generalize over the nouns. Brill
pound interpretation in detail, and constructs a and Resnik (1994) use Brill’s transformation-based
complex set of rules. Vanderwende (1994) uses a so- algorithm along with simple counts within a lexi-
phisticated system to extract semantic information cal hierarchy in order to generalize over individual
automatically from an on-line dictionary, and then words. Li and Abe (1998) use a minimum descrip-
manipulates a set of hand-written rules with hand- tion length-based algorithm to find an optimal tree
assigned weights to create an interpretation. Rind- cut over WordNet for each classification problem,
flesch et al. (2000) use hand-coded rule based sys- finding improvements over both lexical association
tems to extract the factual assertions from biomed- (Hindle and Rooth, 1993) and conceptual associa-
ical text. Lapata (2000) classifies nominalizations tion, and equaling the transformation-based results.
according to whether the modifier is the subject or Our approach differs from these in that we are us-
the object of the underlying verb expressed by the ing machine learning techniques to determine which
head noun.1 level of the lexical hierarchy is appropriate for gen-
In the related sub-area of information extraction eralizing across nouns.
(Cardie, 1997; Riloff, 1996), the main goal is to find
every instance of particular entities or events of in- 3 Noun Compound Relations
terest. These systems use empirical techniques to
In this work we aim for a representation that is in-
learn which terms signal entities of interest, in order
termediate in generality between standard case roles
to fill in pre-defined templates. Our goals are more
(such as Agent, Patient, Topic, Instrument), and the
general than those of information extraction, and
specificity required for information extraction. We
so this work should be helpful for that task. How-
have created a set of relations that are sufficiently
ever, our approach will not solve issues surrounding
general to cover a significant number of noun com-
previously unseen proper nouns, which are often im-
pounds, but that can be domain specific enough to
portant for information extraction tasks.
be useful in analysis. We want to support relation-
There have been several efforts to incorporate lex- ships between entities that are shown to be impor-
ical hierarchies into statistical processing, primar- tant in cognitive linguistics, in particular we intend
ily for the problem of prepositional phrase (PP) to support the kinds of inferences that arise from
attachment. The current standard formulation is: Talmy’s force dynamics (Talmy, 1985). It has been
given a verb followed by a noun and a prepositional shown that relations of this kind can be combined in
phrase, represented by the tuple v, n1, p, n2, deter- order to determine the “directionality” of a sentence
mine which of v or n1 the PP consisting of p and (e.g., whether or not a politician is in favor of, or op-
n2 attaches to, or is most closely associated with. posed to, a proposal) (Hearst, 1990). In the medical
Because the data is sparse, empirical methods that domain this translates to, for example, mapping a
train on word occurrences alone (Hindle and Rooth, sentence into a representation showing that a chem-
1993) have been supplanted by algorithms that gen- ical removes an entity that is blocking the passage
eralize one or both of the nouns according to class- of a fluid through a channel.
membership measures (Resnik, 1993; Resnik and The problem remains of determining what the ap-
Hearst, 1993; Brill and Resnik, 1994; Li and Abe, propriate kinds of relations are. In theoretical lin-
1998), but the statistics are computed for the par- guistics, there are contradictory views regarding the
ticular preposition and verb. semantic properties of noun compounds (NCs). Levi
It is not clear how to use the results of such anal- (1978) argues that there exists a small set of se-
ysis after they are found; the semantics of the rela- mantic relationships that NCs may imply. Downing
1 Nominalizations are compounds whose head noun is a
(1977) argues that the semantics of NCs cannot be
nominalized verb and whose modifier is either the subject or exhausted by any finite listing of relationships. Be-
the object of the verb. We do not distinguish the NCs on the tween these two extremes lies Warren’s (1978) tax-
basis of their formation. onomy of six major semantic relations organized into
a hierarchical structure. ful “downstream” in the analysis.
We have identified the 38 relations shown in Ta- The end goal is to combine these relationships in
ble 1. We tried to produce relations that correspond NCs with more that two constituent nouns, like in
to the linguistic theories such as those of Levi and the example intranasal migraine treatment of Sec-
Warren, but in many cases these are inappropriate. tion 1.
Levi’s classes are too general for our purposes; for
example, she collapses the “location” and “time” 4 Collection and Lexical Resources
relationships into one single class “In” and there-
To create a collection of noun compounds, we per-
fore field mouse and autumnal rain belong to the
formed searches from MedLine, which contains ref-
same class. Warren’s classification schema is much
erences and abstracts from 4300 biomedical journals.
more detailed, and there is some overlap between
We used several query terms, intended to span across
the top levels of Warren’s hierarchy and our set
different subfields. We retained only the titles and
of relations. For example, our “Cause (2-1)” for
the abstracts of the retrieved documents. On these
flu virus corresponds to her “Causer-Result” of hay
titles and abstracts we ran a part-of-speech tagger
fever, and our “Person Afflicted” (migraine patient)
(Cutting et al., 1991) and a program that extracts
can be thought as Warren’s “Belonging-Possessor”
only sequences of units tagged as nouns. We ex-
of gunman. Warren differentiates some classes also
tracted NCs with up to 6 constituents, but for this
on the basis of the semantics of the constituents,
paper we consider only NCs with 2 constituents.
so that, for example, the “Time” relationship is di-
vided up into “Time-Animate Entity” of weekend The Unified Medical Language System (UMLS)
guests and “Time-Inanimate Entity” of Sunday pa- is a biomedical lexical resource produced and
per. Our classification is based on the kind of re- maintained by the National Library of Medicine
lationships that hold between the constituent nouns (Humphreys et al., 1998). We use the MetaThe-
rather than on the semantics of the head nouns. saurus component to map lexical items into unique
concept IDs (CUIs).3 The UMLS also has a map-
For the automatic classification task, we used only
ping from these CUIs into the MeSH lexical hier-
the 18 relations (indicated in bold in Table 1) for
archy (Lowe and Barnett, 1994); we mapped the
which an adequate number of examples were found
CUIs into MeSH terms. There are about 19,000
in the current collection. Many NCs were ambigu-
unique main terms in MeSH, as well as additional
ous, in that they could be described by more than
modifiers. There are 15 main subhierarchies (trees)
one semantic relationship. In these cases, we sim-
in MeSH, each corresponding to a major branch
ply multi-labeled them: for example, cell growth is
of medical ontology. For example, tree A cor-
both “Activity” and “Change”, tumor regression is
responds to Anatomy, tree B to Organisms, and
“Ending/reduction” and “Change” and bladder dys-
so on. The longer the name of the MeSH term,
function is “Location” and “Defect”. Our approach
the longer the path from the root and the more
handles this kind of multi-labeled classification.
precise the description. For example migraine is
Two relation types are especially problematic. C10.228.140.546.800.525, that is, C (a disease), C10
Some compounds are non-compositional or lexical- (Nervous System Diseases), C10.228 (Central Ner-
ized, such as vitamin k and e2 protein; others defy vous System Diseases) and so on.
classification because the nouns are subtypes of one
We use the MeSH hierarchy for generalization
another. This group includes migraine headache,
across classes of nouns; we use it instead of the other
guinea pig, and hbv carrier. We placed all these NCs
resources in the UMLS primarily because of MeSH’s
in a catch-all category. We also included a “wrong”
hierarchical structure. For these experiments, we
category containing word pairs that were incorrectly
considered only those noun compounds for which
labeled as NCs.2
both nouns can be mapped into MeSH terms, re-
The relations were found by iterative refinement sulting in a total of 2245 NCs.
based on looking at 2245 extracted compounds (de-
scribed in the next section) and finding commonal-
ities among them. Labeling was done by the au-
5 Method and Results
thors of this paper and a biology student; the NCs Because we have defined noun compound relation
were classified out of context. We expect to con- determination as a classification problem, we can
tinue development and refinement of these relation- make use of standard classification algorithms. In
ship types, based on what ends up clearly being use- particular, we used neural networks to classify across
all relations simultaneously.
2 The percentage of the word pairs extracted that were not

true NCs was about 6%; some examples are: treat migraine, 3 In some cases a word maps to more than one CUI; for the

ten patient, headache more. We do not know, however, how work reported here we arbitrarily chose the first mapping in
many NCs we missed. The errors occurred when the wrong all cases. In future work we will explore how to make use of
label was assigned by the tagger (see Section 4). all of the mapped terms.
Name N Examples
Wrong parse (1) 109 exhibit asthma, ten drugs, measure headache
Subtype (4) 393 headaches migraine, fungus candida, hbv carrier,
giant cell, mexico city, t1 tumour, ht1 receptor
Activity/Physical process (5) 59 bile delivery, virus reproduction, bile drainage,
headache activity, bowel function, tb transmission
Ending/reduction 8 migraine relief, headache resolution
Beginning of activity 2 headache induction, headache onset
Change 26 papilloma growth, headache transformation,
disease development, tissue reinforcement
Produces (on a genetic level) (7) 47 polyomavirus genome, actin mrna, cmv dna, protein gene
Cause (1-2) (20) 116 asthma hospitalizations, aids death, automobile accident
heat shock, university fatigue, food infection
Cause (2-1) 18 flu virus, diarrhoea virus, influenza infection
Characteristic (8) 33 receptor hypersensitivity, cell immunity,
drug toxicity, gene polymorphism, drug susceptibility
Physical property 9 blood pressure, artery diameter, water solubility
Defect (27) 52 hormone deficiency, csf fistulas, gene mutation
Physical Make Up 6 blood plasma, bile vomit
Person afflicted (15) 55 aids patient, bmt children, headache group, polio survivors
Demographic attributes 19 childhood migraine, infant colic, women migraineur
Person/center who treats 20 headache specialist, headache center, diseases physicians,
asthma nurse, children hospital
Research on 11 asthma researchers, headache study, language research
Attribute of clinical study (18) 77 headache parameter, attack study, headache interview,
biology analyses, biology laboratory, influenza epidemiology
Procedure (36) 60 tumor marker, genotype diagnosis, blood culture,
brain biopsy, tissue pathology
Frequency/time of (2-1) (22) 25 headache interval, attack frequency,
football season, headache phase, influenza season
Time of (1-2) 4 morning headache, hour headache, weekend migraine
Measure of (23) 54 relief rate, asthma mortality, asthma morbidity,
cell population, hospital survival
Standard 5 headache criteria, society standard
Instrument (1-2) (33) 121 aciclovir therapy, chloroquine treatment,
laser irradiation, aerosol treatment
Instrument (2-1) 8 vaccine antigen, biopsy needle, medicine ginseng
Instrument (1) 16 heroin use, internet use, drug utilization
Object (35) 30 bowel transplantation, kidney transplant, drug delivery
Misuse 11 drug abuse, acetaminophen overdose, ergotamine abuser
Subject 18 headache presentation, glucose metabolism, heat transfer
Purpose (14) 61 headache drugs, hiv medications, voice therapy,
influenza treatment, polio vaccine
Topic (40) 38 time visualization, headache questionnaire, tobacco history,
vaccination registries, health education, pharmacy database
Location (21) 145 brain artery, tract calculi, liver cell, hospital beds
Modal 14 emergency surgery, trauma method
Material (39) 28 formaldehyde vapor, aloe gel, gelatin powder, latex glove,
Bind 4 receptor ligand, carbohydrate ligand
Activator (1-2) 6 acetylcholine receptor, pain signals
Activator (2-1) 4 headache trigger, headache precipitant
Inhibitor 11 adrenoreceptor blockers, influenza prevention
Defect in Location (21 27) 157 lung abscess, artery aneurysm, brain disorder

Table 1: The semantic relations defined via iterative refinement over a set of noun compounds. The relations
shown in boldface are those used in the experiments reported on here. Relation ID numbers are shown in
parentheses by the relation names. The second column shows the number of labeled examples for each class;
the last row shows a class consisting of compounds that exhibit more than one relation. The notation (1-2)
and (2-1) indicates the directionality of the relations. For example, Cause (1-2) indicates that the first noun
causes the second, and Cause (2-1) indicates the converse.
flu vaccination Model Acc1 Acc2 Acc3
Model 2 D4G3 Lexical: Log Reg 0.31 0.58 0.62
Model 3 D 4 808 G 3 770 Lexical: NN 0.62 0.73 0.78
Model 4 D 4 808 54 G 3 770 2 0.52 0.65 0.72
Model 5 D 4 808 54 79 G 3 770 670 3 0.58 0.70 0.76
Model 6 D 4 808 54 79 429 G 3 770 670 310 4 0.60 0.70 0.76
5 0.60 0.72 0.78
Table 2: Different lengths of the MeSH descriptors 6 0.61 0.71 0.76
for the different models
Model Feature Vector Table 4: Test accuracy for each model, where the model
2 42 number corresponds to the level of the MeSH hierarchy
3 315 used for classification. Lexical NN is Neural Network on
4 687 Lexical and Lexical: Log Reg is Logistic Regression on
5 950 NN. Acc1 refers to how often the correct relation is the
6 1111 top-scoring relation, Acc2 refers to how often the correct
Lexical 1184 relation is one of the top two according to the neural net,
and so on. Guessing would yield a result of 0.077.

Table 3: Length of the feature vectors for different


models. The network had one hidden layer, in which a hy-
perbolic tangent function was used, and an output
We ran the experiments creating models that used layer representing the 18 relations. A logistic sig-
different levels of the MeSH hierarchy. For example, moid function was used in the output layer to map
for the NC flu vaccination, flu maps to the MeSH the outputs into the interval (0, 1).
term D4.808.54.79.429.154.349 and vaccination to The number of units of the output layer was the
G3.770.670.310.890. Flu vaccination for Model 4 number of relations (18) and therefore fixed. The
would be represented by a vector consisting of the network was trained for several choices of numbers of
concatenation of the two descriptors showing only hidden units; we chose the best-performing networks
the first four levels: D4.808.54.79 G3.770.670.310 based on training set error for each of the models.
(see Table 2). When a word maps to a general MeSH We subsequently tested these networks on held-out
term (like treatment, Y11) zeros are appended to the testing data.
end of the descriptor to stand in place of the missing We compared the results with a baseline in which
values (so, for example, treatment in Model 3 is Y logistic regression was used on the lexical features.
11 0, and in Model 4 is Y 11 0 0, etc.). Given the indicator variable representation of these
The numbers in the MeSH descriptors are cate- features, this logistic regression essentially forms a
gorical values; we represented them with indicator table of log-odds for each lexical item. We also com-
variables. That is, for each variable we calculated pared to a method in which the lexical indicator vari-
the number of possible categories c and then repre- ables were used as input to a neural network. This
sented an observation of the variable as a sequence of approach is of interest to see to what extent, if any,
c binary variables in which one binary variable was the MeSH-based features affect performance. Note
one and the remaining c − 1 binary variables were also that this lexical neural-network approach is fea-
zero. sible in this setting because the number of unique
words is limited (1184) – such an approach would
not scale to larger problems.
In Table 4 and in Figure 1 we report the results
We also used a representation in which the words from these experiments. Neural network using lex-
themselves were used as categorical input variables ical features only yields 62% accuracy on average
(we call this representation “lexical”). For this col- across all 18 relations. A neural net trained on
lection of NCs there were 1184 unique nouns and Model 6 using the MeSH terms to represent the
therefore the feature vector for each noun had 1184 nouns yields an accuracy of 61% on average across
components. In Table 3 we report the length of the all 18 relations. Note that reasonable performance is
feature vectors for one noun for each model. The en- also obtained for Model 2, which is a much more gen-
tire NC was described by concatenating the feature eral representation. Table 4 shows that both meth-
vectors for the two nouns in sequence. ods achieve up to 78% accuracy at including the cor-
The NCs represented in this fashion were used as rect relation among the top three hypothesized.
input to a neural network. We used a feed-forward Multi-class classification is a difficult problem
network trained with conjugate gradient descent. (Vapnik, 1998). In this problem, a baseline in which
Testing set performance on the best models for each MeSH level Performance of each class for the LEXICAL model
1 1
Accuracy for the largest NN output MeSH
within 2 largest NN output Lexical
0.9 within 3 largest NN output 0.9

0.8 0.8

Accuracies on the test set for the best model


0.7 0.7
Accuracy on test set

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
2 3 4 5 6 1 4 5 7 8 14 15 18 20 21 22 23 27 33 35 36 39 40 2027
Levels of the MeSH Hierarchy Classes

Figure 1: Accuracies on the test sets for all the models. Figure 2: Accuracies for each class. The numbers at the
The dotted line at the bottom is the accuracy of guess- bottom refer to the class numbers in Table 1. Note the
ing (the inverse of the number of classes). The dash-dot very high accuracy for the “mixed” relationship 20-27
line above this is the accuracy of logistic regression on (last bar on the right).
the lexical data. The solid line with asterisks is the ac-
curacy of our representation, when only the maximum
output value from the network is considered. The solid the words in the test set are not present in the train-
line with circles if the accuracy of getting the right an- ing set. In relationship 14 (“Purpose”), for example,
swer within the two largest output values from the neural vaccine appears 6 times in the test set (e.g., varicella
network and the last solid line with diamonds is the ac- vaccine). In the training set, NCs with vaccine in
curacy of getting the right answer within the first three it have also been classified as “Instrument” (anti-
outputs from the network. The three flat dashed lines gen vaccine, polysaccharide vaccine), as “Object”
are the corresponding performances of the neural net- (vaccine development), as “Subtype of” (opv vac-
work on lexical inputs. cine) and as “Wrong” (vaccines using). Other words
in the test set for 14 are varicella which is present
in the trainig set only in varicella serology labeled
the algorithm guesses yields about 5% accuracy. We as “Attribute of clinical study”, drainage which is
see that our method is a significant improvement in the training set only as “Location” (gallbladder
over the tabular logistic-regression-based approach, drainage and tract drainage) and “Activity” (bile
which yields an accuracy of only 31 percent. Addi- drainage). Other test set words such as immunisa-
tionally, despite the significant reduction in raw in- tion and carcinogen do not appear in the training
formation content as compared to the lexical repre- set at all.
sentation, the MeSH-based neural network performs In other words, it seems that the MeSHk-based
as well as the lexical-based neural network. (And we categorization does better when generalization is re-
again stress that the lexical-based neural network is quired. Additionally, this data set is “dense” in the
not a viable option for larger domains.) sense that very few testing words are not present in
Figure 2 shows the results for each relation. the training data. This is of course an unrealistic
MeSH-based generalization does better on some re- situation and we wanted to test the robustness of
lations (for example 14 and 15) and Lexical on others the method in a more realistic setting. The results
(7, 22). It turns out that the test set for relation- reported in Table 4 and in Figure 1 were obtained
ship 7 (“Produces on a genetic level”) is dominated splitting the data into 50% training and 50% testing
by NCs containing the words alleles and mrna and for each relation and we had a total of 855 training
that all the NCs in the training set containing these points and 805 test points. Of these, only 75 ex-
words are assigned relation label 7. A similar situa- amples in the testing set consisted of NCs in which
tion is seen for relation 22, “Time(2-1)”. In the test both words were not present in the training set.
set examples the second noun is either recurrence, We decided to test the robustness of the MeSH-
season or time. In the training set, these nouns ap- based model versus the lexical model in the case of
pear only in NCs that have been labeled as belonging unseen words; we are also interested in seeing the
to relation 22. relative importance of the first versus the second
On the other hand, if we look at relations 14 and noun. Therefore, we split the data into 5% training
15, we find a wider range of words, and in some cases (73 data points) and 95% testing (1587 data points)
Model All test 1 2 3 4 1
Testing set performances for different partitions on the test set

Accuracy for MeSH for the entire test


Lexical: NN 0.23 0.54 0.14 0.33 0.08 0.9
Accuracy for MeSH for Case 4
Accuracy for Lexical for the entire test
2 0.44 0.62 0.25 0.53 0.38 Accuracy for Lexical for Case 4
Guessing
3 0.41 0.62 0.18 0.47 0.35 0.8

4 0.42 0.58 0.26 0.39 0.38 0.7

5 0.46 0.64 0.28 0.54 0.40 0.6

Accuracy on test set


6 0.44 0.64 0.25 0.50 0.39
0.5

Table 5: Test accuracy for the four sub-partitions of 0.4

the test set. 0.3

0.2

and partitioned the testing set into 4 subsets as fol- 0.1

lows (the numbers in parentheses are the numbers 0


of points for each case): 2 3 4 5 6
Levels of the MeSH Hierarchy

• Case 1: NCs for which the first noun was not


present in the training set (424) Figure 3: The unbroken lines represent the MeSH mod-
• Case 2: NCs for which the second noun was not els accuracies (for the entire test set and for case 4) and
present in the training set (252) the dashed lines represent the corresponding lexical ac-
curacies. The accuracies are smaller than the previous
• Case 3: NCs for which both nouns were present case of Table 4 because the training set is much smaller,
in the training set (101) but the point of interest is the difference in the perfor-
• Case 4: NCs for which both nouns were not mance of MeSH vs. lexical in this more difficult setting.
present in the training set (810). Note that lexical for case 4 reduces to random guessing.

Table 5 and Figures 3 and 4 present the accuracies Testing set performances for different partitions on the test set for the MeSH−based model
1
for these test set partitions. Figure 3 shows that Accuracy for the entire test
Case 3
the MeSH-based models are more robust than the 0.9 Case 1
Case 2
lexical when the number of unseen words is high and 0.8
Case 4

when the size of training set is (very) small. In this


0.7
more realistic situation, the MeSH models are able to
generalize over previously unseen words. For unseen 0.6
Accuracy on test set

words, lexical reduces to guessing.4 0.5

Figure 4 shows the accuracy for the MeSH based-


0.4
model for the the four cases of Table 5. It is interest-
ing to note that the accuracy for Case 1 (first noun 0.3

not present in the training set) is much higher than 0.2

the accuracy for Case 2 (second noun not present in


0.1
the training set). This seems to indicate that the
second noun is more important for the classification 0

that the first one. 2 3 4


Levels of the MeSH Hierarchy
5 6

6 Conclusions Figure 4: Accuracy for the MeSH based-model for the


We have presented a simple approach to corpus- the four cases. All these curves refer to the case of get-
based assignment of semantic relations for noun ting exactly the right answer. Note the difference in
compounds. The main idea is to define a set of rela- performance between case 1 (first noun not present in
tions that can hold between the terms and use stan- the training set) and case 2 (second noun not present in
dard machine learning techniques and a lexical hi- training set).
erarchy to generalize from training instances to new
examples. The initial results are quite promising.
In this task of multi-class classification (with 18 (1994) who reports an accuracy of 52% with 13
classes) we achieved an accuracy of about 60%. classes and Lapata (2000) whose algorithm achieves
These results can be compared with Vanderwende about 80% accuracy for a much simpler binary clas-
4 Note that for unseen words, the baseline lexical-based
sification.
logistic regression approach, which essentially builds a tabular We have shown that a class-based representation
representation of the log-odds for each class, also reduces to performes as well as a lexical-based model despite
random guessing. the reduction of raw information content and de-
spite a somewhat errorful mapping from terms to tion of Compound Nominals. Ph.d. dissertation,
concepts. We have also shown that representing the University of Illinois, Urbana, Illinois.
nouns of the compound by a very general represen- Marti A. Hearst. 1990. A hybrid approach to re-
tation (Model 2) achieves a reasonable performance stricted text interpretation. In Paul S. Jacobs, ed-
of aout 52% accuracy on average. This is particu- itor, Text-Based Intelligent Systems: Current Re-
larly important in the case of larger collections with search in Text Analysis, Information Extraction,
a much bigger number of unique words for which and Retrieval, pages 38–43. GE Research & De-
the lexical-based model is not a viable option. Our velopment Center, TR 90CRD198.
results seem to indicate that we do not lose much Donald Hindle and Mats Rooth. 1993. Structual
in terms of accuracy using the more compact MeSH ambiguity and lexical relations. Computational
representation. Linguistics, 19(1).
We have also shown how MeSH-besed models out Jerry R. Hobbs, Mark Stickel, Douglas Appelt, and
perform a lexical-based approach when the num- Paul Martin. 1993. Interpretation as abduction.
ber of training points is small and when the test Artificial Intelligence, 63(1-2).
set consists of words unseen in the training data. L. Humphreys, D.A.B. Lindberg, H.M. Schoolman,
This indicates that the MeSH models can generalize and G. O. Barnett. 1998. The unified medical
successfully over unseen words. Our approach han- language system: An informatics research collab-
dles “mixed-class” relations naturally. For the mixed oration. Journal of the American Medical Infor-
class Defect in Location, the algorithm achieved an matics Assocation, 5(1):1–13.
accuracy around 95% for both “Defect” and “Loca-
Maria Lapata. 2000. The automatic interpretation
tion” simultaneously. Our results also indicate that of nominalizations. In AAAI Proceedings.
the second noun (the head) is more important in
determining the relationships than the first one. Mark Lauer and Mark Dras. 1994. A probabilistic
model of compound nouns. In Proceedings of the
In future we plan to train the algorithm to allow 7th Australian Joint Conference on AI.
different levels for each noun in the compound. We
also plan to compare the results to the tree cut algo- Mark Lauer. 1995. Corpus statistics meet the com-
rithm reported in (Li and Abe, 1998), which allows pound noun. In Proceedings of the 33rd Meeting
different levels to be identified for different subtrees. of the Association for Computational Linguistics,
We also plan to tackle the problem of noun com- June.
pounds containing more than two terms. Judith Levi. 1978. The Syntax and Semantics of
Complex Nominals. Academic Press, New York.
Hang Li and Naoki Abe. 1998. Generalizing case
frames using a thesaurus and the MDI principle.
Acknowledgments Computational Linguistics, 24(2):217–244.
We would like to thank Nu Lai for help with the Mark Liberman and Richard Sproat. 1992. The
classification of the noun compound relations. This stress and structure of modified noun phrases in
work was supported in part by NSF award number english. In I.l Sag and A. Szabolsci, editors, Lex-
IIS-9817353. ical Matters. CSLI Lecture Notes No. 24, Univer-
sity of Chicago Press.
Henry J. Lowe and G. Octo Barnett. 1994. Un-
derstanding and using the medical subject head-
ings (MeSH) vocabulary to perform literature
References searches. Journal of the American Medical Asso-
Eric Brill and Philip Resnik. 1994. A rule-based cation (JAMA), 271(4):1103–1108.
approach to prepositional phrase attachment dis- Hwee Tou Ng and John Zelle. 1997. Corpus-based
amibuation. In Proceedings of COLING-94. approaches to semantic interpretation in natural
Claire Cardie. 1997. Empirical methods in informa- language processing. AI Magazine, 18(4).
tion extraction. AI Magazine, 18(4). James Pustejovsky, Sabine Bergler, and Peter An-
Douglass R. Cutting, Julian Kupiec, Jan O. Peder- ick. 1993. Lexical semantic techniques for corpus
sen, and Penelope Sibun. 1991. A practical part- analysis. Computational Linguistics, 19(2).
of-speech tagger. In The 3rd Conference on Ap- Philip Resnik and Marti A. Hearst. 1993. Structural
plied Natural Language Processing, Trento, Italy. ambiguity and conceptual relations. In Proceed-
P. Downing. 1977. On the creation and use of en- ings of the ACL Workshop on Very Large Corpora,
glish compound nouns. Language, (53):810–842. Columbus, OH.
Christiane Fellbaum, editor. 1998. WordNet: An Philip Resnik. 1993. Selection and Information:
Electronic Lexical Database. MIT Press. A Class-Based Approach to Lexical Relationships.
Timothy W. Finin. 1980. The Semantic Interpreta- Ph.D. thesis, University of Pennsylvania, Decem-
ber. (Institute for Research in Cognitive Science
report IRCS-93-42).
Philip Resnik. 1995. Disambiguating noun group-
ings with respect to WordNet senses. In Third
Workshop on Very Large Corpora. Association for
Computational Linguistics.
Ellen Riloff. 1996. Automatically generating ex-
traction patterns from untagged text. In Pro-
ceedings of the Thirteenth National Conference on
Artificial Intelligence and the Eighth Innovative
Applications of Artificial Intelligence Conference,
Menlo Park. AAAI Press / MIT Press.
Thomas Rindflesch, Lorraine Tanabe, John N. We-
instein, and Lawrence Hunter. 2000. Extraction
of drugs, genes and relations from the biomedical
literature. Pacific Symposium on Biocomputing,
5(5).
Don R. Swanson and N. R. Smalheiser. 1994. As-
sessing a gap in the biomedical literature: Mag-
nesium deficiency and neurologic disease. Neuro-
science Research Communications, 15:1–9.
Len Talmy. 1985. Force dynamics in language and
thought. In Parasession on Causatives and Agen-
tivity, University of Chicago. Chicago Linguistic
Society (21st Regional Meeting).
Lucy Vanderwende. 1994. Algorithm for automatic
interpretation of noun sequences. In Proceedings
of COLING-94, pages 782–788.
V. Vapnik. 1998. Statistical Learning Theory. Ox-
ford University Press.
Beatrice Warren. 1978. Semantic Patterns of Noun-
Noun Compounds. Acta Universitatis Gothobur-
gensis.

You might also like