regarding coreference. Specifically, we increase the
number of lexical features to nine to allow more
complex NP string matching operations. In addi-
tion, we include four new semantic features to al-
low finer-grained semantic compatibility tests. We
test for ancestor-descendent relationships in Word-
Net (SUBCLASS), for example, and also measure
the WordNet graph-traversal distance (WNDIST) be-
tween NP£ and NP¢ . Furthermore, we add a new posi-
tional feature that measures the distance in terms of
the number of paragraphs (PARANUM) between the
two NPs.
The most substantial changes to the feature set,
however, occur for grammatical features: we add 26
new features to allow the acquisition of more sophis-
ticated syntactic coreference resolution rules. Four
features simply determine NP type, e.g. are both
NPs definite, or pronouns, or part of a quoted string?
These features allow other tests to be conditioned on
the types of NPs being compared. Similarly, three
new features determine the grammatical role of one
or both of the NPs. Currently, only tests for clausal
subjects are made. Next, eight features encode tra-
ditional linguistic (hard) constraints on coreference.
For example, coreferent NPs must agree both in gen-
der and number (AGREEMENT); cannot SPAN one
another (e.g. “government” and “government offi-
cials”); and cannot violate the BINDING constraints.
Still other grammatical features encode general lin-
guistic preferences either for or against coreference.
For example, an indefinite NP (that is not in appo-
sition to an anaphoric NP) is not likely to be coref-
erent with any NP that precedes it (ARTICLE). The
last subset of grammatical features encodes slightly
more complex, but generally non-linguistic heuris-
tics. For instance, the CONTAINS PN feature ef-
fectively disallows coreference between NPs that
contain distinct proper names but are not them-
selves proper names (e.g. “IBM executives” and
“Microsoft executives”).
Two final features make use of an in-house
naive pronoun resolution algorithm (PRO RESOLVE)
and a rule-based coreference resolution system
(RULE RESOLVE), each of which relies on the origi-
nal and expanded feature sets described above.
Results and discussion. Results using the ex-
panded feature set are shown in the All Features
block of Table 2. These and all subsequent results
also incorporate the learning framework changes
from Section 3. In comparison, we see statistically
significant increases in recall, but much larger de-
creases in precision. As a result, F-measure drops
precipitously for both learning algorithms and both
data sets. A closer examination of the results indi-
cates very poor precision on common nouns in com-
parison to that of pronouns and proper nouns. (See
the indented All Features results in Table 2.7) In
particular, the classifiers acquire a number of low-
precision rules for common noun resolution, pre-
sumably because the current feature set is insuffi-
cient. For instance, a rule induced by RIPPER clas-
sifies two NPs as coreferent if the first NP is a proper
name, the second NP is a definite NP in the subject
position, and the two NPs have the same seman-
tic class and are at most one sentence apart from
each other. This rule covers 38 examples, but has
18 exceptions. In comparison, the Baseline sys-
tem obtains much better precision on common nouns
(i.e. 53.3 for MUC-6/RIPPER and 61.0 for MUC-
7/RIPPER with lower recall in both cases) where the
primary mechanism employed by the classifiers for
common noun resolution is its high-precision string
matching facility. Our results also suggest that data
fragmentation is likely to have contributed to the
drop in performance (i.e. we increased the number
of features without increasing the size of the training
set). For example, the decision tree induced from the
MUC-6 data set using the Soon feature set (Learn-
ing Framework results) has 16 leaves, each of which
contains 1728 instances on average; the tree induced
from the same data set using all of the 53 features,
on the other hand, has 86 leaves with an average of
322 instances per leaf.
Hand-selected feature sets. As a result, we next
evaluate a version of the system that employs man-
ual feature selection: for each classifier/data set
combination, we discard features used primarily to
induce low-precision rules for common noun res-
olution and re-train the coreference classifier using
the reduced feature set. Here, feature selection does
not depend on a separate development corpus and