Improving Machine Learning Approaches to Coreference Resolution

Page 1

Improving Machine Learning Approaches to Coreference Resolution

Vincent Ng and Claire Cardie

Department of Computer Science

Cornell University

Ithaca, NY 14853-7501

yung,cardie¡ @cs.cornell.edu

Abstract

We present a noun phrase coreference sys-

tem that extends the work of Soon et

al. (2001) and, to our knowledge, pro-

duces the best results to date on the MUC-

6 and MUC-7 coreference resolution data

sets — F-measures of 70.4 and 63.4, re-

spectively. Improvements arise from two

sources: extra-linguistic changes to the

learning framework and a large-scale ex-

pansion of the feature set to include more

sophisticated linguistic knowledge.

1 Introduction

Noun phrase coreference resolution refers to the

problem of determining which noun phrases (NPs)

refer to each real-world entity mentioned in a doc-

ument. Machine learning approaches to this prob-

lem have been reasonably successful, operating pri-

marily by recasting the problem as a classification

task (e.g. Aone and Bennett (1995), McCarthy and

Lehnert (1995)). Specifically, a pair of NPs is clas-

sified as co-referring or not based on constraints that

are learned from an annotated corpus. A separate

clustering mechanism then coordinates the possibly

contradictory pairwise classifications and constructs

a partition on the set of NPs. Soon et al. (2001),

for example, apply an NP coreference system based

on decision tree induction to two standard coref-

erence resolution data sets (MUC-6, 1995; MUC-

7, 1998), achieving performance comparable to the

best-performing knowledge-based coreference en-

gines. Perhaps surprisingly, this was accomplished

in a decidedly knowledge-lean manner — the learn-

ing algorithm has access to just 12 surface-level fea-

tures.

This paper presents an NP coreference system that

investigates two types of extensions to the Soon et

al. corpus-based approach. First, we propose and

evaluate three extra-linguistic modifications to the

machine learning framework, which together pro-

vide substantial and statistically significant gains

in coreference resolution precision. Second, in an

attempt to understand whether incorporating addi-

tional knowledge can improve the performance of

a corpus-based coreference resolution system, we

expand the Soon et al. feature set from 12 features

to an arguably deeper set of 53. We propose addi-

tional lexical, semantic, and knowledge-based fea-

tures; most notably, however, we propose 26 addi-

tional grammatical features that include a variety of

linguistic constraints and preferences. Although the

use of similar knowledge sources has been explored

in the context of both pronoun resolution (e.g. Lap-

pin and Leass (1994)) and NP coreference resolution

(e.g. Grishman (1995), Lin (1995)), most previous

work treats linguistic constraints as broadly and un-

conditionally applicable hard constraints. Because

sources of linguistic information in a learning-based

system are represented as features, we can, in con-

trast, incorporate them selectively rather than as uni-

versal hard constraints.

Our results using an expanded feature set are

mixed. First, we find that performance drops signifi-

cantly when using the full feature set, even though

the learning algorithms investigated have built-in

feature selection mechanisms. We demonstrate em-

Computational Linguistics (ACL), Philadelphia, July 2002, pp. 104-111.

Proceedings of the 40th Annual Meeting of the Association for

Page 2

pirically that the degradation in performance can be

attributed, at least in part, to poor performance on

common noun resolution. A manually selected sub-

set of 22–26 features, however, is shown to pro-

vide significant gains in performance when chosen

specifically to improve precision on common noun

resolution. Overall, the learning framework and lin-

guistic knowledge source modifications boost per-

formance of Soon’s learning-based coreference res-

olution approach from an F-measure of 62.6 to 70.4,

and from 60.4 to 63.4 for the MUC-6 and MUC-7

data sets, respectively. To our knowledge, these are

the best results reported to date on these data sets for

the full NP coreference problem.1

The rest of the paper is organized as follows. In

sections 2 and 3, we present the baseline corefer-

ence system and explore extra-linguistic modifica-

tions to the machine learning framework. Section 4

describes and evaluates the expanded feature set. We

conclude with related and future work in Section 5.

2 The Baseline Coreference System

Our baseline coreference system attempts to dupli-

cate both the approach and the knowledge sources

employed in Soon et al. (2001). More specifically, it

employs the standard combination of classification

and clustering described above.

Building an NP coreference classifier. We use

the C4.5 decision tree induction system (Quinlan,

1993) to train a classifier that, given a description

of two NPs in a document, NP¢ and NP£ , decides

whether or not they are coreferent. Each training

instance represents the two NPs under consideration

and consists of the 12 Soon et al. features, which

are described in Table 1. Linguistically, the features

can be divided into four groups: lexical, grammati-

cal, semantic, and positional.2 The classification as-

sociated with a training instance is one of COREF-

ERENT or NOT COREFERENT depending on whether

the NPs co-refer in the associated training text. We

follow the procedure employed in Soon et al. to cre-

1Results presented in Harabagiu et al. (2001) are higher

than those reported here, but assume that all and only the noun

phrases involved in coreference relationships are provided for

analysis by the coreference resolution system. We presume no

preprocessing of the training and test documents.

2In all of the work presented here, NPs are identified, and

features values computed entirely automatically.

ate the training data: we rely on coreference chains

from the MUC answer keys to create (1) a positive

instance for each anaphoric noun phrase, NP£ , and its

closest preceding antecedent, NP¢ ; and (2) a negative

instance for NP£ paired with each of the intervening

NPs, NP¢¥¤§¦ , NP¢¥ ¤©¨ , , NP£ §¦ . This method of neg-

ative instance selection is further described in Soon

et al. (2001); it is designed to operate in conjunction

with their method for creating coreference chains,

which is explained next.

Applying the classifier to create coreference

chains. After training, the decision tree is used by

a clustering algorithm to impose a partitioning on all

NPs in the test texts, creating one cluster for each set

of coreferent NPs. As in Soon et al., texts are pro-

cessed from left to right. Each NP encountered, NP£ ,

is compared in turn to each preceding NP, NP¢ , from

right to left. For each pair, a test instance is created

as during training and is presented to the corefer-

ence classifier, which returns a number between 0

and 1 that indicates the likelihood that the two NPs

are coreferent.3 NP pairs with class values above 0.5

are considered COREFERENT; otherwise the pair is

considered NOT COREFERENT. The process termi-

nates as soon as an antecedent is found for NP£ or the

beginning of the text is reached.

2.1 Baseline Experiments

We evaluate the Duplicated Soon Baseline sys-

tem using the standard MUC-6 (1995) and MUC-

7 (1998) coreference corpora, training the corefer-

ence classifier on the 30 “dry run” texts, and ap-

plying the coreference resolution algorithm on the

20–30 “formal evaluation” texts. The MUC-6 cor-

pus produces a training set of 26455 instances (5.4%

positive) from 4381 NPs and a test set of 28443

instances (5.2% positive) from 4565 NPs. For the

MUC-7 corpus, we obtain a training set of 35895 in-

stances (4.4% positive) from 5270 NPs and a test set

of 22699 instances (3.9% positive) from 3558 NPs.

Results are shown in Table 2 (Duplicated Soon

Baseline) where performance is reported in terms

of recall, precision, and F-measure using the model-

theoretic MUC scoring program (Vilain et al., 1995).

3We convert the binary class value using the smoothed ratio

"! , where p is the number of positive instances and t is the

total number of instances contained in the corresponding leaf

node.

Page 3

Feature Type

Feature

Description

Lexical

SOON STR

C if, after discarding determiners, the string denoting NP# matches that of

NP$ ; else I.

Grammatical

PRONOUN 1*

Y if NP# is a pronoun; else N.

PRONOUN 2*

Y if NP$ is a pronoun; else N.

DEFINITE 2

Y if NP$ starts with the word “the;” else N.

DEMONSTRATIVE 2

Y if NP$ starts with a demonstrative such as “this,” “that,” “these,” or

“those;” else N.

NUMBER*

C if the NP pair agree in number; I if they disagree; NA if number informa-

tion for one or both NPs cannot be determined.

GENDER*

C if the NP pair agree in gender; I if they disagree; NA if gender information

for one or both NPs cannot be determined.

BOTH PROPER NOUNS* C if both NPs are proper names; NA if exactly one NP is a proper name;

else I.

APPOSITIVE*

C if the NPs are in an appositive relationship; else I.

Semantic

WNCLASS*

C if the NPs have the same WordNet semantic class; I if they don’t; NA if

the semantic class information for one or both NPs cannot be determined.

ALIAS*

C if one NP is an alias of the other; else I.

Positional

SENTNUM*

Distance between the NPs in terms of the number of sentences.

Table 1: Feature Set for the Duplicated Soon Baseline system. The feature set contains relational and non-relational

features. Non-relational features test some property P of one of the NPs under consideration and take on a value of YES or NO

depending on whether P holds. Relational features test whether some property P holds for the NP pair under consideration and

indicate whether the NPs are COMPATIBLE or INCOMPATIBLE w.r.t. P; a value of NOT APPLICABLE is used when property P does

not apply. *’d features are in the hand-selected feature set (see Section 4) for at least one classifier/data set combination.

The system achieves an F-measure of 66.3 and

61.2 on the MUC-6 and MUC-7 data sets, respec-

tively. Similar, but slightly worse performance

was obtained using RIPPER (Cohen, 1995), an

information-gain-based rule learning system. Both

sets of results are at least as strong as the original

Soon results (row one of Table 2), indicating indi-

rectly that our Baseline system is a reasonable du-

plication of that system.4 In addition, the trees pro-

duced by Soon and by our Duplicated Soon Baseline

are essentially the same, differing only in two places

where the Baseline system imposes additional con-

ditions on coreference.

The primary reason for improvements over the

original Soon system for the MUC-6 data set ap-

pears to be our higher upper bound on recall (93.8%

vs. 89.9%), due to better identification of NPs. For

MUC-7, our improvement stems from increases in

precision, presumably due to more accurate feature

value computation.

4In all of the experiments described in this paper, default

settings for all C4.5 parameters are used. Similarly, all RIPPER

parameters are set to their default value except that classification

rules are induced for both the positive and negative instances.

3 Modifications to the Machine Learning

Framework

This section studies the effect of three changes to

the general machine learning framework employed

by Soon et al. with the goal of improving precision

in the resulting coreference resolution systems.

Best-first clustering. Rather than a right-to-left

search from each anaphoric NP for the first coref-

erent NP, we hypothesized that a right-to-left search

for a highly likely antecedent might offer more pre-

cise, if not generally better coreference chains. As

a result, we modify the coreference clustering algo-

rithm to select as the antecedent of NP£ the NP with

the highest coreference likelihood value from among

preceding NPs with coreference class values above

0.5.

Training set creation. For the proposed best-first

clustering to be successful, however, a different

method for training instance selection would be

needed: rather than generate a positive training ex-

ample for each anaphoric NP and its closest an-

tecedent, we instead generate a positive training ex-

amples for its most confident antecedent. More

specifically, for a non-pronominal NP, we assume

that the most confident antecedent is the closest non-

Page 4

C4.5

RIPPER

MUC-6

MUC-7

MUC-6

MUC-7

System Variation

Original Soon et al.

58.6

67.3

62.6

56.1

65.5

60.4

Duplicated Soon Baseline

62.4

70.7

66.3

55.2

68.5

61.2

60.8

68.4

64.3

54.0

69.5

60.8

Learning Framework

62.4

73.5

67.5

56.3

71.5

63.0

60.8

75.3

67.2

55.3

73.8

63.2

String Match

60.4

74.4

66.7

54.3

72.1

62.0

58.5

74.9

65.7

48.9

73.2

58.6

Training Instance Selection

61.9

70.3

65.8

55.2

68.3

61.1

61.3

70.4

65.5

54.2

68.8

60.6

Clustering

62.4

70.8

66.3

56.5

69.6

62.3

60.5

68.4

64.2

55.6

70.7

62.2

All Features

70.3

58.3

63.8

65.5

58.2

61.6

67.0

62.2

64.5

61.9

60.6

61.2

Pronouns only

–

66.3

–

62.1

–

71.3

–

62.0

–

Proper Nouns only

–

84.2

–

77.7

–

85.5

–

75.9

–

Common Nouns only

–

40.1

–

45.2

–

43.7

–

48.0

–

Hand-selected Features

64.1

74.9

69.1

57.4

70.8

63.4

64.2

78.0

70.4

55.7

72.8

63.1

Pronouns only

–

67.4

–

54.4

–

77.0

–

60.8

–

Proper Nouns only

–

93.3

–

86.6

–

95.2

–

88.7

–

Common Nouns only

–

63.0

–

64.8

–

62.8

–

63.5

–

Table 2: Results for the MUC-6 and MUC-7 data sets using C4.5 and RIPPER. Recall, Precision, and F-measure

are provided. Results in boldface indicate the best results obtained for a particular data set and classifier combination.

pronominal preceding antecedent. For pronouns,

we assume that the most confident antecedent is sim-

ply its closest preceding antecedent. Negative exam-

ples are generated as in the Baseline system.5

String match feature. Soon’s string match feature

(SOON STR) tests whether the two NPs under con-

sideration are the same string after removing deter-

miners from each. We hypothesized, however, that

splitting this feature into several primitive features,

depending on the type of NP, might give the learn-

ing algorithm additional flexibility in creating coref-

erence rules. Exact string match is likely to be a

better coreference predictor for proper names than

it is for pronouns, for example. Specifically, we

replace the SOON STR feature with three features

— PRO STR, PN STR, and WORDS STR — which

restrict the application of string matching to pro-

nouns, proper names, and non-pronominal NPs, re-

spectively. (See the first entries in Table 3.) Al-

though similar feature splits might have been con-

sidered for other features (e.g. GENDER and NUM-

BER), only the string match feature was tested here.

Results and discussion. Results on the learning

framework modifications are shown in Table 2 (third

block of results). When used in combination, the

modifications consistently provide statistically sig-

nificant gains in precision over the Baseline system

5This new method of training set creation slightly alters the

class value distribution in the training data: for the MUC-6 cor-

pus, there are now 27654 training instances of which 5.2% are

positive; for the MUC-7 corpus, there are now 37870 training

instances of which 4.2% are positive.

without any loss in recall.6 As a result, we observe

reasonable increases in F-measure for both classi-

fiers and both data sets. When using RIPPER, for

example, performance increases from 64.3 to 67.2

for the MUC-6 data set and from 60.8 to 63.2 for

MUC-7. Similar, but weaker, effects occur when ap-

plying each of the learning framework modifications

to the Baseline system in isolation. (See the indented

Learning Framework results in Table 2.)

Our results provide direct evidence for the claim

(Mitkov, 1997) that the extra-linguistic strategies

employed to combine the available linguistic knowl-

edge sources play an important role in computa-

tional approaches to coreference resolution. In par-

ticular, our results suggest that additional perfor-

mance gains might be obtained by further investi-

gating the interaction between training instance se-

lection, feature selection, and the coreference clus-

tering algorithm.

4 NP Coreference Using Many Features

This section describes the second major extension

to the Soon approach investigated here: we explore

the effect of including 41 additional, potentially use-

ful knowledge sources for the coreference resolu-

tion classifier (Table 3). The features were not de-

rived empirically from the corpus, but were based on

common-sense knowledge and linguistic intuitions

6Chi-square statistical significance tests are applied to

changes in recall and precision throughout the paper. Unless

otherwise noted, reported differences are at the 0.05 level or

higher. The chi-square test is not applicable to F-measure.

Page 5

regarding coreference. Specifically, we increase the

number of lexical features to nine to allow more

complex NP string matching operations. In addi-

tion, we include four new semantic features to al-

low finer-grained semantic compatibility tests. We

test for ancestor-descendent relationships in Word-

Net (SUBCLASS), for example, and also measure

the WordNet graph-traversal distance (WNDIST) be-

tween NP£ and NP¢ . Furthermore, we add a new posi-

tional feature that measures the distance in terms of

the number of paragraphs (PARANUM) between the

two NPs.

The most substantial changes to the feature set,

however, occur for grammatical features: we add 26

new features to allow the acquisition of more sophis-

ticated syntactic coreference resolution rules. Four

features simply determine NP type, e.g. are both

NPs definite, or pronouns, or part of a quoted string?

These features allow other tests to be conditioned on

the types of NPs being compared. Similarly, three

new features determine the grammatical role of one

or both of the NPs. Currently, only tests for clausal

subjects are made. Next, eight features encode tra-

ditional linguistic (hard) constraints on coreference.

For example, coreferent NPs must agree both in gen-

der and number (AGREEMENT); cannot SPAN one

another (e.g. “government” and “government offi-

cials”); and cannot violate the BINDING constraints.

Still other grammatical features encode general lin-

guistic preferences either for or against coreference.

For example, an indefinite NP (that is not in appo-

sition to an anaphoric NP) is not likely to be coref-

erent with any NP that precedes it (ARTICLE). The

last subset of grammatical features encodes slightly

more complex, but generally non-linguistic heuris-

tics. For instance, the CONTAINS PN feature ef-

fectively disallows coreference between NPs that

contain distinct proper names but are not them-

selves proper names (e.g. “IBM executives” and

“Microsoft executives”).

Two final features make use of an in-house

naive pronoun resolution algorithm (PRO RESOLVE)

and a rule-based coreference resolution system

(RULE RESOLVE), each of which relies on the origi-

nal and expanded feature sets described above.

Results and discussion. Results using the ex-

panded feature set are shown in the All Features

block of Table 2. These and all subsequent results

also incorporate the learning framework changes

from Section 3. In comparison, we see statistically

significant increases in recall, but much larger de-

creases in precision. As a result, F-measure drops

precipitously for both learning algorithms and both

data sets. A closer examination of the results indi-

cates very poor precision on common nouns in com-

parison to that of pronouns and proper nouns. (See

the indented All Features results in Table 2.7) In

particular, the classifiers acquire a number of low-

precision rules for common noun resolution, pre-

sumably because the current feature set is insuffi-

cient. For instance, a rule induced by RIPPER clas-

sifies two NPs as coreferent if the first NP is a proper

name, the second NP is a definite NP in the subject

position, and the two NPs have the same seman-

tic class and are at most one sentence apart from

each other. This rule covers 38 examples, but has

18 exceptions. In comparison, the Baseline sys-

tem obtains much better precision on common nouns

(i.e. 53.3 for MUC-6/RIPPER and 61.0 for MUC-

7/RIPPER with lower recall in both cases) where the

primary mechanism employed by the classifiers for

common noun resolution is its high-precision string

matching facility. Our results also suggest that data

fragmentation is likely to have contributed to the

drop in performance (i.e. we increased the number

of features without increasing the size of the training

set). For example, the decision tree induced from the

MUC-6 data set using the Soon feature set (Learn-

ing Framework results) has 16 leaves, each of which

contains 1728 instances on average; the tree induced

from the same data set using all of the 53 features,

on the other hand, has 86 leaves with an average of

322 instances per leaf.

Hand-selected feature sets. As a result, we next

evaluate a version of the system that employs man-

ual feature selection: for each classifier/data set

combination, we discard features used primarily to

induce low-precision rules for common noun res-

olution and re-train the coreference classifier using

the reduced feature set. Here, feature selection does

not depend on a separate development corpus and

7For each of the NP-type-specific runs, we measure overall

coreference performance, but restrict NP$ to be of the specified

type. As a result, recall and F-measure for these runs are not

particularly informative.

Page 6

PRO STR*

C if both NPs are pronominal and are the same string; else I.

PN STR*

C if both NPs are proper names and are the same string; else I.

WORDS STR

C if both NPs are non-pronominal and are the same string; else I.

SOON STR NONPRO* C if both NPs are non-pronominal and the string of NP# matches that of NP$ ; else I.

WORD OVERLAP

C if the intersection between the content words in NP# and NP$ is not empty; else I.

MODIFIER

C if the prenominal modifiers of one NP are a subset of the prenominal modifiers of the

other; else I.

PN SUBSTR

C if both NPs are proper names and one NP is a proper substring (w.r.t. content words

only) of the other; else I.

WORDS SUBSTR

C if both NPs are non-pronominal and one NP is a proper substring (w.r.t. content words

only) of the other; else I.

BOTH DEFINITES

C if both NPs start with “the;” I if neither start with “the;” else NA.

type

BOTH EMBEDDED

C if both NPs are prenominal modifiers ; I if neither are prenominal modifiers; else NA.

BOTH IN QUOTES

C if both NPs are part of a quoted string; I if neither are part of a quoted string; else NA.

BOTH PRONOUNS*

C if both NPs are pronouns; I if neither are pronouns, else NA.

role

BOTH SUBJECTS

C if both NPs are grammatical subjects; I if neither are subjects; else NA.

SUBJECT 1*

Y if NP# is a subject; else N.

SUBJECT 2

Y if NP$ is a subject; else N.

lin-

gui-

AGREEMENT*

C if the NPs agree in both gender and number; I if they disagree in both gender and

number; else NA.

stic

ANIMACY*

C if the NPs match in animacy; else I.

MAXIMALNP*

I if both NPs have the same maximal NP projection; else C.

con-

PREDNOM*

C if the NPs form a predicate nominal construction; else I.

stra-

SPAN*

I if one NP spans the other; else C.

ints

BINDING*

I if the NPs violate conditions B or C of the Binding Theory; else C.

CONTRAINDICES*

I if the NPs cannot be co-indexed based on simple heuristics; else C. For instance, two

non-pronominal NPs separated by a preposition cannot be co-indexed.

SYNTAX*

I if the NPs have incompatible values for the BINDING, CONTRAINDICES, SPAN or

MAXIMALNP constraints; else C.

ling.

INDEFINITE*

I if NP$ is an indefinite and not appositive; else C.

prefs

PRONOUN

I if NP# is a pronoun and NP$ is not; else C.

heur-

istics

CONSTRAINTS*

C if the NPs agree in GENDER and NUMBER and do not have incompatible values for

CONTRAINDICES, SPAN, ANIMACY, PRONOUN, and CONTAINS PN; I if the NPs have

incompatible values for any of the above features; else NA.

CONTAINS PN

I if both NPs are not proper names but contain proper names that mismatch on every

word; else C.

DEFINITE 1

Y if NP# starts with “the;” else N.

EMBEDDED 1*

Y if NP# is an embedded noun; else N.

EMBEDDED 2

Y if NP$ is an embedded noun; else N.

IN QUOTE 1

Y if NP# is part of a quoted string; else N.

IN QUOTE 2

Y if NP$ is part of a quoted string; else N.

PROPER NOUN

I if both NPs are proper names, but mismatch on every word; else C.

TITLE*

I if one or both of the NPs is a title; else C.

CLOSEST COMP

C if NP# is the closest NP preceding NP$ that has the same semantic class as NP$ and the

two NPs do not violate any of the linguistic constraints; else I.

SUBCLASS

C if the NPs have different head nouns but have an ancestor-descendent relationship in

WordNet; else I.

WNDIST

Distance between NP# and NP$ in WordNet (using the first sense only) when they have

an ancestor-descendent relationship but have different heads; else infinity.

WNSENSE

Sense number in WordNet for which there exists an ancestor-descendent relationship

between the two NPs when they have different heads; else infinity.

PARANUM

Distance between the NPs in terms of the number of paragraphs.

PRO RESOLVE*

C if NP$ is a pronoun and NP# is its antecedent according to a naive pronoun resolution

algorithm; else I.

RULE RESOLVE

C if the NPs are coreferent according to a rule-based coreference resolution algorithm;

else I.

Table 3: Additional features for NP coreference. As before, *’d features are in the hand-selected feature set for at least

one classifier/data set combination.

Page 7

is guided solely by inspection of the features associ-

ated with low-precision rules induced from the train-

ing data. In current work, we are automating this

feature selection process, which currently employs

a fair amount of user discretion, e.g. to determine a

precision cut-off. Features in the hand-selected set

for at least one of the tested system variations are

*’d in Tables 1 and 3.

In general, we hypothesized that the hand-

selected features would reclaim precision, hopefully

without losing recall. For the most part, the ex-

perimental results support this hypothesis. (See the

Hand-selected Features block in Table 2.) In com-

parison to the All Features version, we see statisti-

cally significant gains in precision and statistically

significant, but much smaller, drops in recall, pro-

ducing systems with better F-measure scores. In

addition, precision on common nouns rises substan-

tially, as expected. Unfortunately, the hand-selected

features precipitate a large drop in precision for pro-

noun resolution for the MUC-7/C4.5 data set. Ad-

ditional analysis is required to determine the reason

for this.

Moreover, the Hand-selected Features produce

the highest scores posted to date for both the MUC-

6 and MUC-7 data sets: F-measure increases w.r.t.

the Baseline system from 64.3 to 70.4 for MUC-

6/RIPPER, and from 61.2 to 63.4 for MUC-7/C4.5.

In one variation (MUC-7/RIPPER), however, the

Hand-selected Features slightly underperforms the

Learning Framework modifications (F-measure of

63.1 vs. 63.2) although changes in recall and pre-

cision are not statistically significant. Overall, our

results indicate that pronoun and especially com-

mon noun resolution remain important challenges

for coreference resolution systems. Somewhat dis-

appointingly, only four of the new grammatical

features corresponding to linguistic constraints and

preferences are selected by the symbolic learning

algorithms investigated: AGREEMENT, ANIMACY,

BINDING, and MAXIMALNP.

Discussion. In an attempt to gain additional in-

sight into the difference in performance between our

system and the original Soon system, we compare

the decision tree induced by each for the MUC-6

ALIAS = C: + (347.0/23.8)

ALIAS = I:

| SOON_STR_NONPRO = C:

| | ANIMACY = NA: - (4.0/2.2)

| | ANIMACY = I: + (0.0)

| | ANIMACY = C: + (259.0/45.8)

| SOON_STR_NONPRO = I:

| | PRO_STR = C: + (39.0/2.6)

| | PRO_STR = I:

| | | PRO_RESOLVE = C:

| | | | EMBEDDED_1 = Y: - (7.0/3.4)

| | | | EMBEDDED_1 = N:

| | | PRO_RESOLVE = I:

| | | | APPOSITIVE = I: - (26806.0/713.8)

| | | | APPOSITIVE = C:

Figure 1: Decision Tree using the Hand-selected

feature set on the MUC-6 data set.

data set.8 For our system, we use the tree induced on

the hand-selected features (Figure 1). The two trees

are fairly different. In particular, our tree makes

use of many of the features that are not present in

the original Soon feature set. The root feature for

Soon, for example, is the general string match fea-

ture (SOON STR); splitting the SOON STR feature

into three primitive features promotes the ALIAS fea-

ture to the root of our tree, on the other hand. In

addition, given two non-pronominal, matching NPs

(SOON STR NONPRO=C), our tree requires an addi-

tional test on ANIMACY before considering the two

NPs coreferent; the Soon tree instead determines

two NPs to be coreferent as long as they are the same

string. Pronoun resolution is also performed quite

differently by the two trees, although both consider

two pronouns coreferent when their strings match.

Finally, intersentential and intrasentential pronomi-

nal references are possible in our system while inter-

sentential pronominal references are largely prohib-

ited by the Soon system.

5 Conclusions

We investigate two methods to improve existing

machine learning approaches to the problem of

8Soon et al. (2001) present only the tree learned for the

MUC-6 data set.

Page 8

noun phrase coreference resolution. First, we pro-

pose three extra-linguistic modifications to the ma-

chine learning framework, which together consis-

tently produce statistically significant gains in pre-

cision and corresponding increases in F-measure.

Our results indicate that coreference resolution sys-

tems can improve by effectively exploiting the in-

teraction between the classification algorithm, train-

ing instance selection, and the clustering algorithm.

We plan to continue investigations along these lines,

developing, for example, a true best-first clustering

coreference framework and exploring a “supervised

clustering” approach to the problem. In addition,

we provide the learning algorithms with many addi-

tional linguistic knowledge sources for coreference

resolution. Unfortunately, we find that performance

drops significantly when using the full feature set;

we attribute this, at least in part, to the system’s poor

performance on common noun resolution and to data

fragmentation problems that arise with the larger

feature set. Manual feature selection, with an eye

toward eliminating low-precision rules for common

noun resolution, is shown to reliably improve per-

formance over the full feature set and produces the

best results to date on the MUC-6 and MUC-7 coref-

erence data sets — F-measures of 70.4 and 63.4, re-

spectively. Nevertheless, there is substantial room

for improvement. As noted above, for example, it is

important to automate the precision-oriented feature

selection procedure as well as to investigate other

methods for feature selection. We also plan to in-

vestigate previous work on common noun phrase

interpretation (e.g. Sidner (1979), Harabagiu et al.

(2001)) as a means of improving common noun

phrase resolution, which remains a challenge for

state-of-the-art coreference resolution systems.

Acknowledgments

Thanks to three anonymous reviewers for their comments and,

in particular, for suggesting that we investigate data fragmen-

tation issues. This work was supported in part by DARPA

TIDES contract N66001-00-C-8009, and NSF Grants 0081334

and 0074896.

References

C. Aone and S. W. Bennett. 1995. Evaluating Auto-

mated and Manual Acquisition of Anaphora Resolu-

tion Strategies. In Proceedings of the 33rd Annual

Meeting of the Association for Computational Linguis-

tics, pages 122–129.

W. Cohen. 1995. Fast Effective Rule Induction. In Pro-

ceedings of the Twelfth International Conference on

Machine Learning.

R. Grishman. 1995. The NYU System for MUC-6 or

Where’s the Syntax? In Proceedings of the Sixth Mes-

sage Understanding Conference (MUC-6).

S. Harabagiu, R. Bunescu, and S. Maiorano. 2001. Text

and Knowledge Mining for Coreference Resolution.

In Proceedings of the Second Meeting of the North

America Chapter of the Association for Computational

Linguistics (NAACL-2001), pages 55–62.

S. Lappin and H. Leass. 1994. An Algorithm for

Pronominal Anaphora Resolution.

Computational

Linguistics, 20(4):535–562.

D. Lin. 1995. University of Manitoba: Description of the

PIE System as Used for MUC-6. In Proceedings of the

Sixth Message Understanding Conference (MUC-6).

J. McCarthy and W. Lehnert. 1995. Using Decision

Trees for Coreference Resolution. In Proceedings of

the Fourteenth International Conference on Artificial

Intelligence, pages 1050–1055.

R. Mitkov. 1997. Factors in anaphora resolution: they

are not the only things that matter. A case study based

on two different approaches. In Proceedings of the

ACL’97/EACL’97 Workshop on Operational Factors

in Practical, Robust Anaphora Resolution.

MUC-6. 1995. Proceedings of the Sixth Message Under-

standing Conference (MUC-6). Morgan Kaufmann,

San Francisco, CA.

MUC-7. 1998. Proceedings of the Seventh Message

Understanding Conference (MUC-7). Morgan Kauf-

mann, San Francisco, CA.

J. R. Quinlan. 1993. C4.5: Programs for Machine

Learning. Morgan Kaufmann, San Mateo, CA.

C. Sidner. 1979. Towards a Computational Theory

of Definite Anaphora Comprehension in English Dis-

course. PhD Thesis, Massachusetts Institute of Tech-

nology.

W. M. Soon, H. T. Ng, and D. C. Y. Lim. 2001. A

Machine Learning Approach to Coreference Resolu-

tion of Noun Phrases. Computational Linguistics,

27(4):521–544.

M. Vilain, J. Burger, J. Aberdeen, D. Connolly, and

L. Hirschman. 1995. A model-theoretic coreference

scoring scheme. In Proceedings of the Sixth Mes-

sage Understanding Conference (MUC-6), pages 45–

52, San Francisco, CA. Morgan Kaufmann.