Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Shallow Parsing With Conditional Random Fields

paper on CRF
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Shallow Parsing With Conditional Random Fields

paper on CRF
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Shallow Parsing with Conditional Random Fields

Fei Sha and Fernando Pereira


Department of Computer and Information Science
University of Pennsylvania
200 South 33rd Street, Philadelphia, PA 19104
(feisha|pereira)@cis.upenn.edu

Abstract which is now the standard evaluation task for shallow


parsing.
Conditional random fields for sequence label- Most previous work used two main machine-learning
ing offer advantages over both generative mod- approaches to sequence labeling. The first approach re-
els like HMMs and classifiers applied at each lies on k-order generative probabilistic models of paired
sequence position. Among sequence labeling input sequences and label sequences, for instance hidden
tasks in language processing, shallow parsing Markov models (HMMs) (Freitag and McCallum, 2000;
has received much attention, with the devel- Kupiec, 1992) or multilevel Markov models (Bikel et al.,
opment of standard evaluation datasets and ex- 1999). The second approach views the sequence labeling
tensive comparison among methods. We show problem as a sequence of classification problems, one for
here how to train a conditional random field to each of the labels in the sequence. The classification re-
achieve performance as good as any reported sult at each position may depend on the whole input and
base noun-phrase chunking method on the on the previous k classifications. 1
CoNLL task, and better than any reported sin- The generative approach provides well-understood
gle model. Improved training methods based training and decoding algorithms for HMMs and more
on modern optimization algorithms were crit- general graphical models. However, effective genera-
ical in achieving these results. We present ex- tive models require stringent conditional independence
tensive comparisons between models and train- assumptions. For instance, it is not practical to make the
ing methods that confirm and strengthen pre- label at a given position depend on a window on the in-
vious results on shallow parsing and training put sequence as well as the surrounding labels, since the
methods for maximum-entropy models. inference problem for the corresponding graphical model
would be intractable. Non-independent features of the
inputs, such as capitalization, suffixes, and surrounding
1 Introduction words, are important in dealing with words unseen in
training, but they are difficult to represent in generative
Sequence analysis tasks in language and biology are of- models.
ten described as mappings from input sequences to se-
The sequential classification approach can handle
quences of labels encoding the analysis. In language pro-
many correlated features, as demonstrated in work on
cessing, examples of such tasks include part-of-speech
maximum-entropy (McCallum et al., 2000; Ratnaparkhi,
tagging, named-entity recognition, and the task we shall
1996) and a variety of other linear classifiers, including
focus on here, shallow parsing. Shallow parsing iden-
winnow (Punyakanok and Roth, 2001), AdaBoost (Ab-
tifies the non-recursive cores of various phrase types in
ney et al., 1999), and support-vector machines (Kudo and
text, possibly as a precursor to full parsing or informa-
Matsumoto, 2001). Furthermore, they are trained to min-
tion extraction (Abney, 1991). The paradigmatic shallow-
imize some function related to labeling error, leading to
parsing problem is NP chunking, which finds the non-
smaller error in practice if enough training data are avail-
recursive cores of noun phrases called base NPs. The
able. In contrast, generative models are trained to max-
pioneering work of Ramshaw and Marcus (1995) in-
imize the joint probability of the training data, which is
troduced NP chunking as a machine-learning problem,
with standard datasets and evaluation metrics. The task 1
Ramshaw and Marcus (1995) used transformation-based
was extended to additional phrase types for the CoNLL- learning (Brill, 1995), which for the present purposes can be
2000 shared task (Tjong Kim Sang and Buchholz, 2000), tought of as a classification-based method.
not as closely tied to the accuracy metrics of interest if the more uniform, we also write
actual data was not generated by the model, as is always
the case in practice. s(y, y 0 , x, i) = s(y 0 , x, i)
s(y, x, i) = s(y i , x, i)
However, since sequential classifiers are trained to
t(yi−1 , yi , x, i) i > 1
make the best local decision, unlike generative mod- t(y, x, i) =
0 i=1
els they cannot trade off decisions at different positions
against each other. In other words, sequential classifiers for any state feature s and transition feature t. Typically,
are myopic about the impact of their current decision features depend on the inputs around the given position,
on later decisions (Bottou, 1991; Lafferty et al., 2001). although they may also depend on global properties of the
This forced the best sequential classifier systems to re- input, or be non-zero only at some positions, for instance
sort to heuristic combinations of forward-moving and features that pick out the first or last labels.
backward-moving sequential classifiers (Kudo and Mat- The CRF’s global feature vector for input sequence x
sumoto, 2001). and label sequence y is given by
Conditional random fields (CRFs) bring together the X
best of generative and classification models. Like classi- F (y, x) = f (y, x, i)
fication models, they can accommodate many statistically i

correlated features of the inputs, and they are trained dis- where i ranges over input positions. The conditional
criminatively. But like generative models, they can trade probability distribution defined by the CRF is then
off decisions at different sequence positions to obtain a
globally optimal labeling. Lafferty et al. (2001) showed exp λ · F (Y , X)
pλ (Y |X) = (1)
that CRFs beat related classification models as well as Zλ (X)
HMMs on synthetic data and on a part-of-speech tagging
where
task. X
Zλ (x) = exp λ · F (y, x)
In the present work, we show that CRFs beat all re- y
ported single-model NP chunking results on the standard
evaluation dataset, and are statistically indistinguishable Any positive conditional distribution p(Y |X) that obeys
from the previous best performer, a voting arrangement of the Markov property
24 forward- and backward-looking support-vector clas- p(Yi |{Yj }j6=i , X) = p(Yi |Yi−1 , Yi+1 , X)
sifiers (Kudo and Matsumoto, 2001). To obtain these
results, we had to abandon the original iterative scal- can be written in the form (1) for appropriate choice of
ing CRF training algorithm for convex optimization al- feature functions and weight vector (Hammersley and
gorithms with better convergence properties. We provide Clifford, 1971).
detailed comparisons between training methods. The most probable label sequence for input sequence
The generalized perceptron proposed by Collins x is
(2002) is closely related to CRFs, but the best CRF train-
ing methods seem to have a slight edge over the general- ŷ = arg max pλ (y|x) = arg max λ · F (y, x)
y y
ized perceptron.
because Zλ (x) does not depend on y. F (y, x) decom-
poses into a sum of terms for consecutive pairs of labels,
2 Conditional Random Fields so the most likely y can be found with the Viterbi algo-
rithm.
We focus here on conditional random fields on sequences,
We train a CRF by maximizing the log-likelihood of a
although the notion can be used more generally (Laf-
given training set T = {(xk , y k )}N k=1 , which we assume
ferty et al., 2001; Taskar et al., 2002). Such CRFs define
fixed for the rest of this section:
conditional probability distributions p(Y |X) of label se-
quences given input sequences. We assume that the ran-
P
Lλ = Pk log pλ (y k |xk )
dom variable sequences X and Y have the same length, = k [λ · F (y k , xk ) − log Zλ (xk )]
and use x = x1 · · · xn and y = y1 · · · yn for the generic
input sequence and label sequence, respectively. To perform this optimization, we seek the zero of the gra-
dient
A CRF on (X, Y ) is specified by a vector f of local
features and a corresponding weight vector λ. Each local ∇Lλ =
feature is either a state feature s(y, x, i) or a transition X
(2)

feature t(y, y 0 , x, i), where y, y 0 are labels, x an input F (y k , xk ) − Epλ (Y |xk ) F (Y , xk )
sequence, and i an input position. To make the notation k
In words, the maximum of the training data likelihood optimization algorithms when many correlated features
is reached when the empirical average of the global fea- are involved. Concurrently with the present work, Wal-
ture vector equals its model expectation. The expectation lach (2002) tested conjugate gradient and second-order
Epλ (Y |x) F (Y , x) can be computed efficiently using a methods for CRF training, showing significant training
variant of the forward-backward algorithm. For a given speed advantages over iterative scaling on a small shal-
x, define the transition matrix for position i as low parsing problem. Our work shows that precon-
ditioned conjugate-gradient (CG) (Shewchuk, 1994) or
Mi [y, y 0 ] = exp λ · f (y, y 0 , x, i) limited-memory quasi-Newton (L-BFGS) (Nocedal and
Wright, 1999) perform comparably on very large prob-
Let f be any P local feature, fi [y, y 0 ] = f (y, y 0 , x, i), lems (around 3.8 million features). We compare those
F (y, x) = i f (yi−1 , yi , x, i), and let ∗ denote algorithms to generalized iterative scaling (GIS) (Dar-
component-wise matrix product. Then roch and Ratcliff, 1972), non-preconditioned CG, and
X voted perceptron training (Collins, 2002). All algorithms
Epλ (Y |x) F (Y , x) = pλ (y|x)F (y, x) except voted perceptron maximize the penalized log-
y
likelihood: λ∗ = arg maxλ L0λ . However, for ease of
X αi−1 (fi ∗ Mi )β > exposition, this discussion of training methods uses the
i
=
Zλ (x) unpenalized log-likelihood Lλ .
i
Zλ (x) = αn · 1> 3.1 Preconditioned Conjugate Gradient

where αi and βi the forward and backward state-cost Conjugate-gradient (CG) methods have been shown to
vectors defined by be very effective in linear and non-linear optimization
(Shewchuk, 1994). Instead of searching along the gra-
dient, conjugate gradient searches along a carefully cho-

αi−1 Mi 0 < i ≤ n
αi = sen linear combination of the gradient and the previous
1 i=0
 > search direction.
Mi+1 βi+1 1 ≤ i < n
β>i = CG methods can be accelerated by linearly trans-
1 i=n
forming the variables with preconditioner (Nocedal and
Therefore, we can use a forward pass to compute the αi Wright, 1999; Shewchuk, 1994). The purpose of the pre-
and a backward bass to compute the β i and accumulate conditioner is to improve the condition number of the
the feature expectations. quadratic form that locally approximates the objective
To avoid overfitting, we penalize the likelihood with function, so the inverse of Hessian is reasonable precon-
a spherical Gaussian weight prior (Chen and Rosenfeld, ditioner. However, this is not applicable to CRFs for two
1999): reasons. First, the size of the Hessian is dim(λ)2 , lead-
ing to unacceptable space and time requirements for the
inversion. In such situations, it is common to use instead
X
L0λ = [λ · F (y k , xk ) − log Zλ (xk )]
k
the (inverse of) the diagonal of the Hessian. However in
kλk2 our case the Hessian has the form
− + const
2σ 2 Hλ
def
= ∇ 2 Lλ
with gradient
X
= − {E [F (Y , xk ) × F (Y , xk )]
k
∇L0λ = −EF (Y , xk ) × EF (Y , xk )}
X λ

F (y k , xk ) − Epλ (Y |xk ) F (Y , xk ) − 2 where the expectations are taken with respect to
σ
k pλ (Y |xk ). Therefore, every Hessian element, includ-
ing the diagonal ones, involve the expectation of a prod-
3 Training Methods
uct of global feature values. Unfortunately, computing
Lafferty et al. (2001) used iterative scaling algorithms those expectations is quadratic on sequence length, as the
for CRF training, following earlier work on maximum- forward-backward algorithm can only compute expecta-
entropy models for natural language (Berger et al., 1996; tions of quantities that are additive along label sequences.
Della Pietra et al., 1997). Those methods are very sim- We solve both problems by discarding the off-diagonal
ple and guaranteed to converge, but as Minka (2001) and terms and approximating expectation of the square of a
Malouf (2002) showed for classification, their conver- global feature by the expectation of the sum of squares of
gence is much slower than that of general-purpose convex the corresponding local features at each position. The ap-
proximated diagonal term Hf for feature f has the form Like the familiar perceptron algorithm, this algorithm re-
peatedly sweeps over the training instances, updating the
Hf = Ef (Y , xk )2 weight vector as it considers each instance. Instead of
taking just the final weight vector, the voted perceptron
 2
X X Mi [y, y 0 ]
−  f (Y , xk ) algorithm takes the average of the λt . Collins (2002) re-
i
Z λ (x) ported and we confirmed that this averaging reduces over-
y,y
0

fitting considerably.
If this approximation is semidefinite, which is trivial to
check, its inverse is an excellent preconditioner for early 4 Shallow Parsing
iterations of CG training. However, when the model is
Figure 1 shows the base NPs in an example sentence. Fol-
close to the maximum, the approximation becomes un-
lowing Ramshaw and Marcus (1995), the input to the
stable, which is not surprising since it is based on fea-
NP chunker consists of the words in a sentence anno-
ture independence assumptions that become invalid as
tated automatically with part-of-speech (POS) tags. The
the weights of interaction features move away from zero.
chunker’s task is to label each word with a label indi-
Therefore, we disable the preconditioner after a certain
cating whether the word is outside a chunk (O), starts
number of iterations, determined from held-out data. We
a chunk (B), or continues a chunk (I). For example,
call this strategy mixed CG training.
the tokens in first line of Figure 1 would be labeled
3.2 Limited-Memory Quasi-Newton BIIBIIOBOBIIO.
Newton methods for nonlinear optimization use second- 4.1 Data Preparation
order (curvature) information to find search directions.
NP chunking results have been reported on two slightly
As discussed in the previous section, it is not practi-
different data sets: the original RM data set of Ramshaw
cal to obtain exact curvature information for CRF train-
and Marcus (1995), and the modified CoNLL-2000 ver-
ing. Limited-memory BFGS (L-BFGS) is a second-order
sion of Tjong Kim Sang and Buchholz (2000). Although
method that estimates the curvature numerically from
the chunk tags in the RM and CoNLL-2000 are somewhat
previous gradients and updates, avoiding the need for
different, we found no significant accuracy differences
an exact Hessian inverse computation. Compared with
between models trained on these two data sets. There-
preconditioned CG, L-BFGS can also handle large-scale
fore, all our results are reported on the CoNLL-2000 data
problems but does not require a specialized Hessian ap-
set. We also used a development test set, provided by
proximations. An earlier study indicates that L-BFGS
Michael Collins, derived from WSJ section 21 tagged
performs well in maximum-entropy classifier training
with the Brill (1995) POS tagger.
(Malouf, 2002).
There is no theoretical guidance on how much infor- 4.2 CRFs for Shallow Parsing
mation from previous steps we should keep to obtain Our chunking CRFs have a second-order Markov depen-
sufficiently accurate curvature estimates. In our exper- dency between chunk tags. This is easily encoded by
iments, storing 3 to 10 pairs of previous gradients and making the CRF labels pairs of consecutive chunk tags.
updates worked well, so the extra memory required over That is, the label at position i is yi = ci−1 ci , where ci is
preconditioned CG was modest. A more detailed descrip- the chunk tag of word i, one of O, B, or I. Since B must be
tion of this method can be found elsewhere (Nocedal and used to start a chunk, the label OI is impossible. In addi-
Wright, 1999). tion, successive labels are constrained: yi−1 = ci−2 ci−1 ,
3.3 Voted Perceptron yi = ci−1 ci , and c0 = O. These contraints on the model
topology are enforced by giving appropriate features a
Unlike other methods discussed so far, voted perceptron
weight of −∞, forcing all the forbidden labelings to have
training (Collins, 2002) attempts to minimize the differ-
zero probability.
ence between the global feature vector for a training in-
Our choice of features was mainly governed by com-
stance and the same feature vector for the best-scoring
puting power, since we do not use feature selection and
labeling of that instance according to the current model.
all features are used in training and testing. We use the
More precisely, for each training instance the method
following factored representation for features
computes a weight update
f (yi−1 , yi , x, i) = p(x, i)q(yi−1 , yi ) (4)
λt+1 = λt + F (y k , xk ) − F (ŷ k , xk ) (3)
where p(x, i) is a predicate on the input sequence x and
in which ŷ k is the Viterbi path current position i and q(yi−1 , yi ) is a predicate on pairs
of labels. For instance, p(x, i) might be “word at posi-
ŷ k = arg max λt · F (y , xk )
y tion i is the” or “the POS tags at positions i − 1, i are
Rockwell International Corp. ’s Tulsa unit said it signed a tentative agreement extending
its contract with Boeing Co. to provide structural parts for Boeing ’s 747 jetliners .

Figure 1: NP chunks

q(yi−1 , yi ) p(x, i) 4.3 Parameter Tuning


yi = y true
yi = y, yi−1 = y 0 As discussed previously, we need a Gaussian weight prior
c(yi ) = c
yi = y wi = w to reduce overfitting. We also need to choose the num-
or wi−1 = w ber of training iterations since we found that the best F
c(yi ) = c wi+1 = w score is attained while the log-likelihood is still improv-
wi−2 = w ing. The reasons for this are not clear, but the Gaussian
wi+2 = w prior may not be enough to keep the optimization from
wi−1 = w0 , wi = w
wi+1 = w0 , wi = w making weight adjustments that slighly improve training
ti = t log-likelihood but cause large F score fluctuations. We
ti−1 = t used the development test set mentioned in Section 4.1 to
ti+1 = t set the prior and the number of iterations.
ti−2 = t
ti+2 = t
ti−1 = t0 , ti = t 4.4 Evaluation Metric
ti−2 = t0 , ti−1 = t
ti = t0 , ti+1 = t
The standard evaluation metrics for a chunker are preci-
ti+1 = t0 , ti+2 = t
ti−2 = t00 , ti−1 = t0 , ti = t sion P (fraction of output chunks that exactly match the
ti−1 = t00 , ti = t0 , ti+1 = t reference chunks), recall R (fraction of reference chunks
ti = t00 , ti+1 = t0 , ti+2 = t returned by the chunker), and their harmonic mean, the
F1 score F1 = 2 ∗ P ∗ R/(P + R) (which we call just
Table 1: Shallow parsing features F score in what follows). The relationships between F
score and labeling error or log-likelihood are not direct,
so we report both F score and the other metrics for the
models we tested. For comparisons with other reported
DT, NN.” Because the label set is finite, such a factoring
results we use F score.
of f (yi−1 , yi , x, i) is always possible, and it allows each
input predicate to be evaluated just once for many fea-
tures that use it, making it possible to work with millions 4.5 Significance Tests
of features on large training sets.
Table 1 summarizes the feature set. For a given po- Ideally, comparisons among chunkers would control for
sition i, wi is the word, ti its POS tag, and yi its label. feature sets, data preparation, training and test proce-
For any label y = c0 c, c(y) = c is the corresponding dures, and parameter tuning, and estimate the statistical
chunk tag. For example, c(OB) = B. The use of chunk significance of performance differences. Unfortunately,
tags as well as labels provides a form of backoff from reported results sometimes leave out details needed for
the very small feature counts that may arise in a second- accurate comparisons. We report F scores for comparison
order model, while allowing significant associations be- with previous work, but we also give statistical signifi-
tween tag pairs and input predicates to be modeled. To cance estimates using McNemar’s test for those methods
save time in some of our experiments, we used only the that we evaluated directly.
820,000 features that are supported in the CoNLL train- Testing the significance of F scores is tricky because
ing set, that is, the features that are on at least once. For the wrong chunks generated by two chunkers are not
our highest F score, we used the complete feature set, directly comparable. Yeh (2000) examined randomized
around 3.8 million in the CoNLL training set, which con- tests for estimating the significance of F scores, and in
tains all the features whose predicate is on at least once in particular the bootstrap over the test set (Efron and Tib-
the training set. The complete feature set may in princi- shirani, 1993; Sang, 2002). However, bootstrap variances
ple perform better because it can place negative weights in preliminary experiments were too high to allow any
on transitions that should be discouraged if a given pred- conclusions, so we used instead a McNemar paired test
icate is on. on labeling disagreements (Gillick and Cox, 1989).
Model F score training method time F score L0λ
SVM combination 94.39% Precond. CG 130 94.19% -2968
(Kudo and Matsumoto, 2001) Mixed CG 540 94.20% -2990
CRF 94.38% Plain CG 648 94.04% -2967
Generalized winnow 93.89% L-BFGS 84 94.19% -2948
(Zhang et al., 2002) GIS 3700 93.55% -5668
Voted perceptron 94.09%
MEMM 93.70% Table 3: Runtime for various training methods

Table 2: NP chunking F scores null hypothesis p-value


CRF vs. SVM 0.469
CRF vs. MEMM 0.00109
5 Results
CRF vs. voted perceptron 0.116
All the experiments were performed with our Java imple- MEMM vs. voted perceptron 0.0734
mentation of CRFs,designed to handle millions of fea-
tures, on 1.7 GHz Pentium IV processors with Linux and Table 4: McNemar’s tests on labeling disagreements
IBM Java 1.3.0. Minor variants support voted perceptron
(Collins, 2002) and MEMMs (McCallum et al., 2000)
with the same efficient feature encoding. GIS, CG, and Mixed CG training converges slightly more slowly
L-BFGS were used to train CRFs and MEMMs. than preconditioned CG. On the other hand, CG without
preconditioner converges much more slowly than both
5.1 F Scores preconditioned CG and mixed CG training. However, it
Table 2 gives representative NP chunking F scores for is still much faster than GIS. We believe that the superior
previous work and for our best model, with the com- convergence rate of preconditioned CG is due to the use
plete set of 3.8 million features. The last row of the table of approximate second-order information. This is con-
gives the score for an MEMM trained with the mixed CG firmed by the performance of L-BFGS, which also uses
method using an approximate preconditioner. The pub- approximate second-order information.2
lished F score for voted perceptron is 93.53% with a dif- Although there is no direct relationship between F
ferent feature set (Collins, 2002). The improved result scores and log-likelihood, in these experiments F score
given here is for the supported feature set; the complete tends to follow log-likelihood. Indeed, Figure 3 shows
feature set gives a slightly lower score of 94.07%. Zhang that preconditioned CG training improves test F scores
et al. (2002) reported a higher F score (94.38%) with gen- much more rapidly than GIS training.
eralized winnow using additional linguistic features that Table 3 compares run times (in minutes) for reaching a
were not available to us. target penalized log-likelihood for various training meth-
ods with prior σ = 1.0. GIS is the only method that failed
5.2 Convergence Speed
to reach the target, after 3,700 iterations. We cannot place
All the results in the rest of this section are for the smaller the voted perceptron in this table, as it does not opti-
supported set of 820,000 features. Figures 2a and 2b mize log-likelihood and does not use a prior. However,
show how preconditioning helps training convergence. it reaches a fairly good F-score above 93% in just two
Since each CG iteration involves a line search that may training sweeps, but after that it improves more slowly, to
require several forward-backward procedures (typically a somewhat lower score, than preconditioned CG train-
between 4 and 5 in our experiments), we plot the progress ing.
of penalized log-likelihood L0λ with respect to the num-
ber of forward-backward evaluations. The objective func- 5.3 Labeling Accuracy
tion increases rapidly, achieving close proximity to the
The accuracy rate for individual labeling decisions is
maximum in a few iterations (typically 10). In contrast,
over-optimistic as an accuracy measure for shallow pars-
GIS training increases L0λ rather slowly, never reaching
ing. For instance, if the chunk BIIIIIII is labled as
the value achieved by CG. The relative slowness of it-
OIIIIIII, the labeling accuracy is 87.5%, but recall is
erative scaling is also documented in a recent evaluation
0. However, individual labeling errors provide a more
of training methods for maximum-entropy classification
convenient basis for statistical significance tests. One
(Malouf, 2002). In theory, GIS would eventually con-
verge to the L0λ optimum, but in practice convergence 2
Although L-BFGS has a slightly higher penalized log-
may be so slow that L0λ improvements may fall below likelihood, its log-likelihood on the data is actually lower than
numerical accuracy, falsely indicating convergence. that of preconditioned CG and mixed CG training.
Comparison of Fast Training Algorithms for CRF Comparison of CG Methods to GIS
0 0

−20000
−5000
−40000

−10000
Penalized Log−likelihood

−60000

Penalized Log−likelihood
−80000
−15000
−100000
−20000
−120000

−25000 −140000

−160000
−30000 Preconditioned CG Preconditioned CG
Mixed CG Training −180000 CG w/o Preconditioner
L−BFGS GIS
−35000 −200000
6 56 106 156 206 256 0 50 100 150 200 250 300 350 400 450 500
# of Forward−backward evaluations # of Forward−backward evaluations

(a) L0λ : CG (precond., mixed), L-BFGS (b) L0λ : CG (precond., plain), GIS

Figure 2: Training convergence for various methods

Comparison of CG Methods to GIS features of generative finite-state models and discrimina-


0.95
tive (log-)linear classifiers, and do NP chunking as well
0.9
as or better than “ad hoc” classifier combinations, which
0.85 were the most accurate approach until now. In a longer
0.8 version of this work we will also describe shallow pars-
0.75 ing results for other phrase types. There is no reason why
the same techniques cannot be used equally successfully
F score

0.7
for the other types or for other related tasks, such as POS
0.65
tagging or named-entity recognition.
0.6
On the machine-learning side, it would be interest-
0.55 ing to generalize the ideas of large-margin classification
Preconditioned CG
0.5 CG w/o Preconditioner to sequence models, strengthening the results of Collins
GIS
0.45 (2002) and leading to new optimal training algorithms
0 50 100 150 200 250 300 350 400 450 500
# of Forward−backward evaluations with stronger guarantees against overfitting.
On the application side, (log-)linear parsing models
Figure 3: Test F scores vs. training time have the potential to supplant the currently dominant
lexicalized PCFG models for parsing by allowing much
richer feature sets and simpler smoothing, while avoid-
such test is McNemar test on paired observations (Gillick ing the label bias problem that may have hindered earlier
and Cox, 1989). classifier-based parsers (Ratnaparkhi, 1997). However,
With McNemar’s test, we compare the correctness of work in that direction has so far addressed only parse
the labeling decisions of two models. The null hypothesis reranking (Collins and Duffy, 2002; Riezler et al., 2002).
is that the disagreements (correct vs. incorrect) are due to Full discriminative parser training faces significant algo-
chance. Table 4 summarizes the results of tests between rithmic challenges in the relationship between parsing al-
the models for which we had labeling decisions. These ternatives and feature values (Geman and Johnson, 2002)
tests suggest that MEMMs are significantly less accurate, and in computing feature expectations.
but that there are no significant differences in accuracy
among the other models. Acknowledgments
6 Conclusions John Lafferty and Andrew McCallum worked with the
second author on developing CRFs. McCallum helped
We have shown that (log-)linear sequence labeling mod- by the second author implemented the first conjugate-
els trained discriminatively with general-purpose opti- gradient trainer for CRFs, which convinced us that train-
mization methods are a simple, competitive solution to ing of large CRFs on large datasets would be practical.
learning shallow parsers. These models combine the best Michael Collins helped us reproduce his generalized per-
cepton results and compare his method with ours. Erik J. Kupiec. Robust part-of-speech tagging using a hidden
Tjong Kim Sang, who has created the best online re- Markov model. Computer Speech and Language, 6:225–242,
sources on shallow parsing, helped us with details of the 1992.
CoNLL-2000 shared task. Taku Kudo provided the out- J. Lafferty, A. McCallum, and F. Pereira. Conditional random
put of his SVM chunker for the significance test. fields: Probabilistic models for segmenting and labeling se-
quence data. In Proc. ICML-01, pages 282–289, 2001.
References R. Malouf. A comparison of algorithms for maximum entropy
parameter estimation. In Proc. CoNLL-2002, 2002.
S. Abney. Parsing by chunks. In R. Berwick, S. Abney, and A. McCallum, D. Freitag, and F. Pereira. Maximum entropy
C. Tenny, editors, Principle-based Parsing. Kluwer Aca- Markov models for information extraction and segmentation.
demic Publishers, 1991. In Proc. ICML 2000, pages 591–598, Stanford, California,
S. Abney, R. E. Schapire, and Y. Singer. Boosting applied to 2000.
tagging and PP attachment. In Proc. EMNLP-VLC, New T. P. Minka. Algorithms for maximum-likelihood logistic re-
Brunswick, New Jersey, 1999. ACL. gression. Technical Report 758, CMU Statistics Department,
A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maxi- 2001.
mum entropy approach to natural language processing. Com- J. Nocedal and S. J. Wright. Numerical Optimization. Springer,
putational Linguistics, 22(1), 1996. 1999.
D. M. Bikel, R. L. Schwartz, and R. M. Weischedel. An algo- V. Punyakanok and D. Roth. The use of classifiers in sequential
rithm that learns what’s in a name. Machine Learning, 34: inference. In NIPS 13, pages 995–1001. MIT Press, 2001.
211–231, 1999. L. A. Ramshaw and M. P. Marcus. Text chunking using
L. Bottou. Une Approche théorique de l’Apprentissage Con- transformation-based learning. In Proc. Third Workshop on
nexionniste: Applications à la Reconnaissance de la Parole. Very Large Corpora. ACL, 1995.
PhD thesis, Université de Paris XI, 1991. A. Ratnaparkhi. A maximum entropy model for part-of-speech
E. Brill. Transformation-based error-driven learning and natural tagging. In Proc. EMNLP, New Brunswick, New Jersey,
language processing: a case study in part of speech tagging. 1996. ACL.
Computational Linguistics, 21:543–565, 1995. A. Ratnaparkhi. A linear observed time statistical parser
S. F. Chen and R. Rosenfeld. A Gaussian prior for smoothing based on maximum entropy models. In C. Cardie and
maximum entropy models. Technical Report CMU-CS-99- R. Weischedel, editors, EMNLP-2. ACL, 1997.
108, Carnegie Mellon University, 1999. S. Riezler, T. H. King, R. M. Kaplan, R. Crouch, J. T.
M. Collins. Discriminative training methods for hidden Markov Maxwell III, and M. Johnson. Parsing the Wall Street Journal
models: Theory and experiments with perceptron algo- using a lexical-functional grammar and discriminative esti-
rithms. In Proc. EMNLP 2002. ACL, 2002. mation techniques. In Proc. 40th ACL, 2002.
E. F. T. K. Sang. Memory-based shallow parsing. Journal of
M. Collins and N. Duffy. New ranking algorithms for parsing Machine Learning Research, 2:559–594, 2002.
and tagging: Kernels over discrete structures, and the voted
perceptron. In Proc. 40th ACL, 2002. J. R. Shewchuk. An introduction to the conjugate gradient
method without the agonizing pain, 1994. URL http://
J. N. Darroch and D. Ratcliff. Generalized iterative scaling for www-2.cs.cmu.edu/˜jrs/jrspapers.html#cg.
log-linear models. The Annals of Mathematical Statistics, 43
(5):1470–1480, 1972. B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilis-
tic models for relational data. In Eighteenth Conference on
S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing fea- Uncertainty in Artificial Intelligence, 2002.
tures of random fields. IEEE PAMI, 19(4):380–393, 1997.
E. F. Tjong Kim Sang and S. Buchholz. Introduction to the
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. CoNLL-2000 shared task: Chunking. In Proc. CoNLL-2000,
Chapman & Hall/CRC, 1993. pages 127–132, 2000.
D. Freitag and A. McCallum. Information extraction with H. Wallach. Efficient training of conditional random fields. In
HMM structures learned by stochastic optimization. In Proc. 6th Annual CLUK Research Colloquium, 2002.
Proc. AAAI 2000, 2000. A. Yeh. More accurate tests for the statistical significance of
S. Geman and M. Johnson. Dynamic programming for parsing result differences. In COLING-2000, pages 947–953, Saar-
and estimation of stochastic unification-based grammars. In bruecken, Germany, 2000.
Proc. 40th ACL, 2002. T. Zhang, F. Damerau, and D. Johnson. Text chunking based
L. Gillick and S. Cox. Some statistical issues in the compairson on a generalization of winnow. Journal of Machine Learning
of speech recognition algorithms. In International Confer- Research, 2:615–637, 2002.
ence on Acoustics Speech and Signal Processing, volume 1,
pages 532–535, 1989.
J. Hammersley and P. Clifford. Markov fields on finite graphs
and lattices. Unpublished manuscript, 1971.
T. Kudo and Y. Matsumoto. Chunking with support vector ma-
chines. In Proc. NAACL 2001. ACL, 2001.

You might also like