Shallow Parsing With Conditional Random Fields
Shallow Parsing With Conditional Random Fields
correlated features of the inputs, and they are trained dis- where i ranges over input positions. The conditional
criminatively. But like generative models, they can trade probability distribution defined by the CRF is then
off decisions at different sequence positions to obtain a
globally optimal labeling. Lafferty et al. (2001) showed exp λ · F (Y , X)
pλ (Y |X) = (1)
that CRFs beat related classification models as well as Zλ (X)
HMMs on synthetic data and on a part-of-speech tagging
where
task. X
Zλ (x) = exp λ · F (y, x)
In the present work, we show that CRFs beat all re- y
ported single-model NP chunking results on the standard
evaluation dataset, and are statistically indistinguishable Any positive conditional distribution p(Y |X) that obeys
from the previous best performer, a voting arrangement of the Markov property
24 forward- and backward-looking support-vector clas- p(Yi |{Yj }j6=i , X) = p(Yi |Yi−1 , Yi+1 , X)
sifiers (Kudo and Matsumoto, 2001). To obtain these
results, we had to abandon the original iterative scal- can be written in the form (1) for appropriate choice of
ing CRF training algorithm for convex optimization al- feature functions and weight vector (Hammersley and
gorithms with better convergence properties. We provide Clifford, 1971).
detailed comparisons between training methods. The most probable label sequence for input sequence
The generalized perceptron proposed by Collins x is
(2002) is closely related to CRFs, but the best CRF train-
ing methods seem to have a slight edge over the general- ŷ = arg max pλ (y|x) = arg max λ · F (y, x)
y y
ized perceptron.
because Zλ (x) does not depend on y. F (y, x) decom-
poses into a sum of terms for consecutive pairs of labels,
2 Conditional Random Fields so the most likely y can be found with the Viterbi algo-
rithm.
We focus here on conditional random fields on sequences,
We train a CRF by maximizing the log-likelihood of a
although the notion can be used more generally (Laf-
given training set T = {(xk , y k )}N k=1 , which we assume
ferty et al., 2001; Taskar et al., 2002). Such CRFs define
fixed for the rest of this section:
conditional probability distributions p(Y |X) of label se-
quences given input sequences. We assume that the ran-
P
Lλ = Pk log pλ (y k |xk )
dom variable sequences X and Y have the same length, = k [λ · F (y k , xk ) − log Zλ (xk )]
and use x = x1 · · · xn and y = y1 · · · yn for the generic
input sequence and label sequence, respectively. To perform this optimization, we seek the zero of the gra-
dient
A CRF on (X, Y ) is specified by a vector f of local
features and a corresponding weight vector λ. Each local ∇Lλ =
feature is either a state feature s(y, x, i) or a transition X
(2)
feature t(y, y 0 , x, i), where y, y 0 are labels, x an input F (y k , xk ) − Epλ (Y |xk ) F (Y , xk )
sequence, and i an input position. To make the notation k
In words, the maximum of the training data likelihood optimization algorithms when many correlated features
is reached when the empirical average of the global fea- are involved. Concurrently with the present work, Wal-
ture vector equals its model expectation. The expectation lach (2002) tested conjugate gradient and second-order
Epλ (Y |x) F (Y , x) can be computed efficiently using a methods for CRF training, showing significant training
variant of the forward-backward algorithm. For a given speed advantages over iterative scaling on a small shal-
x, define the transition matrix for position i as low parsing problem. Our work shows that precon-
ditioned conjugate-gradient (CG) (Shewchuk, 1994) or
Mi [y, y 0 ] = exp λ · f (y, y 0 , x, i) limited-memory quasi-Newton (L-BFGS) (Nocedal and
Wright, 1999) perform comparably on very large prob-
Let f be any P local feature, fi [y, y 0 ] = f (y, y 0 , x, i), lems (around 3.8 million features). We compare those
F (y, x) = i f (yi−1 , yi , x, i), and let ∗ denote algorithms to generalized iterative scaling (GIS) (Dar-
component-wise matrix product. Then roch and Ratcliff, 1972), non-preconditioned CG, and
X voted perceptron training (Collins, 2002). All algorithms
Epλ (Y |x) F (Y , x) = pλ (y|x)F (y, x) except voted perceptron maximize the penalized log-
y
likelihood: λ∗ = arg maxλ L0λ . However, for ease of
X αi−1 (fi ∗ Mi )β > exposition, this discussion of training methods uses the
i
=
Zλ (x) unpenalized log-likelihood Lλ .
i
Zλ (x) = αn · 1> 3.1 Preconditioned Conjugate Gradient
where αi and βi the forward and backward state-cost Conjugate-gradient (CG) methods have been shown to
vectors defined by be very effective in linear and non-linear optimization
(Shewchuk, 1994). Instead of searching along the gra-
dient, conjugate gradient searches along a carefully cho-
αi−1 Mi 0 < i ≤ n
αi = sen linear combination of the gradient and the previous
1 i=0
> search direction.
Mi+1 βi+1 1 ≤ i < n
β>i = CG methods can be accelerated by linearly trans-
1 i=n
forming the variables with preconditioner (Nocedal and
Therefore, we can use a forward pass to compute the αi Wright, 1999; Shewchuk, 1994). The purpose of the pre-
and a backward bass to compute the β i and accumulate conditioner is to improve the condition number of the
the feature expectations. quadratic form that locally approximates the objective
To avoid overfitting, we penalize the likelihood with function, so the inverse of Hessian is reasonable precon-
a spherical Gaussian weight prior (Chen and Rosenfeld, ditioner. However, this is not applicable to CRFs for two
1999): reasons. First, the size of the Hessian is dim(λ)2 , lead-
ing to unacceptable space and time requirements for the
inversion. In such situations, it is common to use instead
X
L0λ = [λ · F (y k , xk ) − log Zλ (xk )]
k
the (inverse of) the diagonal of the Hessian. However in
kλk2 our case the Hessian has the form
− + const
2σ 2 Hλ
def
= ∇ 2 Lλ
with gradient
X
= − {E [F (Y , xk ) × F (Y , xk )]
k
∇L0λ = −EF (Y , xk ) × EF (Y , xk )}
X λ
F (y k , xk ) − Epλ (Y |xk ) F (Y , xk ) − 2 where the expectations are taken with respect to
σ
k pλ (Y |xk ). Therefore, every Hessian element, includ-
ing the diagonal ones, involve the expectation of a prod-
3 Training Methods
uct of global feature values. Unfortunately, computing
Lafferty et al. (2001) used iterative scaling algorithms those expectations is quadratic on sequence length, as the
for CRF training, following earlier work on maximum- forward-backward algorithm can only compute expecta-
entropy models for natural language (Berger et al., 1996; tions of quantities that are additive along label sequences.
Della Pietra et al., 1997). Those methods are very sim- We solve both problems by discarding the off-diagonal
ple and guaranteed to converge, but as Minka (2001) and terms and approximating expectation of the square of a
Malouf (2002) showed for classification, their conver- global feature by the expectation of the sum of squares of
gence is much slower than that of general-purpose convex the corresponding local features at each position. The ap-
proximated diagonal term Hf for feature f has the form Like the familiar perceptron algorithm, this algorithm re-
peatedly sweeps over the training instances, updating the
Hf = Ef (Y , xk )2 weight vector as it considers each instance. Instead of
taking just the final weight vector, the voted perceptron
2
X X Mi [y, y 0 ]
− f (Y , xk ) algorithm takes the average of the λt . Collins (2002) re-
i
Z λ (x) ported and we confirmed that this averaging reduces over-
y,y
0
fitting considerably.
If this approximation is semidefinite, which is trivial to
check, its inverse is an excellent preconditioner for early 4 Shallow Parsing
iterations of CG training. However, when the model is
Figure 1 shows the base NPs in an example sentence. Fol-
close to the maximum, the approximation becomes un-
lowing Ramshaw and Marcus (1995), the input to the
stable, which is not surprising since it is based on fea-
NP chunker consists of the words in a sentence anno-
ture independence assumptions that become invalid as
tated automatically with part-of-speech (POS) tags. The
the weights of interaction features move away from zero.
chunker’s task is to label each word with a label indi-
Therefore, we disable the preconditioner after a certain
cating whether the word is outside a chunk (O), starts
number of iterations, determined from held-out data. We
a chunk (B), or continues a chunk (I). For example,
call this strategy mixed CG training.
the tokens in first line of Figure 1 would be labeled
3.2 Limited-Memory Quasi-Newton BIIBIIOBOBIIO.
Newton methods for nonlinear optimization use second- 4.1 Data Preparation
order (curvature) information to find search directions.
NP chunking results have been reported on two slightly
As discussed in the previous section, it is not practi-
different data sets: the original RM data set of Ramshaw
cal to obtain exact curvature information for CRF train-
and Marcus (1995), and the modified CoNLL-2000 ver-
ing. Limited-memory BFGS (L-BFGS) is a second-order
sion of Tjong Kim Sang and Buchholz (2000). Although
method that estimates the curvature numerically from
the chunk tags in the RM and CoNLL-2000 are somewhat
previous gradients and updates, avoiding the need for
different, we found no significant accuracy differences
an exact Hessian inverse computation. Compared with
between models trained on these two data sets. There-
preconditioned CG, L-BFGS can also handle large-scale
fore, all our results are reported on the CoNLL-2000 data
problems but does not require a specialized Hessian ap-
set. We also used a development test set, provided by
proximations. An earlier study indicates that L-BFGS
Michael Collins, derived from WSJ section 21 tagged
performs well in maximum-entropy classifier training
with the Brill (1995) POS tagger.
(Malouf, 2002).
There is no theoretical guidance on how much infor- 4.2 CRFs for Shallow Parsing
mation from previous steps we should keep to obtain Our chunking CRFs have a second-order Markov depen-
sufficiently accurate curvature estimates. In our exper- dency between chunk tags. This is easily encoded by
iments, storing 3 to 10 pairs of previous gradients and making the CRF labels pairs of consecutive chunk tags.
updates worked well, so the extra memory required over That is, the label at position i is yi = ci−1 ci , where ci is
preconditioned CG was modest. A more detailed descrip- the chunk tag of word i, one of O, B, or I. Since B must be
tion of this method can be found elsewhere (Nocedal and used to start a chunk, the label OI is impossible. In addi-
Wright, 1999). tion, successive labels are constrained: yi−1 = ci−2 ci−1 ,
3.3 Voted Perceptron yi = ci−1 ci , and c0 = O. These contraints on the model
topology are enforced by giving appropriate features a
Unlike other methods discussed so far, voted perceptron
weight of −∞, forcing all the forbidden labelings to have
training (Collins, 2002) attempts to minimize the differ-
zero probability.
ence between the global feature vector for a training in-
Our choice of features was mainly governed by com-
stance and the same feature vector for the best-scoring
puting power, since we do not use feature selection and
labeling of that instance according to the current model.
all features are used in training and testing. We use the
More precisely, for each training instance the method
following factored representation for features
computes a weight update
f (yi−1 , yi , x, i) = p(x, i)q(yi−1 , yi ) (4)
λt+1 = λt + F (y k , xk ) − F (ŷ k , xk ) (3)
where p(x, i) is a predicate on the input sequence x and
in which ŷ k is the Viterbi path current position i and q(yi−1 , yi ) is a predicate on pairs
of labels. For instance, p(x, i) might be “word at posi-
ŷ k = arg max λt · F (y , xk )
y tion i is the” or “the POS tags at positions i − 1, i are
Rockwell International Corp. ’s Tulsa unit said it signed a tentative agreement extending
its contract with Boeing Co. to provide structural parts for Boeing ’s 747 jetliners .
Figure 1: NP chunks
−20000
−5000
−40000
−10000
Penalized Log−likelihood
−60000
Penalized Log−likelihood
−80000
−15000
−100000
−20000
−120000
−25000 −140000
−160000
−30000 Preconditioned CG Preconditioned CG
Mixed CG Training −180000 CG w/o Preconditioner
L−BFGS GIS
−35000 −200000
6 56 106 156 206 256 0 50 100 150 200 250 300 350 400 450 500
# of Forward−backward evaluations # of Forward−backward evaluations
(a) L0λ : CG (precond., mixed), L-BFGS (b) L0λ : CG (precond., plain), GIS
0.7
for the other types or for other related tasks, such as POS
0.65
tagging or named-entity recognition.
0.6
On the machine-learning side, it would be interest-
0.55 ing to generalize the ideas of large-margin classification
Preconditioned CG
0.5 CG w/o Preconditioner to sequence models, strengthening the results of Collins
GIS
0.45 (2002) and leading to new optimal training algorithms
0 50 100 150 200 250 300 350 400 450 500
# of Forward−backward evaluations with stronger guarantees against overfitting.
On the application side, (log-)linear parsing models
Figure 3: Test F scores vs. training time have the potential to supplant the currently dominant
lexicalized PCFG models for parsing by allowing much
richer feature sets and simpler smoothing, while avoid-
such test is McNemar test on paired observations (Gillick ing the label bias problem that may have hindered earlier
and Cox, 1989). classifier-based parsers (Ratnaparkhi, 1997). However,
With McNemar’s test, we compare the correctness of work in that direction has so far addressed only parse
the labeling decisions of two models. The null hypothesis reranking (Collins and Duffy, 2002; Riezler et al., 2002).
is that the disagreements (correct vs. incorrect) are due to Full discriminative parser training faces significant algo-
chance. Table 4 summarizes the results of tests between rithmic challenges in the relationship between parsing al-
the models for which we had labeling decisions. These ternatives and feature values (Geman and Johnson, 2002)
tests suggest that MEMMs are significantly less accurate, and in computing feature expectations.
but that there are no significant differences in accuracy
among the other models. Acknowledgments
6 Conclusions John Lafferty and Andrew McCallum worked with the
second author on developing CRFs. McCallum helped
We have shown that (log-)linear sequence labeling mod- by the second author implemented the first conjugate-
els trained discriminatively with general-purpose opti- gradient trainer for CRFs, which convinced us that train-
mization methods are a simple, competitive solution to ing of large CRFs on large datasets would be practical.
learning shallow parsers. These models combine the best Michael Collins helped us reproduce his generalized per-
cepton results and compare his method with ours. Erik J. Kupiec. Robust part-of-speech tagging using a hidden
Tjong Kim Sang, who has created the best online re- Markov model. Computer Speech and Language, 6:225–242,
sources on shallow parsing, helped us with details of the 1992.
CoNLL-2000 shared task. Taku Kudo provided the out- J. Lafferty, A. McCallum, and F. Pereira. Conditional random
put of his SVM chunker for the significance test. fields: Probabilistic models for segmenting and labeling se-
quence data. In Proc. ICML-01, pages 282–289, 2001.
References R. Malouf. A comparison of algorithms for maximum entropy
parameter estimation. In Proc. CoNLL-2002, 2002.
S. Abney. Parsing by chunks. In R. Berwick, S. Abney, and A. McCallum, D. Freitag, and F. Pereira. Maximum entropy
C. Tenny, editors, Principle-based Parsing. Kluwer Aca- Markov models for information extraction and segmentation.
demic Publishers, 1991. In Proc. ICML 2000, pages 591–598, Stanford, California,
S. Abney, R. E. Schapire, and Y. Singer. Boosting applied to 2000.
tagging and PP attachment. In Proc. EMNLP-VLC, New T. P. Minka. Algorithms for maximum-likelihood logistic re-
Brunswick, New Jersey, 1999. ACL. gression. Technical Report 758, CMU Statistics Department,
A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maxi- 2001.
mum entropy approach to natural language processing. Com- J. Nocedal and S. J. Wright. Numerical Optimization. Springer,
putational Linguistics, 22(1), 1996. 1999.
D. M. Bikel, R. L. Schwartz, and R. M. Weischedel. An algo- V. Punyakanok and D. Roth. The use of classifiers in sequential
rithm that learns what’s in a name. Machine Learning, 34: inference. In NIPS 13, pages 995–1001. MIT Press, 2001.
211–231, 1999. L. A. Ramshaw and M. P. Marcus. Text chunking using
L. Bottou. Une Approche théorique de l’Apprentissage Con- transformation-based learning. In Proc. Third Workshop on
nexionniste: Applications à la Reconnaissance de la Parole. Very Large Corpora. ACL, 1995.
PhD thesis, Université de Paris XI, 1991. A. Ratnaparkhi. A maximum entropy model for part-of-speech
E. Brill. Transformation-based error-driven learning and natural tagging. In Proc. EMNLP, New Brunswick, New Jersey,
language processing: a case study in part of speech tagging. 1996. ACL.
Computational Linguistics, 21:543–565, 1995. A. Ratnaparkhi. A linear observed time statistical parser
S. F. Chen and R. Rosenfeld. A Gaussian prior for smoothing based on maximum entropy models. In C. Cardie and
maximum entropy models. Technical Report CMU-CS-99- R. Weischedel, editors, EMNLP-2. ACL, 1997.
108, Carnegie Mellon University, 1999. S. Riezler, T. H. King, R. M. Kaplan, R. Crouch, J. T.
M. Collins. Discriminative training methods for hidden Markov Maxwell III, and M. Johnson. Parsing the Wall Street Journal
models: Theory and experiments with perceptron algo- using a lexical-functional grammar and discriminative esti-
rithms. In Proc. EMNLP 2002. ACL, 2002. mation techniques. In Proc. 40th ACL, 2002.
E. F. T. K. Sang. Memory-based shallow parsing. Journal of
M. Collins and N. Duffy. New ranking algorithms for parsing Machine Learning Research, 2:559–594, 2002.
and tagging: Kernels over discrete structures, and the voted
perceptron. In Proc. 40th ACL, 2002. J. R. Shewchuk. An introduction to the conjugate gradient
method without the agonizing pain, 1994. URL http://
J. N. Darroch and D. Ratcliff. Generalized iterative scaling for www-2.cs.cmu.edu/˜jrs/jrspapers.html#cg.
log-linear models. The Annals of Mathematical Statistics, 43
(5):1470–1480, 1972. B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilis-
tic models for relational data. In Eighteenth Conference on
S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing fea- Uncertainty in Artificial Intelligence, 2002.
tures of random fields. IEEE PAMI, 19(4):380–393, 1997.
E. F. Tjong Kim Sang and S. Buchholz. Introduction to the
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. CoNLL-2000 shared task: Chunking. In Proc. CoNLL-2000,
Chapman & Hall/CRC, 1993. pages 127–132, 2000.
D. Freitag and A. McCallum. Information extraction with H. Wallach. Efficient training of conditional random fields. In
HMM structures learned by stochastic optimization. In Proc. 6th Annual CLUK Research Colloquium, 2002.
Proc. AAAI 2000, 2000. A. Yeh. More accurate tests for the statistical significance of
S. Geman and M. Johnson. Dynamic programming for parsing result differences. In COLING-2000, pages 947–953, Saar-
and estimation of stochastic unification-based grammars. In bruecken, Germany, 2000.
Proc. 40th ACL, 2002. T. Zhang, F. Damerau, and D. Johnson. Text chunking based
L. Gillick and S. Cox. Some statistical issues in the compairson on a generalization of winnow. Journal of Machine Learning
of speech recognition algorithms. In International Confer- Research, 2:615–637, 2002.
ence on Acoustics Speech and Signal Processing, volume 1,
pages 532–535, 1989.
J. Hammersley and P. Clifford. Markov fields on finite graphs
and lattices. Unpublished manuscript, 1971.
T. Kudo and Y. Matsumoto. Chunking with support vector ma-
chines. In Proc. NAACL 2001. ACL, 2001.