Abstract
Syntactic chunking has been a well-defined and well-studied task since its introduction in 2000 as the conll shared task. Though some efforts have been further spent on chunking performance improvement, the experimental data has been restricted, with few exceptions, to (part of) the Wall Street Journal data, as adopted in the shared task. It remains open how those successful chunking technologies could be extended to other data, which may differ in genre/domain and/or amount of annotation. In this paper we first train chunkers with three classifiers on three different data sets and test on four data sets. We also vary the size of training data systematically to show data requirements for chunkers. It turns out that there is no significant difference between those state-of-the-art classifiers; training on plentiful data from the same corpus (switchboard) yields comparable results to Wall Street Journal chunkers even when the underlying material is spoken; the results from a large amount of unmatched training data can be obtained by using a very modest amount of matched training data.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of english: the penn treebank. Computational Linguistics 19(2), 313–330 (1993)
Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the conll-2000 shared task: Chunking. In: Cardie, C., Daelemans, W., Nedellec, C., Tjong Kim Sang, E. (eds.) Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, pp. 127–132 (2000)
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., Wellner, P.: The AMI meeting corpus: A pre-announcement. In: Proceedings of 2nd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (2005)
Carreras, X., Màrquez, L.: Introduction to the CoNLL-2005 shared task: Semantic role labeling. In: Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL 2005), Association for Computational Linguistics, Ann Arbor, Michigan, pp. 152–164. (2005)
Abney, S.: Parsing by chunks. In: Berwick, R.C., Abney, S.P., Tenny, C. (eds.) Principle-Based Parsing: Computation and Psycholinguistics, pp. 257–278. Kluwer Academic Publishers, Boston (1991)
Abney, S.: Partial parsing via finite-state cascade. Natural Language Engineering 2(4), 337–344 (1996)
Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Yarovsky, D., Church, K. (eds.) Proceedings of the Third Workshop on Very Large Corpora., pp. 82–94 (1995)
Osborne, M.: Shallow parsing as part-of-speech tagging. In: Cardie, C., Daelemans, W., Nedellec, C., Tjong Kim Sang, E. (eds.) Proceedings of CoNLL 2000 and LLL 2000, Lisbon, Portugal, pp. 145–147 (2000)
Osborne, M.: Shallow parsing using noisy and non-stationary training material. Journal of Machine Learning Research 2, 695–719 (2002)
Kudo, T., Matsumoto, Y.: Use of support vector learning for chunk identification. In: Cardie, C., Daelemans, W., Nedellec, C., Tjong Kim Sang, E. (eds.) Proceedings of CoNLL 2000 and LLL 2000, Lisbon, Portugal, pp. 142–144 (2000)
Kudo, T., Matsumoto, Y.: Chunking with support vector machines. In: Proceedings of NAACL 2001. Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pp. 1–8. Association for Computational Linguistics, Morristown (2001)
Zhang, T., Damerau, F., Johnson, D.: Text chunking based on a generalization of winnow. Journal of Machine Learning Research 2, 615–637 (2002)
Carreras, X., Màrquez, L., Castro, J.: Filtering-ranking perceptron learning for partial parsing. Machine Learning 60, 41–71 (2005)
Ando, R., Zhang, T.: A high-performance semi-supervised learning method for text chunking. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, Michigan, Association for Computational Linguistics, pp. 1–9 (2005)
Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Brill, E., Church, K. (eds.) Proceedings of the Conference on Empirical Methods in Natural Language Processing 1996, pp. 133–142 (1996)
Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, Chichester (1998)
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: NAACL 2003. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141. Association for Computational Linguistics, Morristown (2003)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley-Interscience, New York (1991)
Gildea, D.: Corpus variation and parser performance. In: Lee, L., Harman, D. (eds.) Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 167–202 (2001)
Daumé III, H., Marcu, D.: Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research (conditionally accepted, 2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xu, W., Carletta, J., Moore, J. (2006). Syntactic Chunking Across Different Corpora. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_15
Download citation
DOI: https://doi.org/10.1007/11965152_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69267-6
Online ISBN: 978-3-540-69268-3
eBook Packages: Computer ScienceComputer Science (R0)