Abstract
In this paper, we present a corpus-based approach for tagging and chunking. The formalism used is based on stochastic finite-state automata. Therefore, it can include n-grams models or any stochastic finite-state automata learnt using grammatical inference techniques. As the models involved in our system are learnt automatically, it allows for a very flexible and portable system for different languages and chunk definitions. In order to show the viability of our approach, we present results for tagging and chunking using different combinations of bigrams and other more complex automata learnt by means of the Error Correcting Grammatical Inference (ECGI) algorithm. The experimentation was carried out on the Wall Street Journal corpus for English and on the Lexesp corpus for Spanish.
This work has been supported by the Spanish Research Project TIC97-0671-C02-01/02.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
S. Abney. Parsing by Chunks. R. Berwick, S. Abney and C. Tenny (eds.) Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht, 1991.
S. Abney. Partial Parsing via Finite-State Cascades. In Proceedings of the ESSLLI’96 Robust Parsing Workshop, Prague, Czech Republic, 1996.
S. Argamon, I. Dagan, and Y. Krymolowski. A Memory-Based Approach to Learning Shallow Natural Language Patterns. In Proceedings of the joint 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, COLING-ACL, pp. 67–73, Montréal, Canada, 1998.
S. Aït-Mokhtar and J.-P. Chanod. Incremental Finite-State Parsing. In Proceedings of the 5th Conference on Applied Natural Language Processing, Washington D.C., USA, 1997.
D. Bourigault. Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases. In Proceedings of the 15th International Conference on Computational Linguistics, pp. 977–981, 1992.
E. Brill. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-Of-Speech Tagging. Computational Linguistics, 21(4):543–565, 1995.
J. Carmona, S. Cervell, L. Màrquez, M. Martí, L. Padró, R. Placer, H. Rodríýguez, M. Taulé, and J. Turmo. An Environment for Morphosyntactic Processing of Unrestricted Spanish Text. In Proceedings of the 1st International Conference on Language Resources and Evaluation, LREC, pp. 915–922, Granada, Spain, May 1998.
K. W. Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the 1st Conference on Applied Natural Language Processing, ANLP, pp. 136–143. ACL, 1988.
P. Clarksond and R. Ronsenfeld. Statistical Language Modelling using the CMU-Cambridge Toolkit. In Proceedings of Euro speech, Rhodes, Greece, 1997.
W. Daelemans, S. Buchholz, and J. Veenstra. Memory-Based Shallow Parsing. In Proceedings ofEMNLP/VLC-99, pp. 239–246, University of Maryland, USA, June 1999.
W. Daelemans, J. Zavrel, P. Berck, and S. Gillis. MBT: A Memory-Based Part-Of-Speech Tagger Generator. In Proceedings of the 4th Workshop on Very Large Corpora, pp. 14–27, Copenhagen, Denmark, 1996.
E. Ejerhed. Finding Clauses in Unrestricted Text by Finitary and Stochastic Methods. In Proceedings of Second Conference on Applied Natural Language Processing, pp. 219–227. ACL, 1988.
D.M. Magerman. Learning Grammatical Structure Using Statistical Decision-Trees. In Proceedings of the 3rd International Colloquium on Grammatical Inference, ICGI, pp. 1–21, 1996. Springer-Verlag Lecture Notes Series in Artificial Intelligence 1147.
M. P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 1993.
B. Merialdo. Tagging English Text with a Probabilistic Model. Computational Linguistics, 20(2):155–171, 1994.
F. Pla and A. Molina. Etiquetado Morfosintáctico del Corpus BDGEO. In Proceedings of the CAEPIA, Murcia, España, November 1999.
F. Pla and N. Prieto. Using Grammatical Inference Methods for Automatic Part-Of-Speech Tagging. In Proceedings of 1st International Conference on Language Resources and Evaluation, LREC, Granada, Spain, 1998.
N. Prieto and E. Vidal. Learning Language Models through the ECGI Method. Speech Communication, 1:299–309, 1992.
L. Ramshaw and M. Marcus. Text Chunking Using Transformation-Based Learning. In Proceedings of third Workshop on Very Large Corpora, pp. 82–94, June 1995.
A. Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger. In Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP, 1996.
A. Voutilainen. NPTool, a Detector of English Noun Phrases. In Proceedings of the Workshop on Very Large Corpora. ACL, June 1993.
A. Voutilainen. A Syntax-Based Part-Of-Speech Analyzer. In Proceedings of the 7th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL, Dublin, Ireland, 1995.
A. Voutilainen and L. Padró. Developing a Hybrid NP Parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, ANLP, pp. 80–87, Washington DC, 1997. ACL.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pla, F., Molina, A., Prieto, N. (2000). An Integrated Statistical Model for Tagging and Chunking Unrestricted Text. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2000. Lecture Notes in Computer Science(), vol 1902. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45323-7_3
Download citation
DOI: https://doi.org/10.1007/3-540-45323-7_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41042-3
Online ISBN: 978-3-540-45323-9
eBook Packages: Springer Book Archive