Abstract
Identifying discourse relations in a text is essential for various tasks in Natural Language Processing, such as automatic text summarization, question-answering, and dialogue generation. The first step of this process is segmenting a text into elementary units. In this paper, we present a novel model of discourse segmentation based on sequential data labeling. Namely, we use Conditional Random Fields to train a discourse segmenter on the RST Discourse Treebank, using a set of lexical and syntactic features. Our system is compared to other statistical and rule-based segmenters, including one based on Support Vector Machines. Experimental results indicate that our sequential model outperforms current state-of-the-art discourse segmenters, with an F-score of 0.94. This performance level is close to the human agreement F-score of 0.98.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge (2000)
Chai, J.Y., Jin, R.: Discourse structure for context question answering. In: Harabagiu, S., Lacatusu, F. (eds.) HLT-NAACL 2004: Workshop on Pragmatics of Question Answering, Boston, Massachusetts, USA, pp. 23–30. Association for Computational Linguistics (2004)
Hernault, H., Piwek, P., Prendinger, H., Ishizuka, M.: Generating dialogues for virtual agents using nested textual coherence relations. In: Prendinger, H., Lester, J.C., Ishizuka, M. (eds.) IVA 2008. LNCS (LNAI), vol. 5208, pp. 139–145. Springer, Heidelberg (2008)
Georg, G., Hernault, H., Cavazza, M., Prendinger, H., Ishizuka, M.: From rhetorical structures to document structure: shallow pragmatic analysis for document engineering. In: DocEng 2009, pp. 185–192. ACM, New York (2009)
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: Toward a functional theory of text organization. Text 8, 243–281 (1988)
du Verle, D., Prendinger, H.: A novel discourse parser based on support vector machine classification. In: ACL 2009, Suntec, Singapore, pp. 665–673. Association for Computational Linguistics (2009)
Soricut, R., Marcu, D.: Sentence level discourse parsing using syntactic and lexical information. In: NAACL 2003, Morristown, NJ, USA, pp. 149–156. Association for Computational Linguistics (2003)
Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
Carlson, L., Marcu, D., Okurowski, M.E.: Rst discourse treebank (2002)
Subba, R., Di Eugenio, B.: Automatic discourse segmentation using neural networks. In: Proceedings of the 11th Workshop on the Semantics and Pragmatics of Dialogue, Trento, Italy, pp. 189–190 (2007)
Le, H.T., Abeysinghe, G., Huyck, C.: Automated discourse segmentation by syntactic information and cue phrases. In: AIA 2004, Innsbruck, Austria (2004)
Tofiloski, M., Brooke, J., Taboada, M.: A syntactic and lexical-based discourse segmenter. In: ACL 2009, Suntec, Singapore, pp. 77–80. Association for Computational Linguistics (2009)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Okazaki, N.: Crfsuite: a fast implementation of conditional random fields, crfs (2007)
Ng, A.Y.: Feature selection, l1 vs. l2 regularization, and rotational invariance. In: ICML 2004, p. 78. ACM, New York (2004)
Magerman, D.M.: Statistical decision-tree models for parsing. In: ACL 1995, Morristown, NJ, USA, pp. 276–283. Association for Computational Linguistics (1995)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19, 313–330 (1993)
Charniak, E.: A maximum-entropy-inspired parser. In: NAACL 2000, pp. 132–139. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Advances in Neural Information Processing Systems, vol. 15. MIT Press, Cambridge (2003)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hernault, H., Bollegala, D., Ishizuka, M. (2010). A Sequential Model for Discourse Segmentation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-12116-6_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12115-9
Online ISBN: 978-3-642-12116-6
eBook Packages: Computer ScienceComputer Science (R0)