Abstract
In this paper, we discuss paradigms for evaluating open-domain semantic interpretation as they apply to the PASCAL Recognizing Textual Entailment (RTE) evaluation (Dagan et al. 2005). We focus on three aspects critical to a successful evaluation: creation of large quantities of reasonably good training data, analysis of inter-annotator agreement, and joint analysis of test item difficulty and test-taker proficiency (Rasch analysis). We found that although RTE does not correspond to a “real” or naturally occurring language processing task, it nonetheless provides clear and simple metrics, a tolerable cost of corpus development, good annotator reliability (with the potential to exploit the remaining variability), and the possibility of finding noisy but plentiful training material.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aberdeen, J., Condon, S., Doran, C., Harper, L., Oshika, B., Phillips, J.: Evaluation of speech-to-speech translation systems (2005) (unpublished manuscript)
Aberdeen, J., Hirschman, L., Walker, M.: Evaluation for DARPA Communicator spoken dialogue systems. In: Proceedings of the 2nd Conference on Language Resources and Evaluation (2000)
Bayer, S., Burger, J., Ferro, L., Henderson, J., Yeh, A.: MITRE’s submissions to the EU Pascal RTE challenge. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)
Bayer, S., Burger, J., Greiff, W., Wellner, B.: The MITRE logical form generation system. In: Proceedings of Senseval-3: The Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 69–72 (2004)
Bond, T.G., Fox, C.M.: Applying the Rasch Model: Fundamental Measurement in the Human Sciences. University of Toledo Press (2001)
Bos, J., Markert, K.: Combining shallow and deep NLP methods for recognizing textual entailment. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)
Brachman, R. (AA)AI: More than the sum of its parts. AAAI Presidential Address. In: Presented at AAAI 2005 (2005)
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation. Computational Linguistics 19 (1993)
Burger, J., Ferro, L.: Generating an entailment corpus from news headlines. In: ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI (2005)
Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognizing textual entailment challenge. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)
Damianos, L., Wohlever, S., Kozierok, R., Ponte, J.: MiTAP for real users, real data, real problems. In: Proceedings of the Conference on Human Factors of Computing Systems, Fort Lauderdale, FL (2003)
Deshmukh, N., Duncan, R., Ganapathiraju, A., Picone, J.: Benchmarking human performance for continuous speech recognition. In: Proceedings of the Fourth International Conference on Spoken Language Processing, Philadelphia, Pennsylvania, USA, pp. 2486–2489 (1996)
Dolan, B., Brockett, C., Quirk, C.: Microsoft Research paraphrase corpus (2005), http://research.microsoft.com/research/nlp/msr_paraphrase.htm
Graff, D.: English Gigaword (2003), http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05
Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD NIST. Morgan Kaufmann, San Francisco (1995)
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)
Henderson, J., Morgan, W.: Paris: an automated MT evaluation metric toolkit; and a survey of metric performance on the segment ranking task. Technical report, MITRE (2005) (to appear)
Hirschman, L.: The evolution of evaluation: Lessons from the message understanding conferences. Computer Speech and Language 12, 281–305 (1998)
Hirschman, L.: Language understanding evaluations: Lessons learned from MUC and ATIS. In: Proceedings of LREC 1998, Granada (1998)
Hirschman, L., Bates, M., Dahl, D., Fisher, W.M., Garafolo, J., Pallet, D.S., Hunicke- Smith, K., Price, P., Rudnicky, A., Tzoukermann, E.: Multisite data collection and evaluation in spoken language understanding. In: Proceedings of the DARPA Workshop on Human Language Technology, Princeton, NJ, pp. 19–24 (1993)
Hirschman, L., Light, M., Breck, E., Burger, J.D.: Deep Read: A reading comprehension system. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (1999)
Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics 6(suppl. 1) (2005)
Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer, Dordrecht (2002)
Lange, R., Moran, J., Greiff, W., Ferro, L.: A probabilistic Rasch analysis of question answering evaluations. In: Proceedings of HLT-NAACL 2004, pp. 65–72 (2004)
Light, M., Mann, G.S., Riloff, E., Breck, E.: Analyses for elucidating current question answering technology. Natural Language Engineering 7, 325–342 (2001)
Morgan, A., Hirschman, L., Colosimo, M., Yeh, A., Colombe, J.: Gene name identification and normalization using a model organism database. Journal of Biomedical Informatics 37, 396–410 (2004)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29 (2003)
Papineni, K., Roukos, S., Ward, T., Henderson, J., Reeder, F.: Corpus-based comprehensive and diagnostic MT evaluation: Initial Arabic, Chinese, French, and Spanish results. In: Proceedings of the 2002 Conference on Human Language Technology, San Diego, CA, pp. 124–127 (2002)
Sundheim, B.: Overview of results of the MUC-6 evaluation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD. NIST. Morgan Kaufmann, San Francisco (1995)
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proceedings of ICML 2000, 17th International Conference on Machine Learning (2000)
Walker, M., Aberdeen, J., Boland, J., Bratt, E., Garofolo, J., Hirschman, L., Le, A., Lee, S., Narayanan, S., Papineni, K., Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S.: DARPA Communicator dialog travel planning systems: The June 2000 data collection. In: Proceedings of Eurospeech 2001, Aalborg, Denmark (2001)
Wellner, B., Ferro, L., Greiff, W., Hirschman, L.: Reading comprehension tests for computer-based understanding evaluation. Natural Language Engineering (2005) (to appear)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bayer, S., Burger, J., Ferro, L., Henderson, J., Hirschman, L., Yeh, A. (2006). Evaluating Semantic Evaluations: How RTE Measures Up. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds) Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment. MLCW 2005. Lecture Notes in Computer Science(), vol 3944. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11736790_18
Download citation
DOI: https://doi.org/10.1007/11736790_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33427-9
Online ISBN: 978-3-540-33428-6
eBook Packages: Computer ScienceComputer Science (R0)