Abstract
This article presents a minimally supervised approach to question classification on fine-grained taxonomies. We have defined an algorithm that automatically obtains lists of weighted terms for each class in the taxonomy, thus identifying which terms are highly related to the classes and are highly discriminative between them. These lists have then been applied to the task of question classification. Our approach is based on the divergence of probability distributions of terms in plain text retrieved from the Web. A corpus of questions with which to train the classifier is not therefore necessary. As the system is based purely on statistical information, it does not require additional linguistic resources or tools. The experiments were performed on English questions and their Spanish translations. The results reveal that our system surpasses current supervised approaches in this task, obtaining a significant improvement in the experiments carried out.
Similar content being viewed by others
Notes
Text Retrieval Conference: http://trec.nist.org/.
Cross Language Evaluation Forum: http://clef-campaign.org/.
NII-NACSIS Test Collection for IR Systems: http://research.nii.ac.jp/ntcir/.
Text Analysis Conference: http://www.nist.gov/tac/.
In our experiments, we extended the concept of term to unigrams, bigrams, and trigrams.
We employed Yahoo! Search BOSS: http://developer.yahoo.com/search/boss/.
Binary logarithms were used in our experiments.
The value \(-1\) is not included in the interval because it always produces a negative \(\omega \).
Freely available at http://trec.nist.gov/data/qa.html.
All the sets of questions and seeds employed in the evaluation are available at http://www.dlsi.ua.es/~dtomas/resources/.
We employed the Apache Lucene search engine: http://lucene.apache.org.
The set of folds was the same as that used in the experiments with SVM.
References
Abbasnejad ME, Ramachandram D, Mandava R (2012) A survey of the state of the art in learning the kernels. Knowl Inf Syst 31:193–221
Blunsom P, Kocik K, Curran JR (2006) Question classification with log-linear models. In: SIGIR ’06: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 615–616
Brown J (2004) Entity-tagged language models for question classification in a qa system. Technical report, IR Lab
Callan JP, Lu Z, Croft WB (1995) Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’95. ACM, New York, NY, USA, pp 21–28
Cheung Z, Phan KL, Mahidadia A, Hoffmann AG (2004) Feature extraction for learning to classify questions. In: Webb GI, Yu X (eds) AI 2004: advances in artificial intelligence, 17th Australian joint conference on artificial intelligence, vol 3339., of Lecture Notes in Computer Science Springer, Cairns, Australia, pp 1069–1075
Dagan I, Lee L, Pereira FCN (1999) Similarity-based models of word cooccurrence probabilities. Mach Learn 34(1–3):43–69.
Dang HT (2008) Overview of the TAC 2008 opinion question answering and summarization tasks. In: TAC 2008 proceedings papers
Day M-Y, Ong C-S, Hsu W-L (2007) Question classification in english-chinese cross-language question answering: an integrated genetic algorithm and machine learning approach. IEEE international conference on information reuse and integration, 2007. IRI 2007, pp 203–208
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10:1895–1923
Greenwood MA (2005) Open-domain question answering, PhD thesis, Department of Computer Science, University of Sheffield, UK
Hacioglu K, Ward W (2003) Question classification with support vector machines and error correcting codes. In ‘NAACL ’03: proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology. Association for Computational Linguistics, Morristown, NJ, USA, pp 28–30
Hermjakob U (2001) Parsing and question classification for question answering. In: Workshop on open-domain question answering at ACL-2001
Hull DA (1999) Xerox trec-8 question answering track report. In Eighth text REtrieval conference, Vol 500–246 of NIST Special Publication, National Institute of Standards and Technology, Gaithersburg, USA, pp 743–752
Ittycheriah A, Franz M, Zhu W-J, Ratnaparkhi A (2000) IBM’s statistical question answering system. In: Ninth text REtrieval conference, Vol 500–249 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 229–234
Kando N (2005) Overview of the fifth ntcir workshop. In: Proceedings of NTCIR-5 workshop. Tokyo, Japan
Kocik K (2004) Question classification using maximum entropy models. School of Information Technologies, University of Sydney, Master’s thesis
Krishnan V, Das S, Chakrabarti S (2005) Enhanced answer type inference from questions using sequential models. In: HLT ’05: proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, Morristown, NJ, USA, pp 315–322
Li X, Roth D (2002) Learning question classifiers. In: Proceedings of the 19th international conference on Computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 1–7
Li X, Roth D (2005) Learning question classifiers: the role of semantic information. Nat Lang Eng 12(3):229–249
Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37(1):145–151
Magnini B, Romagnoli S, Vallin A, Herrera J, Peñas A, Peinado V, Verdejo F, de Rijke M (2003) Creating the DISEQuA corpus: a test set for multilingual question answering. In: Cross-lingual evaluation forum (CLEF) 2003 workshop, pp 311–320
Magnini B, Vallin A, Ayache C, Erbach G, Peñas A, de Rijke M, Rocha P, Simov KI, Sutcliffe RFE (2005) Overview of the clef 2004 multilingual question answering track. In: 5th Workshop of the cross-language evaluation forum, CLEF 2004, Vol 3491 of Lecture Notes in Computer Science, Springer, pp 371–391
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA
Metzler D, Croft WB (2004) Combining the language model and inference network approaches to retrieval. Inf Process Manag 40:735–750
Metzler D, Croft WB (2005) Analysis of statistical question classification for fact-based questions. Inf Retr 8(3):481–504
Moldovan D, Pasca M, Harabagiu S, Surdeanu M (2003) Performance issues and error analysis in an open-domain question answering system. ACM Trans Inf Syst 21(2):133–154
Nguyen TT, Nguyen LM, Shimazu A (2008) Using semi-supervised learning for question classification. Inf Media Technol 3(1):112–130
Ni X, Quan X, Lu Z, Wenyin L, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365
Paşca M, Harabagiu S (2001) High performance question/answering. In: SIGIR ’01: proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 366–374
Pan Y, Tang Y, Lin L, Luo Y (2008) Question classification with semantic tree kernel. In: SIGIR ’08: proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 837–838
Pinchak C, Lin D (2006) A probabilistic answer type model. In: EACL 2006, 11st conference of the European chapter of the association for computational linguistics. The Association for Computer, Linguistics, pp 393–400
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’98. ACM, New York, NY, USA, pp 275–281
Prager J, Radev D, Brown E, Coden A, Samn V (1999) The use of predictive annotation for question answering in trec-8. In: Eighth text REtrieval conference, vol 500–246 of NIST special publication, National Institute of Standards and Technology, Gaithersburg, USA, pp 399–409
Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25:473–491
Radev D, Fan W, Qi H, Wu H, Grewal A (2002) Probabilistic question answering on the web. In: WWW ’02: proceedings of the 11th international conference on World Wide Web. ACM, New York, NY, USA, pp 408–419
Ray SK, Singh S, Joshi BP (2010) A semantic approach for question classification using wordnet and wikipedia. Pattern Recognit Lett 31:1935–1943
Robertson S, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1996) Okapi at TREC-3. In: Third text REtrieval conference, vol 500–225 of NIST special publication. National Institute of Standars and Technology, Gaithersburg, USA, pp 109–126
Schlobach S, Olsthoorn M, Rijke MD (2004) Type checking in open-domain question answering. In: Journal of Applied Logic. IOS Press, pp 398–402
Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: LREC 2002: language resources and evaluation conference. Las Palmas, Spain, pp 1818–1824
Singhal A, Abney S, Bacchiani M, Collins M, Hindle D, Pereira F (1999) ATT at TREC-8. In: Eighth text REtrieval conference, vol 500–246 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 317–330
Solorio T, no MP-C, y Gémez MM, nor Pineda LV, López-López A (2004) A language independent method for question classification. In: ‘COLING ’04: proceedings of the 20th international conference on computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 1374–1380
Strzalkowski T, Harabagiu S (2006) Advances in open domain question answering (text, speech and language technology). Springer-Verlag New York Inc, Secaucus, NJ, USA
Sundblad H (2007) Question classification in question answering. Linköping University, Department of Computer and Information Science, Master’s thesis
Suzuki J, Hirao T, Sasaki Y, Maeda E (2003) Hierarchical directed acyclic graph kernel: methods for structured natural language data. In ‘ACL’, pp 32–39
Suzuki J, Taira H, Sasaki Y, Maeda E (2003) Question classification using HDAG kernel. In: Proceedings of the ACL 2003 workshop on multilingual summarization and question answering. Association for computational linguistics, Morristown, NJ, USA, pp 61–68
Tomás D, Giuliano C (2009) A semi-supervised approach to question classification. In: 17th European symposium on artificial neural networks: advances in computational intelligence and learning
Tomás D, Vicedo JL (2010) Feature selection for multilingual question classification. Procesamiento del Lenguaje Nat 44:67–74
Vallin A, Magnini B, Giampiccolo D, Aunimo L, Ayache C, Osenova P, Peñas A, de Rijke M, Sacaleanu B, Santos D, Sutcliffe R (2006) Overview of the clef 2005 multilingual question answering track. In: Heidelberg SB (eds) Accessing multilingual information repositories, vol 4022 of Lecture Notes in Computer Science, pp 307–331
Voorhees EM (1999) The trec-8 question answering track report. In: Proceedings of the 8th text REtrieval conference, vol 500–246 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 77–82
Voorhees EM (2001) The trec question answering track. Nat Lang Eng 7(4):361–378
Yu Z, Su L, Li L, Zhao Q, Mao C, Guo J (2010) Question classification based on co-training style semi-supervised learning. Pattern Recognit Lett 31(13):1975–1980
Zhang D, Lee WS (2003) Question classification using support vector machines. In: SIGIR ’03: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 26–32
Zhang R, Tran T (2011) An information gain-based approach for recommending useful product reviews. Knowl Inf Syst 26:419–434
Acknowledgments
This research has been partially funded by the Spanish Government under project TEXTMESS 2.0 (TIN2009-13391-C04-01) and by the University of Alicante under project GRE10-33.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tomás, D., Vicedo, J.L. Minimally supervised question classification on fine-grained taxonomies. Knowl Inf Syst 36, 303–334 (2013). https://doi.org/10.1007/s10115-012-0557-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0557-y