Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Utility-Theoretic Ranking for Semiautomated Text Classification

Published: 22 July 2015 Publication History

Abstract

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

References

[1]
IJsbrand J. Aalbersberg. 1992. Incremental relevance feedback. In Proceedings of the 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1992). Copenhagen, DK, 11--22.
[2]
Paul Anand. 1993. Foundations of Rational Choice Under Risk. Oxford University Press, Oxford, UK.
[3]
Giacomo Berardi, Andrea Esuli, and Fabrizio Sebastiani. 2012. A Utility-theoretic ranking method for semi-automated text classification. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012). Portland, US, 961--970.
[4]
Giacomo Berardi, Andrea Esuli, and Fabrizio Sebastiani. 2014. Optimising human inspection work in automated verbatim coding. International Journal of Market Research 56, 4 (2014), 489--512.
[5]
Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. 2011. Dynamic ranked retrieval. In Proceedings of the 4th International Conference on Web Search and Web Data Mining (WSDM 2011). Hong Kong, CN, 247--256.
[6]
Carla E. Brodley and Mark A. Friedl. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11 (1999), 131--167.
[7]
Prabir Burman. 1987. Smoothing sparse contingency tables. The Indian Journal of Statistics 49, 1 (1987), 24--36.
[8]
Olivier Chapelle, Bernard Schölkopf, and Alexander Zien (Eds.). 2006. Semi-Supervised Learning. The MIT Press, Cambridge, US.
[9]
Stanley F. Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996). Santa Cruz, US, 310--318.
[10]
Charles Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI 2001). Seattle, US, 973--978.
[11]
Andrea Esuli, Tiziano Fagni, and Fabrizio Sebastiani. 2006. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE 2006). Glasgow, UK, 1--12.
[12]
Andrea Esuli and Fabrizio Sebastiani. 2009. Active learning strategies for multi-label text classification. In Proceedings of the 31st European Conference on Information Retrieval (ECIR 2009). Toulouse, FR, 102--113.
[13]
Andrea Esuli and Fabrizio Sebastiani. 2013. Training data cleaning for text classification. ACM Transactions on Information Systems 31, 4 (2013).
[14]
Fumiyo Fukumoto and Yoshimi Suzuki. 2004. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). Geneva, CH, 868--874.
[15]
William A. Gale and Kenneth W. Church. 1994. What’s wrong with adding one? In Corpus-Based Research into Language: In Honour of Jan Aarts, N. Oostdijk and P. de Haan (Eds.). Rodopi, Amsterdam, NL, 189--200.
[16]
Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. 2004. Document classification through interactive supervision of document and term labels. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004). Pisa, IT, 185--196.
[17]
Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263--1284.
[18]
William Hersh, Christopher Buckley, T. J. Leone, and David Hickman. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1994). Dublin, IE, 192--201.
[19]
Steven C. Hoi, Rong Jin, and Michael R. Lyu. 2006. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International Conference on World Wide Web (WWW 2006). Edinburgh, UK, 633--642.
[20]
David J. Ittner, David D. Lewis, and David D. Ahn. 1995. Text categorization of low quality images. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1995). Las Vegas, US, 301--315.
[21]
Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML 1999). Bled, SL, 200--209.
[22]
Ashish Kapoor, Eric Horvitz, and Sumit Basu. 2007. Selective supervision: Guiding supervised learning with decision-theoretic active learning. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI 2007). San Francisco, US, 877--882.
[23]
Leah S. Larkey and W. Bruce Croft. 1996. Combining classifiers in text categorization. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996). Zürich, CH, 289--297.
[24]
David D. Lewis and Jason Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of 11th International Conference on Machine Learning (ICML 1994). New Brunswick, US, 148--156.
[25]
David D. Lewis, Robert E. Schapire, James P. Callan, and Ron Papka. 1996. Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996). Zürich, CH, 298--306.
[26]
Miguel Martinez-Alvarez, Alejandro Bellogin, and Thomas Roelleke. 2013. Document difficulty framework for semi-automatic text classification. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2013). Prague, CZ, 110--121.
[27]
Miguel Martinez-Alvarez, Sirvan Yahyaei, and Thomas Roelleke. 2012. Semi-automatic document classification: exploiting document difficulty. In Proceedings of the 34th European Conference on Information Retrieval (ECIR 2012). Barcelona, ES, 468--471.
[28]
Andrew K. McCallum and Kamal Nigam. 1998. Employing EM in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning (ICML 1998). Madison, US, 350--358.
[29]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (2008).
[30]
Alexandru Niculescu-Mizil and Rich Caruana. 2005. Obtaining calibrated probabilities from boosting. In Proceedings of the 21st Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI 2005). Arlington, US, 413--420.
[31]
Douglas W. Oard, Jason R. Baron, Bruce Hedin, David D. Lewis, and Stephen Tomlinson. 2010. Evaluation of information retrieval for E-discovery. Artificial Intelligence and Law 18, 4 (2010), 347--386.
[32]
Douglas W. Oard and William Webber. 2013. Information retrieval for e-discovery. Foundations and Trends in Information Retrieval 7, 2/3 (2013), 99--237.
[33]
John C. Platt. 2000. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers, Alexander Smola, Peter Bartlett, Bernard Schölkopf, and Dale Schuurmans (Eds.). The MIT Press, Cambridge, MA, 61--74.
[34]
Hema Raghavan, Omid Madani, and Rosie Jones. 2006. Active learning with feedback on features and instances. Journal of Machine Learning Research 7 (2006), 1655--1686.
[35]
Stephen E. Robertson. 2008. A new interpretation of average precision. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR 2008). Singapore, SN, 689--690.
[36]
Robert E. Schapire and Yoram Singer. 2000. BoosTexter: A boosting-based system for text categorization. Machine Learning 39, 2/3 (2000), 135--168.
[37]
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. Comput. Surveys 34, 1 (2002), 1--47.
[38]
Burr Settles. 2012. Active learning. Morgan & Claypool Publishers, San Rafael, US.
[39]
Jeffrey S. Simonoff. 1983. A penalty function approach to smoothing large sparse contingency tables. The Annals of Statistics 11, 1 (1983), 208--218.
[40]
Simon Tong and Daphne Koller. 2001. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2 (2001), 45--66.
[41]
Sudheendra Vijayanarasimhan and Kristen Grauman. 2009. What’s it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In Proceedings of the 15th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009). Miami, US, 2262--2269.
[42]
John von Neumann and Oskar Morgenstern. 1944. Theory of games and economic behavior. Princeton University Press, Princeton, US.
[43]
Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999). Berkeley, US, 42--49.
[44]
ChengXiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214.
[45]
Xiaojin Zhu and Andrew B. Goldberg. 2009. Introduction to Semi-supervised Learning. Morgan and Claypool, San Rafael, US.

Cited By

View all
  • (2024)Cross-Domain Sentiment Classification with Mere Contrastive Learning and Improved Method2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT)10.1109/AICIT62434.2024.10730527(1-10)Online publication date: 20-Sep-2024
  • (2023)Simulation of Big Data Order-Preserving Matching and Retrieval Model Based on Deep Learning2023 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC)10.1109/PEEEC60561.2023.00143(715-719)Online publication date: 25-Sep-2023
  • (2018)Jointly Minimizing the Expected Costs of Review for Responsiveness and Privilege in E-DiscoveryACM Transactions on Information Systems10.1145/326892837:1(1-35)Online publication date: 19-Nov-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 10, Issue 1
July 2015
321 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2808688
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2015
Accepted: 01 March 2015
Revised: 01 February 2015
Received: 01 July 2014
Published in TKDD Volume 10, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Text classification
  2. cost-sensitive learning
  3. ranking
  4. semiautomated text classification
  5. supervised learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cross-Domain Sentiment Classification with Mere Contrastive Learning and Improved Method2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT)10.1109/AICIT62434.2024.10730527(1-10)Online publication date: 20-Sep-2024
  • (2023)Simulation of Big Data Order-Preserving Matching and Retrieval Model Based on Deep Learning2023 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC)10.1109/PEEEC60561.2023.00143(715-719)Online publication date: 25-Sep-2023
  • (2018)Jointly Minimizing the Expected Costs of Review for Responsiveness and Privilege in E-DiscoveryACM Transactions on Information Systems10.1145/326892837:1(1-35)Online publication date: 19-Nov-2018
  • (2017)Mining Domain Similarity to Enhance Digital IndexingProceedings of the 9th International Conference on Management of Digital EcoSystems10.1145/3167020.3167033(88-92)Online publication date: 7-Nov-2017
  • (2016)From classification to quantification in tweet sentiment analysisSocial Network Analysis and Mining10.1007/s13278-016-0327-z6:1Online publication date: 12-Apr-2016

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media