research-article

Utility-Theoretic Ranking for Semiautomated Text Classification

Authors:

Giacomo Berardi,

Fabrizio SebastianiAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 10, Issue 1

Article No.: 6, Pages 1 - 32

https://doi.org/10.1145/2742548

Published: 22 July 2015 Publication History

Abstract

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

References

[1]

IJsbrand J. Aalbersberg. 1992. Incremental relevance feedback. In Proceedings of the 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1992). Copenhagen, DK, 11--22.

Digital Library

[2]

Paul Anand. 1993. Foundations of Rational Choice Under Risk. Oxford University Press, Oxford, UK.

[3]

Giacomo Berardi, Andrea Esuli, and Fabrizio Sebastiani. 2012. A Utility-theoretic ranking method for semi-automated text classification. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012). Portland, US, 961--970.

Digital Library

[4]

Giacomo Berardi, Andrea Esuli, and Fabrizio Sebastiani. 2014. Optimising human inspection work in automated verbatim coding. International Journal of Market Research 56, 4 (2014), 489--512.

[5]

Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. 2011. Dynamic ranked retrieval. In Proceedings of the 4th International Conference on Web Search and Web Data Mining (WSDM 2011). Hong Kong, CN, 247--256.

Digital Library

[6]

Carla E. Brodley and Mark A. Friedl. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11 (1999), 131--167.

[7]

Prabir Burman. 1987. Smoothing sparse contingency tables. The Indian Journal of Statistics 49, 1 (1987), 24--36.

[8]

Olivier Chapelle, Bernard Schölkopf, and Alexander Zien (Eds.). 2006. Semi-Supervised Learning. The MIT Press, Cambridge, US.

[9]

Stanley F. Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996). Santa Cruz, US, 310--318.

Digital Library

[10]

Charles Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI 2001). Seattle, US, 973--978.

Digital Library

[11]

Andrea Esuli, Tiziano Fagni, and Fabrizio Sebastiani. 2006. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE 2006). Glasgow, UK, 1--12.

Digital Library

[12]

Andrea Esuli and Fabrizio Sebastiani. 2009. Active learning strategies for multi-label text classification. In Proceedings of the 31st European Conference on Information Retrieval (ECIR 2009). Toulouse, FR, 102--113.

Digital Library

[13]

Andrea Esuli and Fabrizio Sebastiani. 2013. Training data cleaning for text classification. ACM Transactions on Information Systems 31, 4 (2013).

Digital Library

[14]

Fumiyo Fukumoto and Yoshimi Suzuki. 2004. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). Geneva, CH, 868--874.

Digital Library

[15]

William A. Gale and Kenneth W. Church. 1994. What’s wrong with adding one? In Corpus-Based Research into Language: In Honour of Jan Aarts, N. Oostdijk and P. de Haan (Eds.). Rodopi, Amsterdam, NL, 189--200.

[16]

Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. 2004. Document classification through interactive supervision of document and term labels. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004). Pisa, IT, 185--196.

Digital Library

[17]

Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263--1284.

Digital Library

[18]

William Hersh, Christopher Buckley, T. J. Leone, and David Hickman. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1994). Dublin, IE, 192--201.

Digital Library

[19]

Steven C. Hoi, Rong Jin, and Michael R. Lyu. 2006. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International Conference on World Wide Web (WWW 2006). Edinburgh, UK, 633--642.

Digital Library

[20]

David J. Ittner, David D. Lewis, and David D. Ahn. 1995. Text categorization of low quality images. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1995). Las Vegas, US, 301--315.

[21]

Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML 1999). Bled, SL, 200--209.

Digital Library

[22]

Ashish Kapoor, Eric Horvitz, and Sumit Basu. 2007. Selective supervision: Guiding supervised learning with decision-theoretic active learning. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI 2007). San Francisco, US, 877--882.

Digital Library

[23]

Leah S. Larkey and W. Bruce Croft. 1996. Combining classifiers in text categorization. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996). Zürich, CH, 289--297.

Digital Library

[24]

David D. Lewis and Jason Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of 11th International Conference on Machine Learning (ICML 1994). New Brunswick, US, 148--156.

[25]

David D. Lewis, Robert E. Schapire, James P. Callan, and Ron Papka. 1996. Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996). Zürich, CH, 298--306.

Digital Library

[26]

Miguel Martinez-Alvarez, Alejandro Bellogin, and Thomas Roelleke. 2013. Document difficulty framework for semi-automatic text classification. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2013). Prague, CZ, 110--121.

Digital Library

[27]

Miguel Martinez-Alvarez, Sirvan Yahyaei, and Thomas Roelleke. 2012. Semi-automatic document classification: exploiting document difficulty. In Proceedings of the 34th European Conference on Information Retrieval (ECIR 2012). Barcelona, ES, 468--471.

Digital Library

[28]

Andrew K. McCallum and Kamal Nigam. 1998. Employing EM in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning (ICML 1998). Madison, US, 350--358.

Digital Library

[29]

Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (2008).

Digital Library

[30]

Alexandru Niculescu-Mizil and Rich Caruana. 2005. Obtaining calibrated probabilities from boosting. In Proceedings of the 21st Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI 2005). Arlington, US, 413--420.

[31]

Douglas W. Oard, Jason R. Baron, Bruce Hedin, David D. Lewis, and Stephen Tomlinson. 2010. Evaluation of information retrieval for E-discovery. Artificial Intelligence and Law 18, 4 (2010), 347--386.

Digital Library

[32]

Douglas W. Oard and William Webber. 2013. Information retrieval for e-discovery. Foundations and Trends in Information Retrieval 7, 2/3 (2013), 99--237.

[33]

John C. Platt. 2000. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers, Alexander Smola, Peter Bartlett, Bernard Schölkopf, and Dale Schuurmans (Eds.). The MIT Press, Cambridge, MA, 61--74.

[34]

Hema Raghavan, Omid Madani, and Rosie Jones. 2006. Active learning with feedback on features and instances. Journal of Machine Learning Research 7 (2006), 1655--1686.

Digital Library

[35]

Stephen E. Robertson. 2008. A new interpretation of average precision. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR 2008). Singapore, SN, 689--690.

Digital Library

[36]

Robert E. Schapire and Yoram Singer. 2000. BoosTexter: A boosting-based system for text categorization. Machine Learning 39, 2/3 (2000), 135--168.

Digital Library

[37]

Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. Comput. Surveys 34, 1 (2002), 1--47.

Digital Library

[38]

Burr Settles. 2012. Active learning. Morgan & Claypool Publishers, San Rafael, US.

[39]

Jeffrey S. Simonoff. 1983. A penalty function approach to smoothing large sparse contingency tables. The Annals of Statistics 11, 1 (1983), 208--218.

[40]

Simon Tong and Daphne Koller. 2001. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2 (2001), 45--66.

Digital Library

[41]

Sudheendra Vijayanarasimhan and Kristen Grauman. 2009. What’s it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In Proceedings of the 15th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009). Miami, US, 2262--2269.

[42]

John von Neumann and Oskar Morgenstern. 1944. Theory of games and economic behavior. Princeton University Press, Princeton, US.

[43]

Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999). Berkeley, US, 42--49.

Digital Library

[44]

ChengXiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214.

Digital Library

[45]

Xiaojin Zhu and Andrew B. Goldberg. 2009. Introduction to Semi-supervised Learning. Morgan and Claypool, San Rafael, US.

Digital Library

Cited By

Zhang LWei XYang FZhao CWen BLu Y(2024)Cross-Domain Sentiment Classification with Mere Contrastive Learning and Improved Method2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT)10.1109/AICIT62434.2024.10730527(1-10)Online publication date: 20-Sep-2024
https://doi.org/10.1109/AICIT62434.2024.10730527
Zhu ZFang Y(2023)Simulation of Big Data Order-Preserving Matching and Retrieval Model Based on Deep Learning2023 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC)10.1109/PEEEC60561.2023.00143(715-719)Online publication date: 25-Sep-2023
https://doi.org/10.1109/PEEEC60561.2023.00143
Oard DSebastiani FVinjumur J(2018)Jointly Minimizing the Expected Costs of Review for Responsiveness and Privilege in E-DiscoveryACM Transactions on Information Systems10.1145/326892837:1(1-35)Online publication date: 19-Nov-2018
https://dl.acm.org/doi/10.1145/3268928
Show More Cited By

Index Terms

Utility-Theoretic Ranking for Semiautomated Text Classification

Recommendations

Improving Text Classification Accuracy by Training Label Cleaning

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
A utility-theoretic ranking method for semi-automated text classification
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 10, Issue 1

July 2015

321 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/2808688

Editor:
Philip S. Yu
University of Illinois at Chicago, USA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2015

Accepted: 01 March 2015

Revised: 01 February 2015

Received: 01 July 2014

Published in TKDD Volume 10, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
279
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang LWei XYang FZhao CWen BLu Y(2024)Cross-Domain Sentiment Classification with Mere Contrastive Learning and Improved Method2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT)10.1109/AICIT62434.2024.10730527(1-10)Online publication date: 20-Sep-2024
https://doi.org/10.1109/AICIT62434.2024.10730527
Zhu ZFang Y(2023)Simulation of Big Data Order-Preserving Matching and Retrieval Model Based on Deep Learning2023 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC)10.1109/PEEEC60561.2023.00143(715-719)Online publication date: 25-Sep-2023
https://doi.org/10.1109/PEEEC60561.2023.00143
Oard DSebastiani FVinjumur J(2018)Jointly Minimizing the Expected Costs of Review for Responsiveness and Privilege in E-DiscoveryACM Transactions on Information Systems10.1145/326892837:1(1-35)Online publication date: 19-Nov-2018
https://dl.acm.org/doi/10.1145/3268928
Lakhanpal SGupta AAgrawal R(2017)Mining Domain Similarity to Enhance Digital IndexingProceedings of the 9th International Conference on Management of Digital EcoSystems10.1145/3167020.3167033(88-92)Online publication date: 7-Nov-2017
https://dl.acm.org/doi/10.1145/3167020.3167033
Gao WSebastiani F(2016)From classification to quantification in tweet sentiment analysisSocial Network Analysis and Mining10.1007/s13278-016-0327-z6:1Online publication date: 12-Apr-2016
https://doi.org/10.1007/s13278-016-0327-z

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents