Utility-Theoretic Ranking for Semi-Automated Text Classification

Berardi, Giacomo; Esuli, Andrea; Sebastiani, Fabrizio

doi:10.1145/2742548

Computer Science > Machine Learning

arXiv:1503.00491 (cs)

[Submitted on 2 Mar 2015]

Title:Utility-Theoretic Ranking for Semi-Automated Text Classification

Authors:Giacomo Berardi, Andrea Esuli, Fabrizio Sebastiani

View PDF

Abstract:\emph{Semi-Automated Text Classification} (SATC) may be defined as the task of ranking a set $\mathcal{D}$ of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of $\mathcal{D}$ with the goal of increasing the overall labelling accuracy of $\mathcal{D}$, the expected increase is maximized. An obvious SATC strategy is to rank $\mathcal{D}$ so that the documents that the classifier has labelled with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of \emph{validation gain}, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

Comments:	Forthcoming on ACM Transactions on Knowledge Discovery from Data
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:1503.00491 [cs.LG]
	(or arXiv:1503.00491v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1503.00491
Journal reference:	Final version published in ACM Transactions on Knowledge Discovery from Data, 10(1):Article 6, 2015
Related DOI:	https://doi.org/10.1145/2742548

Submission history

From: Fabrizio Sebastiani [view email]
[v1] Mon, 2 Mar 2015 12:09:23 UTC (200 KB)

Computer Science > Machine Learning

Title:Utility-Theoretic Ranking for Semi-Automated Text Classification

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Utility-Theoretic Ranking for Semi-Automated Text Classification

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators