research-article

Cleaning crowdsourced labels using oracles for statistical classification

Authors:

Mohamad Dolatshah,

Jian PeiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 12, Issue 4

Pages 376 - 389

https://doi.org/10.14778/3297753.3297758

Published: 01 December 2018 Publication History

Abstract

Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with ground-truth labels. In this paper, we consider how to apply oracle-based label cleaning to reduce the gap. We propose TARS, a label-cleaning advisor that can provide two pieces of valuable advice for data scientists when they need to train or test a model using noisy labels. Firstly, in the model testing stage, given a test dataset with noisy labels, and a classification model, TARS can use the test data to estimate how well the model will perform w.r.t. ground-truth labels. Secondly, in the model training stage, given a training dataset with noisy labels, and a classification algorithm, TARS can determine which label should be sent to an oracle to clean such that the model can be improved the most. For the first advice, we propose an effective estimation technique, and study how to compute confidence intervals to bound its estimation error. For the second advice, we propose a novel cleaning strategy along with two optimization techniques, and illustrate that it is superior to the existing cleaning strategies. We evaluate TARS on both simulated and real-world datasets. The results show that (1) TARS can use noisy test data to accurately estimate a model's true performance for various evaluation metrics; and (2) TARS can improve the model accuracy by a larger margin than the existing cleaning strategies, for the same cleaning budget.

References

[1]

Crowdflower: Data For Everyone. https://www.figure-eight.com/data-for-everyone/. Accessed: 2018-05-29.

[2]

UCI Machine Learning Repository: Classification Datasets. https://archive.ics.uci.edu/ml/datasets.html?format=&task=cla. Accessed: 2018-05-29.

[3]

M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In ACM SIGMOD, pages 1199--1214, 2015.

Digital Library

[4]

R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis, and W. C. Tan. Asking the right questions in crowd data sourcing. In ICDE, pages 1261--1264, 2012.

Digital Library

[5]

C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. J. Artif. Intell. Res., 11:131--167, 1999.

[6]

P. Cheng, X. Lian, L. Chen, J. Han, and J. Zhao. Task assignment on multi-skill oriented spatial crowdsourcing. IEEE Trans. Knowl. Data Eng., 28(8):2201--2215, 2016.

Digital Library

[7]

X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In SIGMOD, pages 2201--2206, 2016.

Digital Library

[8]

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In ACM SIGMOD, pages 1247--1261, 2015.

Digital Library

[9]

S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In ACM SIGMOD, pages 1431--1446, 2017.

Digital Library

[10]

A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics, pages 20--28, 1979.

[11]

M. Dolatshah, M. Teoh, J. Wang, and J. Pei. Cleaning crowdsourced labels using oracles for statistical classification. Technical Report, 2018.

Digital Library

[12]

X. L. Dong and F. Naumann. Data fusion - resolving data conflicts for integration. PVLDB, 2(2):1654--1655, 2009.

Digital Library

[13]

J. Fan, G. Li, B. C. Ooi, K. Tan, and J. Feng. iCrowd: An adaptive crowdsourcing framework. In ACM SIGMOD, pages 1015--1030, 2015.

Digital Library

[14]

D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an oracle. PVLDB, 9(5):384--395, 2016.

Digital Library

[15]

B. Frénay and M. Verleysen. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learning Syst., 25(5):845--869, 2014.

[16]

J. Gao, Q. Li, B. Zhao, W. Fan, and J. Han. Truth discovery and crowdsourcing aggregation: A unified perspective. PVLDB, 8(12):2048--2049, 2015.

Digital Library

[17]

J. Gao, X. Liu, B. C. Ooi, H. Wang, and G. Chen. An online cost sensitive decision-making method in crowdsourcing systems. In ACM SIGMOD, pages 217--228, 2013.

Digital Library

[18]

D. Haas, J. Wang, E. Wu, and M. J. Franklin. Clamshell: Speeding up crowds for low-latency data labeling. PVLDB, 9(4):372--383, 2015.

Digital Library

[19]

N. Q. V. Hung, D. C. Thang, M. Weidlich, and K. Aberer. Minimizing efforts in validating crowd answers. In ACM SIGMOD, pages 999--1014, 2015.

Digital Library

[20]

S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948--959, 2016.

Digital Library

[21]

D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In SIGIR, pages 246--254. ACM, 1995.

Digital Library

[22]

D. D. Lewis and J. Catlett. Heterogenous uncertainty sampling for supervised learning. In ICML, pages 148--156, 1994.

Digital Library

[23]

G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. IEEE Trans. Knowl. Data Eng., 28(9):2296--2319, 2016.

Digital Library

[24]

C. H. Lin, Mausam, and D. S. Weld. Re-active learning: Active learning with relabeling. In AAAI, pages 1845--1852, 2016.

Digital Library

[25]

T. Liu and D. Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447--461, 2016.

Digital Library

[26]

X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012.

Digital Library

[27]

B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2):125--136, 2014.

Digital Library

[28]

N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS, pages 1196--1204, 2013.

Digital Library

[29]

S. Nowozin. Optimal decisions from probabilistic models: the intersection-over-union case. In CVPR, pages 548--555, 2014.

Digital Library

[30]

H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In ACM SIGMOD, pages 577--588, 2014.

Digital Library

[31]

J. Pilourdault, S. Amer-Yahia, D. Lee, and S. B. Roy. Motivation-aware task assignment in crowdsourcing. In EDBT, pages 246--257, 2017.

[32]

V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297--1322, 2010.

Digital Library

[33]

T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. G. Parameswaran, and C. Ré. Slimfast: Guaranteed results for data fusion and source reliability. In SIGMOD, pages 1399--1414, 2017.

Digital Library

[34]

S. H. Rice. A stochastic version of the price equation reveals the interplay of deterministic and stochastic processes in evolution. BMC evolutionary biology, 8(1):262, 2008.

[35]

N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In ICML, pages 441--448, 2001.

Digital Library

[36]

B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55--66):11, 2010.

[37]

V. S. Sheng, F. J. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In ACM SIGKDD, pages 614--622, 2008.

Digital Library

[38]

C. G. Small. Expansions and asymptotics for statistics. Chapman and Hall/CRC, 2010.

[39]

S. Sukhbaatar and R. Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014.

[40]

A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. J. Belongie. Learning from noisy large-scale datasets with minimal supervision. In IEEE CVPR, pages 6575--6583, 2017.

[41]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012.

Digital Library

[42]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In ACM SIGMOD, pages 469--480, 2014.

Digital Library

[43]

S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013.

Digital Library

[44]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011.

Digital Library

[45]

Y. Yan, R. Rosales, G. Fung, and J. G. Dy. Active learning from crowds. In ICML, pages 1161--1168, 2011.

Digital Library

[46]

C. J. Zhang, L. Chen, Y. Tong, and Z. Liu. Cleaning uncertain data with a noisy crowd. In ICDE, pages 6--17, 2015.

[47]

Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng. Truth inference in crowdsourcing: is the problem solved? PVLDB, 10(5):541--552, 2017.

Digital Library

[48]

Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. QASCA: A quality-aware task assignment system for crowdsourcing applications. In ACM SIGMOD, pages 1031--1046, 2015.

Digital Library

Cited By

Zhu ZWang YYang SLong LWu RTang XZhao JWang H(2024)CORAL: Collaborative Automatic Labeling System Based on Large Language ModelsProceedings of the VLDB Endowment10.14778/3685800.368588517:12(4401-4404)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685885
Deng YChai CCao LTang NWang JFan JYuan YWang G(2024)MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648161
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 16-Dec-2024
https://dl.acm.org/doi/10.1145/3702315
Show More Cited By

Recommendations

Classification with partial labels
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

In this paper, we address the problem of learning when some cases are fully labeled while other cases are only partially labeled, in the form of partial labels. Partial labels are represented as a set of possible labels for each training example, one of ...
Improving Text Classification Accuracy by Training Label Cleaning

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
Semi-Supervised Deep Learning Using Pseudo Labels for Hyperspectral Image Classification

Deep learning has gained popularity in a variety of computer vision tasks. Recently, it has also been successfully applied for hyperspectral image classification tasks. Training deep neural networks, such as a convolutional neural network for ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 12, Issue 4

December 2018

140 pages

ISSN:2150-8097

Editors:
Lei Chen
HKUST
,
Fatma Özcan
IBM Research - Almaden

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2018

Published in PVLDB Volume 12, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
91
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu ZWang YYang SLong LWu RTang XZhao JWang H(2024)CORAL: Collaborative Automatic Labeling System Based on Large Language ModelsProceedings of the VLDB Endowment10.14778/3685800.368588517:12(4401-4404)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685885
Deng YChai CCao LTang NWang JFan JYuan YWang G(2024)MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648161
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 16-Dec-2024
https://dl.acm.org/doi/10.1145/3702315
Bao XBao ZBinbin BDuan QFan WLei HLi DLin WLiu PLv ZOuyang MTang SWang YWei QXie MZhang JZhang XZhao RZhou SBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Rock: Cleaning Data by Embedding ML in Logic RulesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653372(106-119)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653372
Zhang HYang JHuang WCai MLi JZhang CWu K(2024)Recognizing Textual Entailment by Hierarchical Crowdsourcing with Diverse Labor Costs2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD61410.2024.10580016(453-458)Online publication date: 8-May-2024
https://doi.org/10.1109/CSCWD61410.2024.10580016
Zhou YTu FSha KDing JChen H(2024)A Survey on Data Quality Dimensions and Tools for Machine Learning Invited Paper2024 IEEE International Conference on Artificial Intelligence Testing (AITest)10.1109/AITest62860.2024.00023(120-131)Online publication date: 15-Jul-2024
https://doi.org/10.1109/AITest62860.2024.00023
Kumar SDatta SSingh VSingh SSharma R(2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3369417
Cao YYin LZhao CZhao TLi TKong SShi LZhou JZhang ZYang KXue ZWang HWu RDing CHan YLuo QGu MWang XXu WGu JShi YYang LGong XWen Z(2024)Perovskite-based optoelectronic systems for neuromorphic computingNano Energy10.1016/j.nanoen.2023.109169120(109169)Online publication date: Feb-2024
https://doi.org/10.1016/j.nanoen.2023.109169
Côté PNikanjam AAhmed NHumeniuk DKhomh F(2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
https://dl.acm.org/doi/10.1007/s10515-024-00453-w
Huang LZhao PHuang JPan SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Retaining beneficial information from detrimental data for deep neural network repairProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668206(48051-48069)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668206
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents