Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cleaning crowdsourced labels using oracles for statistical classification

Published: 01 December 2018 Publication History

Abstract

Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with ground-truth labels. In this paper, we consider how to apply oracle-based label cleaning to reduce the gap. We propose TARS, a label-cleaning advisor that can provide two pieces of valuable advice for data scientists when they need to train or test a model using noisy labels. Firstly, in the model testing stage, given a test dataset with noisy labels, and a classification model, TARS can use the test data to estimate how well the model will perform w.r.t. ground-truth labels. Secondly, in the model training stage, given a training dataset with noisy labels, and a classification algorithm, TARS can determine which label should be sent to an oracle to clean such that the model can be improved the most. For the first advice, we propose an effective estimation technique, and study how to compute confidence intervals to bound its estimation error. For the second advice, we propose a novel cleaning strategy along with two optimization techniques, and illustrate that it is superior to the existing cleaning strategies. We evaluate TARS on both simulated and real-world datasets. The results show that (1) TARS can use noisy test data to accurately estimate a model's true performance for various evaluation metrics; and (2) TARS can improve the model accuracy by a larger margin than the existing cleaning strategies, for the same cleaning budget.

References

[1]
Crowdflower: Data For Everyone. https://www.figure-eight.com/data-for-everyone/. Accessed: 2018-05-29.
[2]
UCI Machine Learning Repository: Classification Datasets. https://archive.ics.uci.edu/ml/datasets.html?format=&task=cla. Accessed: 2018-05-29.
[3]
M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In ACM SIGMOD, pages 1199--1214, 2015.
[4]
R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis, and W. C. Tan. Asking the right questions in crowd data sourcing. In ICDE, pages 1261--1264, 2012.
[5]
C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. J. Artif. Intell. Res., 11:131--167, 1999.
[6]
P. Cheng, X. Lian, L. Chen, J. Han, and J. Zhao. Task assignment on multi-skill oriented spatial crowdsourcing. IEEE Trans. Knowl. Data Eng., 28(8):2201--2215, 2016.
[7]
X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In SIGMOD, pages 2201--2206, 2016.
[8]
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In ACM SIGMOD, pages 1247--1261, 2015.
[9]
S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In ACM SIGMOD, pages 1431--1446, 2017.
[10]
A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics, pages 20--28, 1979.
[11]
M. Dolatshah, M. Teoh, J. Wang, and J. Pei. Cleaning crowdsourced labels using oracles for statistical classification. Technical Report, 2018.
[12]
X. L. Dong and F. Naumann. Data fusion - resolving data conflicts for integration. PVLDB, 2(2):1654--1655, 2009.
[13]
J. Fan, G. Li, B. C. Ooi, K. Tan, and J. Feng. iCrowd: An adaptive crowdsourcing framework. In ACM SIGMOD, pages 1015--1030, 2015.
[14]
D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an oracle. PVLDB, 9(5):384--395, 2016.
[15]
B. Frénay and M. Verleysen. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learning Syst., 25(5):845--869, 2014.
[16]
J. Gao, Q. Li, B. Zhao, W. Fan, and J. Han. Truth discovery and crowdsourcing aggregation: A unified perspective. PVLDB, 8(12):2048--2049, 2015.
[17]
J. Gao, X. Liu, B. C. Ooi, H. Wang, and G. Chen. An online cost sensitive decision-making method in crowdsourcing systems. In ACM SIGMOD, pages 217--228, 2013.
[18]
D. Haas, J. Wang, E. Wu, and M. J. Franklin. Clamshell: Speeding up crowds for low-latency data labeling. PVLDB, 9(4):372--383, 2015.
[19]
N. Q. V. Hung, D. C. Thang, M. Weidlich, and K. Aberer. Minimizing efforts in validating crowd answers. In ACM SIGMOD, pages 999--1014, 2015.
[20]
S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948--959, 2016.
[21]
D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In SIGIR, pages 246--254. ACM, 1995.
[22]
D. D. Lewis and J. Catlett. Heterogenous uncertainty sampling for supervised learning. In ICML, pages 148--156, 1994.
[23]
G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. IEEE Trans. Knowl. Data Eng., 28(9):2296--2319, 2016.
[24]
C. H. Lin, Mausam, and D. S. Weld. Re-active learning: Active learning with relabeling. In AAAI, pages 1845--1852, 2016.
[25]
T. Liu and D. Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447--461, 2016.
[26]
X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012.
[27]
B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2):125--136, 2014.
[28]
N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS, pages 1196--1204, 2013.
[29]
S. Nowozin. Optimal decisions from probabilistic models: the intersection-over-union case. In CVPR, pages 548--555, 2014.
[30]
H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In ACM SIGMOD, pages 577--588, 2014.
[31]
J. Pilourdault, S. Amer-Yahia, D. Lee, and S. B. Roy. Motivation-aware task assignment in crowdsourcing. In EDBT, pages 246--257, 2017.
[32]
V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297--1322, 2010.
[33]
T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. G. Parameswaran, and C. Ré. Slimfast: Guaranteed results for data fusion and source reliability. In SIGMOD, pages 1399--1414, 2017.
[34]
S. H. Rice. A stochastic version of the price equation reveals the interplay of deterministic and stochastic processes in evolution. BMC evolutionary biology, 8(1):262, 2008.
[35]
N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In ICML, pages 441--448, 2001.
[36]
B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55--66):11, 2010.
[37]
V. S. Sheng, F. J. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In ACM SIGKDD, pages 614--622, 2008.
[38]
C. G. Small. Expansions and asymptotics for statistics. Chapman and Hall/CRC, 2010.
[39]
S. Sukhbaatar and R. Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014.
[40]
A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. J. Belongie. Learning from noisy large-scale datasets with minimal supervision. In IEEE CVPR, pages 6575--6583, 2017.
[41]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012.
[42]
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In ACM SIGMOD, pages 469--480, 2014.
[43]
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013.
[44]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011.
[45]
Y. Yan, R. Rosales, G. Fung, and J. G. Dy. Active learning from crowds. In ICML, pages 1161--1168, 2011.
[46]
C. J. Zhang, L. Chen, Y. Tong, and Z. Liu. Cleaning uncertain data with a noisy crowd. In ICDE, pages 6--17, 2015.
[47]
Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng. Truth inference in crowdsourcing: is the problem solved? PVLDB, 10(5):541--552, 2017.
[48]
Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. QASCA: A quality-aware task assignment system for crowdsourcing applications. In ACM SIGMOD, pages 1031--1046, 2015.

Cited By

View all
  • (2024)CORAL: Collaborative Automatic Labeling System Based on Large Language ModelsProceedings of the VLDB Endowment10.14778/3685800.368588517:12(4401-4404)Online publication date: 8-Nov-2024
  • (2024)MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 3-May-2024
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 16-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 12, Issue 4
December 2018
140 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2018
Published in PVLDB Volume 12, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CORAL: Collaborative Automatic Labeling System Based on Large Language ModelsProceedings of the VLDB Endowment10.14778/3685800.368588517:12(4401-4404)Online publication date: 8-Nov-2024
  • (2024)MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 3-May-2024
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 16-Dec-2024
  • (2024)Rock: Cleaning Data by Embedding ML in Logic RulesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653372(106-119)Online publication date: 9-Jun-2024
  • (2024)Recognizing Textual Entailment by Hierarchical Crowdsourcing with Diverse Labor Costs2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD61410.2024.10580016(453-458)Online publication date: 8-May-2024
  • (2024)A Survey on Data Quality Dimensions and Tools for Machine Learning Invited Paper2024 IEEE International Conference on Artificial Intelligence Testing (AITest)10.1109/AITest62860.2024.00023(120-131)Online publication date: 15-Jul-2024
  • (2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
  • (2024)Perovskite-based optoelectronic systems for neuromorphic computingNano Energy10.1016/j.nanoen.2023.109169120(109169)Online publication date: Feb-2024
  • (2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
  • (2023)Retaining beneficial information from detrimental data for deep neural network repairProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668206(48051-48069)Online publication date: 10-Dec-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media