research-article

Scaling up crowd-sourcing to very large datasets: a case for active learning

Authors:

Barzan Mozafari,

Michael Franklin,

Michael Jordan, and

Samuel MaddenAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 2

Pages 125 - 136

https://doi.org/10.14778/2735471.2735474

Published: 01 October 2014 Publication History

Abstract

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs).

Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements.

Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1--2 orders of magnitude fewer questions than the baseline, and 4.5--44^× fewer than existing active learning algorithms.

References

[1]

D. N. A. Asuncion. UCI machine learning repository, 2007.

[2]

A. Agarwal, L. Bottou, M. Dudík, and J. Langford. Para-active learning. CoRR, abs/1310.8243, 2013.

[3]

A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010.

Digital Library

[4]

K. Bellare et al. Active sampling for entity matching. In KDD, 2012.

Digital Library

[5]

A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009.

Digital Library

[6]

A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.

Digital Library

[7]

C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.

Digital Library

[8]

A. Bosch, A. Zisserman, and X. Muoz. Image classification using random forests and ferns. In ICCV, 2007.

[9]

A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In ECAI, 2010.

Digital Library

[10]

A. Chatterjee and S. N. Lahiri. Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494):608--625, 2011.

[11]

D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. J. Artif. Int. Res., 4, 1996.

Digital Library

[12]

S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In ISAIM, 2008.

[13]

A. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 1979.

[14]

P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 2009.

Digital Library

[15]

B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.

[16]

L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples. In WGMBV, 2004.

Digital Library

[17]

M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011.

Digital Library

[18]

S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to medical image classification. In ICML, 2006.

Digital Library

[19]

A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut. Crowdforge: crowdsourcing complex work. In UIST, 2011.

Digital Library

[20]

A. Kleiner et al. The big data bootstrap. In ICML, 2012.

Digital Library

[21]

R. Kohavi and D. H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In ICML, 1996.

[22]

S. Lahiri. On bootstrapping m-estimators. Sankhyā. Series A. Methods and Techniques, 54(2), 1992.

[23]

F. Laws, C. Scheible, and H. Schütze. Active learning with amazon mechanical turk. In EMNLP, 2011.

Digital Library

[24]

A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 2012.

Digital Library

[25]

A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5, 2011.

Digital Library

[26]

B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Active learning for crowd-sourced databases. CoRR, 2012.

[27]

A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of LREC, 2010.

[28]

A. G. Parameswaran et al. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, 2012.

Digital Library

[29]

M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking. Mach. Learn., 54, 2004.

Digital Library

[30]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.

Digital Library

[31]

B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2010.

[32]

V. Sheng, F. Provost, and P. Ipeirotis. Get another label? In KDD, 2008.

[33]

B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.

Digital Library

[34]

M.-H. Tsai et al. Active learning strategies using svms. In IJCNN, 2010.

[35]

S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual category learning. Int. J. Comput. Vision, 91, 2011.

Digital Library

[36]

A. Vlachos. A stopping criterion for active learning. Comput. Speech Lang., 22(3), 2008.

Digital Library

[37]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5, 2012.

Digital Library

[38]

K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. ABS: a system for scalable approximate queries with accuracy guarantees. In SIGMOD, 2014.

Digital Library

[39]

K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.

Digital Library

[40]

D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in NIPS, 2012.

Cited By

Burrell NSchoenebeck GEvans RShpitser I(2023)Testing conventional wisdom (of the crowd)Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625857(237-248)Online publication date: 31-Jul-2023
https://dl.acm.org/doi/10.5555/3625834.3625857
Yu DShi WYu QWilliams BChen YNeville J(2023)STARSProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i9.26301(10980-10988)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i9.26301
Fürst JArgerich MCheng B(2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583148
Show More Cited By

Index Terms

Scaling up crowd-sourcing to very large datasets: a case for active learning
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient Crowd Exploration of Large Networks: The Case of Causal Attribution

Accurately and efficiently crowdsourcing complex, open-ended tasks can be difficult, as crowd participants tend to favor short, repetitive "microtasks". We study the crowdsourcing of large networks where the crowd provides the network topology via ...
Read More
Accurate Crowd Counting using Merged Datasets
ICMLT '22: Proceedings of the 2022 7th International Conference on Machine Learning Technologies

For city surveillance, intelligent crowd control is crucial. Insufficient samples are huge obstacles in the process of crowd counting and in the learning of deep neural networks. In this work, we have conducted an exploratory study using large and small-...
Read More
An Online Learning Approach to Improving the Quality of Crowd-Sourcing
SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

We consider a crowd-sourcing problem where in the process of labeling massive datasets, multiple labelers with unknown annotation quality must be selected to perform the labeling task for each incoming data sample or task, with the results aggregated ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 2

October 2014

84 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2014

Published in PVLDB Volume 8, Issue 2

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
552
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

Cited By

Burrell NSchoenebeck GEvans RShpitser I(2023)Testing conventional wisdom (of the crowd)Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625857(237-248)Online publication date: 31-Jul-2023
https://dl.acm.org/doi/10.5555/3625834.3625857
Yu DShi WYu QWilliams BChen YNeville J(2023)STARSProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i9.26301(10980-10988)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i9.26301
Fürst JArgerich MCheng B(2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583148
Genossar BGal AShraga R(2023)The Battleship Approach to the Low Resource Entity Matching ProblemProceedings of the ACM on Management of Data10.1145/36267111:4(1-25)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626711
Jain ASarawagi SSen P(2022)Deep indexed active learning for matching heterogeneous entity representationsProceedings of the VLDB Endowment10.14778/3485450.348545515:1(31-45)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.14778/3485450.3485455
Primpeli ABizer C(2022)Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning MethodsThe Semantic Web10.1007/978-3-031-06981-9_7(113-129)Online publication date: 29-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-06981-9_7
Wu HMa TWu LXu FJi S(2021)Exploiting Heterogeneous Graph Neural Networks with Latent Worker/Task Correlation Information for Label Aggregation in CrowdsourcingACM Transactions on Knowledge Discovery from Data10.1145/346086516:2(1-18)Online publication date: 3-Sep-2021
https://dl.acm.org/doi/10.1145/3460865
Shraga RKatz GBadian YCalderon NGal ADemartini GZuccon GCulpepper JHuang ZTong H(2021)From Limited Annotated Raw Material Data to Quality Production DataProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481921(4114-4124)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3481921
Yang KGao YLiang LBian SChen LZheng B(2021)CrowdTC: Crowd-powered Learning for Text ClassificationACM Transactions on Knowledge Discovery from Data10.1145/345721616:1(1-23)Online publication date: 20-Jul-2021
https://dl.acm.org/doi/10.1145/3457216
Lourentzou IGruhl DAlba AGentile ARistoski PDeluca CWelch SZhai CMakedon F(2021)AdaReNet: Adaptive Reweighted Semi-supervised Active Learning to Accelerate Label AcquisitionProceedings of the 14th PErvasive Technologies Related to Assistive Environments Conference10.1145/3453892.3461321(431-438)Online publication date: 29-Jun-2021
https://dl.acm.org/doi/10.1145/3453892.3461321
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents