research-article

When in Doubt Ask the Crowd: Employing Crowdsourcing for Active Learning

Authors:

Wolfgang NejdlAuthors Info & Claims

WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

Article No.: 12, Pages 1 - 12

https://doi.org/10.1145/2611040.2611047

Published: 02 June 2014 Publication History

Get Access

Abstract

Crowdsourcing has become ubiquitous in machine learning as a cost effective method to gather training labels. In this paper we examine the challenges that appear when employing crowdsourcing for active learning, in an integrated environment where an automatic method and human labelers work together towards improving their performance at a certain task. By using Active Learning techniques on crowd-labeled data, we optimize the performance of the automatic method towards better accuracy, while keeping the costs low by gathering data on demand. In order to verify our proposed methods, we apply them to the task of deduplication of publications in a digital library by examining metadata. We investigate the problems created by noisy labels produced by the crowd and explore methods to aggregate them. We analyze how different automatic methods are affected by the quantity and quality of the allocated resources as well as the instance selection strategies for each active learning round, aiming towards attaining a balance between cost and performance.

References

[1]

V. Ambati, S. Hewavitharana, S. Vogel, and J. Carbonell. Active learning with multiple annotations for comparable data classification task. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. Association for Computational Linguistics, 2011.

Digital Library

Google Scholar

[2]

V. Ambati, S. Vogel, and J. G. Carbonell. Active learning and crowd-sourcing for machine translation. In ILREC, volume 11. Citeseer, 2010.

Google Scholar

[3]

J. Attenberg and F. Provost. Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '10. ACM, 2010.

Digital Library

Google Scholar

[4]

O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal, 18(1), Jan. 2009.

Digital Library

Google Scholar

[5]

M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive Name Matching in Information Integration. IEEE Intelligent Systems, 18(5), 2003.

Digital Library

Google Scholar

[6]

E. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, and J. Kittler. Active learning in social context for image classification. In 9th International Conference on Computer Vision Theory and Applications, VISAPP, 2014.

Google Scholar

[7]

D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Mach. Learn., 15(2), 1994.

Digital Library

Google Scholar

[8]

A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics, 28(1), 1979.

Google Scholar

[9]

G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, WWW '12. ACM, 2012.

Digital Library

Google Scholar

[10]

A. Doan, Y. Lu, Y. Lee, and J. Han. Object Matching for Information Integration: A Profiler-Based Approach. In IIWeb, 2003.

Google Scholar

[11]

P. Donmez and J. G. Carbonell. Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008.

Digital Library

Google Scholar

[12]

M. Georgescu, D. D. Pham, C. S. Firan, W. Nejdl, and J. Gaugaz. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12. ACM, 2012.

Digital Library

Google Scholar

[13]

A. Y. Halevy, X. Dong, J. Madhavan, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05. ACM Press, 2005.

Digital Library

Google Scholar

[14]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 2009.

Digital Library

Google Scholar

[15]

M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov., 2(1), 1998.

Digital Library

Google Scholar

[16]

E. Ioannou, C. Niederée, and W. Nejdl. Probabilistic Entity Linkage for Heterogeneous Information Spaces. In CAiSE, 2008.

Digital Library

Google Scholar

[17]

P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10. ACM, 2010.

Digital Library

Google Scholar

[18]

F. Laws, C. Scheible, and H. Schütze. Active learning with amazon mechanical turk. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.

Digital Library

Google Scholar

[19]

M. Lease. On quality control and machine learning in crowdsourcing. In Human Computation, 2011.

Google Scholar

[20]

Z. Miklós, N. Bonvin, P. Bouquet, M. Catasta, D. Cordioli, P. Fankhauser, J. Gaugaz, E. Ioannou, H. Koshutanski, A. Maña, C. Niederée, T. Palpanas, and H. Stoermer. From Web Data to Entities and Back. CAiSE, 2010.

Crossref

Google Scholar

[21]

A. Morris, Y. Velegrakis, and P. Bouquet. Entity Identification on the Semantic Web. In SWAP, 2008.

Google Scholar

[22]

V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11, 2010.

Digital Library

Google Scholar

[23]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.

Digital Library

Google Scholar

[24]

B. Settles. Active learning literature survey. University of Wisconsin, Madison, 2010.

Digital Library

Google Scholar

[25]

V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '08. ACM, 2008.

Digital Library

Google Scholar

[26]

A. Sheshadri and M. Lease. Square: A benchmark for research on computing crowd consensus. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.

Crossref

Google Scholar

[27]

R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.

Digital Library

Google Scholar

[28]

O. Tamuz, C. Liu, S. Belongie, O. Shamir, A. Kalai, and A. Kalai. Adaptively learning the crowd kernel. In ICML, 2011.

Digital Library

Google Scholar

[29]

S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In Computer Vision and Pattern Recognition, CVPR. IEEE, 2011.

Digital Library

Google Scholar

[30]

L. von Ahn. Human computation. In CIVR, 2009.

Digital Library

Google Scholar

[31]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11), 2012.

Digital Library

Google Scholar

[32]

P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems, 2010.

Digital Library

Google Scholar

[33]

J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, 2009.

Google Scholar

[34]

Y. Yan, G. M. Fung, R. Rosales, and J. G. Dy. Active learning from crowds. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011.

Google Scholar

[35]

H. Yang, A. Mityagin, K. M. Svore, and S. Markov. Collecting high quality overlapping labels at low cost. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010.

Digital Library

Google Scholar

[36]

M.-C. Yuen, L.-J. Chen, I. King, and I. King. A survey of human computation systems. In CSE (4), 2009.

Digital Library

Google Scholar

[37]

L. Zhao, G. Sukthankar, and R. Sukthankar. Incremental relabeling for active learning with noisy crowdsourced annotations. In Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom). IEEE, 2011.

Google Scholar

Cited By

View all

Gadiraju UZhuang M(2019)What You Sow, So Shall You Reap! Toward Preselection Mechanisms for Macrotask CrowdsourcingMacrotask Crowdsourcing10.1007/978-3-030-12334-5_6(163-188)Online publication date: 7-Aug-2019
https://doi.org/10.1007/978-3-030-12334-5_6
Balasubramanian TEstrada T(2016)Crowdlearning: A framework for collaborative and personalized learning2016 IEEE Frontiers in Education Conference (FIE)10.1109/FIE.2016.7757355(1-9)Online publication date: Oct-2016
https://doi.org/10.1109/FIE.2016.7757355
Victorino JEstuar M(2014)Profiling Flood Risk through Crowdsourced Flood Level Reports2014 International Conference on IT Convergence and Security (ICITCS)10.1109/ICITCS.2014.7021800(1-4)Online publication date: Oct-2014
https://doi.org/10.1109/ICITCS.2014.7021800

Index Terms

When in Doubt Ask the Crowd: Employing Crowdsourcing for Active Learning
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications

Recommendations

Learning from crowds with active learning and self-healing

With the development of crowdsourcing, data acquisition for supervised learning from annotators all over the world becomes simple and economical. To improve accuracy, it is nature to obtain multiple noisy labels (i.e., a multiple label set) for each ...
Ask me better questions: active learning queries based on rule induction
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Active learning methods are used to improve the classification accuracy when little labeled data is available. Most traditional active learning methods pose a very specific query to the oracle, i.e. they ask for the label of an unlabeled example. This ...
A framework for learning web wrappers from the crowd
WWW '13: Proceedings of the 22nd international conference on World Wide Web

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their ...

Reviews

Reviewer: Andrea F Paramithiotti

In machine learning, computers run not according to programs set once and for all by humans, but by following instructions that change according to some given sets of rules. These rules, however, must still be laid out by humans in a long, complex, and costly process. This paper presents a method to ease that process using a well-known methodology called crowdsourcing. Most of the paper is devoted to the overall process description: first, candidate rules are selected and the crowd is asked to evaluate them (gather); then, rules are assigned to categories according to crowd judgment (aggregate); and, finally, rules are added to the existing set (select). The cycle is repeated as many times as needed; the gather-aggregate-select cycle is carried out by a computer algorithm, while the evaluation of rules is carried out by humans. The method is then experimentally applied to disambiguate references to scientific publications; the goal is labeling pairs of scientific publications as either duplicate or non-duplicate. The factors helping to achieve good results in a short time are reported; among them are the fields by which publications are categorized, the best voting strategy used to assign rules to categories, the optimal size of the crowd, and the number of sessions. A strategy to assess worker reliability is also described as a fundamental part of the method. The paper builds on previous work, so comprehensive references are given at the end. That being said, the paper looks to the future, too, as it also discusses ways to improve the process in view of its application to real-case scenarios. In this respect, the authors say that most work should be done to improve the strategy for agreement among individuals in the crowd, as well as in the choice of algorithms used in the computer part of the process. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

June 2014

506 pages

ISBN:9781450325387

DOI:10.1145/2611040

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Aristotle University of Thessaloniki

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WIMS '14

WIMS '14: 4th International Conference on Web Intelligence, Mining and Semantics

June 2 - 4, 2014

Thessaloniki, Greece

Acceptance Rates

WIMS '14 Paper Acceptance Rate 41 of 90 submissions, 46%;

Overall Acceptance Rate 140 of 278 submissions, 50%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
302
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Gadiraju UZhuang M(2019)What You Sow, So Shall You Reap! Toward Preselection Mechanisms for Macrotask CrowdsourcingMacrotask Crowdsourcing10.1007/978-3-030-12334-5_6(163-188)Online publication date: 7-Aug-2019
https://doi.org/10.1007/978-3-030-12334-5_6
Balasubramanian TEstrada T(2016)Crowdlearning: A framework for collaborative and personalized learning2016 IEEE Frontiers in Education Conference (FIE)10.1109/FIE.2016.7757355(1-9)Online publication date: Oct-2016
https://doi.org/10.1109/FIE.2016.7757355
Victorino JEstuar M(2014)Profiling Flood Risk through Crowdsourced Flood Level Reports2014 International Conference on IT Convergence and Security (ICITCS)10.1109/ICITCS.2014.7021800(1-4)Online publication date: Oct-2014
https://doi.org/10.1109/ICITCS.2014.7021800

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Learning from crowds with active learning and self-healing

Ask me better questions: active learning queries based on rule induction

A framework for learning web wrappers from the crowd

Reviews

Access critical reviews of Computing literature here