Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scaling up crowd-sourcing to very large datasets: a case for active learning

Published: 01 October 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs).
    Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements.
    Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1--2 orders of magnitude fewer questions than the baseline, and 4.5--44× fewer than existing active learning algorithms.

    References

    [1]
    D. N. A. Asuncion. UCI machine learning repository, 2007.
    [2]
    A. Agarwal, L. Bottou, M. Dudík, and J. Langford. Para-active learning. CoRR, abs/1310.8243, 2013.
    [3]
    A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010.
    [4]
    K. Bellare et al. Active sampling for entity matching. In KDD, 2012.
    [5]
    A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009.
    [6]
    A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.
    [7]
    C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.
    [8]
    A. Bosch, A. Zisserman, and X. Muoz. Image classification using random forests and ferns. In ICCV, 2007.
    [9]
    A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In ECAI, 2010.
    [10]
    A. Chatterjee and S. N. Lahiri. Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494):608--625, 2011.
    [11]
    D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. J. Artif. Int. Res., 4, 1996.
    [12]
    S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In ISAIM, 2008.
    [13]
    A. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 1979.
    [14]
    P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 2009.
    [15]
    B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.
    [16]
    L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples. In WGMBV, 2004.
    [17]
    M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011.
    [18]
    S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to medical image classification. In ICML, 2006.
    [19]
    A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut. Crowdforge: crowdsourcing complex work. In UIST, 2011.
    [20]
    A. Kleiner et al. The big data bootstrap. In ICML, 2012.
    [21]
    R. Kohavi and D. H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In ICML, 1996.
    [22]
    S. Lahiri. On bootstrapping m-estimators. Sankhyā. Series A. Methods and Techniques, 54(2), 1992.
    [23]
    F. Laws, C. Scheible, and H. Schütze. Active learning with amazon mechanical turk. In EMNLP, 2011.
    [24]
    A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 2012.
    [25]
    A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5, 2011.
    [26]
    B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Active learning for crowd-sourced databases. CoRR, 2012.
    [27]
    A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of LREC, 2010.
    [28]
    A. G. Parameswaran et al. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, 2012.
    [29]
    M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking. Mach. Learn., 54, 2004.
    [30]
    S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.
    [31]
    B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2010.
    [32]
    V. Sheng, F. Provost, and P. Ipeirotis. Get another label? In KDD, 2008.
    [33]
    B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.
    [34]
    M.-H. Tsai et al. Active learning strategies using svms. In IJCNN, 2010.
    [35]
    S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual category learning. Int. J. Comput. Vision, 91, 2011.
    [36]
    A. Vlachos. A stopping criterion for active learning. Comput. Speech Lang., 22(3), 2008.
    [37]
    J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5, 2012.
    [38]
    K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. ABS: a system for scalable approximate queries with accuracy guarantees. In SIGMOD, 2014.
    [39]
    K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.
    [40]
    D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in NIPS, 2012.

    Cited By

    View all
    • (2023)Testing conventional wisdom (of the crowd)Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625857(237-248)Online publication date: 31-Jul-2023
    • (2023)STARSProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i9.26301(10980-10988)Online publication date: 7-Feb-2023
    • (2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
    • Show More Cited By

    Index Terms

    1. Scaling up crowd-sourcing to very large datasets: a case for active learning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 8, Issue 2
        October 2014
        84 pages

        Publisher

        VLDB Endowment

        Publication History

        Published: 01 October 2014
        Published in PVLDB Volume 8, Issue 2

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)38
        • Downloads (Last 6 weeks)2

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Testing conventional wisdom (of the crowd)Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625857(237-248)Online publication date: 31-Jul-2023
        • (2023)STARSProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i9.26301(10980-10988)Online publication date: 7-Feb-2023
        • (2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
        • (2023)The Battleship Approach to the Low Resource Entity Matching ProblemProceedings of the ACM on Management of Data10.1145/36267111:4(1-25)Online publication date: 12-Dec-2023
        • (2022)Deep indexed active learning for matching heterogeneous entity representationsProceedings of the VLDB Endowment10.14778/3485450.348545515:1(31-45)Online publication date: 14-Jan-2022
        • (2022)Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning MethodsThe Semantic Web10.1007/978-3-031-06981-9_7(113-129)Online publication date: 29-May-2022
        • (2021)Exploiting Heterogeneous Graph Neural Networks with Latent Worker/Task Correlation Information for Label Aggregation in CrowdsourcingACM Transactions on Knowledge Discovery from Data10.1145/346086516:2(1-18)Online publication date: 3-Sep-2021
        • (2021)From Limited Annotated Raw Material Data to Quality Production DataProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481921(4114-4124)Online publication date: 26-Oct-2021
        • (2021)CrowdTC: Crowd-powered Learning for Text ClassificationACM Transactions on Knowledge Discovery from Data10.1145/345721616:1(1-23)Online publication date: 20-Jul-2021
        • (2021)AdaReNet: Adaptive Reweighted Semi-supervised Active Learning to Accelerate Label AcquisitionProceedings of the 14th PErvasive Technologies Related to Assistive Environments Conference10.1145/3453892.3461321(431-438)Online publication date: 29-Jun-2021
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media