Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Asterisk: Generating Large Training Datasets with Automatic Active Supervision

Published: 30 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Labeling datasets is one of the most expensive bottlenecks in data preprocessing tasks in machine learning. Therefore, organizations, in many domains, are applying weak supervision to produce noisy labels. However, since weak supervision relies on cheaper sources, the quality of the generated labels is problematic. Therefore, in this article, we present Asterisk, an end-to-end framework to generate high-quality, large-scale labeled datasets. The system, first, automatically generates heuristics to assign initial labels. Then, the framework applies a novel data-driven active learning process to enhance the labeling quality. We present an algorithm that learns the selection policy by accommodating the modeled accuracies of the heuristics, along with the outcome of the generative model. Finally, the system employs the output of the active learning process to enhance the quality of the labels. To evaluate the proposed system, we report its performance against four state-of-the-art techniques. In collaboration with our industrial partner, IBM, we test the framework within a wide range of real-world applications. The experiments include 10 datasets of varying sizes with a maximum size of 11 million records. The results illustrate the effectiveness of the framework in producing high-quality labels and achieving high classification accuracy with minimal annotation efforts.

    References

    [1]
    W. Zhao, G. Guan, L. Chen, X. He, D. Cai, B. Wang, and Q. Wang. 2018. Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans. Knowl. Data Eng. 30, 1 (2018), 185--197
    [2]
    A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. 2016. Data programming: Creating large training sets, quickly. Advances in Neural Information Processing Systems, pp. 3567--3575.
    [3]
    V. S. Sheng, J. Zhang, B. Gu, and X. Wu. 2019. Majority voting and pairing with multiple noisy labeling. IEEE Trans. Knowl. Data Eng. 31, 7 (2019), 1355--1368.
    [4]
    P. Cheng, X. Lian, X. Jian, and L. Chen. 2019. FROG: A fast and reliable crowdsourcing framework. IEEE Trans. Knowl. Data Eng. 31, 5 (2019), 894--908.
    [5]
    C. De Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. 2016. DeepDive: Declarative knowledge base construction. SIGMOD Rec. 45, 1 (2016), 60--67.
    [6]
    A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. 2017. Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11, 3 (2017), 269--282.
    [7]
    Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowl. Inf. Syst. 35, 2 (2013), 249--283.
    [8]
    N. Gurjar, S. Sudholt, and G. A. Fink. 2018. Learning deep representations for word spotting under weak supervision. International Workshop on Document Analysis Systems, pp. 7--12.
    [9]
    S. Chaidaroon, T. Ebesu, and Y. Fang. 2018. Deep semantic text hashing with weak supervision. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1109--1112.
    [10]
    A. H. Akbarnejad and M. S. Baghshah. 2019. An efficient semi-supervised multi-label classifier capable of handling missing labels. IEEE Trans. Knowl. Data Eng. 31 (2019), 229--242.
    [11]
    T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. 2016. Improved techniques for training GANs. Advances in Neural Information Processing Systems, pp. 2234--2242.
    [12]
    S. H. Bach, B. He, A. Ratner, and C. Ré. 2017. Learning the structure of generative models without labeled data. Proc. 34th International Conference on Machine Learning, pp. 273--282.
    [13]
    P. Varma and C. Ré. 2018. Snuba: Automating weak supervision to label training data. Proc. VLDB Endowment pp. 223--236.
    [14]
    E.-C. Huang, H.-K. Pao, and Y.-J. Lee. 2017. Big active learning. IEEE International Conference on Big Data, pp. 94--101.
    [15]
    Z.-H. Zhou. 2017. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5, 1 (2017), 44--53.
    [16]
    M.-F. Balcan, S. Hanneke, and J. W. Vaughan. 2010. The true sample complexity of active learning. Machine Learn. 80, 2--3 (2010), 111--139.
    [17]
    B. Settles. 2009. Active Learning Literature Survey.
    [18]
    N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. 2013. Aggregating crowdsourced binary ratings. Proc. International Conference on World Wide Web, pp. 285--294.
    [19]
    M. Joglekar, H. Garcia-Molina, and A. Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. IEEE International Conference on Data Engineering, pp. 195--206.
    [20]
    P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. Ré. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat, 2016.
    [21]
    P. Varma et al. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat, 2016.
    [22]
    N. Das, S. Chaba, S. Gandhi, D. H. Chau, and X. Chu. 2019. GOGGLES: Automatic training data generation with affinity coding. ArXiv190304552 Cs, 2019.
    [23]
    J. Zhu, H. Wang, B. K. Tsou, and M. Ma. 2010. Active learning with sampling by uncertainty and density for data annotations. IEEE Trans. Audio Speech Lang. Process. 18, 6 (2010), 1323--1331.
    [24]
    R. B. C. Prudencio and T. B. Ludermir. 2008. Active meta-learning with uncertainty sampling and outlier detection. IEEE International Joint Conference on Neural Networks, pp. 346--351.
    [25]
    K. Konyushkova, R. Sznitman, and P. Fua. 2015. Introducing geometry in active learning for image segmentation. ArXiv150804955 Cs, 2015.
    [26]
    Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann. 2015. Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis. 113, 2 (2015), 113--127.
    [27]
    M. Liu, W. Buntine, and G. Haffari. 2018. Learning how to actively learn: A deep imitation learning approach. Proc. Annual Meeting of the Association for Computational Linguistics. 1 (2018), 1874--1883.
    [28]
    M. E. Ramirez-Loaiza, M. Sharma, G. Kumar, and M. Bilgic. 2017. Active learning: An empirical study of common baselines. Data Min. Knowl. Discov. 31. 2 (2017), 287--313.
    [29]
    M. Fang, Y. Li, and T. Cohn. 2017. Learning how to active Learn: A deep reinforcement learning approach. ArXiv170802383 Cs, Aug. 2017.
    [30]
    K. Konyushkova, R. Sznitman, and P. Fua. 2017. Learning active learning from data. Advances in Neural Information Processing Systems.
    [31]
    H. Chu and H. Lin. 2016. Can active learning experience be transferred? IEEE International Conference on Data Mining, pp. 841--846.
    [32]
    K. Pang, M. Dong, Y. Wu, and T. Hospedales. 2018. Meta-learning transferable active learning policies by deep reinforcement learning. ArXiv Prepr. ArXiv180604798, 2018.
    [33]
    A. Niculescu-Mizil and R. Caruana. 2005. Predicting good probabilities with supervised learning. Proc. International Conference on Machine Learning, pp. 625--632.
    [34]
    O. Sagi and L. Rokach. 2018. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8, 4 (2018), 1--18.
    [35]
    B. Desharnais, F. Camirand-Lemyre, P. Mireault, and C. D. Skinner. 2015. Determination of confidence intervals in non-normal data: Application of the bootstrap to cocaine concentration in femoral blood. J. Anal. Toxicol. 39, 2 (2015), 113--117.
    [36]
    R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. 2014. Supervised hashing for image retrieval via image representation learning. AAAI Conference on Artificial Intelligence, pp. 2156--2162.
    [37]
    J. Bernard, M. Zeppelzauer, M. Lehmann, M. Müller, and M. Sedlmair. 2018. Towards user-centered active learning algorithms. Comput. Graph. Forum. 37, 3 (2018), 121--132.
    [38]
    M. Nashaat, A. Ghosh, J. Miller, S. Quader, C. Marston, and J. F. Puget. 2018. Hybridization of active learning and data programming for labeling large industrial datasets. IEEE International Conference on Big Data, pp. 46--55.
    [39]
    M. Nashaat, A. Ghosh, J. Miller, S. Quader, and C. Marston. 2019. M-Lean: An end-to-end development framework for predictive models in B2B scenarios. Inf. Softw. Technol. 113 (2019), 131--145.
    [40]
    S. Moro, P. Cortez, and P. Rita. 2014. A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62 (2014), 22--23.
    [41]
    I.-C. Yeh and C. Lien. 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36, 2 (2009), 2473--2480.
    [42]
    P. Baldi, P. Sadowski, D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nat. Comm. 5, 4308 (2014).
    [43]
    L. M. Candanedo and V. Feldheim. 2016. Accurate occupancy detection of an office room from light, temperature, humidity and CO 2 measurements using statistical learning models. 2016. Energy Build. 112 (2018), 28--39.
    [44]
    R. K. Bock et al. 2004. Methods for multidimensional event classification: A case study using images from a Cherenkov gamma-ray telescope. Nucl. Instrum. Methods Phys. Res. Sect. Accel. Spectrometers Detect. Assoc. Equip. 516, 2 (2004), 511--528.
    [45]
    K. Fernandes, P. Vinagre, and P. Cortez. 2015. A proactive intelligent decision support system for predicting the popularity of online news. Conference on Artificial Intelligence, pp. 535--546.
    [46]
    H. Xiao, K. Rasul, and R. Vollgraf. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. ArXiv170807747 Cs Stat, 2017.
    [47]
    P. Varma et al. 2017. Inferring generative model structure with static analysis. ArXiv170902477 Cs Stat, 2017.
    [48]
    G. Beatty, E. Kochis, and M. Bloodgood. 2019. The use of unlabeled data versus labeled data for stopping active learning for text classification. IEEE International Conference on Semantic Computing, pp. 287--294.
    [49]
    M. Bloodgood and K. Vijay-Shanker. 2009. A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. Proc. 13th Conference on Computational Natural Language Learning, pp. 39--47.
    [50]
    M. Nashaat, A. Ghosh, J. Miller, and S. Quader. WeSAL: Applying active supervision to find high-quality labels at industrial scale. Hawaii International Conference on System Sciences, submitted for publication.
    [51]
    A. C. Tan and D. Gilbert. 2003. An empirical comparison of supervised machine learning techniques in bioinformatics. Proc. Conference on Bioinformatics. 19 (2003), 219--222.
    [52]
    D. M. Powers. 2011. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2 (2011), 37--63.
    [53]
    T. Chen and C. Guestrin. 2016. XGBoost: A scalable tree boosting system. Proc. ACM SIGKDD, pp. 785--794.
    [54]
    I. Teinemaa, M. Dumas, F. M. Maggi, and C. Di Francescomarino. 2016. Predictive business process monitoring with structured and unstructured data. International Conference on Business Process Management, pp. 401--417.
    [55]
    J. Kremer, K. Steenstrup Pedersen, and C. Igel. 2014. Active learning with support vector machines. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4, 4 (2014), 313--326.
    [56]
    A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré. 2018. Snorkel MeTaL: Weak supervision for multi-task learning. Proc. 2nd Workshop on Data Management for End-to-End Machine Learning.
    [57]
    T. Durand, N. Thome, and M. Cord. 2018. SyMIL: MinMax latent SVM for weakly labeled data. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 6099--6112.
    [58]
    R. Stewart and S. Ermon. 2017. Label-free supervision of neural networks with physics and domain knowledge. AAAI Conference on Artificial Intelligence, pp. 2576--2582.
    [59]
    L. Cao et al. 2019. Smile: A system to support machine learning on EEG data at scale. Proc. VLDB Endow. 12, 12 (2019), 2230--2241.
    [60]
    S. Wu et al. 2018. Fonduer: Knowledge base construction from richly formatted data. Proc. International Conference on Management of Data, pp. 1301--1316.
    [61]
    Y. Li, Y. Wang, D. Yu, Y. Ning, P. Hu, and R. Zhao. 2020. ASCENT: Active supervision for semi-supervised learning. IEEE Trans. Knowl. Data Eng. 32, 5 (2020), 868--882.
    [62]
    P. Bachman, A. Sordoni, and A. Trischler. 2017. Learning algorithms for active learning. Proc. International Conference on Machine Learning. 70 (2017), 301--310.
    [63]
    D. Kang, D. Raghavan, P. Bailis, and M. Zaharia. 2018. Model assertions for debugging machine learning. NeurIPS MLSys Workshop.
    [64]
    M. Carbonneau, E. Granger, and G. Gagnon. 2019. Bag-level aggregation for multiple-instance active learning in instance classification problems. IEEE Trans. Neural Netw. Learn. Syst. 30, 5 (2019), 1441--1451.
    [65]
    Z. Zhou, J. Y. Shin, S. R. Gurudu, M. B. Gotway, and J. Liang. 2018. AFT* Integrating active learning and transfer learning to reduce annotation efforts. ArXiv180200912 Cs Stat, 2018.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM/IMS Transactions on Data Science
    ACM/IMS Transactions on Data Science  Volume 1, Issue 2
    May 2020
    169 pages
    ISSN:2691-1922
    DOI:10.1145/3403596
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2020
    Online AM: 07 May 2020
    Accepted: 01 February 2020
    Revised: 01 December 2019
    Received: 01 July 2019
    Published in TDS Volume 1, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Active learning
    2. Data labeling
    3. Heuristics design
    4. Machine learning

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)198
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Review of Automatic Lie Detection from Facial FeaturesJournal of Nonverbal Behavior10.1007/s10919-024-00451-248:1(93-136)Online publication date: 23-Mar-2024
    • (2023)Steered Training Data Generation for Learned Semantic Type DetectionProceedings of the ACM on Management of Data10.1145/35897861:2(1-25)Online publication date: 20-Jun-2023
    • (2022)NemoProceedings of the VLDB Endowment10.14778/3565838.356585915:13(4093-4105)Online publication date: 1-Sep-2022
    • (2022)WitanProceedings of the VLDB Endowment10.14778/3551793.355179715:11(2334-2347)Online publication date: 1-Jul-2022
    • (2022)BUDS+: Better Privacy with Converger and Noisy ShufflingDigital Threats: Research and Practice10.1145/34912594:2(1-23)Online publication date: 15-Feb-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media