research-article

Open access

Asterisk: Generating Large Training Datasets with Automatic Active Supervision

Authors:

Aindrila Ghosh,

Shaikh QuaderAuthors Info & Claims

ACM Transactions on Data Science, Volume 1, Issue 2

Article No.: 13, Pages 1 - 25

https://doi.org/10.1145/3385188

Published: 30 May 2020 Publication History

All formats PDF

Abstract

Labeling datasets is one of the most expensive bottlenecks in data preprocessing tasks in machine learning. Therefore, organizations, in many domains, are applying weak supervision to produce noisy labels. However, since weak supervision relies on cheaper sources, the quality of the generated labels is problematic. Therefore, in this article, we present Asterisk, an end-to-end framework to generate high-quality, large-scale labeled datasets. The system, first, automatically generates heuristics to assign initial labels. Then, the framework applies a novel data-driven active learning process to enhance the labeling quality. We present an algorithm that learns the selection policy by accommodating the modeled accuracies of the heuristics, along with the outcome of the generative model. Finally, the system employs the output of the active learning process to enhance the quality of the labels. To evaluate the proposed system, we report its performance against four state-of-the-art techniques. In collaboration with our industrial partner, IBM, we test the framework within a wide range of real-world applications. The experiments include 10 datasets of varying sizes with a maximum size of 11 million records. The results illustrate the effectiveness of the framework in producing high-quality labels and achieving high classification accuracy with minimal annotation efforts.

References

[1]

W. Zhao, G. Guan, L. Chen, X. He, D. Cai, B. Wang, and Q. Wang. 2018. Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans. Knowl. Data Eng. 30, 1 (2018), 185--197

[2]

A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. 2016. Data programming: Creating large training sets, quickly. Advances in Neural Information Processing Systems, pp. 3567--3575.

[3]

V. S. Sheng, J. Zhang, B. Gu, and X. Wu. 2019. Majority voting and pairing with multiple noisy labeling. IEEE Trans. Knowl. Data Eng. 31, 7 (2019), 1355--1368.

[4]

P. Cheng, X. Lian, X. Jian, and L. Chen. 2019. FROG: A fast and reliable crowdsourcing framework. IEEE Trans. Knowl. Data Eng. 31, 5 (2019), 894--908.

[5]

C. De Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. 2016. DeepDive: Declarative knowledge base construction. SIGMOD Rec. 45, 1 (2016), 60--67.

Digital Library

[6]

A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. 2017. Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11, 3 (2017), 269--282.

Digital Library

[7]

Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowl. Inf. Syst. 35, 2 (2013), 249--283.

[8]

N. Gurjar, S. Sudholt, and G. A. Fink. 2018. Learning deep representations for word spotting under weak supervision. International Workshop on Document Analysis Systems, pp. 7--12.

[9]

S. Chaidaroon, T. Ebesu, and Y. Fang. 2018. Deep semantic text hashing with weak supervision. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1109--1112.

[10]

A. H. Akbarnejad and M. S. Baghshah. 2019. An efficient semi-supervised multi-label classifier capable of handling missing labels. IEEE Trans. Knowl. Data Eng. 31 (2019), 229--242.

Digital Library

[11]

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. 2016. Improved techniques for training GANs. Advances in Neural Information Processing Systems, pp. 2234--2242.

[12]

S. H. Bach, B. He, A. Ratner, and C. Ré. 2017. Learning the structure of generative models without labeled data. Proc. 34th International Conference on Machine Learning, pp. 273--282.

[13]

P. Varma and C. Ré. 2018. Snuba: Automating weak supervision to label training data. Proc. VLDB Endowment pp. 223--236.

[14]

E.-C. Huang, H.-K. Pao, and Y.-J. Lee. 2017. Big active learning. IEEE International Conference on Big Data, pp. 94--101.

[15]

Z.-H. Zhou. 2017. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5, 1 (2017), 44--53.

[16]

M.-F. Balcan, S. Hanneke, and J. W. Vaughan. 2010. The true sample complexity of active learning. Machine Learn. 80, 2--3 (2010), 111--139.

Digital Library

[17]

B. Settles. 2009. Active Learning Literature Survey.

[18]

N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. 2013. Aggregating crowdsourced binary ratings. Proc. International Conference on World Wide Web, pp. 285--294.

[19]

M. Joglekar, H. Garcia-Molina, and A. Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. IEEE International Conference on Data Engineering, pp. 195--206.

[20]

P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. Ré. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat, 2016.

[21]

P. Varma et al. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat, 2016.

[22]

N. Das, S. Chaba, S. Gandhi, D. H. Chau, and X. Chu. 2019. GOGGLES: Automatic training data generation with affinity coding. ArXiv190304552 Cs, 2019.

[23]

J. Zhu, H. Wang, B. K. Tsou, and M. Ma. 2010. Active learning with sampling by uncertainty and density for data annotations. IEEE Trans. Audio Speech Lang. Process. 18, 6 (2010), 1323--1331.

Digital Library

[24]

R. B. C. Prudencio and T. B. Ludermir. 2008. Active meta-learning with uncertainty sampling and outlier detection. IEEE International Joint Conference on Neural Networks, pp. 346--351.

[25]

K. Konyushkova, R. Sznitman, and P. Fua. 2015. Introducing geometry in active learning for image segmentation. ArXiv150804955 Cs, 2015.

[26]

Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann. 2015. Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis. 113, 2 (2015), 113--127.

Digital Library

[27]

M. Liu, W. Buntine, and G. Haffari. 2018. Learning how to actively learn: A deep imitation learning approach. Proc. Annual Meeting of the Association for Computational Linguistics. 1 (2018), 1874--1883.

[28]

M. E. Ramirez-Loaiza, M. Sharma, G. Kumar, and M. Bilgic. 2017. Active learning: An empirical study of common baselines. Data Min. Knowl. Discov. 31. 2 (2017), 287--313.

[29]

M. Fang, Y. Li, and T. Cohn. 2017. Learning how to active Learn: A deep reinforcement learning approach. ArXiv170802383 Cs, Aug. 2017.

[30]

K. Konyushkova, R. Sznitman, and P. Fua. 2017. Learning active learning from data. Advances in Neural Information Processing Systems.

[31]

H. Chu and H. Lin. 2016. Can active learning experience be transferred? IEEE International Conference on Data Mining, pp. 841--846.

[32]

K. Pang, M. Dong, Y. Wu, and T. Hospedales. 2018. Meta-learning transferable active learning policies by deep reinforcement learning. ArXiv Prepr. ArXiv180604798, 2018.

[33]

A. Niculescu-Mizil and R. Caruana. 2005. Predicting good probabilities with supervised learning. Proc. International Conference on Machine Learning, pp. 625--632.

[34]

O. Sagi and L. Rokach. 2018. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8, 4 (2018), 1--18.

[35]

B. Desharnais, F. Camirand-Lemyre, P. Mireault, and C. D. Skinner. 2015. Determination of confidence intervals in non-normal data: Application of the bootstrap to cocaine concentration in femoral blood. J. Anal. Toxicol. 39, 2 (2015), 113--117.

[36]

R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. 2014. Supervised hashing for image retrieval via image representation learning. AAAI Conference on Artificial Intelligence, pp. 2156--2162.

[37]

J. Bernard, M. Zeppelzauer, M. Lehmann, M. Müller, and M. Sedlmair. 2018. Towards user-centered active learning algorithms. Comput. Graph. Forum. 37, 3 (2018), 121--132.

[38]

M. Nashaat, A. Ghosh, J. Miller, S. Quader, C. Marston, and J. F. Puget. 2018. Hybridization of active learning and data programming for labeling large industrial datasets. IEEE International Conference on Big Data, pp. 46--55.

[39]

M. Nashaat, A. Ghosh, J. Miller, S. Quader, and C. Marston. 2019. M-Lean: An end-to-end development framework for predictive models in B2B scenarios. Inf. Softw. Technol. 113 (2019), 131--145.

[40]

S. Moro, P. Cortez, and P. Rita. 2014. A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62 (2014), 22--23.

[41]

I.-C. Yeh and C. Lien. 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36, 2 (2009), 2473--2480.

Digital Library

[42]

P. Baldi, P. Sadowski, D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nat. Comm. 5, 4308 (2014).

[43]

L. M. Candanedo and V. Feldheim. 2016. Accurate occupancy detection of an office room from light, temperature, humidity and CO 2 measurements using statistical learning models. 2016. Energy Build. 112 (2018), 28--39.

[44]

R. K. Bock et al. 2004. Methods for multidimensional event classification: A case study using images from a Cherenkov gamma-ray telescope. Nucl. Instrum. Methods Phys. Res. Sect. Accel. Spectrometers Detect. Assoc. Equip. 516, 2 (2004), 511--528.

[45]

K. Fernandes, P. Vinagre, and P. Cortez. 2015. A proactive intelligent decision support system for predicting the popularity of online news. Conference on Artificial Intelligence, pp. 535--546.

[46]

H. Xiao, K. Rasul, and R. Vollgraf. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. ArXiv170807747 Cs Stat, 2017.

[47]

P. Varma et al. 2017. Inferring generative model structure with static analysis. ArXiv170902477 Cs Stat, 2017.

[48]

G. Beatty, E. Kochis, and M. Bloodgood. 2019. The use of unlabeled data versus labeled data for stopping active learning for text classification. IEEE International Conference on Semantic Computing, pp. 287--294.

[49]

M. Bloodgood and K. Vijay-Shanker. 2009. A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. Proc. 13th Conference on Computational Natural Language Learning, pp. 39--47.

[50]

M. Nashaat, A. Ghosh, J. Miller, and S. Quader. WeSAL: Applying active supervision to find high-quality labels at industrial scale. Hawaii International Conference on System Sciences, submitted for publication.

[51]

A. C. Tan and D. Gilbert. 2003. An empirical comparison of supervised machine learning techniques in bioinformatics. Proc. Conference on Bioinformatics. 19 (2003), 219--222.

[52]

D. M. Powers. 2011. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2 (2011), 37--63.

[53]

T. Chen and C. Guestrin. 2016. XGBoost: A scalable tree boosting system. Proc. ACM SIGKDD, pp. 785--794.

[54]

I. Teinemaa, M. Dumas, F. M. Maggi, and C. Di Francescomarino. 2016. Predictive business process monitoring with structured and unstructured data. International Conference on Business Process Management, pp. 401--417.

[55]

J. Kremer, K. Steenstrup Pedersen, and C. Igel. 2014. Active learning with support vector machines. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4, 4 (2014), 313--326.

Digital Library

[56]

A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré. 2018. Snorkel MeTaL: Weak supervision for multi-task learning. Proc. 2nd Workshop on Data Management for End-to-End Machine Learning.

[57]

T. Durand, N. Thome, and M. Cord. 2018. SyMIL: MinMax latent SVM for weakly labeled data. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 6099--6112.

[58]

R. Stewart and S. Ermon. 2017. Label-free supervision of neural networks with physics and domain knowledge. AAAI Conference on Artificial Intelligence, pp. 2576--2582.

[59]

L. Cao et al. 2019. Smile: A system to support machine learning on EEG data at scale. Proc. VLDB Endow. 12, 12 (2019), 2230--2241.

Digital Library

[60]

S. Wu et al. 2018. Fonduer: Knowledge base construction from richly formatted data. Proc. International Conference on Management of Data, pp. 1301--1316.

[61]

Y. Li, Y. Wang, D. Yu, Y. Ning, P. Hu, and R. Zhao. 2020. ASCENT: Active supervision for semi-supervised learning. IEEE Trans. Knowl. Data Eng. 32, 5 (2020), 868--882.

[62]

P. Bachman, A. Sordoni, and A. Trischler. 2017. Learning algorithms for active learning. Proc. International Conference on Machine Learning. 70 (2017), 301--310.

[63]

D. Kang, D. Raghavan, P. Bailis, and M. Zaharia. 2018. Model assertions for debugging machine learning. NeurIPS MLSys Workshop.

[64]

M. Carbonneau, E. Granger, and G. Gagnon. 2019. Bag-level aggregation for multiple-instance active learning in instance classification problems. IEEE Trans. Neural Netw. Learn. Syst. 30, 5 (2019), 1441--1451.

[65]

Z. Zhou, J. Y. Shin, S. R. Gurudu, M. B. Gotway, and J. Liang. 2018. AFT* Integrating active learning and transfer learning to reduce annotation efforts. ArXiv180200912 Cs Stat, 2018.

Cited By

Delmas HDenault VBurgoon JDunbar N(2024)A Review of Automatic Lie Detection from Facial FeaturesJournal of Nonverbal Behavior10.1007/s10919-024-00451-248:1(93-136)Online publication date: 23-Mar-2024
https://doi.org/10.1007/s10919-024-00451-2
Langenecker SSturm CSchalles CBinnig C(2023)Steered Training Data Generation for Learned Semantic Type DetectionProceedings of the ACM on Management of Data10.1145/35897861:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589786
Hsieh CZhang JRatner A(2022)NemoProceedings of the VLDB Endowment10.14778/3565838.356585915:13(4093-4105)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565859
Show More Cited By

Index Terms

Asterisk: Generating Large Training Datasets with Automatic Active Supervision
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Active learning

Recommendations

SMART: an open source data labeling platform for supervised learning

SMART is an open source web application designed to help data scientists and research teams efficiently build labeled training data sets for supervised machine learning tasks. SMART provides users with an intuitive interface for creating labeled data ...
Efficient Learning from Few Labeled Examples
ISNN '09: Proceedings of the 6th International Symposium on Neural Networks on Advances in Neural Networks

Active learning and semi-supervised learning are two approaches to alleviate the burden of labeling large amounts of data. In active learning, user is asked to label the most informative examples in the domain. In semi-supervised learning, labeled data ...
Analysis of active semi-supervised learning
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

In many real-world applications, labeled instances are costly and infeasible to obtain large training sets. This way, learning strategies that do the most with fewer labels are calling attention, such as semi-supervised learning (SSL) and active learning ...

Comments

Information & Contributors

Information

Published In

cover image ACM/IMS Transactions on Data Science

ACM/IMS Transactions on Data Science Volume 1, Issue 2

May 2020

169 pages

ISSN:2691-1922

DOI:10.1145/3403596

Editor:
Beng Chin Ooi
National University of Singapore, Singapore

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2020

Online AM: 07 May 2020

Accepted: 01 February 2020

Revised: 01 December 2019

Received: 01 July 2019

Published in TDS Volume 1, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
1,332
Total Downloads

Downloads (Last 12 months)198
Downloads (Last 6 weeks)16

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Delmas HDenault VBurgoon JDunbar N(2024)A Review of Automatic Lie Detection from Facial FeaturesJournal of Nonverbal Behavior10.1007/s10919-024-00451-248:1(93-136)Online publication date: 23-Mar-2024
https://doi.org/10.1007/s10919-024-00451-2
Langenecker SSturm CSchalles CBinnig C(2023)Steered Training Data Generation for Learned Semantic Type DetectionProceedings of the ACM on Management of Data10.1145/35897861:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589786
Hsieh CZhang JRatner A(2022)NemoProceedings of the VLDB Endowment10.14778/3565838.356585915:13(4093-4105)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565859
Denham BLai ESinha RNaeem M(2022)WitanProceedings of the VLDB Endowment10.14778/3551793.355179715:11(2334-2347)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551797
Sengupta PPaul SMishra S(2022)BUDS+: Better Privacy with Converger and Noisy ShufflingDigital Threats: Research and Practice10.1145/34912594:2(1-23)Online publication date: 15-Feb-2022
https://dl.acm.org/doi/10.1145/3491259

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents