Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Active Learning for Data Quality Control: A Survey

Published: 24 June 2024 Publication History

Abstract

Data quality plays a vital role in scientific research and decision-making across industries. Thus, it is crucial to incorporate the data quality control (DQC) process, which comprises various actions and operations to detect and correct data errors. The increasing adoption of machine learning (ML) techniques in different domains has raised concerns about data quality in the ML field. Conversely, ML’s capability to uncover complex patterns makes it suitable for addressing challenges involved in the DQC process. However, supervised learning methods demand abundant labeled data, while unsupervised learning methods heavily rely on the underlying distribution of the data. Active learning (AL) provides a promising solution by proactively selecting data points for inspection, thus reducing the burden of data labeling for domain experts. Therefore, this survey focuses on applying AL to DQC. Starting with a review of common data quality issues and solutions in the ML field, we aim to enhance the understanding of current quality assessment methods. We then present two scenarios to illustrate the adoption of AL into the DQC systems on the anomaly detection task, including pool-based and stream-based approaches. Finally, we provide the remaining challenges and research opportunities in this field.

References

[1]
Naoki Abe, Bianca Zadrozny, and John Langford. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 504–509.
[2]
Charu C. Aggarwal. 2017. In data mining. In Outlier Analysis. Springer.
[3]
Charu C. Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and S. Yu Philip. 2014. Active learning: A survey. In Data Classification. Chapman and Hall/CRC, 599–634.
[4]
Hamed H. Aghdam, Abel Gonzalez-Garcia, Joost van de Weijer, and Antonio M. López. 2019. Active learning for deep detection neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3672–3680.
[5]
Gabriel Aguiar, Bartosz Krawczyk, and Alberto Cano. 2023. A survey on learning from imbalanced data streams: Taxonomy, challenges, empirical study, and reproducible experimental framework. Mach. Learn. (2023), 1–79.
[6]
Magnus Almgren and Erland Jonsson. 2004. Using active learning in intrusion detection. In Proceedings of the 17th IEEE Computer Security Foundations Workshop. IEEE, 88–98.
[7]
Laith Alzubaidi, Jinshuai Bai, Aiman Al-Sabaawi, Jose Santamaría, A. S. Albahri, Bashar Sami Nayyef Al-dabbagh, Mohammed A. Fadhel, Mohamed Manoufali, Jinglan Zhang, Ali H. Al-Timemy et al. 2023. A survey on deep learning tools dealing with data scarcity: Definitions, challenges, solutions, tips, and applications. J. Big Data 10, 1 (2023), 46.
[8]
Samaneh Aminikhanghahi and Diane J. Cook. 2017. A survey of methods for time series change point detection. Knowl. Info. Syst. 51, 2 (2017), 339–367.
[9]
Shin Ando and Chun Yuan Huang. 2017. Deep over-sampling framework for classifying imbalanced data. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD’17). Springer, 770–785.
[10]
Dana Angluin. 1988. Queries and concept learning. Mach. Learn. 2, 4 (1988), 319.
[11]
Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. 26–33.
[12]
Ms. Aayushi Bansal, Dr. Rewa Sharma, and Dr. Mamta Kathuria. 2022. A systematic review on data scarcity problem in deep learning: Solution and applications. ACM Comput. Surveys (CSUR’22) 54, 10s (2022), 1–29. DOI:
[13]
Vic Barnett and Toby Lewis. 1984. Outliers in statistical data. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics.
[14]
Eric B. Baum and Kenneth Lang. 1992. Query learning can work poorly when a human oracle is used. In Proceedings of the 8th International Joint Conference on Neural Networks.
[15]
Mayukh Bhattacharjee, Hema Sri Kambhampati, Paula Branco, and Luis Torgo. 2021. Active learning for imbalanced domains: The ALOD and ALOD-RE algorithms. In Proceedings of the IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA’21). 1–10.
[16]
Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A. Lozano. 2021. A review on outlier/anomaly detection in time series data. ACM Comput. Surveys 54, 3 (2021), 1–33.
[17]
Michael Bloodgood and John Grothendieck. 2015. Analysis of stopping active learning based on stabilizing predictions. Retrieved from https://arXiv:1504.06329.
[18]
Raymond Board and Leonard Pitt. 1989. Semi-supervised learning. Mach. Learn. 4, 1 (1989), 41–.
[19]
Hamza Bodor, Thai V. Hoang, and Zonghua Zhang. 2022. Little Help Makes a Big Difference: Leveraging Active Learning to Improve Unsupervised Time Series Anomaly Detection. Retrieved from https://arxiv.org/abs/2201.10323.
[20]
Azzedine Boukerche, Lining Zheng, and Omar Alfandi. 2020. Outlier detection: Methods, models, and classification. ACM Comput. Surveys 53, 3 (2020), 1–37.
[21]
Stephen Boyd, Corinna Cortes, Mehryar Mohri, and Ana Radovanovic. 2012. Accuracy at the top. Adv. Neural Info. Process. Syst. 25 (2012).
[22]
Mohammad Braei and Sebastian Wagner. 2020. Anomaly detection in univariate time-series: A survey on the state-of-the-art. Retrieved from https://arXiv:2004.00433.
[23]
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 93–104.
[24]
Klaus Brinker. 2003. Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th International Conference on Machine Learning (ICML’03). 59–66.
[25]
Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. 2022. The effects of data quality on machine learning performance. Retrieved from https://arXiv:2207.14529.
[26]
Samuel Budd, Emma C. Robinson, and Bernhard Kainz. 2021. A survey on active learning and human-in-the-loop deep learning for medical image analysis. Med. Image Anal. 71 (2021), 102062.
[27]
Davide Cacciarelli and Murat Kulahci. 2024. Active learning for data streams: A survey. Mach. Learn. 113, 1 (2024), 185–239.
[28]
Xiangyong Cao, Jing Yao, Zongben Xu, and Deyu Meng. 2020. Hyperspectral image classification with convolutional neural network and active learning. IEEE Trans. Geosci. Remote Sens. 58, 7 (2020), 4604–4616.
[29]
Mauro Castelli, Luca Manzoni, Tatiane Espindola, Aleš Popovič, and Andrea De Lorenzo. 2021. Generative adversarial networks for generating synthetic features for Wi-Fi signal quality. Plos One 16, 11 (2021), e0260308.
[30]
Nicolo Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. 2004. Worst-case analysis of selective sampling for linear-threshold algorithms. Adv. Neural Info. Process. Syst. 17 (2004).
[31]
Shayok Chakraborty. 2020. Asking the right questions to the right users: Active learning with imperfect oracles. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 3365–3372.
[32]
Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep learning for anomaly detection: A survey. Retrieved from https://arXiv:1901.03407.
[33]
Mathieu Chambefort, Raphaël Butez, Emilie Chautru, and Stephan Clémençon. 2022. Improving the quality control of seismic data through active learning. Retrieved from https://arXiv:2201.06616.
[34]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Comput. Surveys 41, 3 (2009), 1–58.
[35]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artific. Intell. Res. 16 (2002), 321–357.
[36]
Giuseppe Chicco, Davide; Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21, 1 (2020), 6.
[37]
Wei Chu, Martin Zinkevich, Lihong Li, Achint Thomas, and Belle Tseng. 2011. Unbiased online active learning in data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). ACM, New York, NY, 195–203.
[38]
Quang-Vinh Dang. 2020. Active learning for intrusion detection systems. In Proceedings of the RIVF International Conference on Computing and Communication Technologies (RIVF’20). IEEE, 1–3.
[39]
T. T. Dang, Hyt Ngan, and L. Wei. 2015. Distance-based k-nearest neighbors outlier detection method in large-scale traffic data. In Proceedings of the IEEE International Conference on Digital Signal Processing.
[40]
Shubhomoy Das, Md Rakibul Islam, Nitthilan Kannappan Jayakodi, and Janardhan Rao Doppa. 2019. Active anomaly detection via ensembles: Insights, algorithms, and interpretability. Retrieved Jan. 27, 2019 from https://arxiv.org/abs/1901.08930.
[41]
Shubhomoy Das, Weng-Keen Wong, Thomas Dietterich, Alan Fern, and Andrew Emmott. 2016. Incorporating expert feedback into active anomaly discovery. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 853–858.
[42]
Shubhomoy Das, Weng-Keen Wong, Alan Fern, Thomas G. Dietterich, and Md Amran Siddiqui. 2017. Incorporating feedback into tree-based anomaly detection. Retrieved from https://arXiv:1708.09441.
[43]
Jesse Davis and Mark Goadrich. 2006. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). ACM, New York, NY, 233–240.
[44]
Jifei Deng, Jie Sun, Wen Peng, Dianhua Zhang, and Valeriy Vyatkin. 2022. Imbalanced multiclass classification with active learning in strip rolling process. Knowl.-Based Syst. 255 (2022), 109754.
[45]
Debashree Devi, Saroj K. Biswas, and Biswajit Purkayastha. 2020. A review on solution to class imbalance problem: Undersampling approaches. In Proceedings of the International Conference on Computational performance evaluation (ComPE’20). IEEE, 626–631.
[46]
Jun Du and Charles X. Ling. 2010. Active learning with human-like noisy oracle. In Proceedings of the IEEE International Conference on Data Mining. IEEE, 797–802.
[47]
Murat Dundar, Balaji Krishnapuram, Jinbo Bi, and R. Bharat Rao. 2007. Learning classifiers when the training data is not IID. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07), Vol. 2007. 756–61.
[48]
Suparna Dutta and Monidipa Das. 2023. Remote sensing scene classification under scarcity of labelled samples—A survey of the state-of-the-arts. Comput. & Geosci. 171 (2023), 105295. DOI:
[49]
Dmitry Efimov, Di Xu, Luyang Kong, Alexey Nefedov, and Archana Anandakrishnan. 2020. Using generative adversarial networks to synthesize artificial financial datasets. Retrieved from https://arXiv:2002.02271.
[50]
Laila El Jiani, Sanaa El Filali et al. 2022. Overcome medical image data scarcity by data augmentation techniques: A review. In Proceedings of the International Conference on Microelectronics (ICM’22). IEEE, 21–24.
[51]
Ahmed K Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2006. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2006), 1–16.
[52]
Justin Engelmann and Stefan Lessmann. 2021. Conditional wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst. Appl. 174 (2021), 114582.
[53]
Seyda Ertekin, Jian Huang, Leon Bottou, and Lee Giles. 2007. Learning on the border: Active learning in imbalanced data classification. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 127–136.
[54]
Seyda Ertekin, Jian Huang, and C. Lee Giles. 2007. Active learning for class imbalance problem. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 823–824.
[55]
Eleazar Eskin. 2000. Detecting errors within a corpus using anomaly detection. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics.
[56]
Conor Fahy, Shengxiang Yang, and Mario Gongora. 2022. Scarcity of labels in non-stationary data streams: A survey. ACM Comput. Surveys 55, 2 (2022), 1–39.
[57]
Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learning how to active learn: A deep reinforcement learning approach. Retrieved from https://arXiv:1708.02383.
[58]
Abolfazl Farahani, Sahar Voghoei, Khaled Rasheed, and Hamid R. Arabnia. 2021. A brief review of domain adaptation. In Proceedings of the International Conference on Advances in Data Science and Information Engineering (ICDATA’20 and IKE’20). 877–894.
[59]
Alvaro Figueira and Bruno Vaz. 2022. Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10, 15 (2022), 2733.
[60]
Benoît Frénay and Michel Verleysen. 2013. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 5 (2013), 845–869.
[61]
Robert M. French. 1999. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3, 4 (1999), 128–135.
[62]
Alexander Freytag, Erik Rodner, and Joachim Denzler. 2014. Selecting influential examples: Active learning with expected model output changes. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14). Springer, 562–577.
[63]
Yifan Fu, Xingquan Zhu, and Bin Li. 2013. A survey on instance selection for active learning. Knowl. Info. Syst. 35 (2013), 249–283.
[64]
Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep Bayesian active learning with image data. In Proceedings of the International Conference on Machine Learning. PMLR, 1183–1192.
[65]
Joao Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. 2004. Learning with drift detection. In Proceedings of the 17th Brazilian Symposium on Artificial Intelligence (SBIA’04). Springer, 286–295.
[66]
Joao Gama, Raquel Sebastiao, and Pedro Pereira Rodrigues. 2013. On evaluating stream learning algorithms. Mach. Learn. 90 (2013), 317–346.
[67]
Guojun Gan and Michael Kwok-Po Ng. 2017. K-means clustering with outlier removal. Pattern Recogn. Lett. 90 (2017), 8–14.
[68]
Aurélien Géron. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc.
[69]
Mohsen Ghassemi, Anand D. Sarwate, and Rebecca N. Wright. 2016. Differentially private online active learning with applications to anomaly detection. In Proceedings of the ACM Workshop on Artificial Intelligence and Security. 117–128.
[70]
Markus Goldstein. 2012. FastLOF: An expectation-maximization based local outlier detection algorithm. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR’12). IEEE, 2282–2285.
[71]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Adv. Neural Info. Process. Syst. 27 (2014).
[72]
Venkat Gudivada, Amy Apon, and Junhua Ding. 2017. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 10, 1 (2017), 1–20.
[73]
Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. 2016. Robust random cut forest based anomaly detection on streams. In Proceedings of the International Conference on Machine Learning. PMLR, 2712–2721.
[74]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 1321–1330.
[75]
Nitin Gupta, Shashank Mujumdar, Hima Patel, Satoshi Masuda, Naveen Panwar, Sambaran Bandyopadhyay, Sameep Mehta, Shanmukha Guttula, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. 2021. Data Quality for Machine Learning Tasks. ACM, New York, NY, 4040–4041.
[76]
Nico Görnitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld. 2013. Toward supervised anomaly detection. J. Artific. Intell. Res. 46 (2013), 235–262.
[77]
Robbie Haertel, Eric Ringger, Kevin Seppi, James Carroll, and Peter McClanahan. 2008. Assessing the costs of sampling methods in active learning for annotation. In Proceedings of the Association for Computational Linguistics (ACL’08). 65–68.
[78]
Bohnishikha Halder, Md Manjur Ahmed, Toshiyuki Amagasa, Nor Ashidi Mat Isa, Rahat Hossain Faisal, and Md Mostafijur Rahman. 2022. Missing information in imbalanced data stream: Fuzzy adaptive imputation approach. Appl. Intell. 52, 5 (2022), 5561–5583.
[79]
Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 2 (2009), 8–12.
[80]
Steve Hanneke. 2014. Theory of disagreement-based active learning. Found. and Trends® in Mach. Learn. 7, 2–3 (2014), 131–309. DOI:
[81]
Shuji Hao, Jing Lu, Peilin Zhao, Chi Zhang, Steven C. H. Hoi, and Chunyan Miao. 2017. Second-order online active learning and its applications. IEEE Trans. Knowl. Data Eng. 30, 7 (2017), 1338–1351.
[82]
Sahand Hariri, Matias Carrasco Kind, and Robert J Brunner. 2019. Extended isolation forest. IEEE Trans. Knowl. Data Eng. 33, 4 (2019), 1479–1489.
[83]
Ville Hautamaki, Ismo Karkkainen, and Pasi Franti. 2004. Outlier detection using k-nearest neighbour graph. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04), Vol. 3. IEEE, 430–433.
[84]
Douglas M. Hawkins. 1980. Identification of Outliers. Vol. 11. Springer.
[85]
Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artific. Intell. Rev. 22 (2004), 85–126.
[86]
T. Ryan Hoens, Robi Polikar, and Nitesh V. Chawla. 2012. Learning from streaming data with concept drift and imbalance: An overview. Progr. Artific. Intell. 1 (2012), 89–101.
[87]
Boshuang Huang, Kobi Cohen, and Qing Zhao. 2018. Active anomaly detection in heterogeneous processes. IEEE Trans. Info. Theory 65, 4 (2018), 2284–2301.
[88]
Chengqiang Huang, Yulei Wu, Yuan Zuo, Ke Pei, and Geyong Min. 2018. Towards experienced anomaly detector through reinforcement learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
[89]
ISO. 2008. ISO/IEC 25012-Software engineering: Software product quality requirements and evaluation (SQuaRE). Data Quality Model. Retrieved from https://www.iso.org/standard/35736.html.
[90]
Wen Jin, Anthony K. H. Tung, Jiawei Han, and Wei Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 577–593.
[91]
Justin M. Johnson and Taghi M. Khoshgoftaar. 2019. Survey on deep learning with class imbalance. J. Big Data 6, 1 (2019), 1–54.
[92]
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of GANs for improved quality, stability, and variation. Retrieved from https://arXiv:1710.10196.
[93]
Harsurinder Kaur, Husanbir Singh Pannu, and Avleen Kaur Malhi. 2019. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Comput. Surveys 52, 4 (2019), 1–36.
[94]
Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. Retrieved from https://arXiv:1312.6114.
[95]
Eduard T. Klapwijk, Ferdi Van De Kamp, Mara Van Der Meulen, Sabine Peters, and Lara M. Wierenga. 2019. Qoala-T: A supervised-learning tool for quality control of FreeSurfer segmented MRI data. Neuroimage 189 (2019), 116–129.
[96]
Edwin M. Knox and Raymond T. Ng. 1998. Algorithms for mining distancebased outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases. Citeseer, 392–403.
[97]
Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. 2017. On convergence and stability of GANs. Retrieved from https://arXiv:1705.07215 (2017).
[98]
Bartosz Krawczyk, Leandro L. Minku, João Gama, Jerzy Stefanowski, and Michał Woźniak. 2017. Ensemble learning for data stream analysis: A survey. Info. Fusion 37 (2017), 132–156.
[99]
Punit Kumar and Atul Gupta. 2020. Active learning query strategies for classification, regression, and clustering: A survey. J. Comput. Sci. Technol. 35 (2020), 913–945.
[100]
Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C. Suh, Ikkyun Kim, and Kuinam J. Kim. 2019. A survey of deep learning-based network anomaly detection. Cluster Comput. 22 (2019), 949–961.
[101]
Tze Leung Lai. 1995. Sequential changepoint detection in quality control and dynamical systems. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57, 4 (1995), 613–644.
[102]
Longyuan Li, Junchi Yan, Haiyang Wang, and Yaohui Jin. 2021. Anomaly detection of time series with smoothness-inducing sequential variational auto-encoder. IEEE Trans. Neural Netw. Learn. Syst. 32, 3 (2021), 1177–1191.
[103]
Weixin Liang, Girmaw Abebe Tadesse, Daniel Ho, L. Fei-Fei, Matei Zaharia, Ce Zhang, and James Zou. 2022. Advances, challenges and opportunities in creating data for trustworthy AI. Nature Mach. Intell. 4, 8 (2022), 669–677.
[104]
Zicheng Liao, Yizhou Yu, and Baoquan Chen. 2010. Anomaly detection in GPS data based on visual analytics. In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology. IEEE, 51–58.
[105]
Ray Liere and Prasad Tadepalli. 1996. The use of active learning in text categorization. In Proceedings of the AAAI Symposium on Machine Learning in Information Access. Citeseer.
[106]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 413–422.
[107]
Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. Adv. Neural Info. Process. Syst. 29 (2016), 469–477.
[108]
Sanmin Liu, Shan Xue, Jia Wu, Chuan Zhou, Jian Yang, Zhao Li, and Jie Cao. 2021. Online active learning for drifting data streams. IEEE Trans. Neural Netw. Learn. Syst. 34, 1 (2021), 186–200.
[109]
Manuel Lopes, Francisco Melo, and Luis Montesano. 2009. Active learning for reward estimation in inverse reinforcement learning. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD’09). Springer, Berlin, 31–46.
[110]
Mohammad Lotfollahi, Mohsen Naghipourfar, Fabian J. Theis, and F. Alexander Wolf. 2020. Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics 36, Supplement_2 (2020), i610–i617.
[111]
Chen Change Loy, Tao Xiang, and Shaogang Gong. [n.d.]. Stream-based active unusual event detection. In Proceedings of the Asian Confernce on Computer Vision (ACCV’10) (Lecture Notes in Computer Science). Springer, Berlin, 161–175.
[112]
Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2018. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 31, 12 (2018), 2346–2363.
[113]
Yingzhou Lu, Huazheng Wang, and Wenqi Wei. 2023. Machine learning for synthetic data generation: A review. Retrieved from https://arXiv:2302.04062.
[114]
Batta Mahesh. 2020. Machine learning algorithms-a review. Int. J. Sci. Res. 9 (2020), 381–386.
[115]
Elisa Marcelli, Tommaso Barbariol, and Gian Antonio Susto. 2022. Active learning-based isolation forest (ALIF): Enhancing anomaly detection in decision support systems. Retrieved from https://arXiv:2207.03934.
[116]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Comput. Surveys 54, 6 (2021), 1–35.
[117]
Sebastian Mieruch, Serdar Demirel, Simona Simoncelli, Reiner Schlitzer, and Steffen Seitz. 2021. SalaciaML: A deep learning approach for supporting ocean data quality control. Front. Marine Sci. 8 (2021), 611742.
[118]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. Retrieved from https://arXiv:1411.1784.
[119]
Douglas C. Montgomery. 2019. Introduction to Statistical Quality Control. John Wiley & Sons.
[120]
Nour Moustafa and Jill Slay. 2015. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the Military Communications and Information Systems Conference (MilCIS’15). 1–6.
[121]
Stephen Mussmann and Percy Liang. 2018. On the relationship between data efficiency and error for uncertainty sampling. In Proceedings of the International Conference on Machine Learning. PMLR, 3674–3682.
[122]
Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader, Chad Marston, and Jean-Francois Puget. 2018. Hybridization of active learning and data programming for labeling large industrial datasets. In Proceedings of the IEEE International Conference on Big Data (Big Data’18). IEEE, 46–55.
[123]
Ali Bou Nassif, Manar Abu Talib, Qassim Nasir, and Fatima Mohamad Dakalbab. 2021. Machine learning for anomaly detection: A systematic review. IEEE Access 9 (2021), 78658–78700.
[124]
Andrew Y. Ng, Stuart J. Russell et al. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML’00), Vol. 1. 2.
[125]
Christopher Nixon, Mohamed Sedky, and Mohamed Hassan. 2023. Salad: An exploration of split active learning based unsupervised network data stream anomaly detection using autoencoders. Authorea Preprints (2023). Retrieved from https://advance.sagepub.com/doi/full/10.36227/techrxiv.14896773.v1
[126]
Min-hwan Oh and Garud Iyengar. 2019. Sequential anomaly detection using inverse reinforcement learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1480–1490.
[127]
Kalia Orphanou, Jahna Otterbacher, Styliani Kleanthous, Khuyagbaatar Batsuren, Fausto Giunchiglia, Veronika Bogina, Avital Shulner Tal, Alan Hartman, and Tsvi Kuflik. 2022. Mitigating bias in algorithmic systems—A fish-eye view. Comput. Surveys 55, 5 (2022), 1–37.
[128]
Rajendra Pamula, Jatindra Kumar Deka, and Sukumar Nandi. 2011. An outlier detection method based on clustering. In Proceedings of the 2nd International Conference on Emerging Applications of Information Technology. IEEE, 253–256.
[129]
Lujia Pan, Jianfeng Zhang, Patrick P.C. Lee, Marcus Kalander, Junjian Ye, and Pinghui Wang. 2020. Proactive microwave link anomaly detection in cellular data networks. Comput. Netw. 167 (2020), 106969.
[130]
Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021. Deep learning for anomaly detection: A review. ACM Comput. Surveys 54, 2 (2021), 1–38.
[131]
Guansong Pang, Chunhua Shen, and Anton van den Hengel. 2019. Deep anomaly detection with deviation networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 353–362.
[132]
Kunkun Pang, Mingzhi Dong, Yang Wu, and Timothy Hospedales. 2018. Meta-learning transferable active learning policies by deep reinforcement learning. Retrieved from https://arXiv:1806.04798.
[133]
Gilberto Pastorello, Dan Gunter, Housen Chu, Danielle Christianson, Carlo Trotta, Eleonora Canfora, Boris Faybishenko, You-Wei Cheah, Norm Beekwilder, Stephen Chan et al. 2017. Hunting data rogues at scale: Data quality control for observational data in research infrastructures. In Proceedings of the IEEE 13th International Conference on e-Science (e-Science’17). IEEE, 446–447.
[134]
Tomáš Pevnỳ. 2016. Loda: Lightweight on-line detector of anomalies. Mach. Learn. 102, 2 (2016), 275–304.
[135]
Marco A. F. Pimentel, David A. Clifton, Lei Clifton, and Lionel Tarassenko. 2014. A review of novelty detection. Signal Process. 99 (2014), 215–249.
[136]
Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. Commun. ACM 45, 4 (2002), 211–218.
[137]
Antonella D. Pontoriero, Giovanna Nordio, Rubaida Easmin, Alessio Giacomel, Barbara Santangelo, Sameer Jahuar, Ilaria Bonoldi, Maria Rogdaki, Federico Turkheimer, Oliver Howes et al. 2021. Automated data quality control in FDOPA brain PET imaging using deep learning. Comput. Methods Programs Biomed. 208 (2021), 106239.
[138]
Maria Priestley, Fionntán O’Donnell, and Elena Simperl. 2023. A survey of data quality requirements that matter in ML development pipelines. ACM J. Data Info. Qual. 15, 2 (2023), 1–39.
[139]
Piyush Rai, Avishek Saha, Hal Daumé III, and Suresh Venkatasubramanian. 2010. Domain adaptation meets active learning. In Proceedings of the NAACL HLT Workshop on Active Learning for Natural Language Processing. 27–32.
[140]
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM Comput. Surveys 54, 9 (2021), 1–40.
[141]
Nicholas Roy and Andrew McCallum. 2001. Toward optimal active learning through monte carlo estimation of error reduction. In Proceedings of the International Conference on Machine Learning (ICML’01). 441–448.
[142]
Stefania Russo, Moritz Lürig, Wenjin Hao, Blake Matthews, and Kris Villez. 2020. Active learning for anomaly detection in environmental data. Environ. Model. Softw. 134 (2020), 104869.
[143]
Carlos Sáez, Nekane Romero, J. Alberto Conejero, and Juan M. García-Gómez. 2021. Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset. J. Amer. Med. Info. Assoc. 28, 2 (2021), 360–364.
[144]
Takaya Saito and Marc Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS One 10, 3 (2015), e0118432.
[145]
Vignesh Sampath, Iñaki Maurtua, Juan Jose Aguilar Martin, and Aitor Gutierrez. 2021. A survey on generative adversarial networks for imbalance problems in computer vision tasks. J. Big Data 8 (2021), 1–59.
[146]
Andrew I. Schein and Lyle H. Ungar. 2007. Active learning for logistic regression: An evaluation. Mach. Learn. 68, 3 (2007), 235–265.
[147]
Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. 2022. Anomaly detection in time series: A comprehensive evaluation. Proc. VLDB Endow. 15, 9 (2022), 1779–1797.
[148]
Christopher Schröder and Andreas Niekler. 2020. A survey of active learning for text classification using deep neural networks. Retrieved from https://arXiv:2008.07267.
[149]
Christopher Schröder, Andreas Niekler, and Martin Potthast. 2021. Uncertainty-based query strategies for active learning with transformers. Retrieved from https://arXiv:2107.05687.
[150]
Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. Retrieved from https://arXiv:1708.00489.
[151]
Burr Settles. 2009. Active learning literature survey. Retrieved from http://digital.library.wisc.edu/1793/60660
[152]
Burr Settles. 2011. From theories to queries: Active learning in practice. In Proceedings of the Active Learning and Experimental Design Workshop in Conjunction with AISTATS. JMLR Workshop and Conference Proceedings, 1–18.
[153]
Burr Settles, Mark Craven, and Lewis Friedland. 2008. Active learning with real annotation costs. In Proceedings of the NIPS Workshop on Cost-sensitive Learning, Vol. 1.
[154]
Burr Settles, Mark Craven, and Soumya Ray. 2007. Multiple-instance active learning. Adv. Neural Info. Process. Syst. 20 (2007).
[155]
H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT’92). ACM, New York, NY, 287–294.
[156]
Shweta Sharma, Anjana Gosain, and Shreya Jain. 2022. A review of the oversampling techniques in class imbalance problem. In Proceedings of the International Conference on Innovative Computing and Communications (ICICC’21). Springer, 459–472.
[157]
Haotian Shi, Haoren Wang, Chengjin Qin, Liqun Zhao, and Chengliang Liu. 2020. An incremental learning system for atrial fibrillation detection based on transfer learning and active learning. Comput. Methods Progr. Biomed. 187 (2020), 105219.
[158]
Fatimah Sidi, Payam Hassany Shariat Panahy, Lilly Suriani Affendey, Marzanah A. Jabar, Hamidah Ibrahim, and Aida Mustapha. 2012. Data quality: A survey of data quality dimensions. In Proceedings of the International Conference on Information Retrieval and Knowledge Management. 300–304.
[159]
Simona Simoncelli, Paolo Oliveri, Gelsomina Mattia, and Volodymyr Myroshnychenko. 2020. SeaDataCloud temperature and salinity historical data collection for the mediterranean sea (version 2).
[160]
Ikbal Taleb, Mohamed Adel Serhani, and Rachida Dssouli. 2018. Big data quality: A survey. In Proceedings of the IEEE International Congress on Big Data (BigData’18). IEEE, 166–173.
[161]
Jian Tang, Zhixiang Chen, Ada Wai chee Fu, and David Cheung. 2001. A robust outlier detection scheme for large data sets. In Proceedings of the 6th Pacific-Asia Conf. on Knowledge Discovery and Data Mining. 6–8.
[162]
Youbao Tang, Jinzheng Cai, Le Lu, Adam P. Harrison, Ke Yan, Jing Xiao, Lin Yang, and Ronald M. Summers. 2018. CT image enhancement using stacked generative adversarial networks and transfer learning for lesion segmentation improvement. In Proceedings of the International Workshop on Machine Learning in Medical Imaging. Springer, 46–54.
[163]
Alexander G. Tartakovsky, Aleksey S. Polunchenko, and Grigory Sokolov. 2012. Efficient computer network anomaly detection by changepoint detection methods. IEEE J. Select. Top. Signal Process. 7, 1 (2012), 4–11.
[164]
Hui Yie Teh, Andreas W. Kempa-Liehr, and Kevin I-Kai Wang. 2020. Sensor data quality: A systematic review. J. Big Data 7, 1 (2020), 1–49.
[165]
Fadi Thabtah, Suhel Hammoud, Firuz Kamalov, and Amanda Gonsalves. 2020. Data imbalance in classification: Experimental evaluation. Info. Sci. 513 (2020), 429–441.
[166]
Siddharth Thakur, Jaytrilok Choudhary, and Dhirendra Pratap Singh. 2021. A survey on missing values handling methods for time series data. In Proceedings of the Scandinavian Conference on Information Systems: Intelligent Systems (SCIS’21). Springer, 435–443.
[167]
Holger Trittenbach, Adrian Englhardt, and Klemens Böhm. 2021. An overview and a benchmark of active learning for outlier detection with one-class classifiers. Expert Syst. Appl. 168 (2021), 114372.
[168]
Gido M. van de Ven, Tinne Tuytelaars, and Andreas S. Tolias. 2022. Three types of incremental learning. Nature Mach. Intell. 4, 12 (2022), 1185–1197.
[169]
Jesper E. Van Engelen and Holger H. Hoos. 2020. A survey on semi-supervised learning. Mach. Learn. 109, 2 (2020), 373–440.
[170]
Susana M. Vieira, Uzay Kaymak, and João M. C. Sousa. 2010. Cohen’s kappa coefficient as a performance measure for feature selection. In Proceedings of the International Conference on Fuzzy Systems. IEEE, 1–8.
[171]
Zhiqiang Wan, Yazhou Zhang, and Haibo He. 2017. Variational autoencoder based synthetic data generation for imbalanced learning. In Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI’17). IEEE, 1–7.
[172]
Meng Wang and Xian-Sheng Hua. 2011. Active learning in multimedia annotation and retrieval: A survey. ACM Trans. Intell. Syst. Technol. 2, 2 (2011), 1–21.
[173]
Shuo Wang, Leandro L. Minku, and Xin Yao. 2018. A systematic study of online class imbalance learning with concept drift. IEEE Trans. Neural Netw. Learn. Syst. 29, 10 (2018), 4802–4821.
[174]
Wenlu Wang, Pengfei Chen, Yibin Xu, and Zilong He. 2022. Active-MTSAD: Multivariate time series anomaly detection with active learning. In Proceedings of the 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 263–274.
[175]
Xiaogang Wang, Xiaoxu Ma, and W. Eric L. Grimson. 2008. Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. IEEE Trans. Pattern Anal. Mach. Intell. 31, 3 (2008), 539–555.
[176]
Yao Wang, Zhaowei Wang, Zejun Xie, Nengwen Zhao, Junjie Chen, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Practical and white-box anomaly detection through unsupervised and active learning. In Proceedings of the 29th International Conference on Computer Communications and Networks (ICCCN’20). IEEE, 1–9.
[177]
Gary M. Weiss and Foster Provost. 2003. Learning when training data are costly: The effect of class distribution on tree induction. J. Artific. Intell. Res. 19 (2003), 315–354.
[178]
Ashenafi Zebene Woldaregay, Eirik Årsand, Taxiarchis Botsis, David Albers, Lena Mamykina, and Gunnar Hartvigsen. 2019. Data-driven blood glucose pattern classification and anomalies detection: Machine-learning applications in type 1 diabetes. J. Med. Internet Res. 21, 5 (2019), e11030.
[179]
Tong Wu and Jorge Ortiz. 2021. Rlad: Time series anomaly detection through reinforcement learning and active learning. Retrieved from https://arXiv:2104.00543.
[180]
Yanping Yang, Guangzhi Ma et al. 2010. Ensemble-based active learning for class imbalance problem. J. Biomed. Sci. Eng. 3, 10 (2010), 1022.
[181]
Yang Yang, Da-Wei Zhou, De-Chuan Zhan, Hui Xiong, and Yuan Jiang. 2019. Adaptive deep models for incremental learning: Considering capacity scalability and sustainability. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 74–82.
[182]
Donggeun Yoo and In So Kweon. 2019. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 93–102.
[183]
Mengran Yu and Shiliang Sun. 2020. Policy-based reinforcement learning for time series anomaly detection. Eng. Appl. Artific. Intell. 95 (2020), 103919.
[184]
Hang Zhang, Weike Liu, Jicheng Shan, and Qingbao Liu. 2018. Online active learning paired ensemble for concept drift and class imbalance. IEEE Access 6 (2018), 73815–73828.
[185]
Liumei Zhang, Baoyu Tan, Tianshi Liu, and Xiaoqun Sun. 2019. Classification study for the imbalanced data based on Biased-SVM and the modified over-sampling algorithm. In Journal of Physics: Conference Series, Vol. 1237. IOP Publishing, 022052.
[186]
Xiaoxuan Zhang, Tianbao Yang, and Padmini Srinivasan. 2016. Online asymmetric active learning with imbalanced data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2055–2064.
[187]
Peilin Zhao and Steven C. H. Hoi. 2013. Cost-sensitive online active learning with application to malicious URL detection. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13). ACM, New York, NY, 919–927.
[188]
Dequan Zheng, Fenghuan Li, and Tiejun Zhao. 2016. Self-adaptive statistical process control for anomaly detection in time series. Expert Syst. Appl. 57 (2016), 324–336.
[189]
Xingquan Zhu, Peng Zhang, Xiaodong Lin, and Yong Shi. 2007. Active learning from data streams. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 757–762.
[190]
Yizhe Zhu, Martin Renqiang Min, Asim Kadav, and Hans Peter Graf. 2020. S3vae: Self-supervised sequential vae for representation disentanglement and data generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6538–6547.
[191]
Amir Ziai. 2021. Active learning for network intrusion detection. In Data Science: Theory, Algorithms, and Applications. Springer, 3–14.
[192]
Indrė Žliobaitė, Albert Bifet, Bernhard Pfahringer, and Geoffrey Holmes. 2013. Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25, 1 (2013), 27–39.
[193]
Indrė Žliobaitė, Mykola Pechenizkiy, and Joao Gama. 2016. An overview of concept drift applications. In Big Data Analysis: New Algorithms for a New Society. Springer, 91–114.

Cited By

View all
  • (2024)A Survey on Data Quality Dimensions and Tools for Machine Learning Invited Paper2024 IEEE International Conference on Artificial Intelligence Testing (AITest)10.1109/AITest62860.2024.00023(120-131)Online publication date: 15-Jul-2024
  • (2024)Active learning for industrial applicationsQuality Engineering10.1080/08982112.2024.2402376(1-10)Online publication date: 17-Sep-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 16, Issue 2
June 2024
135 pages
EISSN:1936-1963
DOI:10.1145/3613602
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2024
Online AM: 11 May 2024
Accepted: 12 April 2024
Revised: 19 February 2024
Received: 10 August 2023
Published in JDIQ Volume 16, Issue 2

Check for updates

Author Tags

  1. Data Quality Control
  2. Active Learning
  3. query strategy
  4. anomaly detection
  5. Machine Learning

Qualifiers

  • Survey

Funding Sources

  • European Union’s Horizon research and innovation program via the CLARIFY
  • BLUECLOUD 2026
  • ENVRI-FAIR
  • ENVRI-Hub Next
  • EVERSE
  • BioDT
  • Dutch research council via the LTER-LIFE project

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)419
  • Downloads (Last 6 weeks)79
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Survey on Data Quality Dimensions and Tools for Machine Learning Invited Paper2024 IEEE International Conference on Artificial Intelligence Testing (AITest)10.1109/AITest62860.2024.00023(120-131)Online publication date: 15-Jul-2024
  • (2024)Active learning for industrial applicationsQuality Engineering10.1080/08982112.2024.2402376(1-10)Online publication date: 17-Sep-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media