research-article

OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams

Authors:

Mian LuAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 6

Pages 1283 - 1296

https://doi.org/10.14778/3648160.3648170

Published: 03 May 2024 Publication History

Abstract

How to get insights from relational data streams in a timely manner is a hot research topic. Data streams can present unique challenges, such as distribution drifts, outliers, emerging classes, and changing features, which have recently been described as open environment challenges for machine learning. While existing studies have been done on incremental learning for data streams, their evaluations are mostly conducted with synthetic datasets. Thus, a natural question is how those open environment challenges look like and how existing incremental learning algorithms perform on real-world relational data streams. To fill this gap, we develop an Open Environment Benchmark named OEBench to evaluate open environment challenges in real-world relational data streams. Specifically, we investigate 55 real-world relational data streams and establish that open environment scenarios are indeed widespread, which presents significant challenges for stream learning algorithms. Through benchmarks with existing incremental learning algorithms, we find that increased data quantity may not consistently enhance the model accuracy when applied in open environment scenarios, where machine learning models can be significantly compromised by missing values, distribution drifts, or anomalies in real-world data streams. The current techniques are insufficient in effectively mitigating these challenges brought by open environments. More researches are needed to address real-world open environment challenges. All datasets and code are open-sourced in https://github.com/Xtra-Computing/OEBench.

References

[1]

Ahmad Abbasi, Abdul Rehman Javed, Chinmay Chakraborty, Jamel Nebhen, Wisha Zehra, and Zunera Jalil. 2021. ElStream: An ensemble learning approach for concept drift detection in dynamic social big data stream learning. IEEE Access 9 (2021), 66408--66419.

[2]

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. 2018. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV). 139--154.

Digital Library

[3]

Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. 2017. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3366--3375.

[4]

Robert Anderson, Yun Sing Koh, Gillian Dobbie, and Albert Bifet. 2019. Recurring concept meta-learning for evolving data streams. Expert Systems with Applications 138 (2019), 112832.

[5]

Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 6679--6687.

[6]

Manuel Baena-Garcıa, José del Campo-Ávila, Raúl Fidalgo, Albert Bifet, R Gavalda, and Rafael Morales-Bueno. 2006. Early drift detection method. In Fourth international workshop on knowledge discovery from data streams, Vol. 6. 77--86.

[7]

Jürgen Beringer and Eyke Hüllermeier. 2007. Efficient instance-based learning on data streams. Intelligent Data Analysis 11, 6 (2007), 627--650.

Digital Library

[8]

Albert Bifet and Ricard Gavalda. 2007. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining. SIAM, 443--448.

[9]

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavalda. 2009. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 139--148.

Digital Library

[10]

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Philipp Kranen, Hardy Kremer, Timm Jansen, and Thomas Seidl. 2010. Moa: Massive online analysis, a framework for stream classification and clustering. In Proceedings of the first workshop on applications of pattern analysis. PMLR, 44--50.

[11]

Dariusz Brzeziński and Jerzy Stefanowski. 2011. Accuracy updated ensemble for data streams with concept drift. In International conference on hybrid artificial intelligence systems. Springer, 155--163.

[12]

Shaofeng Cai, Kaiping Zheng, Gang Chen, HV Jagadish, Beng Chin Ooi, and Meihui Zhang. 2021. Arm-net: Adaptive relation modeling network for structured data. In Proceedings of the 2021 International Conference on Management of Data. 207--220.

Digital Library

[13]

Rich Caruana, Steve Lawrence, and C Giles. 2000. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. Advances in neural information processing systems 13 (2000).

[14]

Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. 2018. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV). 532--547.

Digital Library

[15]

Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubramanian, and Ke Yi. 2006. An information-theoretic approach to detecting changes in multidimensional data streams. In Proc. Symposium on the Interface of Statistics, Computing Science, and Applications (Interface).

[16]

Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. 2019. Learning without memorizing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5138--5146.

[17]

Yiqun Diao, Yutong Yang, Qinbin Li, Bingsheng He, and Mian Lu. 2023. OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams. arXiv:2308.15059

[18]

Gregory Ditzler and Robi Polikar. 2011. Hellinger distance based drift detection for nonstationary environments. In 2011 IEEE symposium on computational intelligence in dynamic and uncertain environments (CIDUE). IEEE, 41--48.

[19]

Tlamelo Emmanuel, Thabiso Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in machine learning. Journal of Big Data 8, 1 (2021), 1--37.

[20]

Sebastian Farquhar and Yarin Gal. 2018. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733 (2018).

[21]

Joao Gama and Petr Kosina. 2011. Learning decision rules from data streams. In Twenty-Second International Joint Conference on Artificial Intelligence.

[22]

Joao Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. 2004. Learning with drift detection. In Brazilian symposium on artificial intelligence. Springer, 286--295.

[23]

Joao Gama, Ricardo Rocha, and Pedro Medas. 2003. Accurate decision trees for mining high-speed data streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 523--528.

Digital Library

[24]

Joao Gama, Raquel Sebastiao, and Pedro Pereira Rodrigues. 2009. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 329--338.

Digital Library

[25]

Heitor M Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabrício Enembreck, Bernhard Pfharinger, Geoff Holmes, and Talel Abdessalem. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106 (2017), 1469--1495.

Digital Library

[26]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).

[27]

Jim Gray. 1993. The benchmark handbook for database and transasction systems. Mergan Kaufmann, San Mateo (1993).

[28]

Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. 2016. Robust random cut forest based anomaly detection on streams. In International conference on machine learning. PMLR, 2712--2721.

[29]

Songqiao Han, Xiyang Hu, Hailiang Huang, Mingqi Jiang, and Yue Zhao. 2022. ADBench: Anomaly Detection Benchmark. In Neural Information Processing Systems (NeurIPS).

[30]

Maayan Harel, Shie Mannor, Ran El-Yaniv, and Koby Crammer. 2014. Concept drift detection through resampling. In International conference on machine learning. PMLR, 1009--1017.

[31]

Tyler L Hayes, Nathan D Cahill, and Christopher Kanan. 2019. Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 9769--9776.

Digital Library

[32]

T Ryan Hoens, Robi Polikar, and Nitesh V Chawla. 2012. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence 1 (2012), 89--101.

[33]

Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. 2019. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 831--839.

[34]

Addison Howard, Bernadette Bouchon-Meunier, IEEE CIS, inversion, John Lei, Lynn@Vesta, Marcus2010, and Hussein Abbass. 2019. IEEE-CIS Fraud Detection. https://kaggle.com/competitions/ieee-fraud-detection

[35]

Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, and Hanwang Zhang. 2021. Distilling causal effect of data in class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3957--3966.

[36]

Ibrahem Kandel and Mauro Castelli. 2020. The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT express 6, 4 (2020), 312--315.

[37]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521--3526.

[38]

Bartosz Krawczyk, Bernhard Pfahringer, and Michał Woźniak. 2018. Combining active learning with concept drift detection for data stream mining. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2239--2244.

[39]

Pawel Ksieniewicz and Pawel Zyblewski. 2022. Stream-learn---Open-source Python library for difficult data stream batch analysis. Neurocomputing 478 (2022), 11--21.

[40]

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236 (2016).

[41]

Weizhi Li, Weirong Mo, Xu Zhang, John J Squiers, Yang Lu, Eric W Sellke, Wensheng Fan, J Michael DiMaio, and Jeffrey E Thatcher. 2015. Outlier detection and removal improves accuracy of machine learning approach to multispectral burn diagnostic imaging. Journal of biomedical optics 20, 12 (2015), 121305--121305.

[42]

Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40, 12 (2017), 2935--2947.

[43]

Zheng Li, Yue Zhao, Xiyang Hu, Nicola Botta, Cezar Ionescu, and George Chen. 2022. Ecod: Unsupervised outlier detection using empirical cumulative distribution functions. IEEE Transactions on Knowledge and Data Engineering (2022).

Digital Library

[44]

Patrick Lindstrom, Brian Mac Namee, and Sarah Jane Delany. 2013. Drift detection using uncertainty distribution divergence. Evolving Systems 4 (2013), 13--25.

[45]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413--422.

Digital Library

[46]

Vincenzo Lomonaco and Davide Maltoni. 2017. Core50: a new dataset and benchmark for continuous object recognition. In Conference on Robot Learning. PMLR, 17--26.

[47]

Vincenzo Lomonaco, Lorenzo Pellegrini, Andrea Cossu, Antonio Carta, Gabriele Graffieti, Tyler L Hayes, Matthias De Lange, Marc Masana, Jary Pomponi, Gido M Van de Ven, et al. 2021. Avalanche: an end-to-end library for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3600--3610.

[48]

Osama A Mahdi, Eric Pardede, Nawfal Ali, and Jinli Cao. 2020. Diversity measure as a new drift detection method in data streaming. Knowledge-Based Systems 191 (2020), 105227.

Digital Library

[49]

Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. 2022. Online continual learning in image classification: An empirical survey. Neurocomputing 469 (2022), 28--51.

Digital Library

[50]

Arun Mallya and Svetlana Lazebnik. 2018. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 7765--7773.

[51]

Emaad Manzoor, Hemank Lamba, and Leman Akoglu. 2018. xstream: Outlier detection in feature-evolving data streams. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1963--1972.

Digital Library

[52]

Marc Masana, Xialei Liu, Bartłomiej Twardowski, Mikel Menta, Andrew D Bagdanov, and Joost van de Weijer. 2022. Class-incremental learning: survey and performance evaluation on image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

Digital Library

[53]

Agnieszka Mikołajczyk and Michał Grochowski. 2018. Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW). IEEE, 117--122.

[54]

Anna Montoya, inversion, KirillOdintsov, and Martin Kotek. 2018. Home Credit Default Risk. https://kaggle.com/competitions/home-credit-default-risk

[55]

Daniel Müllner. 2011. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 (2011).

[56]

Husein Perez and Joseph HM Tah. 2020. Improving the accuracy of convolutional neural networks by identifying and removing outlier images in datasets using t-SNE. Mathematics 8, 5 (2020), 662.

[57]

Ali Pesaranghader, Herna L Viktor, and Eric Paquet. 2018. McDiarmid drift detection methods for evolving data streams. In 2018 International joint conference on neural networks (IJCNN). IEEE, 1--9.

[58]

Tomáš Pevnỳ. 2016. Loda: Lightweight on-line detector of anomalies. Machine Learning 102 (2016), 275--304.

Digital Library

[59]

Utomo Pujianto, Aji Prasetya Wibawa, Muhammad Iqbal Akbar, et al. 2019. K-nearest neighbor (k-NN) based missing data imputation. In 2019 5th International Conference on Science in Information Technology (ICSITech). IEEE, 83--88.

[60]

Abdulhakim A Qahtan, Basma Alharbi, Suojin Wang, and Xiangliang Zhang. 2015. A pca-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 935--944.

Digital Library

[61]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2001--2010.

[62]

Douglas A Reynolds et al. 2009. Gaussian mixture models. Encyclopedia of biometrics 741, 659-663 (2009).

[63]

Ryne Roady, Tyler L Hayes, Hitesh Vaidya, and Christopher Kanan. 2020. Stream-51: Streaming classification and novelty detection from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 228--229.

[64]

Saket Sathe and Charu C Aggarwal. 2016. Subspace outlier detection in linear time with randomized hashing. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 459--468.

[65]

Qi She, Fan Feng, Xinyue Hao, Qihan Yang, Chuanlin Lan, Vincenzo Lomonaco, Xuesong Shi, Zhengwei Wang, Yao Guo, Yimin Zhang, et al. 2020. OpenLORIS-Object: A robotic vision dataset and benchmark for lifelong deep learning. In 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 4767--4773.

[66]

Ravid Shwartz-Ziv and Amitai Armon. 2022. Tabular data: Deep learning is not all you need. Information Fusion 81 (2022), 84--90.

Digital Library

[67]

James Smith, Yen-Chang Hsu, Jonathan Balloch, Yilin Shen, Hongxia Jin, and Zsolt Kira. 2021. Always be dreaming: A new approach for data-free class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9374--9384.

[68]

Vinicius Souza, Denis M dos Reis, Andre G Maletzke, and Gustavo EAPA Batista. 2020. Challenges in benchmarking stream learning algorithms with real-world data. Data Mining and Knowledge Discovery 34, 6 (2020), 1805--1858.

Digital Library

[69]

David RB Stockwell and A Townsend Peterson. 2002. Effects of sample size on accuracy of species distribution models. Ecological modelling 148, 1 (2002), 1--13.

[70]

W Nick Street and YongSeog Kim. 2001. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 377--382.

Digital Library

[71]

Swee Chuan Tan, Kai Ming Ting, and Tony Fei Liu. 2011. Fast anomaly detection for streaming data. In Twenty-second international joint conference on artificial intelligence. Citeseer.

[72]

Marco Toldo and Mete Ozay. 2022. Bring Evanescent Representations to Life in Lifelong Class Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16732--16741.

[73]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).

[74]

Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and computing 17 (2007), 395--416.

[75]

Heng Wang and Zubin Abraham. 2015. Concept drift detection for streaming data. In 2015 international joint conference on neural networks (IJCNN). IEEE, 1--9.

[76]

Pingfan Wang, Nanlin Jin, Wai Lok Woo, John R Woodward, and Duncan Davies. 2022. Noise tolerant drift detection method for data stream mining. Information Sciences 609 (2022), 1318--1333.

Digital Library

[77]

Guile Wu, Shaogang Gong, and Pan Li. 2021. Striking a balance between stability and plasticity for class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1124--1133.

[78]

Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. 2019. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 374--382.

[79]

Shipeng Yan, Jiangwei Xie, and Xuming He. 2021. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3014--3023.

[80]

Ying Yang, Xindong Wu, and Xingquan Zhu. 2006. Mining in anticipation for concept change: Proactive-reactive prediction in data streams. Data mining and knowledge discovery 13 (2006), 261--289.

[81]

Friedemann Zenke, Ben Poole, and Surya Ganguli. 2017. Continual learning through synaptic intelligence. In International Conference on Machine Learning. PMLR, 3987--3995.

[82]

Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming Zhang, and C-C Jay Kuo. 2020. Class-incremental learning via deep model consolidation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1131--1140.

[83]

Da-Wei Zhou, Qi-Wei Wang, Han-Jia Ye, and De-Chuan Zhan. 2022. A model or 603 exemplars: Towards memory-efficient class-incremental learning. arXiv preprint arXiv:2205.13218 (2022).

[84]

Zhi-Hua Zhou. 2022. Open-environment machine learning. National Science Review 9, 8 (2022).

[85]

Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, and Wenqiao Zhang. 2023. METER: A Dynamic Concept Adaptation Framework for Online Anomaly Detection. arXiv preprint arXiv:2312.16831 (2023).

Recommendations

Regression on evolving multi-relational data streams
PhD '11: Proceedings of the 2011 Joint EDBT/ICDT Ph.D. Workshop

In the last decade, researchers have recognized the need of an increased attention to a type of knowledge discovery applications where the data analyzed is not finite, but streams into the system continuously and endlessly. Data streams are ubiquitous, ...
Challenges in benchmarking stream learning algorithms with real-world data
Abstract
Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data. The main characteristics of these applications are the online arrival of data observations at ...
Challenges of Machine Learning for Data Streams in the Banking Industry
Big Data Analytics
Abstract
Banking Information Systems continuously generate large quantities of data as inter-connected streams (transactions, events logs, time series, metrics, graphs, process, etc.). Such data streams need to be processed online to deal with critical ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 6

February 2024

369 pages

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 03 May 2024

Published in PVLDB Volume 17, Issue 6

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
18
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents