research-article

Concept Evolution Detecting over Feature Streams

Authors:

Xindong WuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 8

Article No.: 209, Pages 1 - 32

https://doi.org/10.1145/3678012

Published: 21 August 2024 Publication History

Abstract

The explosion of data volume has gradually transformed big data processing from the static batch mode to the online streaming model. Streaming data can be divided into instance streams (feature space remains fixed while instances increase over time), feature streams (instance space is fixed while features arrive over time), or both. Generally, online streaming data learning has two main challenges: infinite length and concept changing. Recently, feature stream learning has received much attention. However, existing feature stream learning methods focus on feature selection or classification but ignore the concept changing over time. To the best of our knowledge, this is the first work that studies concept evolution detection over feature streams. Specifically, we first give the formal definition of concept evolution over feature streams, which include three different types: concept emerging, concept drift, and concept forgetting. Then, we design a novel framework to detect the concept evolution over feature streams that consists of a sliding window, an improved density peak-based clustering algorithm, and a weighted bipartite graph-based concept detecting method. Extensive experiments have been conducted on several synthetic and high-dimensional datasets to indicate our new method’s ability to cluster and detect concept evolution over feature streams.

References

[1]

David Arthur and Sergei Vassilvitskii. 2007. K-means++ the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 1027–1035.

[2]

Maroua Bahri, Albert Bifet, João Gama, Heitor Murilo Gomes, and Silviu Maniu. 2021. Data stream analysis: Foundations, major tasks and tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11, 3 (2021), Article e1405.

[3]

Ewan Birney. 2012. Lessons for big-data projects. Nature 489, 7414 (2012), 49–51.

[4]

Jianguo Chen, Kenli Li, Huigui Rong, Kashif Bilal, Nan Yang, and Keqin Li. 2018. A disease diagnosis and treatment recommendation system based on big data mining and cloud computing. Information Sciences 435 (2018), 124–149.

[5]

Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7 (2006), 1–30.

Digital Library

[6]

Konstantinos I. Diamantaras and Sun Yuan Kung. 1996. Principal Component Neural Networks: Theory and Applications. John Wiley & Sons, Inc.

Digital Library

[7]

Jiajun Ding, Xiongxiong He, Junqing Yuan, and Bo Jiang. 2018. Automatic clustering based on density peak detection using generalized extreme value distribution. Soft Computing 22 (2018), 2777–2796.

Digital Library

[8]

Mingjing Du, Shifei Ding, and Hongjie Jia. 2016. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems 99 (2016), 135–145.

Digital Library

[9]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD ’96), Vol. 96. 226–231.

[10]

Qihang Fang, Gang Xiong, MengChu Zhou, Tariku Sinshaw Tamir, Chao-Bo Yan, Huaiyu Wu, Zhen Shen, and Fei-Yue Wang. 2022. Process monitoring, diagnosis and control of additive manufacturing. IEEE Transactions on Automation Science and Engineering 21, 1 (2022), 1041–1067.

[11]

Xiaoli Zhang Fern and Carla E. Brodley. 2004. Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the 21st International Conference on Machine Learning. 36.

Digital Library

[12]

Gajendra Singh Gurjar and Sharda Chhabria. 2015. A review on concept evolution technique on data stream. In Proceedings of the 2015 International Conference on Pervasive Computing (ICPC). IEEE, 1–3.

[13]

Ben Halstead, Yun Sing Koh, Patricia Riddle, Mykola Pechenizkiy, and Albert Bifet. 2023. Combining diverse meta-features to accurately identify recurring concept drift in data streams. ACM Transactions on Knowledge Discovery from Data 17, 8 (2023), 1–36.

Digital Library

[14]

Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani Thuraisingham, and Charu Aggarwal. 2016. Efficient handling of concept drift and concept evolution over stream data. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 481–492.

[15]

Morteza Zi Hayat and Mahmoud Reza Hashemi. 2010. A DCT based approach for detecting novelty and concept drift in data streams. In Proceedings of the 2010 International Conference of Soft Computing and Pattern Recognition. IEEE, 373–378.

[16]

Yi He, Xu Yuan, Sheng Chen, and Xindong Wu. 2021. Online learning in variable feature spaces under incomplete supervision. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI ’21). 4106–4114.

[17]

Bo Jian Hou, Lijun Zhang, and Zhi Hua Zhou. 2021. Learning with feature evolvable streams. IEEE Transactions on Knowledge and Data Engineering 33, 6 (2021), 2602–2615.

[18]

XueGang Hu, Peng Zhou, PeiPei Li, Jing Wang, and XinDong Wu. 2018. A survey on online feature selection with streaming features. Frontiers of Computer Science 12, 3 (2018), 479–493.

Digital Library

[19]

Jinlong Huang, Qingsheng Zhu, Lijun Yang, Dongdong Cheng, and Quanwang Wu. 2017. QCC: A novel clustering algorithm based on quasi-cluster centers. Machine Learning 106, 3 (2017), 337–357.

Digital Library

[20]

Wen Jin, Anthony K. H. Tung, Jiawei Han, and Wei Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 577–593.

Digital Library

[21]

Georg Krempl, Indre Žliobaite, Dariusz Brzeziński, Eyke Hüllermeier, Mark Last, Vincent Lemaire, Tino Noack, Ammar Shaker, Sonja Sievi, Myra Spiliopoulou, and Jerzy Stefanowski. 2014. Open challenges for data stream mining research. SIGKDD Explorations Newsletter 16, 1 (Sep 2014), 1–10.

Digital Library

[22]

Mark Last. 2002. Online classification of nonstationary data streams. Intelligent Data Analysis 6, 2 (2002), 129–147.

Digital Library

[23]

Haiguang Li, Xindong Wu, Zhao Li, and Wei Ding. 2013. Group feature selection with streaming features. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining. IEEE, 1109–1114.

[24]

Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (2017), 1–45.

Digital Library

[25]

Zejian Li and Yongchuan Tang. 2018. Comparative density peaks clustering. Expert Systems with Applications 95 (2018), 236–247.

[26]

Zhenguo Li, Xiao-Ming Wu, and Shih-Fu Chang. 2012. Segmentation using superpixels: A bipartite graph partitioning approach. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 789–796.

[27]

Anjin Liu, Jie Lu, Yiliao Song, Junyu Xuan, and Guangquan Zhang. 2022. Concept drift detection delay index. IEEE Transactions on Knowledge and Data Engineering 35, 5 (2022), 4585–4597.

[28]

Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2019. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering 31, 12 (2019), 2346–2363.

[29]

Ioannis A. Maraziotis, Stavros Perantonis, Andrei Dragomir, and Dimitris Thanos. 2019. K-Nets: Clustering through nearest neighbors networks. Pattern Recognition 88 (2019), 470–481.

Digital Library

[30]

Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham. 2010. Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering 23, 6 (2010), 859–874.

Digital Library

[31]

Mohammad M. Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and Bhavani Thuraisingham. 2010. Addressing concept-evolution in concept-drifting data streams. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 929–934.

Digital Library

[32]

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. 2009. Integrating novel class detection with classification for concept-drifting data streams. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 79–94.

[33]

Ujjwal Maulik and Sanghamitra Bandyopadhyay. 2002. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 12 (2002), 1650–1654.

Digital Library

[34]

Saad Mohamad, Moamar Sayed-Mouchaweh, and Abdelhamid Bouchachia. 2017. Active learning for data streams under concept drift and concept evolution. CEUR Workshop Proceedings 2069 (2017), 1–18.

[35]

Hai-Long Nguyen, Yew-Kwong Woon, and Wee-Keong Ng. 2015. A survey on data stream clustering and classification. Knowledge and Information Systems 45, 3 (2015), 535–569.

Digital Library

[36]

Le T. Nguyen, Ming Zeng, Patrick Tague, and Joy Zhang. 2015. Recognizing new activities with limited training data. In Proceedings of the 2015 ACM International Symposium on Wearable Computers. 67–74.

Digital Library

[37]

Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36, 2 (2009), 3336–3341.

Digital Library

[38]

Dehua Peng, Zhipeng Gui, Dehe Wang, Yuncheng Ma, Zichen Huang, Yu Zhou, and Huayi Wu. 2022. Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nature Communications 13, 1 (2022), 5455.

[39]

Marco A. F. Pimentel, David A. Clifton, Lei Clifton, and Lionel Tarassenko. 2014. A review of novelty detection. Signal Processing 99 (2014), 215–249.

Digital Library

[40]

Joseph Prusa, Taghi M Khoshgoftaar, and Naeem Seliya. 2015. The effect of dataset size on training tweet sentiment classifiers. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). IEEE, 96–102.

[41]

Yikun Qin, Zhu Liang Yu, Chang-Dong Wang, Zhenghui Gu, and Yuanqing Li. 2018. A novel clustering method based on hybrid k-nearest-neighbor graph. Pattern Recognition 74 (2018), 1–14.

Digital Library

[42]

Sergio Ramírez-Gallego, Bartosz Krawczyk, Salvador García, Michał Woźniak, and Francisco Herrera. 2017. A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing 239 (2017), 39–57.

Digital Library

[43]

William M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 336 (1971), 846–850.

[44]

Kaspar Riesen and Horst Bunke. 2009. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision Computing 27, 7 (2009), 950–959.

Digital Library

[45]

Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492–1496.

[46]

Amanpreet Kaur Sandhu. 2021. Big data with cloud computing: Discussions and challenges. Big Data Mining and Analytics 5, 1 (2021), 32–40.

[47]

Jeffrey C. Schlimmer and Richard H. Granger. 1986. Incremental learning from noisy data. Machine Learning 1, 3 (1986), 317–354.

[48]

Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 5 (1998), 1299–1319.

Digital Library

[49]

Seyed Amjad Seyedi, Abdulrahman Lotfi, Parham Moradi, and Nooruldeen Nasih Qader. 2019. Dynamic graph-based label propagation for density peaks clustering. Expert Systems with Applications 115 (2019), 314–328.

[50]

Eduardo J. Spinosa, André Ponce de Leon F. de Carvalho, and Joao Gama. 2007. Olindda: A cluster-based approach for detecting novelty and concept drift in data streams. In Proceedings of the 2007 ACM Symposium on Applied Computing. 448–452.

Digital Library

[51]

Jing Wang, Meng Wang, Peipei Li, Luoqi Liu, Zhongqiu Zhao, Xuegang Hu, and Xindong Wu. 2015. Online feature selection with group structure analysis. IEEE Transactions on Knowledge and Data Engineering 27, 11 (2015), 3029–3041.

Digital Library

[52]

Xindong Wu, Kui Yu, Wei Ding, Hao Wang, and Xingquan Zhu. 2012. Online feature selection with streaming features. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 5 (2012), 1178–1192.

Digital Library

[53]

Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. 2013. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26, 1 (2013), 97–107.

Digital Library

[54]

Juanying Xie, Hongchao Gao, Weixin Xie, Xiaohui Liu, and Philip W. Grant. 2016. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Information Sciences 354 (2016), 19–40.

Digital Library

[55]

Shuliang Xu, Lin Feng, Shenglan Liu, and Hong Qiao. 2020. Self-adaption neighborhood density clustering method for mixed data stream with concept drift. Engineering Applications of Artificial Intelligence 89 (2020), 103451.

Digital Library

[56]

Liu Yaohui, Ma Zhengming, and Yu Fang. 2017. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy. Knowledge-Based Systems 133 (2017), 208–220.

Digital Library

[57]

Kui Yu, Wei Ding, Dan A Simovici, Hao Wang, Jian Pei, and Xindong Wu. 2015. Classification with streaming features: An emerging-pattern mining approach. ACM Transactions on Knowledge Discovery from Data (TKDD) 9, 4 (2015), 1–31.

Digital Library

[58]

Qixin Zhang, Zengde Deng, Zaiyi Chen, Haoyuan Hu, and Yu Yang. 2022. Stochastic continuous submodular maximization: Boosting via non-oblivious function. In Proceedings of the 39th International Conference on Machine Learning, Vol. 162. PMLR, 26116–26134.

[59]

Qixin Zhang, Zengde Deng, Xiangru Jian, Zaiyi Chen, Haoyuan Hu, and Yu Yang. 2023. Communication-efficient decentralized online continuous DR-submodular maximization. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). ACM, New York, NY, 3330–3339.

Digital Library

[60]

Peng Zhou, Xuegang Hu, Peipei Li, and Xindong Wu. 2019. OFS-density: A novel online streaming feature selection method. Pattern Recognition 86 (2019), 48–61.

[61]

Peng Zhou, Shu Zhao, Yuanting Yan, and Xindong Wu. 2022. Online scalable streaming feature selection via dynamic decision. ACM Transactions on Knowledge Discovery from Data 16, 5 (2022), 1–20.

Digital Library

Index Terms

Concept Evolution Detecting over Feature Streams
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Online learning settings

Recommendations

Detecting group concept drift from multiple data streams
Highlights
- Proposing a new type of concept drift group concept drift that commonly exists in multiple data streams.
Abstract
Concept drift may lead to a sharp downturn in the performance of streaming in data-based algorithms, caused by unforeseeable changes in the underlying distribution of data. In this paper, we are mainly concerned with concept drift ...
Addressing Concept-Evolution in Concept-Drifting Data Streams
ICDM '10: Proceedings of the 2010 IEEE International Conference on Data Mining

The problem of data stream classification is challenging because of many practical aspects associated with efficient processing and temporal behavior of the stream. Two such well studied aspects are infinite length and concept-drift. Since a data stream ...
Online Clustering for Novelty Detection and Concept Drift in Data Streams
Progress in Artificial Intelligence
Abstract
Data streams are related to large amounts of data that can continuously arrive with a probability distribution that may change over time. Depending on the changes in the data distribution, different phenomena can occur, like new classes can appear ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 18, Issue 8

September 2024

700 pages

EISSN:1556-472X

DOI:10.1145/3613713

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2024

Online AM: 13 July 2024

Accepted: 09 July 2024

Revised: 02 July 2024

Received: 04 January 2024

Published in TKDD Volume 18, Issue 8

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Science Foundation of Anhui Province of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
126
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)49

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents