Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Concept Evolution Detecting over Feature Streams

Published: 21 August 2024 Publication History

Abstract

The explosion of data volume has gradually transformed big data processing from the static batch mode to the online streaming model. Streaming data can be divided into instance streams (feature space remains fixed while instances increase over time), feature streams (instance space is fixed while features arrive over time), or both. Generally, online streaming data learning has two main challenges: infinite length and concept changing. Recently, feature stream learning has received much attention. However, existing feature stream learning methods focus on feature selection or classification but ignore the concept changing over time. To the best of our knowledge, this is the first work that studies concept evolution detection over feature streams. Specifically, we first give the formal definition of concept evolution over feature streams, which include three different types: concept emerging, concept drift, and concept forgetting. Then, we design a novel framework to detect the concept evolution over feature streams that consists of a sliding window, an improved density peak-based clustering algorithm, and a weighted bipartite graph-based concept detecting method. Extensive experiments have been conducted on several synthetic and high-dimensional datasets to indicate our new method’s ability to cluster and detect concept evolution over feature streams.

References

[1]
David Arthur and Sergei Vassilvitskii. 2007. K-means++ the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 1027–1035.
[2]
Maroua Bahri, Albert Bifet, João Gama, Heitor Murilo Gomes, and Silviu Maniu. 2021. Data stream analysis: Foundations, major tasks and tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11, 3 (2021), Article e1405.
[3]
Ewan Birney. 2012. Lessons for big-data projects. Nature 489, 7414 (2012), 49–51.
[4]
Jianguo Chen, Kenli Li, Huigui Rong, Kashif Bilal, Nan Yang, and Keqin Li. 2018. A disease diagnosis and treatment recommendation system based on big data mining and cloud computing. Information Sciences 435 (2018), 124–149.
[5]
Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7 (2006), 1–30.
[6]
Konstantinos I. Diamantaras and Sun Yuan Kung. 1996. Principal Component Neural Networks: Theory and Applications. John Wiley & Sons, Inc.
[7]
Jiajun Ding, Xiongxiong He, Junqing Yuan, and Bo Jiang. 2018. Automatic clustering based on density peak detection using generalized extreme value distribution. Soft Computing 22 (2018), 2777–2796.
[8]
Mingjing Du, Shifei Ding, and Hongjie Jia. 2016. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems 99 (2016), 135–145.
[9]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD ’96), Vol. 96. 226–231.
[10]
Qihang Fang, Gang Xiong, MengChu Zhou, Tariku Sinshaw Tamir, Chao-Bo Yan, Huaiyu Wu, Zhen Shen, and Fei-Yue Wang. 2022. Process monitoring, diagnosis and control of additive manufacturing. IEEE Transactions on Automation Science and Engineering 21, 1 (2022), 1041–1067.
[11]
Xiaoli Zhang Fern and Carla E. Brodley. 2004. Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the 21st International Conference on Machine Learning. 36.
[12]
Gajendra Singh Gurjar and Sharda Chhabria. 2015. A review on concept evolution technique on data stream. In Proceedings of the 2015 International Conference on Pervasive Computing (ICPC). IEEE, 1–3.
[13]
Ben Halstead, Yun Sing Koh, Patricia Riddle, Mykola Pechenizkiy, and Albert Bifet. 2023. Combining diverse meta-features to accurately identify recurring concept drift in data streams. ACM Transactions on Knowledge Discovery from Data 17, 8 (2023), 1–36.
[14]
Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani Thuraisingham, and Charu Aggarwal. 2016. Efficient handling of concept drift and concept evolution over stream data. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 481–492.
[15]
Morteza Zi Hayat and Mahmoud Reza Hashemi. 2010. A DCT based approach for detecting novelty and concept drift in data streams. In Proceedings of the 2010 International Conference of Soft Computing and Pattern Recognition. IEEE, 373–378.
[16]
Yi He, Xu Yuan, Sheng Chen, and Xindong Wu. 2021. Online learning in variable feature spaces under incomplete supervision. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI ’21). 4106–4114.
[17]
Bo Jian Hou, Lijun Zhang, and Zhi Hua Zhou. 2021. Learning with feature evolvable streams. IEEE Transactions on Knowledge and Data Engineering 33, 6 (2021), 2602–2615.
[18]
XueGang Hu, Peng Zhou, PeiPei Li, Jing Wang, and XinDong Wu. 2018. A survey on online feature selection with streaming features. Frontiers of Computer Science 12, 3 (2018), 479–493.
[19]
Jinlong Huang, Qingsheng Zhu, Lijun Yang, Dongdong Cheng, and Quanwang Wu. 2017. QCC: A novel clustering algorithm based on quasi-cluster centers. Machine Learning 106, 3 (2017), 337–357.
[20]
Wen Jin, Anthony K. H. Tung, Jiawei Han, and Wei Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 577–593.
[21]
Georg Krempl, Indre Žliobaite, Dariusz Brzeziński, Eyke Hüllermeier, Mark Last, Vincent Lemaire, Tino Noack, Ammar Shaker, Sonja Sievi, Myra Spiliopoulou, and Jerzy Stefanowski. 2014. Open challenges for data stream mining research. SIGKDD Explorations Newsletter 16, 1 (Sep 2014), 1–10.
[22]
Mark Last. 2002. Online classification of nonstationary data streams. Intelligent Data Analysis 6, 2 (2002), 129–147.
[23]
Haiguang Li, Xindong Wu, Zhao Li, and Wei Ding. 2013. Group feature selection with streaming features. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining. IEEE, 1109–1114.
[24]
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (2017), 1–45.
[25]
Zejian Li and Yongchuan Tang. 2018. Comparative density peaks clustering. Expert Systems with Applications 95 (2018), 236–247.
[26]
Zhenguo Li, Xiao-Ming Wu, and Shih-Fu Chang. 2012. Segmentation using superpixels: A bipartite graph partitioning approach. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 789–796.
[27]
Anjin Liu, Jie Lu, Yiliao Song, Junyu Xuan, and Guangquan Zhang. 2022. Concept drift detection delay index. IEEE Transactions on Knowledge and Data Engineering 35, 5 (2022), 4585–4597.
[28]
Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2019. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering 31, 12 (2019), 2346–2363.
[29]
Ioannis A. Maraziotis, Stavros Perantonis, Andrei Dragomir, and Dimitris Thanos. 2019. K-Nets: Clustering through nearest neighbors networks. Pattern Recognition 88 (2019), 470–481.
[30]
Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham. 2010. Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering 23, 6 (2010), 859–874.
[31]
Mohammad M. Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and Bhavani Thuraisingham. 2010. Addressing concept-evolution in concept-drifting data streams. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 929–934.
[32]
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. 2009. Integrating novel class detection with classification for concept-drifting data streams. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 79–94.
[33]
Ujjwal Maulik and Sanghamitra Bandyopadhyay. 2002. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 12 (2002), 1650–1654.
[34]
Saad Mohamad, Moamar Sayed-Mouchaweh, and Abdelhamid Bouchachia. 2017. Active learning for data streams under concept drift and concept evolution. CEUR Workshop Proceedings 2069 (2017), 1–18.
[35]
Hai-Long Nguyen, Yew-Kwong Woon, and Wee-Keong Ng. 2015. A survey on data stream clustering and classification. Knowledge and Information Systems 45, 3 (2015), 535–569.
[36]
Le T. Nguyen, Ming Zeng, Patrick Tague, and Joy Zhang. 2015. Recognizing new activities with limited training data. In Proceedings of the 2015 ACM International Symposium on Wearable Computers. 67–74.
[37]
Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36, 2 (2009), 3336–3341.
[38]
Dehua Peng, Zhipeng Gui, Dehe Wang, Yuncheng Ma, Zichen Huang, Yu Zhou, and Huayi Wu. 2022. Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nature Communications 13, 1 (2022), 5455.
[39]
Marco A. F. Pimentel, David A. Clifton, Lei Clifton, and Lionel Tarassenko. 2014. A review of novelty detection. Signal Processing 99 (2014), 215–249.
[40]
Joseph Prusa, Taghi M Khoshgoftaar, and Naeem Seliya. 2015. The effect of dataset size on training tweet sentiment classifiers. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). IEEE, 96–102.
[41]
Yikun Qin, Zhu Liang Yu, Chang-Dong Wang, Zhenghui Gu, and Yuanqing Li. 2018. A novel clustering method based on hybrid k-nearest-neighbor graph. Pattern Recognition 74 (2018), 1–14.
[42]
Sergio Ramírez-Gallego, Bartosz Krawczyk, Salvador García, Michał Woźniak, and Francisco Herrera. 2017. A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing 239 (2017), 39–57.
[43]
William M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 336 (1971), 846–850.
[44]
Kaspar Riesen and Horst Bunke. 2009. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision Computing 27, 7 (2009), 950–959.
[45]
Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492–1496.
[46]
Amanpreet Kaur Sandhu. 2021. Big data with cloud computing: Discussions and challenges. Big Data Mining and Analytics 5, 1 (2021), 32–40.
[47]
Jeffrey C. Schlimmer and Richard H. Granger. 1986. Incremental learning from noisy data. Machine Learning 1, 3 (1986), 317–354.
[48]
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 5 (1998), 1299–1319.
[49]
Seyed Amjad Seyedi, Abdulrahman Lotfi, Parham Moradi, and Nooruldeen Nasih Qader. 2019. Dynamic graph-based label propagation for density peaks clustering. Expert Systems with Applications 115 (2019), 314–328.
[50]
Eduardo J. Spinosa, André Ponce de Leon F. de Carvalho, and Joao Gama. 2007. Olindda: A cluster-based approach for detecting novelty and concept drift in data streams. In Proceedings of the 2007 ACM Symposium on Applied Computing. 448–452.
[51]
Jing Wang, Meng Wang, Peipei Li, Luoqi Liu, Zhongqiu Zhao, Xuegang Hu, and Xindong Wu. 2015. Online feature selection with group structure analysis. IEEE Transactions on Knowledge and Data Engineering 27, 11 (2015), 3029–3041.
[52]
Xindong Wu, Kui Yu, Wei Ding, Hao Wang, and Xingquan Zhu. 2012. Online feature selection with streaming features. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 5 (2012), 1178–1192.
[53]
Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. 2013. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26, 1 (2013), 97–107.
[54]
Juanying Xie, Hongchao Gao, Weixin Xie, Xiaohui Liu, and Philip W. Grant. 2016. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Information Sciences 354 (2016), 19–40.
[55]
Shuliang Xu, Lin Feng, Shenglan Liu, and Hong Qiao. 2020. Self-adaption neighborhood density clustering method for mixed data stream with concept drift. Engineering Applications of Artificial Intelligence 89 (2020), 103451.
[56]
Liu Yaohui, Ma Zhengming, and Yu Fang. 2017. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy. Knowledge-Based Systems 133 (2017), 208–220.
[57]
Kui Yu, Wei Ding, Dan A Simovici, Hao Wang, Jian Pei, and Xindong Wu. 2015. Classification with streaming features: An emerging-pattern mining approach. ACM Transactions on Knowledge Discovery from Data (TKDD) 9, 4 (2015), 1–31.
[58]
Qixin Zhang, Zengde Deng, Zaiyi Chen, Haoyuan Hu, and Yu Yang. 2022. Stochastic continuous submodular maximization: Boosting via non-oblivious function. In Proceedings of the 39th International Conference on Machine Learning, Vol. 162. PMLR, 26116–26134.
[59]
Qixin Zhang, Zengde Deng, Xiangru Jian, Zaiyi Chen, Haoyuan Hu, and Yu Yang. 2023. Communication-efficient decentralized online continuous DR-submodular maximization. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). ACM, New York, NY, 3330–3339.
[60]
Peng Zhou, Xuegang Hu, Peipei Li, and Xindong Wu. 2019. OFS-density: A novel online streaming feature selection method. Pattern Recognition 86 (2019), 48–61.
[61]
Peng Zhou, Shu Zhao, Yuanting Yan, and Xindong Wu. 2022. Online scalable streaming feature selection via dynamic decision. ACM Transactions on Knowledge Discovery from Data 16, 5 (2022), 1–20.

Index Terms

  1. Concept Evolution Detecting over Feature Streams

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 8
    September 2024
    700 pages
    EISSN:1556-472X
    DOI:10.1145/3613713
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2024
    Online AM: 13 July 2024
    Accepted: 09 July 2024
    Revised: 02 July 2024
    Received: 04 January 2024
    Published in TKDD Volume 18, Issue 8

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Online learning
    2. feature streams
    3. stream learning
    4. concept evolution detecting

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Science Foundation of Anhui Province of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 126
      Total Downloads
    • Downloads (Last 12 months)126
    • Downloads (Last 6 weeks)49
    Reflects downloads up to 06 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media