Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multi-Label Punitive kNN with Self-Adjusting Memory for Drifting Data Streams

Published: 11 November 2019 Publication History
  • Get Citation Alerts
  • Abstract

    In multi-label learning, data may simultaneously belong to more than one class. When multi-label data arrives as a stream, the challenges associated with multi-label learning are joined by those of data stream mining, including the need for algorithms that are fast and flexible, able to match both the speed and evolving nature of the stream. This article presents a punitive k nearest neighbors algorithm with a self-adjusting memory (MLSAMPkNN) for multi-label, drifting data streams. The memory adjusts in size to contain only the current concept and a novel punitive system identifies and penalizes errant data examples early, removing them from the window. By retaining and using only data that are both current and beneficial, MLSAMPkNN is able to adapt quickly and efficiently to changes within the data stream while still maintaining a low computational complexity. Additionally, the punitive removal mechanism offers increased robustness to various data-level difficulties present in data streams, such as class imbalance and noise. The experimental study compares the proposal to 24 algorithms using 30 real-world and 15 artificial multi-label data streams on six multi-label metrics, evaluation time, and memory consumption. The superior performance of the proposed method is validated through non-parametric statistical analysis, proving both high accuracy and low time complexity. MLSAMPkNN is a versatile classifier, capable of returning excellent performance in diverse stream scenarios.

    References

    [1]
    Charu Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2004. On demand classification of data streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 503--508.
    [2]
    Amirhossein Akbarnejad and Mahdieh Soleymani Baghshah. 2017. A probabilistic multi-label classifier with missing and noisy labels handling capability. Pattern Recognition Letters 89 (2017), 18--24.
    [3]
    Tahseen Al-Khateeb, Mohammad M. Masud, Khaled M. Al-Naami, Sadi Evren Seker, Ahmad M. Mustafa, Latifur Khan, Zouheir Trabelsi, Charu Aggarwal, and Jiawei Han. 2016. Recurring and novel class detection using class-based ensemble for evolving data stream. IEEE Transactions on Knowledge and Data Engineering 28, 10 (2016), 2752--2764.
    [4]
    Albert Bifet and Ricard Gavaldà. 2007. Learning from time-changing data with adaptive windowing. In SIAM International Conference on Data Mining. 443--448.
    [5]
    Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. Journal of Machine Learning Research 11 (2010), 1601--1604.
    [6]
    Belainine Billal, Alexsandro Fonseca, Fatiha Sadat, and Hakim Lounis. 2017. Semi-supervised learning and social media text analysis towards multi-labeling categorization. In IEEE International Conference on Big Data. 1907--1916.
    [7]
    Hamed R. Bonab and Fazli Can. 2018. GOOWE: Geometrically optimum and online-weighted ensemble classifier for evolving data streams. ACM Transactions on Knowledge Discovery from Data 12, 2 (2018), 25.
    [8]
    Dariusz Brzezinski and Jerzy Stefanowski. 2014. Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Transactions on Neural Networks and Learning Systems 25, 1 (2014), 81--94.
    [9]
    Alberto Cano. 2018. A survey on graphic processing unit computing for large-scale data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 1 (2018), e1232.
    [10]
    Alberto Cano and Bartosz Krawczyk. 2018. Learning classification rules with differential evolution for high-speed data stream mining on GPUs. In IEEE Congress on Evolutionary Computation. 197--204.
    [11]
    Alberto Cano and Bartosz Krawczyk. 2019. Evolving rule-based classifiers with genetic programming on GPUs for drifting data streams. Pattern Recognition 87 (2019), 248--268.
    [12]
    Alberto Cano, Jose Maria Luna, Eva Gibaja, and Sebastian Ventura. 2016. LAIM discretization for multi-label data. Information Sciences 330 (2016), 370--384.
    [13]
    Francisco Charte, Antonio J. Rivera, María J. del Jesús, and Francisco Herrera. 2015. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 163 (2015), 3--16.
    [14]
    Francisco Charte, Antonio J. Rivera, María J. del Jesús, and Francisco Herrera. 2019. Dealing with difficult minority labels in imbalanced multilabel data sets. Neurocomputing 326--327 (2019), 39--53.
    [15]
    Swagatam Das, Shounak Datta, and Bidyut B. Chaudhuri. 2018. Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognition 81 (2018), 674--693.
    [16]
    Denis Moreira dos Reis, Peter A. Flach, Stan Matwin, and Gustavo E. A. P. A. Batista. 2016. Fast unsupervised online drift detection using incremental Kolmogorov-Smirnov test. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1545--1554.
    [17]
    Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, and Francisco Herrera. 2018. Learning from Imbalanced Data Sets. Springer.
    [18]
    Raul Sena Ferreira, Bruno M. A. da Silva, Wendell Teixeira, Geraldo Zimbrao, and Leandro G. M. Alvim. 2018. Density-based core support extraction for non-stationary environments with extreme verification latency. In 7th Brazilian Conference on Intelligent Systems. 181--187.
    [19]
    João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. 2004. Learning with drift detection. In Brazilian Symposium on Artificial Intelligence. 286--295.
    [20]
    João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Computing Surveys 46, 4 (2014), 44:1--44:37.
    [21]
    Luís Paulo F. Garcia, Ana Carolina Lorena, Stan Matwin, and André Carlos Ponce de Leon Ferreira de Carvalho. 2016. Ensembles of label noise filters: A ranking approach. Data Mining and Knowledge Discovery 30, 5 (2016), 1192--1216.
    [22]
    Eva Gibaja and Sebastian Ventura. 2015. A tutorial on multilabel learning. ACM Computing Surveys 47, 3 (2015), 52:1--52:38.
    [23]
    Heitor M. Gomes, Albert Bifet, Jesse Read, Jean P. Barddal, Fabrício Enembreck, Bernhard Pfharinger, Geoff Holmes, and Talel Abdessalem. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106, 9--10 (2017), 1469--1495.
    [24]
    Jorge Gonzalez, Alberto Cano, and Sebastian Ventura. 2017. Large-scale multi-label ensemble learning on Spark. In IEEE Trustcom/BigDataSE/ICESS. 893--900.
    [25]
    Jorge Gonzalez, Sebastian Ventura, and Alberto Cano. 2018. Distributed nearest neighbor classification for large-scale multi-label data on Spark. Future Generation Computer Systems 87 (2018), 66--82.
    [26]
    Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani Thuraisingham, and Charu Aggarwal. 2016. Efficient handling of concept drift and concept evolution over stream data. In IEEE International Conference on Data Engineering. 481--492.
    [27]
    Chang-Qin Huang, Shang-Ming Yang, Yan Pan, and Hanjiang Lai. 2018. Object-location-aware hashing for multi-label image retrieval via automatic mask learning. IEEE Transactions on Image Processing 27, 9 (2018), 4490--4502.
    [28]
    Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining time-changing data streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 97--106.
    [29]
    Imen Khamassi, Moamar Sayed-Mouchaweh, Moez Hammami, and Khaled Ghedira. 2018. Discussion and review on evolving data streams and concept drift adapting. Evolving Systems 9, 1 (2018), 1--23.
    [30]
    Jeremy Z. Kolter and Marcus A. Maloof. 2007. Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Research 8 (2007), 2755--2790.
    [31]
    Bartosz Krawczyk. 2016. Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence 5, 4 (2016), 221--232.
    [32]
    Bartosz Krawczyk and Alberto Cano. 2018. Online ensemble learning with abstaining classifiers for drifting and noisy data streams. Applied Soft Computing 68 (2018), 677--692.
    [33]
    B. Krawczyk and A. Cano. 2019. Adaptive ensemble active learning for drifting data stream mining. In International Joint Conference on Artificial Intelligence (Macao). 2763--2771.
    [34]
    Bartosz Krawczyk, Leandro L. Minku, João Gama, Jerzy Stefanowski, and Michal Woźniak. 2017. Ensemble learning for data stream analysis: A survey. Information Fusion 37 (2017), 132--156.
    [35]
    Bartosz Krawczyk, Bernhard Pfahringer, and Michal Wozniak. 2018. Combining active learning with concept drift detection for data stream mining. In IEEE International Conference on Big Data. 2239--2244.
    [36]
    Peipei Li, Lu He, Haiyan Wang, Xuegang Hu, Yuhong Zhang, Lei Li, and Xindong Wu. 2018. Learning from short text streams with topic drifts. IEEE Transactions on Cybernetics 48, 9 (2018), 2697--2711.
    [37]
    Suqi Li, Giorgio Battistelli, Luigi Chisci, Wei Yi, Bailu Wang, and Lingjiang Kong. 2019. Computationally efficient multi-agent multi-object tracking with labeled random finite sets. IEEE Transactions on Signal Processing 67, 1 (2019), 260--275.
    [38]
    Yaojin Lin, Qinghua Hu, Jinghua Liu, Jinjin Li, and Xindong Wu. 2017. Streaming feature selection for multilabel learning based on fuzzy mutual information. IEEE Transactions on Fuzzy Systems 25, 6 (2017), 1491--1507.
    [39]
    Yaojin Lin, Qinghua Hu, Jia Zhang, and Xindong Wu. 2016. Multi-label feature selection with streaming labels. Information Sciences 372 (2016), 256--275.
    [40]
    Huawen Liu, Xindong Wu, and Shichao Zhang. 2016. Neighbor selection for multilabel classification. Neurocomputing 182 (2016), 187--196.
    [41]
    Jinghua Liu, Yaojin Lin, Yuwen Li, Wei Weng, and Shunxiang Wu. 2018. Online multi-label streaming feature selection based on neighborhood rough set. Pattern Recognition 84 (2018), 273--287.
    [42]
    Viktor Losing, Barbara Hammer, and Heiko Wersing. 2016. KNN classifier with self adjusting memory for heterogeneous concept drift. In IEEE International Conference on Data Mining. 291--300.
    [43]
    Viktor Losing, Barbara Hammer, and Heiko Wersing. 2018. Tackling heterogeneous concept drift with the Self-Adjusting Memory (SAM). Knowledge and Information Systems 54, 1 (2018), 171--201.
    [44]
    Edwin Lughofer. 2017. On-line active learning: A new paradigm to improve practical useability of data stream modeling methods. Information Sciences 415 (2017), 356--376.
    [45]
    Gjorgji Madjarov, Dragi Kocev, Dejan Gjorgjevikj, and Sašo Džeroski. 2012. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition 45, 9 (2012), 3084--3104.
    [46]
    Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, and Nikunj C. Oza. 2011. Facing the reality of data stream classification: Coping with scarcity of labeled data. Knowledge and Information Systems 33, 1 (2011), 213--244.
    [47]
    Jeff Mo, Eibe Frank, and Varvara Vetrova. 2017. Large-scale automatic species identification. In AI 2017: Advances in Artificial Intelligence—30th Australasian Joint Conference. 301--312.
    [48]
    Aljaž Osojnik, Panče Panov, and Sašo Džeroski. 2017. Multi-label classification via multi-target regression on data streams. Machine Learning 106, 6 (2017), 745--770.
    [49]
    Nikunj C. Oza. 2005. Online bagging and boosting. In IEEE International Conference on Systems, Man and Cybernetics, Vol. 3. 2340--2345.
    [50]
    Raphael Pelossof, Michael Jones, Ilia Vovsha, and Cynthia Rudin. 2009. Online coordinate boosting. In IEEE International Conference on Computer Vision. 1354--1361.
    [51]
    Wei Qu, Yang Zhang, Junping Zhu, and Qiang Qiu. 2009. Mining multi-label concept-drifting data streams using dynamic classifier ensemble. In Asian Conference on Machine Learning. 308--321.
    [52]
    Sergio Ramírez-Gallego, Bartosz Krawczyk, Salvador García, Michal Woźniak, José Manuel Benítez, and Francisco Herrera. 2017. Nearest neighbor classification for high-speed big data streams using Spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems 47, 10 (2017), 2727--2739.
    [53]
    Sergio Ramírez-Gallego, Bartosz Krawczyk, Salvador García, Michal Woźniak, and Francisco Herrera. 2017. A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing 239 (2017), 39--57.
    [54]
    Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. 2012. Scalable and efficient multi-label classification for evolving data streams. Machine Learning 88, 1--2 (2012), 243--272.
    [55]
    Jesse Read, Bernhard Pfahringer, and Geoff Holmes. 2008. Multi-label classification using ensembles of pruned sets. In IEEE International Conference on Data Mining. 995--1000.
    [56]
    Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification. Machine Learning 85, 3 (2011), 333--359.
    [57]
    Jesse Read, Peter Reutemann, Bernhard Pfahringer, and Geoff Holmes. 2016. MEKA: A multi-label/multi-target extension to Weka. Journal of Machine Learning Research 17, 21 (2016), 1--5.
    [58]
    Martha Roseberry and Alberto Cano. 2018. Multi-label kNN classifier with self adjusting memory for drifting data streams. In 2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications (ECML-PKDD’18), Vol. 94. 23--37.
    [59]
    Zhongwei Shi, Yimin Wen, Chao Feng, and Hai Zhao. 2014. Drift detection for multi-label data streams based on label grouping and entropy. In IEEE International Conference on Data Mining Workshop. 724--731.
    [60]
    Zhongwei Shi, Yun Xue, Yimin Wen, and Guoyong Cai. 2014. Efficient class incremental learning for multi-label classification of evolving data streams. In International Joint Conference on Neural Networks. 2093--2099.
    [61]
    Przemyslaw Skryjomski, Bartosz Krawczyk, and Alberto Cano. 2019. Speeding up k-nearest neighbors classifier for large-scale multi-label learning on GPUs. Neurocomputing 354 (2019), 10--19.
    [62]
    Michael R. Smith, Tony R. Martinez, and Christophe G. Giraud-Carrier. 2014. An instance level analysis of data complexity. Machine Learning 95, 2 (2014), 225--256.
    [63]
    Ricardo Sousa and João Gama. 2018. Multi-label classification from high-speed data streams with adaptive model rules and random rules. Progress in Artificial Intelligence 7, 3 (2018), 177--187.
    [64]
    Lei Tao, Xue Jiang, Zhou Li, Xingzhao Liu, and Zhixin Zhou. 2019. Multiscale incremental dictionary learning with label constraint for SAR object recognition. IEEE Geoscience and Remote Sensing Letters 16, 1 (2019), 80--84.
    [65]
    Edgar S. García Treviño, Muhammad Zaid Hameed, and Javier A. Barria. 2018. Data stream evolution diagnosis using recursive wavelet density estimators. ACM Transactions on Knowledge Discovery from Data 12, 1 (2018), 28.
    [66]
    Grigorios Tsoumakas and Ioannis Vlahavas. 2007. Random k-labelsets: An ensemble method for multilabel classification. In European Conference on Machine Learning. 406--417.
    [67]
    Bruno Veloso, João Gama, and Benedita Malheiro. 2018. Self hyper-parameter tuning for data streams. In Discovery Science. 241--255.
    [68]
    Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. 2003. Mining concept-drifting data streams using ensemble classifiers. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 226--235.
    [69]
    Lulu Wang, Hong Shen, and Hui Tian. 2017. Weighted ensemble classification of multi-label data streams. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 551--562.
    [70]
    Shuo Wang, Leandro L. Minku, and Xin Yao. 2018. A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems 29, 10 (2018), 4802--4821.
    [71]
    Eleftherios Spyromitros Xioufis, Myra Spiliopoulou, Grigorios Tsoumakas, and Ioannis P. Vlahavas. 2011. Dealing with concept drift and class imbalance in multi-label stream classification. In International Joint Conference on Artificial Intelligence. 1583--1588.
    [72]
    Min-Ling Zhang and Zhi-Hua Zhou. 2007. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 40, 7 (2007), 2038--2048.
    [73]
    Peng Zhang, Byron J. Gao, Xingquan Zhu, and Li Guo. 2011. Enabling fast lazy learning for data streams. In IEEE International Conference on Data Mining. 932--941.
    [74]
    Yuhong Zhang, Guang Chu, Peipei Li, Xuegang Hu, and Xindong Wu. 2017. Three-layer concept drifting detection in text data streams. Neurocomputing 260 (2017), 393--403.
    [75]
    Indre Zliobaite, Albert Bifet, Bernhard Pfahringer, and Geoffrey Holmes. 2014. Active learning with drifting streaming data. IEEE Transactions on Neural Networks and Learning Systems 25, 1 (2014), 27--39.

    Cited By

    View all
    • (2024)Math Word Problem Generation via Disentangled Memory RetrievalACM Transactions on Knowledge Discovery from Data10.1145/363956918:5(1-21)Online publication date: 26-Jan-2024
    • (2024)Balancing efficiency vs. effectiveness and providing missing label robustness in multi-label stream classificationKnowledge-Based Systems10.1016/j.knosys.2024.111489289(111489)Online publication date: Apr-2024
    • (2024)An oversampling algorithm of multi-label data based on cluster-specific samples and fuzzy rough set theoryComplex & Intelligent Systems10.1007/s40747-024-01498-wOnline publication date: 6-Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 13, Issue 6
    December 2019
    282 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3366748
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 November 2019
    Accepted: 01 August 2019
    Revised: 01 April 2019
    Received: 01 November 2018
    Published in TKDD Volume 13, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multi-label classification
    2. concept drift
    3. data stream
    4. nearest neighbor

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • 2018 VCU Presidential Research Quest Fund and an Amazon AWS Machine Learning Research award

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)6

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Math Word Problem Generation via Disentangled Memory RetrievalACM Transactions on Knowledge Discovery from Data10.1145/363956918:5(1-21)Online publication date: 26-Jan-2024
    • (2024)Balancing efficiency vs. effectiveness and providing missing label robustness in multi-label stream classificationKnowledge-Based Systems10.1016/j.knosys.2024.111489289(111489)Online publication date: Apr-2024
    • (2024)An oversampling algorithm of multi-label data based on cluster-specific samples and fuzzy rough set theoryComplex & Intelligent Systems10.1007/s40747-024-01498-wOnline publication date: 6-Jun-2024
    • (2024)A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental frameworkMachine Language10.1007/s10994-023-06353-6113:7(4165-4243)Online publication date: 1-Jul-2024
    • (2023)A Weighted Ensemble Classification Algorithm Based on Nearest Neighbors for Multi-Label Data StreamACM Transactions on Knowledge Discovery from Data10.1145/357096017:5(1-21)Online publication date: 27-Feb-2023
    • (2023)Integrating Global and Local Feature Selection for Multi-Label LearningACM Transactions on Knowledge Discovery from Data10.1145/353219017:1(1-37)Online publication date: 20-Feb-2023
    • (2023)Online Semi-Supervised Classification on Multilabel Evolving High-Dimensional Text StreamsIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.327529853:10(5983-5995)Online publication date: Oct-2023
    • (2023)Interpretable SAM-kNN Regressor for Incremental Learning on High-Dimensional Data StreamsApplied Artificial Intelligence10.1080/08839514.2023.219884637:1Online publication date: 9-Apr-2023
    • (2023)A survey on machine learning for recurring concept drifting data streamsExpert Systems with Applications10.1016/j.eswa.2022.118934213(118934)Online publication date: Mar-2023
    • (2023)AdaDeepStream: streaming adaptation to concept evolution in deep neural networksApplied Intelligence10.1007/s10489-023-04812-053:22(27323-27343)Online publication date: 7-Sep-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media