Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2312.02901v1 [cs.LG] 05 Dec 2023

11institutetext: C. Garcia 22institutetext: Instituto Federal de Santa Catarina (IFSC) - Campus Caçador, Caçador, Brazil
22email: cristiano.garcia@ifsc.edu.br
33institutetext: R. Abilio 44institutetext: Instituto Federal de São Paulo (IFSP) - Campus Boituva, Boituva, Brazil
44email: ramon.abilio@ifsp.edu.br
55institutetext: A. L. Koerich 66institutetext: École de Technologie Supérieure (ÉTS), Université du Québec, Montréal, QC, Canada
66email: alessandro.koerich@etsmtl.ca
77institutetext: A. Britto Jr. 88institutetext: Universidade Estadual de Ponta Grossa (UEPG), Ponta Grossa, Brazil
88email: alceu@ppgia.pucpr.br
99institutetext: C. Garcia 1010institutetext: A. Britto Jr. 1111institutetext: J. P. Barddal (corresponding author) 1212institutetext: Pós-Graduação em Informática (PPGIa), Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba, Brazil

Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review

Cristiano Mesquita Garcia    Ramon Simões Abilio    Alessandro Lameiras Koerich    Alceu de Souza Britto Jr    Jean Paul Barddal(*)
(Received: date / Accepted: date)
Abstract

Due to the advent and increase in the popularity of the Internet, people have been producing and disseminating textual data in several ways, such as reviews, social media posts, and news articles. As a result, numerous researchers have been working on discovering patterns in textual data, especially because social media posts function as social sensors, indicating peoples’ opinions, interests, etc. However, most tasks regarding natural language processing are addressed using traditional machine learning methods and static datasets. This setting can lead to several problems, such as an outdated dataset, which may not correspond to reality, and an outdated model, which has its performance degrading over time. Concept drift is another aspect that emphasizes these issues, which corresponds to data distribution and pattern changes. In a text stream scenario, it is even more challenging due to the its characteristics, such as the high speed and data arriving sequentially. In addition, models for this type of scenario must adhere to the constraints mentioned above while learning from the stream by storing texts for a limited time and consuming low memory. In this study, we performed a systematic literature review regarding concept drift adaptation in text stream scenarios. Considering well-defined criteria, we selected 40 papers to unravel aspects such as text drift categories, types of text drift detection, model update mechanism, the addressed stream mining tasks, types of text representations, and text representation update mechanism. In addition, we discussed drift visualization and simulation and listed real-world datasets used in the selected papers. Therefore, this paper comprehensively reviews the concept drift adaptation in text stream mining scenarios.

Keywords:
Concept drift Text stream mining Model update schemes Text representation methods Textual datasets

1 Introduction

Machine learning (ML) has been increasingly researched as processing power has increased and storage capacity has been cheapened. The development of frameworks and libraries, such as Weka (Hall et al., 2009) and Scikit-Learn (Pedregosa et al., 2011), has enabled the rapid development and deployment of ML models and their applications. Moreover, Tensorflow (Abadi et al., 2015), Keras (Chollet et al., 2015), PyTorch (Paszke et al., 2019), and HuggingFace (Wolf et al., 2019) are more contemporary enablers that are related to deep learning models and generally rely on graphic processing units (GPUs) to expedite the training process. Therefore, there has been an increase in the development of ML applications, such as credit scoring (Barddal et al., 2020), emotion recognition (Delazeri et al., 2022), and cryptocurrency pricing prediction (Garcia et al., 2019a).

Software, sensors, processes, and humans generate data, the primary raw resource for developing ML models. Humans, in particular, produce a considerable amount of unstructured data on the Internet, especially on social media, where users upload pictures and post opinions regarding anything, including products, artists, and politicians. Therefore, social networks have been considered a low-cost, rapid source of information, with the collected data utilized for election prediction (Dwi Prasetyo and Hauff, 2015; Brito and Adeodato, 2023; Tsai et al., 2019), stance analysis (Bondielli et al., 2022), and event detection (Suprem et al., 2019a), etc.

Texts are unstructured data. Most ML approaches expect numbers as input parameters, so texts cannot be directly used as input for ML methods. To overcome the aforementioned limitation, text must be processed, cleaned, sometimes standardized, and converted to fixed-size numerical vector representations. The conversion from unstructured to structured data is also known as feature extraction (Ahuja et al., 2019; Thuma et al., 2023). Recent natural language processing (NLP) advances have simplified text-based real-life applications. It is worth mentioning Word2Vec (Mikolov et al., 2013b), which is a neural network-based approach for generating word embeddings (vector representation), and BERT (Devlin et al., 2018), a bidirectional transformer-based modeling architecture, that can be applied in tasks such as sequence-to-sequence learning, e.g., language translation, text generation, and text classification. One advantage of the aforementioned methods is their reuse capability. Several pre-trained models are available on the Internet in specialized hubs such as HuggingFace111https://huggingface.co/models. A pre-trained model can aid in extracting features from text and use them as input for a classifier, e.g., a sentiment classifier. The time necessary to develop the final ML model can be drastically reduced if a representation-learning model from scratch is unnecessary.

Although these aforementioned approaches were initially designed for batch learning, it is possible to use pre-trained models to extract features in data stream scenarios. Data streams are considered a collection of sequential data that comes consecutively, or in small batches, in a timely order (Bifet et al., 2018). Thus, for ML models in data streams, there are challenges such as learning from the data the instant it arrives, adapting the model in case of pattern change, and keeping it concise. Text streams represent a continuous flow of textual data, such as social media updates, news articles, customer reviews, or online discussions. Several social networks and news agencies provide application programming interfaces (API) that function as a text stream. Twitter222https://twitter.com/ is an example of a social media platform that offers API access. MOA (Bifet et al., 2010) and RiverML (Montiel et al., 2021) have been enablers of experimentation and development of methods for stream mining.

In the case of pattern change, it is commonly referred to as concept drift in the literature. Concept drift is a phenomenon that occurs in data subject to non-stationary processes (Bifet et al., 2018; Gama et al., 2014). In real life, for example, changes may occur in temperature or customer purchasing patterns across given analyzed periods. Concept drift imposes several difficulties for ML models, e.g., if concept drifts are not captured and managed by the model, its performance will degrade over time. It can be even more challenging for ML approaches that require the processing of text streams due to the constraints inherent to streaming learning settings, such as the speed of the stream. In text streams, concept drift occurs when the underlying patterns and relationships within the textual data shift, making previously learned models or approaches ineffective. Concept drift in text streams arises from the dynamic, evolving nature of language, with trends, changing context and sentiment, and diversity of data sources. Therefore, understanding and addressing concept drift is crucial for maintaining ML models’ accuracy, relevance, and ethical integrity for text stream processing.

Processing text and learning in stream scenarios are challenging due to the requirements for ML models to function effectively in such scenarios. The requirements include: (i) learning from the data as it arrives; (ii) discarding the data after learning from it; (iii) performing all operations in a single-pass fashion (Gama et al., 2004; Bifet et al., 2018). In addition, NLP-related activities can be challenging in stream scenarios, such as maintaining an updated and concise vocabulary, and updating representations when possible. Therefore, the constraints of streaming scenarios are more restrictive when handling text streams.

This study provides a systematic review regarding concept drift detection in text streams. This systematic review aims to unravel the most common approaches to managing concept drift, updating the model to recover from concept drifts, text representation methods, datasets, and applications in challenging scenarios such as text streams. This work is organized as follows: Section 2 introduces data stream mining and presents the aspects of concept drift, semantic shift, and concept drift detectors. Section 3 details the protocol for this systematic review. Section 4 presents and discusses the results. Section 5 lists and describes the available real-world datasets. Section 6 discusses concept drift visualization and drift simulation settings. Section 7 concludes the study and emphasizes future directions.

2 Background

According to Bifet et al. (2018), “data streams are an algorithmic abstraction to support real-time analytics”. Data streams are data items arriving continuously and are temporally ordered. In traditional data mining, it is compulsory to have data collection so that the ML model can learn patterns from it and perform the desired task. However, there are several constraints in Data Stream Mining (DSM). Because the data arrives continuously and streams are potentially infinite, storing the data to posteriorly learn from can become unfeasible. Thus, the ML model must learn from the data and discard it within a short period (Bifet et al., 2018). In addition, the authors in Bifet et al. (2018) mentioned that there are two main challenges for ML models when handling data streams: (a) learning from the data the instant it arrives; and (b) being able to adapt in case the data evolves. Because these challenges must be addressed quickly and consume minimal processing, the outcome is an approximate model rather than a precise model. Because data streams are continuously arriving rapidly and can be infinite, the data generation process may undergo significant changes over time, reflecting the data distribution. These changes, namely concept drift, increase the challenges of managing data and text streams.

Concept drift in text streams is formally described as follows. Let a text stream T={X1,X2,X3,}𝑇subscript𝑋1subscript𝑋2subscript𝑋3T=\{X_{1},X_{2},X_{3},...\}italic_T = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … } be a potentially infinite input text Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT sequence, where i𝑖iitalic_i is the text index. Assuming a textual data stream in a classification task, each text may be accompanied by its label y𝑦yitalic_y, thus becoming a sequence of pairs (X,y)𝑋𝑦(X,y)( italic_X , italic_y ). Therefore, according to Gama et al. (2014), a concept drift is deemed to have occurred if

X:pt0(X,y)pt1(X,y),:𝑋subscript𝑝subscript𝑡0𝑋𝑦subscript𝑝subscript𝑡1𝑋𝑦\exists{X}:p_{t_{0}}(X,y)\neq p_{t_{1}}(X,y),∃ italic_X : italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X , italic_y ) ≠ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X , italic_y ) , (1)

in which pt0(X,y)subscript𝑝subscript𝑡0𝑋𝑦p_{t_{0}}(X,y)italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X , italic_y ) is the joint distribution between X𝑋Xitalic_X and the label y𝑦yitalic_y in a time t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

According to Gama et al. (2014), “data is expected to evolve”. Thus, the data distribution can change as time passes. These changes are referred to as concept drift. There are two primary types of drifts in data distribution: (i) Real concept drift, where the relationship between X𝑋Xitalic_X (input data) and y𝑦yitalic_y (class) changes, and (ii) virtual concept drift, where the data distribution in X𝑋Xitalic_X changes, but p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ) does not change, meaning that the boundaries are unchanged. Real concept drift can occur even if the data distribution in X𝑋Xitalic_X does not change.

In addition, Gama et al. (2014) highlighted four different types of concept drift dynamics over time. The four categories are as follows: (a) abrupt, where the data distribution changes from tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to ti+1subscript𝑡𝑖1t_{i+1}italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT; (b) incremental, where the data distribution changes from tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to ti+Δsubscript𝑡𝑖Δt_{i+\Delta}italic_t start_POSTSUBSCRIPT italic_i + roman_Δ end_POSTSUBSCRIPT, where Δ>1Δ1\Delta>1roman_Δ > 1; (c) gradual, where the data distribution switches between different means until remaining in the last distribution; and lastly (d) reoccurring, where the data distribution changes and later, switches back to the first data distribution observed. Fig. 1 depicts the aforementioned types concerning the dynamics over time.

Refer to caption
Figure 1: Dynamics of concept drift over time. Adapted from Gama et al. (2014).

When it comes to text, different aspects of drifts may emerge, such as a word gaining or losing meanings over time, known as semantic shift. According to Kutuzov et al. (2018), semantic shift constitutes “the evolution of word meaning over time”. Fig. 2 depicts examples of semantic shifts that occurred across decades and centuries (Hamilton et al., 2016b). Fig. 2 was generated using Word2Vec representations (Mikolov et al., 2013b) and t-SNE (van der Maaten and Hinton, 2008) for dimensionality reduction, according to Hamilton et al. (2016b). In the 1850s, awful had a positive connotation, as depicted in Fig. 2 (c). The surrounding words, e.g., majestic and solemn, corroborate the previous statement. However, in the 1900s, the word awful shifted to a negative connotation due to its proximity to the words terrible and horrible.

Several works have been proposed to measure the evolution of a word’s meaning over time (Di Carlo et al., 2019; Belotti et al., 2020; Ryzhova et al., 2021). Some papers provide semantic shift detection methods that measure the cosine distance between word embeddings in a period and the word embeddings from the same words in a previous period (Amba Hombaiah et al., 2021). If the distance exceeds a certain threshold, it is deemed a semantic shift to have occurred. Other approaches may use embedding alignment across time slices, such as orthogonal Procrustes (Hamilton et al., 2016a), and compass alignment (Belotti et al., 2020). However, traditional ML methods mostly address semantic shift detection, i.e., outside of the streaming context. It means that for most of the approaches, there are no constraints on processing and storage.

Approaches capable of handling text streaming become relevant in a world where enormous quantities of data are generated each second. Therefore, this review focuses exclusively on approaches applied to text stream scenarios. In addition, despite works that depict semantic shifts over long periods, works such as Stewart et al. (2017) demonstrate that semantic shifts may occur not only in decades or centuries but also in a shorter period, e.g., weeks.

Refer to caption
Figure 2: Semantic shift across several decades or centuries. Adapted from Hamilton et al. (2016b).

Concept drift detectors are methods used for detecting changes in data distribution, and they can be beneficial in performing both concept drift and semantic shift detection. These types of detectors were initially developed in statistics. However, there is no guarantee that such methods would work specifically in streaming scenarios because some may not work in a one-pass fashion (Bifet et al., 2018). In Gama et al. (2014), the authors categorized concept drift detection methods into four classes: (i) sequential analysis; (ii) control charts; (iii) monitoring two distributions; and (iv) context-based methods, which are also called heuristic methods. Sequential analysis corresponds to a scenario in which two subsets of data are generated sequentially by processes bound to different unknown distributions, e.g., P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. According to Gama et al. (2014), “when the underlying distribution changes from P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at point w, the probability of observing certain subsequences under P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is expected to be significantly higher than that under P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT”. It signifies that a statistical test, for example, can be used to detect this aforementioned change. Two primary representatives of this category are the cumulative sum (CUSUM) test (Page, 1954) and the Page-Hinkley test (Page, 1954), which is a variant of the CUSUM test (Gama et al., 2014; Bifet et al., 2018).

The second category proposed by Gama et al. (2014) is control charts, also known as statistical process control (SPC). Control charts correspond to “standard statistical techniques to monitor and control the quality of a product during continuous manufacturing” (Gama et al., 2014). In this case, the data are received over time and are input to the model, and the model’s error is used to determine the states of the system. The system states are as follows: (i) in-control, which indicates that the system is stable; (ii) drift detection, which signifies the error increased significantly, compared to the historical error; and (iii) warning, which indicates the error increased but was insufficient to raise a detection. Drift and warning are generally associated with a statistical confidence of 99% and 95%, respectively. An example of this category is the exponentially weighted moving average (EWMA) (Ross et al., 2012). The third category regards monitoring two distributions. Methods in this category, according to Gama et al. (2014), “typically use a fixed reference window that summarizes the past information and a sliding detection window over the most recent examples”. In this scenario, it is considered a drift to have occurred if the distributions of the windows are statistically different. An example of a method that embeds a concept drift detector from this category is the Very Fast Decision Tree (VFDT) (Gama et al., 2006). An actual concept drift detector that fits in this category is the Adaptive Windowing (ADWIN) (Bifet and Gavalda, 2007). ADWIN is a distribution-free concept drift detector suited for detecting drifts in real-valued or bits streams (Bifet et al., 2018). It maintains a window with the most recent items, from which subwindows are compared. If these subwindows exhibit different means above a threshold based on Hoeffding’s bounds, a drift is flagged (Gama et al., 2014). ADWIN is computationally more expensive in time and memory than sequential analysis detectors; however, it is simpler to use because the user does not need to specify a cutoff parameter (Gama et al., 2014; Bifet et al., 2018). In addition, ADWIN provides more precise change points (Gama et al., 2014).

The last category, i.e., context-based, regards specific approaches that use characteristics intrinsic to ML methods to perform drift detection or adaptation. For example, in Leite et al. (2012); Soares et al. (2019); Garcia et al. (2019b), the authors proposed a method that balances incremental learning and forgetting using fuzzy granular computation. Whenever a new instance is inputted, the existing granules, i.e., groups that share similar properties, have their (either complete or partial, whenever there are missing attributes in the new instance) similarity with the newly seen instance calculated. The new instance is assigned to the chosen granule if the similarity exceeds a certain threshold. However, if no granule can match the newly seen instance, i.e., a drift occurs, a new granule is created to accommodate the new instance. In addition, a pre-defined parameter controls the periods of verifying stale granules, which can be deleted to maintain the model’s conciseness.

The common metrics used to evaluate and compare concept drift detection methods, according to Bifet et al. (2018), are as follows: (i) mean time between false alarms (MTFA), which assesses the frequency with which a method raises false alarms; (ii) false alarms rate (FAR), which is the inverse of MTFA; (iii) mean time to detection (MTD), which assesses how quickly the method detects and responds to drift once it occurs; (iv) missing detection rate (MDR), which determines how frequently the method fails to warn when drift occurs; and (v) average run length (ARL), which is the time it takes to raise the alarm once a drift occurs (Bifet et al., 2018). ARL integrates MTD and MTFA (Bifet et al., 2018). Additional metrics, such as Mean Time Rate (MTR) (Wares et al., 2019; Bifet, 2017), may emerge in the literature; however, the primary focus is on missing drifts, hits, time/iterations until detecting an actual drift, or a combination of such factors. MTR, for instance, is analogous to ARL (Wares et al., 2019).

Typically, concept drift detectors are coupled to traditional or online ML systems by receiving the hits and errors of prediction. These concept drift detectors have two levels of alarms: warning and drift. The most straightforward use is when a warning alarm is issued. Either the input data are buffered, or a new ML model is trained such that when the drift alert occurs, a new model (trained using data from the buffer) replaces the outdated one. This learning strategy is called background learning (Gomes et al., 2017). Thus, the idea is to maintain an updated model based on the most recent/frequent data.

3 Systematic Review Protocol

This review followed the guideline proposed in Kitchenham and Charters (2007), which comprises three steps: (i) planning the review, (ii) conducting the review, and (iii) reporting the review. Planning the review includes identifying the need for the review and formulating the research questions. In conducting the review, we select primary studies and perform data extraction and synthesis. Finally, in reporting the review, it is expected to disclose the results and findings. In this work, we used five sources of studies: IEEEXplore333https://ieeexplore.ieee.org/, Science Direct444https://www.sciencedirect.com/, ACM Digital Library555https://dl.acm.org/, Springer Link666https://link.springer.com/, and Scopus777https://www.scopus.com/. We devised a series of four questions to guide our research. The primary question, RQ1𝑅𝑄1RQ1italic_R italic_Q 1, takes precedence, while the remaining questions are derived from RQ1𝑅𝑄1RQ1italic_R italic_Q 1. Table 1 displays our research questions for reference.

Table 1: Research questions used in this work.
ID Research Questions
RQ1 “How to handle concept drift using ML approaches having as source
text streams?”
RQ2 “Which type of application is addressed?”
RQ3 “Which type of token/word/sentence representation is used in the study?”
RQ4 “Which datasets were used to evaluate the proposed approach(es)?”

The search query was developed considering RQ1𝑅𝑄1RQ1italic_R italic_Q 1. We also used a few synonyms to aid in developing a broadening query. The reader can discover additional information on the terms and synonyms in Table 2. RQ2𝑅𝑄2RQ2italic_R italic_Q 2 focuses on the applications the papers addressed when handling concept drift in textual streams. This question is crucial because it can illustrate various scenarios, the potential, and increased interest in specific problems. Besides the application, we wanted to know which ML methods were employed and how the models were updated, e.g., incrementally or regularly retrained. With RQ3𝑅𝑄3RQ3italic_R italic_Q 3, we intended to uncover the most common approaches to representing texts (or smaller parts, such as tokens, words, and sentences). Finally, RQ4𝑅𝑄4RQ4italic_R italic_Q 4 pursues insights into the existence of consolidated datasets for the field and their aspects, such as the level of labeling in the dataset, e.g., instance or token, the data mining task employed, e.g., clustering, classification, whether the dataset contains real-world data or it is synthesized, metrics used in those data mining tasks, and whether drifts are labeled in the dataset.

Table 2: Table containing keywords and respective synonyms.
Keyword Synonyms
concept drift semantic shift, representation shift, semantic change
machine learning -
text streams textual streams, social network streams, Twitter streams, diachronic,
text streaming
detection -

We developed the query presented below using Table 2. The terminologies semantic shift and representation drift are closely related to concept drift, especially in the textual context. Semantic shift (or semantic change), according to Bloomberg (1933), refers to “innovations which change the lexical meaning rather than the grammatical function of a form”. However, according to Fu et al. (2022), the representation shift in NLP relates to changes in the vector representation. We included social network streams because they are the notable source of text streams produced directly by humans nowadays. We also used the terminology Twitter streams, because Twitter is a microblog (one of the most popular) and generated around 500 million tweets (posts) per day, in 2022888https://www.dsayce.com/social-media/tweets-day/. Furthermore, we included the term diachronic. When serving as an adjective for a dataset, diachronic refers to a dataset that contains data produced over time. The term machine learning was withdrawn because concept drift is mostly addressed by or in processes that use ML techniques. The query used in the search is: (“concept drift” OR “semantic shift” OR “representation shift” OR “semantic change”) AND (“text streams” OR “textual streams” OR “textual streaming” OR “social network streams” OR “twitter streams” OR “diachronic”) AND (“detection”). Each source has its parameters, but we prioritized full-text search in all of them.

3.1 Inclusion and Exclusion Criteria

The inclusion and exclusion criteria used in this paper are described below. It is crucial to note that we limited this review to papers published after 2018 because other previous secondary studies tackle similar problems (Kutuzov et al., 2018; Tahmasebia et al., 2021; Patil et al., 2021; Montanelli and Periti, 2023). Kutuzov et al. (2018) evaluated several papers regarding diachronic word embeddings and semantic shifts. The authors approached several aspects, such as diachronic semantic relations and the sources of diachronic data for training and testing. Tahmasebia et al. (2021) developed a survey on computational approaches for lexical semantic change detection. They approached aspects such as the semantic change types and computational modeling of diachronic semantics. Patil et al. (2021) also developed a survey on concept drift detection for social media. The authors provided information on datasets and the evolution of techniques over time. Montanelli and Periti (2023) presented a survey on contextualized semantic shift detection regarding aspects such as time awareness, learning scheme, language model, training language, and corpus language. The surveys/reviews from Kutuzov et al. (2018); Tahmasebia et al. (2021); Montanelli and Periti (2023) evaluated semantic shift and diachronic aspects without concerning specifically streams and methods that respect the streaming processing constraints. Patil et al. (2021) approached a similar aspect as ours; however, we provided deeper analysis on several characteristics, such as model update scheme, text representation methods, and their update schemes when available, datasets, and so on. Thus, a substantial difference between our systematic review and the aforementioned works is that we focus on papers that approach the problem of concept drift/semantic shift using text streams as a data source. Using streams as data sources requires specific approaches to overcome the stream processing constraints, as seen in Section 2. Therefore, we considered the following inclusion and exclusion criteria, according to Table 3. It is also essential to note that this review protocol was last executed on September 2nd𝑛𝑑{}^{nd}start_FLOATSUPERSCRIPT italic_n italic_d end_FLOATSUPERSCRIPT, 2023.

Table 3: Inclusion and Exclusion criteria used in this study.
Ref Inclusion criteria Ref Exclusion criteria
IC1 The study is published in journals or EC1 The study is not primary
conference proceedings EC2 The study is not written in English
IC2 The study is published from 2018 (inclusive) EC3 The study is incomplete
IC3 The study presents a method for handling EC4 The study is not an article
concept drift EC5 The study is duplicated
IC4 The study uses text streams as data source EC6 The study does not meet
the inclusion criteria

After gathering the returned papers, each researcher screened their abstracts to flag the inclusion or exclusion of each study. Concerning divergences, the researchers agreed to read the divergent papers carefully to have confidence in their decision. We used Cohen’s Kappa coefficient (McHugh, 2012) to measure the agreement level between the researchers.

4 Results and Discussion

Fig. 3 overviews the paper selection process. We collected 662 papers, considering the research query. The final calculated Cohen’s Kappa coefficient reached 87.25, which indicates a high agreement level between the researchers. In addition, the divergences were discussed after a thorough reading of the divergent papers, and a decision was reached on their inclusion or exclusion. After removing duplicates (n=132), non-article studies (n=4), non-primary studies (n=38), and unrelated studies (n=448), we retained 40 articles for a full reading and analysis.

Refer to caption
Figure 3: Process of papers selection. Each rounded-corner rectangle on the right side corresponds to an exclusion criterion. The numbers of remaining studies after each elimination are presented on the left side.

Considering the process depicted in Fig. 3, the reader’s attention may be drawn by the high number of unrelated studies after screening the abstract. It occurred due to the query term diachronic, which relates to something that evolves, especially concerning language. Most approaches that handle language evolution cannot work in streaming environments (about 60% of the papers). Therefore, we excluded those studies from our paper selection.

Based on the information extracted from the selected papers using the research questions, we categorized the approaches for handling text drifts presented according to the following characteristics: (DC) text drift categories; (DD) drift detection types; (MU) model update; (TR) text representation; and (TRUS) text representation update scheme. Our proposed taxonomy is depicted in Fig. 4. In addition, Table 4 shows the selected papers according to our proposed taxonomy. Subsection 4.1 describes the main statistics of the selected papers.

Refer to caption
Figure 4: A taxonomy regarding concept drift in text stream scenarios.

The selected papers are studied in detail considering the taxonomy presented in Fig. 4. Section 4.2 describes and categorizes the types of concept drift handled in the selected papers, i.e., Drift categories. Section 4.3 analyzes how the text-related concept drift detection is performed, i.e., in a model-adaptive way or explicitly, regarding the Drift detection in our proposed taxonomy. Section 4.4 describes how the ML models used in the papers are updated when handling a text stream, i.e., Model update in the taxonomy. We categorize the approaches according to the related Stream mining tasks, in addition to the applications and related metrics. Section 4.5 expands the information on the stream mining tasks presented in the papers. In Text representation, we uncover the text representation methods used in the papers, considering embeddings, frequency-based methods, and words directly. Section 4.6 describes the text representation methods used in the papers. For Text representation update mechanism, we analyzed whether and how the text representations are updated over time. Section 4.7 explores the update scheme of the text representation methods. All selected methods are studied under the taxonomy’s second level, i.e., text drift categories, text drift detection, model update, stream mining task, text representation, and text representation update mechanism. In addition, the methods can fit more than one characteristic below the second level.

Table 4: Selected papers ordered by year.
Method (DC) (DD) (MU) (SMT) (TR) (TRUS)
AWILDA (Murena et al., 2018) r >>> td e >>> st i >>> o tm w n >>> s
OBAL (Pohl et al., 2018) r a i >>> b class >>> cm fb n >>> s
CRQA (de Mello et al., 2018) r e >>> st - class >>> sa, nd >>> cdd - n >>> s
AIS-Clus (Abid et al., 2018) r, fd >>> ciw a i >>> b, i >>> o clust, class, gd >>> ed, nd >>> ce w n >>> s
- (Li et al., 2018) r >>> td e >>> dc i >>> b class >>> stclass fb n >>> s
- (Melidis et al., 2018) fd >>> ciw a, e >>> st¹ i >>> b class >>> s, class >>> sa fb, w n >>> s
MStream (Yin et al., 2018) r >>> td a i >>> b clust >>> stclust fb n >>> s
OurE.Drift (Hu et al., 2018) r >>> td e >>> dc eu class >>> stclass fb n >>> s
- (Hammer and Yazidi, 2018) r >>> td e >>> st i >>> o class >>> tc w n >>> s
- (He et al., 2018) r a i >>> b class e n >>> s
AIS-Clus (Abid et al., 2019) r, fd >>> ciw a i >>> b, i >>> o clust, class, gd >>> ed, nd >>> ce w n >>> s
LITMUS-ASSED (Suprem and Pu, 2019a) r a i >>> b gd >>> ed >>> ped e n >>> s
LITMUS (Suprem et al., 2019b) r a i >>> b gd >>> ed >>> ped e n >>> s
DCFS (Chamby-Diaz et al., 2019) fd >>> cid e >>> st r >>> ad class >>> s, nd >>> ce fb n >>> s
LITMUS (Suprem and Pu, 2019b) r, v e >>> dc eu gd >>> ed >>> ped e n >>> s
ESACOD (Wang et al., 2019) r e >>> st r >>> ad class, nd >>> ce e n >>> s
- (D’Andrea et al., 2019) r a r >>> t class >>> sa >>> sd fb, e n >>> r
- (Mohawesh et al., 2021) r e >>> st i >>> o class >>> frd fb n >>> r
- (Heusinger et al., 2020a) r a i >>> o class >>> ht fb, e n >>> s
OFSER (de Moraes and Gradvohl, 2021) r a i >>> o class >>> s fb n >>> s
- (Bechini et al., 2021) r a i >>> b class >>> sa >>> sd fb, e n >>> r
- (Sun et al., 2021) r e >>> dc eu class >>> stclass fb n >>> s
- (Amba Hombaiah et al., 2021) r, s, fd >>> v a kce class >>> ht e i >>> b
EStream (Rakib et al., 2021) r >>> td a i >>> o clust >>> stclust fb, e n >>> s
EWNStream+ (Yang et al., 2021) r >>> td a i >>> b clust >>> stclust fb n >>> s
GCTM (Van Linh et al., 2022) r >>> td a i >>> b tm e n >>> s
BSP (Nguyen et al., 2022) r >>> td a i >>> b tm e n >>> s
- (Heusinger et al., 2022) r e >>> st i >>> b class >>> ht e, fb n >>> s
DDAW (Rabiu et al., 2022) r, v e >>> dc eu class >>> sa - -
GOWSeqStream (Vo, 2022) r >>> td a i >>> b clust >>> stclust e n >>> s
GDWE (Lu et al., 2022) r, s a i >>> b class e i >>> b
- (Bravo-Marquez et al., 2022) r a i >>> o class >>> sa fb i >>> i
- (Bondielli et al., 2022) r a i >>> b, r >>> t class >>> stclass fb n >>> r
SMAFED (Kolajo et al., 2022) r a i >>> b class, clust, gd >>> ed e n >>> s
WIDID (Periti et al., 2022) s e >>> dc i >>> b nd >>> ssd e n >>> s
- (Li et al., 2022) r >>> td e >>> dc r >>> td class >>> stclass e n >>> s
FFCA index (Fenza et al., 2023) r e >>> dc - class >>> fnd fb n >>> s
TSDA-BERT (Susi and Shanthi, 2023) r e >>> dc r >>> ad class >>> sa e n >>> r
DDAW (Rabiu et al., 2023) r, v e >>> dc eu class >>> sa f -
textClust (Assenmacher and Trautmann, 2022) r a i >>> b, i >>> o clust fb i >>> b

Legends: >>> : a level down in the taxonomy. (DC) Drift category normal-→\rightarrow r: real drift; td: topic drift; fd: feature drift; cid: changes in important dimensions; ciw: changes in important words; v: vocabulary shift; vd: virtual drift; s: semantic shift. (DD) Drift detection method normal-→\rightarrow a: adaptive; e: explicit; st: statistical tests; dc: distance calculation. (MU) Model update normal-→\rightarrow eu: ensemble update; i: incremental; o: one input at a time; b: batches; kce: keep-compare-evolve; r: retraining; ad: after drift detected; t: time-to-time. (SMT) Stream mining task normal-→\rightarrow class: classification; cm: crisis management; fnd: fake news detection; frd: fake review detection; ht: hashtag prediction; sa: sentiment analysis; sd: stance detection; s: spam detection; stclass: short-text classification; clust: clustering; stclust: short-text clustering; gd: general detection; ed: event detection; ped: physical event detection; nd: novelty detection; cdd: concept drift detection; ce: concept evolution; ssd: semantic shift detection; tm: topic modeling. (TR) Text representation normal-→\rightarrow e: embedding; fb: frequency-based; w: words. (TRUS) Text representation update scheme normal-→\rightarrow i: incremental; b: batch; inst: instance; n: none; r: retrain; s: static.

¹: one version uses ADWIN to explicitly detect feature drift.

4.1 Main Statistics

We unraveled statistics on the selected papers regarding (i) the sources, (ii) years of publication, and (iii) venues of publication. Table 5 shows the number of selected papers by source. Scopus provided 37.5% of the selected papers for this work. We noted a steady interest across the years in streaming text applications susceptible to concept drift in its various possibilities. Considering the limited time range in our search, i.e., between 2018 and September 2023, we collected the respective number of papers: (2018) 10 papers; (2019) seven papers; (2020) two papers; (2021) six papers; (2022) 11 papers; and (2023) four papers. Considering the characteristics of the papers across the years, we cannot infer a trend. We hypothesize that this behavior occurred because the research area is still incipient.

Table 5: Number of selected papers according to the source.
Source Selected papers
ACM Digital Library 4
IEEE Xplore 6
Science Direct 6
Scopus 15
Springer Link 9
Total 40

Table 6 shows the venues that contributed the most to our search. The journal Expert Systems with Applications published three papers, followed by ACM SIGKDD, Evolving Systems, IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), International Joint Conference on Artificial Intelligence (IJCAI), International Joint Conference on Neural Networks (IJCNN), and Neurocomputing, each with two papers.

Table 6: Venues where the selected papers were published.
Venues Appearances
Expert Systems with Applications 3
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2
Evolving Systems 2
IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS) 2
International Joint Conference on Artificial Intelligence (IJCAI) 2
International Joint Conference on Neural Networks (IJCNN) 2
Neurocomputing 2
3rd International Workshop on Computational Approaches to Historical Language Change 1
ACM International Conference on Distributed and Event-Based Systems 1
ACM Symposium on Document Engineering 1
Applied Intelligence 1
Asian Conference on Intelligent Information and Database Systems 1
Brazilian Conference on Intelligent Systems (BRACIS) 1
Chaos 1
Cognitive Computation 1
Computational Collective Intelligence 1
Computers, Materials and Continua 1
Computer Systems Science and Engineering 1
IEEE Access 1
IEEE Transactions on Big Data 1
IEEE Transactions on Cybernetics 1
International Conference of Reliable Information and Communication Technology 1
International Conference on Collaboration and Internet Computing (CIC) 1
International Conference on Information and Knowledge Management 1
International Journal of Computer Science (IAENG) 1
International Journal of Information Technology and Decision Making 1
Journal of Big Data 1
Neural Computing and Applications 1
Pattern Recognition Letters 1
Technological Forecasting and Social Change 1
Vietnam Journal of Computer Science 1
World Congress on Services 1

4.2 Drift Categories

Considering the categories of concept drift in text stream settings, we arrange them into (i) Feature drift; (ii) Real drift; (iii) Semantic shift; and (iv) Virtual drift. Fig. 5 depicts the arrangement.

Refer to caption
Figure 5: Drift categories.

4.2.1 Feature drift

Feature drift considers the changes in the importance of features, signifying that, over time, a subset of features may become necessary for an ML model, while other subsets may become obsolete (Barddal et al., 2017). It constitutes a challenge for an ML model because incrementally defining the best feature set over time can be complex. In addition, the dependent ML model can have its performance degraded over time if the selected feature set is inadequate.

Considering the subcategories of feature drift depicted in Fig. 5, we describe the Changes in important words, Changes in important dimensions, and Vocabulary shift. Different types of features may be considered in ML approaches regarding text-related tasks. For instance, one approach is to consider the texts split into tokens or use direct techniques such as bag-of-words or TF-IDF. These techniques resort to counting tokens and measuring their overall importance, respectively.

Other approaches regard the numerical transformation of texts, such as Word2Vec (Mikolov et al., 2013a). Because the former can be directly related to specific words (or tokens), the changes in those words are regarded in this study as Changes in important words. However, changes in numerical representations without direct relation to words or tokens are regarded as Changes in important dimensions.

Finally, we consider vocabulary shift as one type of feature drift. Vocabulary shift (Amba Hombaiah et al., 2021) ponders the changes of words in a vocabulary maintained by the approach as a type of text drift. Different from the aforementioned subtypes of feature drift, vocabulary shift considers the changes, i.e., addition or removal of items, in the internal structure that stores the tokens. Amba Hombaiah et al. (2021) compared vocabularies in year-timed slices, measuring changes between vocabularies from different years.

Melidis et al. (2018), Chamby-Diaz et al. (2019), and Amba Hombaiah et al. (2021) addressed one of these aforementioned categories of feature drift directly. Melidis et al. (2018) proposed an ensemble-based method for predicting feature values in the next time point. Considering this case, the work is categorized as Changes in important words because their method uses a sketching mechanism to retain essential words in a fixed-size feature space, according to their occurrence count. In one version the authors presented, they utilized ADWIN (Bifet and Gavalda, 2007) to evaluate a significant decrease in word usage to decide when to remove it from the sketch.

Chamby-Diaz et al. (2019) proposed a feature selection method based on correlation suitable for data streams, categorized as Changes in important dimensions. Although the method was not developed specifically for use on text streams, the authors demonstrated its use on a text-related dataset, i.e., a spam dataset. Their method retains a covariance matrix coupled to a concept drift detector. Whenever it receives a warning signal, the covariance matrix is incrementally updated. When the concept drift detector triggers a drift signal, a one-pass algorithm computes feature-feature and feature-class correlations. Subsequently, a new Naive Bayes model is trained based on the new feature subset, which is chosen according to the merit of each feature subset from the correlation-based feature selection method (CFS) (Hall, 1999).

Unlike prior works, Amba Hombaiah et al. (2021) used vocabulary shift to estimate the changes in the usage of tokens across several years, i.e., between 2013 and 2019. The authors proposed sampling methods for updating BERT models (Devlin et al., 2018) to maintain the models’ usefulness in text-streaming scenarios. Initially, the authors emphasized that “vocabulary is the foundation of language models”. However, vocabularies can contain different types of representation, such as complete words and sub-word segments, e.g., wordpiece (Devlin et al., 2018). The authors analyzed the vocabulary shift considering the 40,000 most frequent tokens, accounting for hashtags, and wordpieces. Regarding hashtags in the years 2013 and 2019, the vocabulary shift was 78.31%, while for wordpieces in the same period, the shift was 38.47%. The authors argued that these results and their analysis justify the development of such an incremental method proposed by them. Furthermore, the authors stated that although larger vocabularies may lessen the vocabulary shift, they are more computationally costly and, therefore, potentially infeasible for real-world scenarios.

4.2.2 Real drift

We consider real drift according to the definition in Gama et al. (2014), which is changes in p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ) that can occur with or without changes in p(X)𝑝𝑋p(X)italic_p ( italic_X ). Considering this case, X𝑋Xitalic_X regards the input features, while y𝑦yitalic_y corresponds to the class, and p𝑝pitalic_p is the probability. Real drift in a classification task refers to the change in the classes’ boundaries, which may be accompanied by changes in the data distribution in X𝑋Xitalic_X. In this work, few papers handle different types of real concept drift, e.g., sentiment drift. However, because they regarded changes in p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ), these papers were categorized as real drift.

This study considers topic drifts as an extension of real drifts. In the literature, topic drifts are encountered in applications regarding topic modeling, topic labeling, and short-text classification. Thus, a topic could drift by the change of either text labeled as a particular topic, i.e., p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X )), or by the change of a topic distribution in the stream, i.e., p(X)𝑝𝑋p(X)italic_p ( italic_X ), or both simultaneously. In addition, it is common to use methods based on Latent Dirichlet Allocation (LDA) in short-text-related applications.

A significant number of papers (Pohl et al., 2018; Abid et al., 2018; He et al., 2018; Suprem and Pu, 2019a; Suprem et al., 2019b; Wang et al., 2019; D’Andrea et al., 2019; Mohawesh et al., 2021; Heusinger et al., 2020a; de Moraes and Gradvohl, 2021; Bechini et al., 2021; Sun et al., 2021; Amba Hombaiah et al., 2021; de Mello et al., 2018; Heusinger et al., 2022; Rabiu et al., 2022; Bravo-Marquez et al., 2022; Bondielli et al., 2022; Kolajo et al., 2022; Fenza et al., 2023; Susi and Shanthi, 2023; Assenmacher and Trautmann, 2022) regard exclusively real drifts. Most commonly, methods in this category either: (i) use concept drift detectors to detect drift and trigger the model update or (ii) update the model regularly.

Suprem and Pu (2019a), Suprem et al. (2019b), and Suprem and Pu (2019b) presented from multiple perspectives a system for detecting physical events with emphasis on landslides, i.e., the sudden mass of rock and earth movements downwards steep slopes. They combined data from social media (which is voluminous but not so trustworthy) and governmental reports (scarce but trustworthy) to train a model for landslide detection. The authors argued that the terminology landslide can suffer concept drift because of its use in different contexts, such as politics. In their case, the model is updated regularly, using the governmental reports as ground truth. However, Mohawesh et al. (2021) and Heusinger et al. (2022) utilize concept drift detectors to detect drifts explicitly. Mohawesh et al. (2021) used ADWIN (Bifet and Gavalda, 2007), DDM (Gama et al., 2004), EDDM (Baena-Garcıa et al., 2006), and Page Hinkley (Page, 1954; Sebastião and Fernandes, 2017), while evaluating fake reviews detection. The authors claimed that fake reviews could lead customers to make poor decisions. Also, it is an adversarial problem: once models become better at detecting fake reviews, the unlawful reviewers change patterns over time to overcome the models. The adversarial aspect of this problem results in concept drift, which can cause the models’ performance to degrade over time.

Susi and Shanthi (2023) proposed a complete system for tweet collection, automated training data generation, and BERT (re)training for sentiment prediction and adaptation to sentiment drift, namely Twitter Sentiment Drift Analysis - BERT (TSDA-BERT). The authors used Apache Kafka to simulate the Twitter stream. A BERT model has on top a three-layer dense network that performs the classification. Since the sentiment drift is verified using the predictions, we categorized this paper in the real drift category.

Assenmacher and Trautmann (2022) proposed a 2-phase online method for textual clustering, namely textClust. This method leverages TF-IDF to decide the proximity of incoming text to microclusters. In addition, the authors take advantage of unigram and bigram representations and use cosine similarity to evaluate the most suitable cluster to include the incoming text when possible. Over time, in the offline phase, the method can maintain the model concise by merging similar clusters and removing outdated ones. To define the outdated clusters, the authors used a fading factor for the cluster weights. The authors mentioned that the fading factor helps the model handle concept drift.

Another significant number of papers (Murena et al., 2018; Li et al., 2018; Yin et al., 2018; Hu et al., 2018; Hammer and Yazidi, 2018; Rakib et al., 2021; Yang et al., 2021; Van Linh et al., 2022; Nguyen et al., 2022; Vo, 2022; Li et al., 2022) approach the Topic drift problem. Topic drift primarily refers to short-text-related tasks, which commonly require additional steps to provide satisfying results, e.g., data enrichment step or use of statistical information of the application context. Li et al. (2018) proposed a method for short-text classification using feature space extension. Probase (Wu et al., 2012), an open semantic network, is used for the extension. According to Li et al. (2018), Probase was selected by the availability of several super-concepts. It means that, in order to enrich a short text, they could obtain more information from Probase, e.g., super-concepts(Apple) = [company, tech giant, large company, manufacturer], and add it to the short text. Rakib et al. (2021) developed EStream, a method for efficient short-text clustering. Their approach used lexical, e.g., bigrams, unigrams, biterms, and semantic information from GloVe (Pennington et al., 2014) to define the clusters. Changes in proximity between text and clusters over time are used to determine whether a concept drift occurred.

Both Murena et al. (2018) and Hu et al. (2018) used LDA (Blei et al., 2003) to address their challenges (short-text classification and topic modeling, respectively). As Li et al. (2018), Hu et al. (2018) enriched data using external sources. They employed LDA to mine hidden information from these external sources to add the top representative words in the short texts. Drifts are flagged by calculating the semantic distance between each short text in the current and subsequent chunks. Similarly to Hu et al. (2018), Murena et al. (2018) used LDA for topic modeling in document streams. The authors, in this case, integrated an ADWIN to LDA to detect topic drifts.

Li et al. (2022) presented a method for short-text classification in text stream scenarios. The authors enriched short texts by using representations from BERT and Word2Vec. Both were trained using massive corpora, which, according to the authors, should be highly consistent with the topics related to the datasets the authors evaluated. In addition, the authors proposed a distributed LSTM-based ensemble method that includes a concept drift factor. The concept drift factor is used to determine the importance of an LSTM layer in the final result.

4.2.3 Semantic shift

Semantic shift regards changes in the meaning of tokens over time. It is most commonly handled in papers that study linguistic changes over several years, decades, or even centuries. Generally, the datasets that support these tasks are entitled diachronic. However, semantic changes can also occur within a short time, such as in weeks (Stewart et al., 2017). The semantic shift was briefly introduced and discussed in Section 2.

Amba Hombaiah et al. (2021), Lu et al. (2022), and Periti et al. (2022) approach the problem of semantic shift. Amba Hombaiah et al. (2021) discussed the semantic shift as an analysis of whether it occurred. In the specific task of hashtag prediction, the authors evaluated the shift in top contextual words of the hashtags #china, #uk, and #usa, considering the years 2014 and 2017. The authors agreed that, in 2014, the contextual words related to #usa were related to the World Cup, while in 2017, the words were related to US politics. However, Periti et al. (2022) aimed at detecting semantic shifts incrementally. In this case, the authors applied clustering methods, such as affinity propagation, to generate clusters in time slices. The authors determined a semantic shift by measuring the distance between embedding sets using metrics such as Jensen-Shannon divergence (Nielsen, 2019) and the distance between prototype embeddings.

Lu et al. (2022) presented a word-level graph-based method to generate dynamic word embeddings. The fundamental concepts are around maintaining long-term and short-term word-level knowledge graphs. These graphs preserve the co-occurrence between words. The relations between words help define the occurrence of semantic shifts. For semantic shift detection, the authors evaluated the closest words to apple (in the New York Times dataset) and network (in the Arxiv dataset). In addition, the authors evaluated their method by considering trend detection and text stream classification. In both (Lu et al., 2022) and (Periti et al., 2022), the words of interest in detecting semantic shifts must be known in advance.

4.2.4 Virtual drift

According to Gama et al. (2014), virtual drift regards changes in data distribution without changing the boundaries between classes. Using a similar notation as in Section 4.2.2, virtual drift happens when p(X)𝑝𝑋p(X)italic_p ( italic_X ) changes but p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ) does not. In addition, Gama et al. (2014) state that different definitions exist for virtual drift in the literature. Papers Suprem and Pu (2019b); Rabiu et al. (2022) and Rabiu et al. (2023) illustrate this category. Virtual drifts must be tracked, particularly in cases where no classes or clusters’ labels y𝑦yitalic_y are available.

In Suprem and Pu (2019b), the authors proposed a method for landslide detection. The method relies on social media data and governmental reports. Section 4.2.2 already cited this paper together with Suprem and Pu (2019a) and Suprem et al. (2019b). However, Suprem and Pu (2019b) explicitly emphasized their concern about handling the virtual drift problem. They highlighted that model fine-tuning is sufficient in this case, compared to model re-creation. Nonetheless, no reason for their concern about virtual drifts is provided. Rabiu et al. (2022) presented a two-component method for concept drift detection applied to sentiment analysis and opinion mining. Similar to Suprem and Pu (2019b), Rabiu et al. (2022, 2023) handle virtual drift. Although it is not explicit in the papers, the drift detection method uses two windows to evaluate possible concept drift based on a distance metric to be selected. Different from most works that couple a concept drift detector with a classifier to utilize the classification errors as a proxy for the detector, Rabiu et al. (2022, 2023) used the input data, thereby using the concept drift detector to check p(X)𝑝𝑋p(X)italic_p ( italic_X ).

4.3 Drift Detection Methods

We considered two categories for drift detection methods: Adaptive and Explicit. Fig. 6 depicts the categorization regarding the type of drift detection. In subsequent subsections, we describe selected papers from each drift detection scheme.

Refer to caption
Figure 6: Drift detection categories.

4.3.1 Adaptive

Adaptive corresponds to a self-updating model without explicitly detecting drift but rather from time to time. This category is called blind adaptation in Gama et al. (2014). A substantial number of papers (Pohl et al., 2018; Abid et al., 2018; Melidis et al., 2018; Yin et al., 2018; He et al., 2018; Abid et al., 2019; Suprem and Pu, 2019a; Suprem et al., 2019b; D’Andrea et al., 2019; Heusinger et al., 2020a; de Moraes and Gradvohl, 2021; Bechini et al., 2021; Amba Hombaiah et al., 2021; Rakib et al., 2021; Yang et al., 2021; Van Linh et al., 2022; Nguyen et al., 2022; Vo, 2022; Lu et al., 2022; Bondielli et al., 2022; Kolajo et al., 2022; Assenmacher and Trautmann, 2022) consider adaptive approaches. Pohl et al. (2018) proposed a batch-based method with an application for crisis management in social media. The method is based on active learning and queries a user whenever the classifier fails to confidently determine whether the input text is relevant to the task. The authors selected two events corresponding to two subsets from a more extensive dataset, i.e., Colorado floods and Australian bushfires, and 1000 data points were labeled via crowd-sourcing. In addition, the authors mentioned that labeling data is particularly costly in streaming scenarios; however, it still requires a human in the loop in a task such as crisis management. According to the authors, this model can adapt itself in case of concept drift using the characteristics of the ML technique. For example, although they applied their scheme using k-Nearest Neighbors (k-NN) and Support Vector Machines (SVM), it could be any other classifier. For instance, the authors claimed that when using k-NN and SVM, the continuous calculation of the boundaries results in drift adaptation.

Amba Hombaiah et al. (2021) split the social media data by considering the years of publication. The work considers two datasets, corresponding to three different tasks: (i) 2014 Country hashtag prediction, (ii) 2017 Country hashtag prediction, and (iii) OffensEval 2019. The authors compared seven methods for each scenario: two static BERT models and five dynamic BERT models. Considering the static BERT models, one is trained with data from the previous year, and the other uses data from the current year. For example, considering the 2014 Country Hashtag prediction task, one model is trained with tweets from 2013, and the other (a model checkpoint from the first model) is updated with an amount of data from 2014. The dynamic BERT models are fine-tuned using sampled tweets from the current year using different sampling methods, e.g., uniform random, weighted random, token embedding, sentence embedding, and token masked language modeling (MLM) loss. The sampling methods define different strategies for the model to overcome drifts/semantic shifts over time. The uniform random sampling method regards sampling methods in which the tweets from the current year are sampled randomly. In addition, the weighted random regards a sampling method in which the tweets from the current year are sampled randomly, considering the number of wordpieces generated by the tokens in the current year’s tweets. However, token embedding, sentence embedding, and token MLM loss differ. The token embedding method assigns higher weights to tweets that contain new tokens and random samples from the current year’s tweets. The sentence embedding method calculates the cosine distance between the updated and the current models. Both cosine distance and tweet length are used to determine a score and then perform the sampling. The token MLM loss method considers the last layer from the BERT model, masks out 15% of the tokens and uses the surrounding words to predict the masked ones. A high loss value may indicate drifts.

Yin et al. (2018) proposed two algorithms for short-text stream clustering: MStream and MStreamF, a concise version that deletes outdated clusters. The algorithms receive document batches and are one-pass, in which the first document creates a new cluster, and the subsequent either selects one of the clusters to be assigned to or creates a new cluster. This assignment occurs after the batch is processed. The authors argued that concept drift is handled by assuming that the documents are generated by a Dirichlet Process Multinomial Mixture (DPMM) (Antoniak, 1974) and thus derive the probabilities of documents belonging to existing clusters.

D’Andrea et al. (2019), Bechini et al. (2021), and Bondielli et al. (2022) tackled two different problems similarly: stance detection about vaccination and the Green Pass (as the EU Digital COVID Certificate is known), both in Italy. The authors in both works categorized the application into stance detection, a branch of sentiment analysis. In these cases, the tweets are classified in a three-class fashion as either (i) in favor, (ii) neutral, or (iii) not in favor. D’Andrea et al. (2019) and Bechini et al. (2021) analyzed public opinion about vaccines in Italy based on tweets. D’Andrea et al. (2019) addressed concept drift by incrementally retraining the model, such as an SVM model. However, they emphasized that, considering their dataset, incremental retraining could not outperform a static SVM regarding accuracy. In Bechini et al. (2021), concept drifts are handled similarly to D’Andrea et al. (2019). However, the tweets from the new batch are semantically weighted according to previous events. Thus, the authors reached better values than other approaches, e.g., static model, regular retrain, DARK (Costa et al., 2017), and the proposed semantic scheme. Although Bechini et al. (2021) was published in 2021, it was applied to regular vaccination, unrelated to Covid-19. However, Bondielli et al. (2022) covered the opinion about the Green Pass concerning Covid-19. The authors evaluated different schemes to handle concept drift, including retraining with sliding windows and an ensemble of classifiers. The complete retraining led to the best average accuracy. Still, the highest feature space was reached due to the data accumulation and the utilization of TF-IDF as a text encoding method that generates a very high-dimensional representation.

Assenmacher and Trautmann (2022) presented an online method for textual clustering, i.e., textClust. In order to overcome concept drifts, the method leverages a fading factor. It helps the model to exclude stale clusters. In addition, there is another parameter tr𝑡𝑟tritalic_t italic_r that dynamically determines the distance limit for a cluster to merge with another. This is also used to help determine whether a new input instance should be incorporated into a given cluster.

4.3.2 Explicit

Explicit approaches directly detect the drift, via statistical tests or distance calculation. As examples of statistical tests used in the selected papers, we mention the Page-Hinkley test (Page, 1954; Sebastião and Fernandes, 2017), and ADWIN (Bifet and Gavalda, 2007). As examples of distance calculation metrics, we cite the Jensen-Shannon divergence (Nielsen, 2019), the Kullback-Leibler divergence test (Kullback and Leibler, 1951), and the cosine distance. Approaches presented in the papers Murena et al. (2018); de Mello et al. (2018); Li et al. (2018); Hu et al. (2018); Hammer and Yazidi (2018); Chamby-Diaz et al. (2019); Suprem and Pu (2019b); Wang et al. (2019); Mohawesh et al. (2021); Sun et al. (2021); Heusinger et al. (2022); Rabiu et al. (2022); Periti et al. (2022); Li et al. (2022); Fenza et al. (2023); Susi and Shanthi (2023); Rabiu et al. (2023) explicitly handle concept drift.

Concerning explicit detection using statistical tests, Mohawesh et al. (2021) tested four concept drift detectors: ADWIN (Bifet and Gavalda, 2007), DDM (Gama et al., 2004), EDDM (Baena-Garcıa et al., 2006), and Page-Hinkley test (Page, 1954; Sebastião and Fernandes, 2017). We considered ADWIN as a statistical test because, in the original paper, the authors indicated that their statistical test verifies whether the observed average in subwindows is above a defined threshold (Bifet and Gavalda, 2007). In addition, DDM (Gama et al., 2004) and EDDM (Baena-Garcıa et al., 2006) perform evaluations based on the statistical properties of a stream and thus are considered in this work a statistical test. Mohawesh et al. (2021) simulated concept drift by splitting the temporally ordered dataset into five chunks and rearranging them. The concept drift detectors use the calculated accuracy over the most recent input data as a proxy, i.e., a window size of 200. ADWIN and EDDM had the best accuracy (coupled with a classifier) among the scenarios tested in the study.

Heusinger et al. (2022) proposed a method that uses random projection for dimensionality reduction using text streams as input. In their experiments, preprocessing was done offline for the whole dataset to generate TF-IDF and embedding representations. Thus, their process is not fully incremental, except for the dimensionality reduction method, which is incremental (in batches). Considering the real-world dataset, i.e., NSDQ, proposed in the same paper, the authors obtained a vector representation of 3442 dimensions using TF-IDF. Using their online dimensionality reduction method, NSDQ was projected onto 200 dimensions. The authors concluded that random projection could reduce the run time, even considering the offline preprocessing time. To detect concept drift, the authors used KSWIN (Raab et al., 2020), based on the Kolmogorov-Smirnov test (Kolmogorov, 1933; Smirnov, 1948). In this case, KSWIN monitors every dimension of the vector representation. In addition, the authors mentioned that different types of concept drift might be present because NSDQ is a real-world dataset (Heusinger et al., 2022). Their assessment of concept drift detection relies on true positives and false positives. However, it is unclear how both metrics were calculated due to the absence of labeled drifts in the dataset. The results indicated more concept drifts detected in the original space, an expected outcome because KSWIN monitors each dimension separately. Finally, the authors mentioned that models trained with original and projected feature spaces maintain the same level of accuracy. Both Suprem and Pu (2019b) and Heusinger et al. (2022) used t-SNE (van der Maaten and Hinton, 2008) plots to support the existence of concept drift in the datasets on which they applied their proposed methods.

Considering the Explicit detection with the aid of distance metrics, Li et al. (2018) developed a method for short-text classification in the presence of topic drifts. As explained in Section 4.2, the approach automatically enriches the short texts by using Probase. The topic drift detection is performed as follows: the short-text stream is received in chunks, and after they are clustered, the label distribution can be evaluated using the clusters. Subsequently, the distance between the cluster centers in sequential chunks is calculated using the cosine distance. According to the value obtained, the method categorizes it either into: (a) no drift; (b) noisy impact; or (c) topic drift. In addition, the authors simulated topic drifts by generating datasets with topic changes after fixed periods. Their detection method was compared to nine drift detectors. In terms of false alarms, missing drifts, and delay the proposed method obtained high average rankings, being statistically equivalent (using the Bonferroni-Dunn test) to the best drift detectors in each metric.

Rabiu et al. (2022, 2023) developed an ensemble classifier coupled to a novel mechanism for drift detection-based adaptive windows (DDAW). Their method suits text streams, especially users’ sentiments and opinions. Their approach can be divided into two components: (i) drift detection and (ii) classification. In many applications, classification errors are used as a proxy for the drift detector. However, the drift detection component compares the data distribution considering two windows. Thus, it is possible to measure drift by evaluating the dissimilarity between the windows. An intriguing aspect of this approach is that it allows for distance metrics and statistical tests. In the paper, the authors compared the Hellinger distance (Hellinger, 1909), Kullback-Leibler divergence (Kullback and Leibler, 1951), Total Variation distance, and the Kolmogorov-Smirnov test (Kolmogorov, 1933; Smirnov, 1948). Their approach, coupled with the Hellinger distance, obtained the best values regarding false alarms, detection rate, and accuracy, even compared to other drift detection methods, i.e., AEE (Kolter and Maloof, 2005), RDDM (Barros et al., 2017), and Page-Hinkley (Page, 1954; Sebastião and Fernandes, 2017). It is unmentioned how the drifts are labeled or whether the data is rearranged to simulate drifts.

Suprem and Pu (2019b) developed a system for landslide detection, a physical event that causes destruction and for which there are no physical sensors to detect. The authors combined data from social media and governmental agencies to perform the detection. Concept drift is detected using the Kullback-Leibler divergence test (Kullback and Leibler, 1951) to evaluate the distribution of two batches. The model is updated by generating or updating the classifiers to handle the concept drift.

Li et al. (2022) presented a distributed long short-term memory (LSTM)-based ensemble method for short-text classification in text stream scenarios. The short texts are enriched by using BERT and Word2Vec models. The LSTM-based method includes a concept drift factor used as a threshold to compare the distance between the LSTM layer trained with the previous batch and the layer trained with the current batch. If the concept drift factor is above the threshold, the weight of the current layer will be higher to generate the combined final output.

Fenza et al. (2023) proposed a fuzzy-formal-concept-analysis-based index for concept drift detection and applied the method to a fake news classification problem. Although the concept drift detection is not directly approached, the authors calculated the correlation between the classifier’s performance and the proposed index. The index is calculated from a fuzzy lattice, i.e., a fuzzy hierarchical knowledge structure, while the classifier’s performance is calculated using F-Score and accuracy. Their results demonstrated a high (Sperman’s and Person’s) correlation, between 69% and 87%. The authors claimed that the method has the potential to be used as a proxy for the model update process. In addition, the fuzzy lattice seems never to be updated, which may hamper the model from properly working over a long time.

Susi and Shanthi (2023) proposed a sentiment drift analysis system based on BERT models, namely TSDA-BERT. According to the authors, the system receives data in a sliding window fashion corresponding to four days. The authors calculated the positive and negative scores per window based on the proportion of them in the window. From these values, a sentiment drift measure is calculated by simply subtracting the number of negative from the number of positive tweets. This measure is used for sentiment drift detection by calculating it between time periods; if the score is negative and goes positive or vice-versa, it indicates a drift.

4.4 Model Update Method in Text Stream Settings

We also looked closely for information regarding the model update scheme from the analyzed papers. Fig. 7 depicts the organization. We found four mechanisms: (i) Ensemble update, in which the base learners are substituted or removed over time; (ii) Incremental, which corresponds to the model incrementally learning new data without a retraining process, splitting regarding the amount of data used to learn: one input at time or batches; (iii) Keep-compare-evolve, which corresponds to methods that generate and evolve new models to adapt to drifts and uses the old model to measure the similarity between information from both models; and (iv) Retraining, which can occur after detecting a concept drift, or time-to-time, which does not detect drifts but adapts to them.

Refer to caption
Figure 7: Model update methods used when handling text streams bound to concept drift.

4.4.1 Ensemble update

In this category, the works propose techniques that create, update, and combine multiple models, the so-called ensembles. Over time, an ensemble can be updated by removing outdated base learners while adding new base learners trained on newly arrived data. In Suprem et al. (2019b), the presented system for landslide detection uses batches to update the model. The landslide detector uses a classifier, which is an ensemble. The authors mentioned that they used two approaches for selecting base learners: relevancy and recency. When using relevancy, a k-NN search is performed to discover the most relevant base learners from a pool of trained base learners, considering the centroid of the data used to train these learners. However, the recency scheme returns the most recent base learners used to compound the ensemble. In addition, the weighting scheme can be configured as an unweighted, weighted, or model-weighted average. The unweighted average equally considers the base learns to provide an output. The weighted average considers weights provided by domain experts, and the model-weighted scheme considers the base learners’ prior performance to weigh them.

Sun et al. (2021) described an ensemble classification model for short text classification in environments bound to concept drift. The paper emphasizes three main aspects: a feature extension based on the short text features, a concept drift detection method, and an ensemble model. Considering the ensemble model, the authors used SVM as a base classifier. A new classifier is added when concept drift is detected. If the classifier pool is complete, the oldest classifier is removed to add the current one after being trained on the new batch.

Hu et al. (2018) proposed a short text stream classification method based on content expansion coupled with a concept drift detector. The expansion was performed by adding information from external sources, and 100 Wikipedia pages related to 50 keywords were selected, totaling 60,600 pages. The classification task in this study was performed using an ensemble of SVMs, in which each base learner is trained per chunk using the expanded texts. The number of base learners is limited to a specific parameter H𝐻Hitalic_H: when this number is met, the oldest learner is replaced. In specific situations, the latest learner can replace an older learner trained using semantically similar chunks.

Rabiu et al. (2022, 2023) presented an ensemble method for classification. Particularly, Rabiu et al. (2023) tackles the sentiment classification problem. The ensemble model is updated over time by removing the worst base learner from the ensemble when it reaches the maximum number of base learners. To determine the worst base learner, a weighting calculation is performed by leveraging the base learner’s mean square error on the new input data, i.e., MSEisubscriptMSE𝑖\textrm{MSE}_{i}MSE start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the base learner’s mean square error on the data from the previous batch (reference data), i.e., MSEr𝑟{}_{r}start_FLOATSUBSCRIPT italic_r end_FLOATSUBSCRIPT. The complete weight calculation for a base learner is performed as weight=1MSEr+MSEi+α𝑤𝑒𝑖𝑔𝑡1subscriptMSE𝑟subscriptMSE𝑖𝛼weight=\frac{1}{\textrm{MSE}_{r}+\textrm{MSE}_{i}+\alpha}italic_w italic_e italic_i italic_g italic_h italic_t = divide start_ARG 1 end_ARG start_ARG MSE start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + MSE start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α end_ARG, where α𝛼\alphaitalic_α is a non-zero factor to avoid division by zero.

4.4.2 Incremental

The Incremental update scheme regards models capable of learning from new pieces of data without completely retraining the model. In our selection, several papers (Murena et al., 2018; Abid et al., 2018; Li et al., 2018; Melidis et al., 2018; Yin et al., 2018; Hammer and Yazidi, 2018; He et al., 2018; Abid et al., 2019; Suprem and Pu, 2019a, b; Mohawesh et al., 2021; Heusinger et al., 2020a; de Moraes and Gradvohl, 2021; Rakib et al., 2021; Yang et al., 2021; Van Linh et al., 2022; Nguyen et al., 2022; Heusinger et al., 2022; Vo, 2022; Lu et al., 2022; Bravo-Marquez et al., 2022; Kolajo et al., 2022; Periti et al., 2022; Assenmacher and Trautmann, 2022) employ incremental models to approach their applications. However, we distinguish between the manners in which the data are inputted into the model: (i) One input at a time; and (ii) In batches.

Heusinger et al. (2020a) proposed a method for dimensionality reduction using random projection. As already cited in Section 4.3.2, the process is not fully incremental. In this study, the authors utilized three classifiers: (i) Adaptive Robust Soft Learning Vector Quantization (Heusinger et al., 2020b), (ii) Adaptive Random Forest (Gomes et al., 2017), and (iii) Self-adjusting Memory k-NN (Losing et al., 2017). The dimensionality reduction method uses a window of size 1000. However, when applied to the classification methods, the process in incremental One input at time, except for the Self-adjusting Memory k-NN, which Heusinger et al. (2020a) cited that they used as parameters five neighbors and a window size of 1000 to match the window size of the random projection. The authors in Mohawesh et al. (2021) incrementally updated the models. In their case, they used Stochastic Gradient Descent for Support Vector Machine, Perceptron, and Logistic Regression algorithms incrementally. However, similar to Heusinger et al. (2020a), the process is not fully incremental because it uses TF-IDF and principal component analysis (PCA) for dimensionality reduction.

Abid et al. (2018, 2019) presented a method for text stream clustering called AIS-Clus, based on the artificial immune system (Kephart et al., 1994). This system has online and offline phases. The offline phase comprises receiving historical data to generate the first clusters. In the online phase, new data are divided into equal blocks, i.e., it works in batches. Concurrently, each instance is evaluated alone, being also capable of handling novel classes. Thus, this work could be categorized as Incremental in Batches or Incremental with One input at a time, depending on the point-of-view. Although it works in a clustering fashion, the method performs classification tasks.

Murena et al. (2018) presented the adaptive window-based incremental LDA (AWILDA), a method for topic modeling in document streams. This method contains two LDA models, one for topic modeling and another for drift detection, with the help of ADWIN. It receives the data in batches, making it possible for the approach to use ADWIN as a drift detector and to resort to LDA over the batch.

Assenmacher and Trautmann (2022) presented a stream text clustering method. The use of online and offline phases for algorithms that perform stream clustering is well known. The offline phase generally performs adjustments in the model, such as the stale cluster removal and merging of similar clusters. In the online phase, the method receives input data and verifies the most similar cluster to assign the new input data to the most similar cluster. However, a new cluster is created to accommodate the incoming text if no cluster is sufficiently similar. Due to these characteristics, this method could be categorized as Incremental with One input at a time. Interestingly, this method outperformed other batch-based methods in the evaluation considered in the paper.

4.4.3 Keep-compare-evolve

The single representative of this model update category is Amba Hombaiah et al. (2021). As aforementioned, this study proposes three methods for sampling to update the language models. The three methods, i.e., the Token Embedding Shift method, Sentence Embedding Shift method, and Token MLM Loss method, use both current and previous models to evaluate changes to sample new data to fine-tune the current model. Thus, more significant differences between a given text representation and the representations provided by the old and current models generate higher chances for a given text to be selected for fine-tuning. Thus, in this specific case, it is costly to fine-tune using all the data because of the size of the BERT models. In addition, GPUs are necessary to speed up the training/update of these models.

4.4.4 Retraining

Some papers resort to the complete retraining of models. The retraining can be triggered by drift detection or periodically, typically after batch processing. As noted, Chamby-Diaz et al. (2019) proposed a dynamic feature selection method to handle feature drift, namely Dynamic Correlation-based Feature Selection (DCFS). This method uses concept drift detectors, such as ADWIN. Concept drift detectors generally provide two levels of signaling: warning and drift. Whenever a warning signal is outputted, DCFS updates the covariance matrix incrementally. The feature-feature and feature-class correlations are calculated when a drift signal is emitted. Thus, a new Naive Bayes model is trained from scratch using the feature subset selected according to the correlation-based feature selection (CFS).

Other works such as D’Andrea et al. (2019), Bechini et al. (2021), and Bondielli et al. (2022) also utilize the retraining scheme. All these papers compare approaches that resort to the retraining scheme. Retraining occurs regularly and considers data from events. However, the dataset is increased incrementally to be used by the methods during the training step. For example, when event #10 concludes, the data related to this event are appended to the data regarding previous events. Thus, a new model can be trained based on the dataset, now containing the data about event #10. Concerning these three works, only Bondielli et al. (2022) uses an incremental approach, i.e., Complement Naive Bayes (Rennie et al., 2003) with the partial fit. For this approach, however, the authors used TF-IDF for vectorization, which is not updated during the online monitoring after the first event. Thus, the process is not fully incremental. In addition, the authors did not mention any strategy for maintaining a dataset in a feasible size after several incremental additions of batches.

The system proposed by Susi and Shanthi (2023), i.e., TSDA-BERT, also considers periodic retraining to overcome sentiment drift. Whenever a sentiment drift happens, the system uses a domain impact score, which calculates the impact of a tweet in the domain. The calculation considers the intersection of a tweet’s words and the domain-specific impact words. According to the authors, if the impact is above 0.5, it indicates adherence to the domain. However, the authors do not explain how the domain-specific words are selected. Compared to D’Andrea et al. (2019); Bechini et al. (2021) and Bondielli et al. (2022), Susi and Shanthi (2023) provided a strategy to maintain the training set in a feasible size. The tweets with higher adherence to the domain are included in the training set, and the same number of tweets are removed from the training set. It means that the training set is always the same size. The authors mentioned the utilization of at most 324,685 tweets in the training set. This training set is used for fine-tuning over time.

4.5 Stream Mining Tasks applied in Text Stream Settings

In Fig. 8, we organize the stream mining tasks addressed and respective applications in the analyzed papers, considering the information obtained from the selected papers. In this study, we considered Stream mining tasks: (a) Classification; (b) Clustering; (c) General detection; and (d) Topic modeling.

4.5.1 Classification

Classification is among the most common stream mining tasks. In the general classification, the objective is to predict, with arbitrary accuracy, a unique class from a small set of values from a given input. Some applications found in the papers addressing the classification task include (i) crisis management; (ii) fake news detection; (iii) fake review detection; (iv) hashtag prediction; (v) sentiment analysis; (vi) short-text classification; and (vii) spam detection.

Refer to caption
Figure 8: Text stream mining tasks and applications found in the selected papers.

Regarding crisis management, Pohl et al. (2018) aimed to identify the relevant tweets about two environmental disasters: the Colorado floods and the Australian bushfires. It is considered a binary classification task because the model assesses whether or not a tweet is relevant, sometimes with a human in the loop. Their approach is evaluated regarding the average error and the number of queries. Because the method presented in Pohl et al. (2018) employs active learning strategies, the label uncertainty determines whether the system should query a user. Only Pohl et al. (2018) represents this application in the classification task.

Fake news detection is addressed in Fenza et al. (2023). The authors proposed an index based on fuzzy formal concept analysis, which correlates with the classifier’s performance. According to the authors, the fake news detection problem is generally tackled as a binary classification, where a model should classify news as fake or real. The authors evaluated three machine learning methods: Randon Forest, Naive Bayes, and Passive-Aggressive (Crammer et al., 2006). Although the authors proposed the method, they did not couple the index to the methods to trigger retraining. Three datasets containing news articles between 2018 and 2020 were used, i.e., NELA-GT-2018, NELA-GT-2019, and NELA-GT-2020 (see Section 5). According to the authors, only the Passive-Agressive algorithm was tested online, and the news between February and August 2018 were used as the training set, considering also the fuzzy lattice structure. The classifiers’ evaluation is performed using accuracy and F1-score. The evaluation of the proposed index happens through visual analysis, Pearson’s and Spearman’s correlation, and cosine similarity. The authors argue that their method would allow early drift detection but do not provide experiments or evidence.

Considering fake review detection, Mohawesh et al. (2021) tackled this task using three ML methods: SVM, logistic regression, and perceptron. The authors used four Yelp datasets, only one containing fake and genuine reviews. Mohawesh et al. (2021) noted that the datasets “were built based on an unknown filtering algorithm and web-scraper techniques to label each review as fake or genuine”. Because the idea is to determine whether or not a review is fake, it corresponds to a binary classification task. The ML methods are evaluated using accuracy and statistically assessed using the Nemenyi test (Nemenyi, 1963). The authors claimed that their work is the first to address concept drift in the fake review detection problem. Considering the selected papers, this is the only method that approaches fake review detection.

Hashtag prediction is addressed in Heusinger et al. (2020a), Amba Hombaiah et al. (2021), and Heusinger et al. (2022). Both Heusinger et al. (2020a) and Heusinger et al. (2022) used random projection as a dimensionality reduction method for text streams. Also, a dataset, i.e., NSDQ, is proposed for the problem because it generates high-dimensional data and can be reduced in real-time by random projection. Furthermore, this is the only real-world textual dataset addressed in these papers, while the others are synthetic. This dataset contains 15 classes that make the stream mining task approached by them a multiclass classification. The evaluation is performed in terms of accuracy, Cohen’s Kappa, and run time. In Amba Hombaiah et al. (2021), the sampling approaches for updating BERT are tested using two datasets: OffensEval 2019 and Country Hashtag Prediction. Approaching OffensEval constitutes a binary classification task; thus, the authors used the Area Under Curve (AUC) of the Receiving Operating Characteristic (ROC) curve and F1 score. However, addressing Country Hashtag Prediction corresponds to a multiclass classification task and is evaluated using micro-F1 score, macro-F1 score, and accuracy.

In sentiment analysis, the objective is to develop a model capable of inferring a user sentiment from text. According to Medhat et al. (2014), “sentiment analysis (SA) or opinion mining (OM) is the computational study of people’s opinions, attitudes and emotions toward an entity”. Similarly to sentiment analysis, stance detection regards the position of a given text’s author about a specific topic, considering the labels in favor, neutral/neither, and against, sometimes expressed in literature with different labels but with similar meanings (Bechini et al., 2021; Küçük and Can, 2020). The papers D’Andrea et al. (2019), Bechini et al. (2021), and Bondielli et al. (2022) approach stance detection, with D’Andrea et al. (2019) and Bechini et al. (2021) related to vaccination, and Bondielli et al. (2022) regards the stance about the green pass, as mentioned in previous sections. The authors in these three works collected the dataset that they needed to utilize primarily from Twitter. As aforementioned, stance detection classifies texts in three labels, indicating that it is a multiclass classification task. In D’Andrea et al. (2019), the metrics utilized to evaluate the method are F1 score, precision, recall, AUC, and accuracy. In Bechini et al. (2021), models were evaluated using accuracy and F1 score, and Bondielli et al. (2022) used F1 score, accuracy, and the number of features in each model.

Bravo-Marquez et al. (2022) proposed a sentiment lexicon inductor for time-evolving environments in a sentiment analysis context. The authors claimed that sentiments could change over time, while new words in different sentiments can emerge. In addition, the lexicon would be static in a fully incremental system without sentiment induction. In this case, from a seed lexicon, the authors processed the dataset in a stream fashion and, at the same time, inferred sentiment from tokens absent in the lexicon. Although, in practice, the system outputs a value limited by a logistic function, we presented this paper in the classification section because of the sentiment analysis application. In addition, the authors tested their approach by deliberately changing lexicon sentiment scores and measuring how long the system would take to recognize the new sentiments. Finally, the authors used accuracy and Cohen’s Kappa to evaluate the classifiers applied together with their method.

Short-text classification is addressed in papers such as Li et al. (2018); Sun et al. (2021) and Li et al. (2022). Li et al. (2018) proposed a method for short text streams bound to concept drift. This method takes advantage of Probase for short text enrichment. The approach is evaluated in terms of time and accuracy. Sun et al. (2021) described a method for text stream classification based on feature extension and ensembles formed by ensembles. This method can handle concept drifts by calculating the distance between each short text in the previous and new batches. Li et al. (2022) proposed a method for short text classification in text stream scenarios. This method enriches text by using representations from BERT and Word2Vec. In addition, the method uses a Convolutional Neural Network (CNN) to extract high-level features. This method handles concept drift by resorting to a concept drift factor used in the systems. Both approaches in Li et al. (2018); Sun et al. (2021) and Li et al. (2022) are applied to the same datasets, i.e., Tweets, TagMyNews, and Snippets, considering text classification as topics.

Some papers address Spam detection as experiments (Melidis et al., 2018; Chamby-Diaz et al., 2019; de Moraes and Gradvohl, 2021). Because the goal is to classify a piece of text into either non-spam or spam, the task is considered a binary classification task. The authors in Melidis et al. (2018) provided an ensemble-based mechanism for predicting a feature’s probability of association with a given class by considering that words might be subject to temporal trends and a sketch-based feature space maintenance mechanism that allows for memory-bounded feature space maintenance. The approach utilizes an ensemble compounded by statistical techniques to account for feature periodicities. The ensemble consists of a Poisson model (Melidis et al., 2018), a Seasonal Poisson model (Holt, 2004), an Auto-regressive Integrated Moving Average (ARIMA) model (Box et al., 2015), and an Exponential Weighted Moving Average (EWMA) model (Nishida et al., 2012), to capture regular, seasonal, auto-correlated, and sudden trends. A sketch-based approach is designed to maintain a concise feature space. The authors tested three versions: a baseline sketch that retains only word and occurrence counts, a fading sketch that considers the importance of frequent words, and a drift-detector-based sketch, which uses ADWIN to detect the decrease in word usage. The approaches are compared regarding the accuracy, Cohen’s Kappa, and run time. Chamby-Diaz et al. (2019) proposed a method for feature selection based on correlations to handle feature drifts in data stream scenarios. The method is not exclusive to spam detection, but the spam dataset is the only text-based dataset used by the authors. The method is evaluated in terms of accuracy. de Moraes and Gradvohl (2021) proposed a method for feature selection in binary text stream classification tasks, namely OFSER. The proposed method leverages adaptive regularization and weighs the input for each new data. The regularization, according to the authors, decreases the impact of the feature drift. Despite being fast and having decent overall performance, their method depends on a parameter to define the number of features to be selected from the original set. The method runs on top of a naive Bayes classifier, chosen due to its simplicity and naive assumption of independence among the features. The approach is evaluated using F1 score, accuracy, memory consumption, and run time. Furthermore, due to “an undesired conservativeness of the Friedman test” (de Moraes and Gradvohl, 2021), it is statistically assessed using the Iman-Davenport test (Iman and Davenport, 1980) instead of the Friedman test (Friedman, 1937, 1940), and the Bergmann-Hommels’ procedure (Garcia and Herrera, 2008) instead of the Nemenyi test (Nemenyi, 1963). OFSER ranked among the three best approaches.

As expected, the most frequent metrics in this stream mining task are: accuracy, Cohen’s Kappa, F1 score, AUC, and run time. Although not all methods were assessed regarding run time, it is crucial to have values for this metric due to its use in streaming scenarios, where time and memory consumption are constrained.

4.5.2 Clustering

Clustering is a stream mining task in which the aim is to find intrinsic clusters, according to their features (Bifet et al., 2018). The general idea is to minimize the similarity between different clusters and maximize the intra-cluster similarity (Bezerra et al., 2015). Differently from classification, in the clustering task, the labels are not available before the learning process. Therefore, alternative metrics are necessary and, since there is no ground truth, the learning process is named unsupervised (Bifet et al., 2018).

Three works approach the stream clustering task (Abid et al., 2018, 2019; Assenmacher and Trautmann, 2022). The first two papers presented similar approaches that use the artificial immune system (AIS) for text clustering, while the third presents textClust, a stream clustering method. Abid et al. (2018) developed a method for text stream clustering based on the AIS called AIS-Clus. It uses heuristics based on the AIS to cluster data efficiently and, by discovering these clusters, can also detect concept drift and feature evolution. The authors could also recognize new classes corresponding to the concept evolution task in the experiments. According to the authors, the AIS is analogous to the biological immune system because it receives an intruder, clones specific cells, and handles the intruder until it dies. In their approach, for each new input (analogized as antigen), a scoring function calculates its adherence to each cluster (analogized as a B-cell). The clonal selection makes copies of clusters that undergo a mutation process. Later, the negative selection mechanism makes it possible to detect noisy data. Their method does not start from scratch, having an initial static phase for preprocessed historical data clustering. The other phase is online stream processing, which receives the clusters from the first phase as input. The authors used a survival factor for each word in an aging-like scheme. Although it works in a clustering scheme, the method is evaluated regarding F1 score, accuracy, recall, and precision. More information is provided in Abid et al. (2019), which expands on Abid et al. (2018), and new experiments are executed. For example, to test the approach’s capacity to handle new classes, the authors arranged data in three datasets to simulate the emergence of new classes/events, one of which included texts in Arabic. When AIS-Clus is compared to other methods, i.e., CluStream and DenStream, it achieves the best results regarding the precision, recall, and number of clusters, functioning as a classifier as described in Abid et al. (2018).

Assenmacher and Trautmann (2022) presented an online method for textual clustering, namely textClust. The algorithm is available within RiverML Python library (Montiel et al., 2021)999https://riverml.xyz/0.19.0/api/cluster/TextClust/. Over time, in the offline phase, the model is maintained concisely by merging similar clusters and removing outdated ones. A fading factor for the cluster weighting is used to determine cluster staleness. The method was evaluated regarding homogeneity, completeness, and Normalized Mutual Information. Homogeneity evaluates how well a clustering method assigns the data points to the clusters. Reaching 1 for homogeneity means that each cluster contains data points of a single class. On the other hand, completeness measures whether the data points of a given class were assigned to the same cluster. Reaching the value 1 for completeness means that the data points of each class were assigned to a single cluster. The authors support these statements by mentioning that “completeness scores tend to be lower than the homogeneity scores”, and that it “indicates that online clusters are quite pure with low entropy, but the topics are distributed over multiple clusters” (Assenmacher and Trautmann, 2022).

Most selected works that addressed a stream clustering task focused on short-text clustering. Rakib et al. (2021) proposed an efficient method for similarity-based short-text stream clustering called EStream. The method’s efficiency comes from utilizing an inverted index to find the most similar clusters. The authors tested lexical (unigram, bigram, and biterm) and semantic text representations (using a pre-trained GloVe (Pennington et al., 2014)). Their method has two steps: the online and the offline phases. First, each cluster is lexically represented as a cluster feature 4-sized vector consisting of the features (in unigram, bigram or biterm), their frequencies in the cluster, the number of texts in the cluster, and the cluster identifier. The semantic representation consists of the cluster vector and the cluster center, calculated from the average of the GloVe representation of the texts. EStream is compared in terms of Normalized Mutual Information (NMI), Homogeneity, and V-Measure. EStream had the best performance in 50% of the datasets used for evaluation. The authors highlighted that EStream requires less running time and that it stores more information than the other approaches, but that would be an acceptable trade-off (Rakib et al., 2021). They also highlighted that EStream might perform inadequately in more extensive texts.

Vo (2022) proposed a new method called GOWSeqStream, for short text stream clustering, using deep sequential methods, graph-of-words representation, and pre-trained word-embedding models. It uses subgraph mining to extract semantic information from the texts, although it lacks information on how to use it, considering even the number of sliding windows and the support. The method also utilizes Word2Vec representations to generate embeddings to serve as input for other deep encoders, such as GRU. The author also experimented using bidirectional LSTM, Doc2Vec, and BERT representations. These representations are utilized as input for a Dirichlet Process Mixture Model (DPMM). The method was compared with five approaches using three datasets; the proposed approach achieved the best values for two. The author also compared the representation generation; the best combination was with BERT and Bi-LSTM. In addition to English, the author used a Vietnamese text dataset as a final test. In this scenario, the proposed approach achieved the best results among the competitors. As in Rakib et al. (2021), the authors used the Normalized Mutual Information (NMI) as the primary evaluation metric.

Yang et al. (2021) proposed a new short text stream clustering method using an incremental word relation network. The authors highlighted their primary contribution as (a) a new method for real-time short text clustering using a bi-weighted relation: term frequency and co-occurrences, to overcome sparsity; (b) a fast method to locate core terms that represent text clusters sufficiently; (c) the mechanism to overcome topic drift, removing outdated relations and incrementally adding new terms and relations. In addition, the authors proposed a new data structure to represent the clusters, which they named cluster abstract. This data structure has five fields: an index, the number of short texts in clusters, the sum of timestamps, the squared timestamps sums, and a new attribute compared to EWNStream (their previous approach) called pd, containing a core term set. The method uses data windows and specific calculations to update the model to add new data, exclude outdated data, and merge clusters. Besides, the method has a decay scheme to control the forgetfulness of old clusters. In essence, the method develops a graph containing terms and relations, and the clusters are obtained from groups of closely related words. The method searches for a cluster abstract with the most intersection of words considering the input data to predict a cluster to newly inputted data. Using a dataset crawled by themselves, the authors compared their proposed method against EWNStream, MStream, Sumblr, and Dynamic Topic Model. EWNStream+ outperformed its previous version (achieving roughly 86% of NMI accuracy) and was approximately 30 percentage points better than MStream, the third in the ranking. In addition, the run time was very modest across different stream lengths.

Yin et al. (2018) proposed two text stream clustering algorithms: (a) MStream, a one-pass clustering method that utilizes Dirichlet Multinomial Mixture Model (DPMM) and an update process per batch; and (b) MStreamF, which deletes outdated clusters, maintaining a concise model. Considering the clustering process of the MStream algorithm, there is the assumption that the new documents arrive sequentially, and each is processed only once. The initial document generates a new cluster, and subsequent documents choose one of the existing clusters or create a new one. The authors’ updating process proves beneficial in the batch processing of text streams. The process is designed such that each document gets assigned and then temporarily deleted from the cluster so that the similarity of the other document in the same batch is not impacted. After completing the batch process, all documents are assigned to their original cluster. For MStreamF, the authors developed a deleting scheme that works for batch processing by adding a new parameter Bssubscript𝐵𝑠B_{s}italic_B start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, which accounts for the number of batches. When the number of processed batches meets the Bssubscript𝐵𝑠B_{s}italic_B start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT parameter, the new batches are processed after the documents related to the oldest batch are deleted. As the iterations go by, it is expected that some clusters become empty, indicating that they are outdated and could be deleted. The approaches were assessed in terms of NMI, run time, and the number of clusters. They concluded that MStreamF is faster than MStream due to the conciseness of the former model. Comparing the proposed and the state-of-the-art models, MStream and MStreamF outperformed their competitors. MStreamF performed best with temporally ordered datasets, whereas MStream performed best with unordered datasets. The run time of all algorithms increased linearly with the size of the datasets, while the single-pass algorithms were faster.

In summary, the NMI, run time, and the number of clusters are the most often used metrics for stream clustering and short-text stream clustering. The latter may be considered a measure of conciseness, which directly corresponds to one of the constraints of streaming scenarios, i.e., memory consumption, and may indirectly impact run time. NMI, a Shannon-entropy-based metric, measures the similarity of two sets and, concerning clustering, the similarity of the ground-truth and the model-generated clusters (Yin et al., 2018; Emmons et al., 2016). Other metrics may appear, such as completeness, and homogeneity. Those metrics vary between 0 and 1, where the higher, the better. Homogeneity evaluates how well a clustering method assigns the data points to the clusters. A perfect homogeneity, i.e., 1, indicates that each cluster contains data points of a single class. As aforementioned, completeness evaluates whether the data points of a given class were assigned to the same cluster. A perfect completeness value suggests that the data points of each class were assigned to a single cluster.

4.5.3 General detection

In this category, we grouped papers that tackled event detection and novelty detection. According to Faria et al. (2016), novelty detection is “the ability to identify an unlabeled instance (…) that differs significantly from the known concepts”. As suggested in Faria et al. (2016), we considered concept drift detection, semantic shift detection, and concept evolution as sub-categories of novelty detection.

We also considered physical event detection a sub-category of event detection. As mentioned previously, Suprem and Pu (2019a), Suprem et al. (2019b), and Suprem and Pu (2019b) described distinct aspects of a system for landslide detection. They utilized governmental reports as trustworthy sources and social media posts as social sensors (also named strong and weak signals, respectively). The system is described as fully autonomous and continuously evolving, becoming unnecessary human intervention. Although the works are similar in several aspects, there are minor variations in the evaluation metrics. In Suprem and Pu (2019a), the selected metrics are precision and F1 score. The event detection is assessed using false positives and false negatives, where the original variant of the system is used as ground truth. Suprem et al. (2019b) used F1 score, precision, recall, and the number of events detected as metrics. There is no ground truth regarding the number of events: only the events counted. Suprem and Pu (2019b) used accuracy to evaluate classifiers’ performance across data windows.

Kolajo et al. (2022) proposed a framework for real-time event detection using social media as a data source. The interesting highlights in this paper regard the tweets’ enrichment for slang, abbreviations, and acronyms based on external sources. The method creates a local vocabulary using data from various external sources. In addition, they utilized spelling correction and emoticon replacement. The authors used an incremental clustering algorithm to cluster events and then rank these events based on important words for each event. The authors evaluated their method using two experiments: (a) comparing it to the General Social Media Feed Preprocessing Method (GSMFPM) to determine if the enrichment layer performs effectively; and (b) event detection from social media. In experiment (a), the authors represented the tweets using unigrams and bigrams, supposedly later converted to GloVe (unclear in the paper). Later, the vectors are applied as input to a Feedforward Neural Network (FNN) and a Convolutional Neural Network (CNN). These approaches are not incremental, thus presenting concerns about the process’ timeliness regarding real-time events. In this experiment, they measured the cross-entropy loss across the training epochs for both Twitter Sentiment Analysis and Naija datasets. Their method outperforms GSMFPM. The second experiment measures accuracy over events in social media, using precision, recall, and F1 score. The authors used a dataset called Event2012, which contains annotations about events. The proposed method obtained a higher F1 score than the other approaches.

Regarding novelty detection and its subdivision in this study, only one paper exclusively considers concept drift detection (de Mello et al., 2018). Three include the concept evolution problem (Abid et al., 2019; Chamby-Diaz et al., 2019; Wang et al., 2019), and another mentions the semantic shift detection (Periti et al., 2022). Considering the concept drift detection, de Mello et al. (2018) used a cross-recurrence quantification analysis (CRQA) to detect concept drifts. The author’s idea was to highlight the most significant hashtag-related events. Cross-recurrence quantification analysis is used to compare the changes in trajectory. This outcome is achieved by assessing the longest diagonal line of two consecutive windows and whether they follow the same generating process over time. All operations occur inside a system called TSViz. The experiments discussed in the paper are on drift detection related to hashtags from Brazilian politics. The authors concluded that the drifts detected directly trace back facts from the news. According to the authors, recurrence analysis “characterizes the behavior of dynamical systems by reconstructing produced data in phase spaces”. The authors used Normalized Compression Distance (NCD) to compute the similarity among texts and Naive Bayes to perform sentiment analysis; however, the authors did not detail the classification process. The results were visually assessed.

The concept evolution problem regards the increase in the number of classes over time. For a model to be updated, it must internally account for these novel classes (Faria et al., 2016). Traditional ML methods require prior knowledge of the number of classes. Abid et al. (2018) and Abid et al. (2019) proposed a method for text stream clustering based on AIS. These papers were previously referenced in this work. They also managed concept evolution (under the name of novelty detection). These methods address the concept evolution problem by cloning and mutating existing clusters, a heuristic of the clonal selection principle. If the novel data do not fit into a cluster, they are sent to the outlier buffer, where they are examined periodically to detect novel classes. Abid et al. (2018) evaluated the quality of concept evolution handling using the Mnewsubscript𝑀𝑛𝑒𝑤M_{new}italic_M start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT metric, which measures the rate of novel class instances misclassified as from an existing class. In addition, the authors plotted the F1 score, accuracy, and recall over time, demonstrating the appearances of new classes and how their method recovers from concept evolution. The run time was not measured. Abid et al. (2019) employed a similar plot as Abid et al. (2018) for two datasets. In addition, they plotted the number of existing classes and identified classes by the method over time. The metric Mnewsubscript𝑀𝑛𝑒𝑤M_{new}italic_M start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT is also used, and the number of missed classes is computed.

Wang et al. (2019) proposed ESACOD, a framework for streaming classification with concept evolution and subject to concept drift. Their work aims to learn satisfying parametric Mahalanobis-based metrics in real-time. According to the authors, the objective is to identify a feature space projection in which its constraints generate properties of cohesion and separation (Wang et al., 2019). Cohesion is the ability of data points to occur close to others from the same class. In contrast, separation is the ability of data points to be distant from others from different classes (Wang et al., 2019). Their method trains an open-world classifier with a small dataset with an initial metric established. When new data arrives from the stream, the metric is applied to it, generating data in a new feature space, and the prediction is made afterward. If the prediction indicates that the data does not belong to a novel class, the prediction remains unchanged. On the contrary, if the classifier assumes the data are from a potentially novel class, the data are added to a buffer. When this buffer is filled, it is checked for concept evolution and concept drift. An arbitrary percentage (between 0 and 30%) of data with their respective labels is required. Finally, the evolution class metric is computed using paired constraints based on this randomly selected data. Later, a K-Means algorithm (Lloyd, 1982) is applied, and a label propagation (Zhu and Ghahramani, 2002) method is performed apparently to the other data in the buffer. If a concept drift or concept evolution is detected, a new classifier is trained with the data to replace the older classifier. The authors concluded that their approach could address the challenges of multiple novel class detection and stream classification bound to concept drift and with few labels available. The method is evaluated in terms of accuracy and run time. Concerning concept evolution, the metrics used are Mnewsubscript𝑀𝑛𝑒𝑤M_{new}italic_M start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT and Fnewsubscript𝐹𝑛𝑒𝑤F_{new}italic_F start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT, which measure the instances of an existing class misclassified as a novel class, Anewsubscript𝐴𝑛𝑒𝑤A_{new}italic_A start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT, which is the accuracy of novel class classification, and Aknownsubscript𝐴𝑘𝑛𝑜𝑤𝑛A_{known}italic_A start_POSTSUBSCRIPT italic_k italic_n italic_o italic_w italic_n end_POSTSUBSCRIPT, which is the accuracy of known class classification.

Regarding semantic shift detection, Periti et al. (2022) addressed this problem in an incremental way. The authors used incremental clustering techniques (such as affinity propagation) to generate representation clusters in time slices. The word contexts in the past are clustered into several clusters, serving as a memory for posterior observations. To generate representations, the authors tested BERT and Doc2Vec. BERT provides contextual representation, whereas Doc2Vec provides pseudo-contextual embeddings. The approach selects documents in which target words emerge, fine-tunes the embedding model to add new arriving documents, extracts the embeddings, clusters the representations, and refines the clusters by removing clusters of single or old representations. The authors tested their approach using representations generated by BERT and Doc2Vec, for two datasets from SemEval 2020: CCOHA and LatinISE. The authors evaluated alternatives based on affinity propagation. The incremental version of the affinity propagation (IAPNA) performed adequately on the LatinISE dataset using BERT representations and on the English dataset using Doc2Vec representations. In contrast, the affinity propagation a posteriori had satisfying results in the opposite situations. The authors were surprised that Doc2Vec obtained decent results and consumed less time than contextual models. It is important to notice that the target word must be known in advance to perform analyses.

4.5.4 Topic modeling

Topic modeling consists of statistical tools to examine textual data and identify the most relevant terms related to each theme. This approach facilitates the exploration of the interconnections among these themes and their temporal evolution (Blei, 2012). It is also considered a text mining task (Kherwa and Bansal, 2019). Four selected papers approach topic modeling (Murena et al., 2018; Hu et al., 2018; Van Linh et al., 2022; Nguyen et al., 2022).

Murena et al. (2018) proposed an approach mixing LDA and ADWIN to overcome the problem of topic modeling in document streams, entitled AWILDA. LDA (Blei et al., 2003) is a common method for topic modeling. The authors mentioned that LDA had gained much attention, and it also has an online version. However, one problem with the online version is setting window sizes because drifts may happen in a smaller period than the window size. Thus, the authors defined the window with the aid of an ADWIN module, which can assist in determining topic drifts and the new window for LDA to consider. Two classes of algorithms are mentioned: the passive, which updates a model for each observation, and the active algorithms, which attempt to detect the drift and update the model only when the drift is detected. We can draw parallels between these classes of algorithms and the detection methods presented in Section 4.3, i.e., adaptive and explicit, respectively. The author’s idea was to separate the task of topic modeling and topic drift detection. There are two LDA models inside AWILDA: one for language modeling (LDAm𝐿𝐷subscript𝐴𝑚LDA_{m}italic_L italic_D italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) and the other for drift detection (LDAd𝐿𝐷subscript𝐴𝑑LDA_{d}italic_L italic_D italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT). In this approach, for each document received from the stream, AWILDA reckons the likelihood for LDAd𝐿𝐷subscript𝐴𝑑LDA_{d}italic_L italic_D italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and adds it to the ADWIN module. If a drift is detected, LDAm𝐿𝐷subscript𝐴𝑚LDA_{m}italic_L italic_D italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is trained on the subwindow ADWIN selects. LDAm𝐿𝐷subscript𝐴𝑚LDA_{m}italic_L italic_D italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is updated whenever a new document arrives from the stream. The authors evaluated their proposed method using the perplexity metric for document modeling and the latency between the actual current drift and the detection. According to the authors, perplexity is “used by default in language modeling to measure the generalization capacity of a model on new data” (Murena et al., 2018). The authors concluded that AWILDA could recognize all drifts in the synthetic datasets and one version of the real-world dataset. In addition, the method can select the documents window to be used for updating. AWILDA can detect abrupt drifts and works sufficiently for gradual drifts. Compared to online LDA, it works similarly until a drift occurs. When drift occurs, AWILDA is retrained, which increases perplexity, but it outperforms the online LDA ultimately.

Hu et al. (2018), proposed a short text stream classification method that uses content expansion and includes a concept drift detector. According to the paper, the external sources must satisfy two criteria: to be large and sufficiently rich to comprise most contents in the short text stream that will be classified and highly topic-consistent with the text stream. The method mines hidden information from the external corpus by using LDA because, according to the authors, LDA performs adequately on longer texts. From the LDA model, top representative words for the topics are selected to be added (once or several) times to a short text according to the topic distribution and word probability of belonging to a topic. The topic distribution represents each short text. The method was evaluated regarding accuracy (classification task) and the drifts, using false alarms, missing drifts, and delays. The datasets were arranged to simulate drift; however, the method is unspecified. The authors concluded that their approach surpasses the accuracy of all the competitors, demonstrating more stability. In addition, their approach could recover from drift earlier than other approaches and outperforms the competitors regarding delay and missing drifts.

Van Linh et al. (2022) proposed a graph convolutional method for topic modeling, considering short and noisy text streams. The authors leveraged Word2Vec representations and Wordnet knowledge graph to improve the predictions of their method, called GCTM. The authors claimed that their method could balance the knowledge graph and the knowledge obtained from the previous data batch. This ability can be valuable when handling concept drift. GCTM integrates a graph convolutional network (GCN) into an LDA model to exploit a knowledge graph, and both are updated simultaneously in the streaming environment. The authors tested their approach using six short text datasets and two regular text datasets. Using previous knowledge allows GCTM to output satisfying predictions and recover more quickly from concept drift. The authors simulated concept drift by rearranging the topics sequentially. The metrics selected for evaluation were the Log Predictive Probability (LPP) (Hoffman et al., 2013) and the Normalized Pointwise Mutual Information (NPMI) (Lau et al., 2014). These methods measure the model generalization and the coherence of the topics, respectively. GCTM is evaluated in two ways: utilizing Word2Vec (GCTM-W2V) and the knowledge from the Wordnet graph (GCTM-WN). GCTM-WN and GCTM-W2V outperform the competitors in LPP across all the datasets, even in the presence of concept drift. The authors also performed an ablation study.

Nguyen et al. (2022) proposed an LDA-based topic modeling approach with mechanisms for balancing stability and plasticity, namely BSP. Stability-plasticity is a dilemma involving maintaining old knowledge (stability) and learning new knowledge (plasticity) (Nguyen et al., 2022; Gama et al., 2014). Balancing both prevents concept drift from impacting performance and catastrophic forgetting (Nguyen et al., 2022). The authors used TPS and iDropout combined into an LDA-based topic modeling method. TPS (Tran et al., 2021) aids the model with external knowledge, i.e., Word2Vec representations. iDropout (Nguyen et al., 2019) created variables βtsuperscript𝛽𝑡\beta^{t}italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, updated whenever a new mini-batch is inputted. Because both are different mechanisms, the authors modified the calculation of β𝛽\betaitalic_β to comprise information from both mechanisms. They performed experiments on eight datasets: one long, two regular, and five short-text. The authors compared their method to six different approaches. The hyperparameters were selected using a grid search. Similar to Van Linh et al. (2022), the authors contrasted LPP and NPMI. The authors tested using the datasets shuffled and ordered chronologically whenever possible. Their method achieved the best values for LPP in four out of six datasets tested. It is worth noting that their method achieves satisfactory results very rapidly at the highest levels. Their method maintains high levels of performance while using chronological datasets. The authors tested the stability and plasticity by simulating drifts by sorting the topics in order of classes, similarly to Van Linh et al. (2022). BSP could reach the best values when testing for catastrophic forgetting and maintained the highest levels when recovering from concept drift. As in Van Linh et al. (2022), the authors performed an ablation study to understand the impact of some parameters.

4.6 Text Representation Methods

This subsection describes the text representation methods used in the selected papers. Fig. 9 depicts the three main categories: (i) Embedding-based methods, such as Word2Vec and BERT; (ii) Frequency-based methods, which include Bag-of-Words, TF-IDF; and (iii) Words.

Refer to caption
Figure 9: Categories of text representation methods used in the papers.

Table 7 lists the text representation methods used across the papers. Seven approaches were categorized as frequency-based, six as embedding, and one as words. Two papers have not provided the text representation method, while one provided file compression, which cannot be directly classified among the categories but could adapt to words because the compression is performed over a file containing a set of words. de Mello et al. (2018) used file compression and calculated text similarity by using a formula that considers the sizes of the zipped file containing the two texts and the zipped files containing each of the given texts, separately. As mentioned, this method is named NCD (Cilibrasi and Vitányi, 2005). Although it appears reasonable for files and images, NCD is also used for texts in some works. For example, NCD is listed as a similarity metric in structure data (Ontañón, 2020), and in texts (Pradhan et al., 2015), even in the presence of noise (Cebrián et al., 2007). At first sight, it may appear unreasonable because two different files containing distinct texts may result in similar file sizes. However, according to Ontañón (2020), if two given files are similar, compressing them together results in an approximate file size to compressing only one. Thus NCD calculation utilizes this aspect to determine the similarity between two files containing raw text.

Table 7: Text representation used in the studied papers.
Text representation method Papers Category
Bag-of-words (Harris, 1954) (Li et al., 2018) Frequency-based
(Melidis et al., 2018)
(Chamby-Diaz et al., 2019)
(D’Andrea et al., 2019)
(de Moraes and Gradvohl, 2021)
(Bechini et al., 2021)
(Sun et al., 2021)
(Rakib et al., 2021)
(Heusinger et al., 2022)
(Bondielli et al., 2022)
(Assenmacher and Trautmann, 2022)
BERT (Devlin et al., 2018) (Bechini et al., 2021) Embedding
(Amba Hombaiah et al., 2021)
(Vo, 2022)
(Periti et al., 2022)
(Li et al., 2022)
(Susi and Shanthi, 2023)
Bigram (Rakib et al., 2021) Frequency-based
(Assenmacher and Trautmann, 2022)
Biterm (Hu et al., 2018) Frequency-based
(Rakib et al., 2021)
Co-occurrences (Yang et al., 2021) Frequency-based
Doc2Vec (Le and Mikolov, 2014) (Vo, 2022) Embedding
(Periti et al., 2022)
FastText (Bojanowski et al., 2016) (D’Andrea et al., 2019) Embedding
GloVe (Pennington et al., 2014) (Suprem and Pu, 2019a) Embedding
(D’Andrea et al., 2019)
(Rakib et al., 2021)
(Nguyen et al., 2022)
(Kolajo et al., 2022)
Graph-of-words (Yang et al., 2021) -
(Van Linh et al., 2022)
(Vo, 2022)
Incremental Word Context (Bravo-Marquez et al., 2022) Frequency-based
Sent2Vec (Moghadasi and Zhuang, 2020) (Kolajo et al., 2022) Embedding
TF-IDF (Salton and Buckley, 1988) (Pohl et al., 2018) Frequency-based
(Li et al., 2018)
(Heusinger et al., 2020a)
(Bechini et al., 2021)
(Heusinger et al., 2022)
(Bondielli et al., 2022)
(Fenza et al., 2023)
(Rabiu et al., 2023)
(Assenmacher and Trautmann, 2022)
Word2Vec (Mikolov et al., 2013b) (He et al., 2018) Embedding
(Suprem and Pu, 2019a)
(Suprem and Pu, 2019b)
(Wang et al., 2019)
(D’Andrea et al., 2019)
(Heusinger et al., 2020a)
(Van Linh et al., 2022)
(Nguyen et al., 2022)
(Heusinger et al., 2022)
(Vo, 2022)
(Li et al., 2022)
Word frequency (Yin et al., 2018) Frequency-based
(Yang et al., 2021)
Words (Murena et al., 2018) Words
(Abid et al., 2018)
(Hammer and Yazidi, 2018)
(Abid et al., 2019)

Regarding the representations, several papers used more than one method, sometimes combined, e.g., Bag-of-words + TF-IDF. However, they were divided in Table 7. In addition, Word2Vec and Bag-of-words (BOW) were used in ten papers and TF-IDF in seven papers. Finally, words are used directly in five papers as a representation method. We briefly describe the methods as follows, considering the chronological order of each method.

4.6.1 Bigram

Bigram adheres to the Bag-of-words concept, in which it is possible to organize texts in two dimensions: columns as words, and rows corresponding to documents. The cells contain the count of a given word in a specific document. The difference is that a pair of sequential words is represented in each column instead of the words. For example, the sentence “he has been here” will generate three columns: (he, has), (has, been), and (been, here). The challenge incurred from utilizing bag-of-words in streaming scenarios also happens to bigrams, i.e., the dimensions regard fixed words and do not evolve. Rakib et al. (2021) used three representation methods while testing their proposed method for short-text stream clustering: unigram, i.e., bag-of-words, bigram, and biterm. Assenmacher and Trautmann (2022) also used both unigram and bigram for text representations.

4.6.2 Biterm

According to Hu et al. (2018), a biterm corresponds to unordered word-pair co-occurrences. Furthermore, the authors in Hu et al. (2018) highlighted that biterms are more sparse than regular bag-of-words, and utilized external sources to reduce the sparseness. Considering the biterm definition and using the same example as in a bigram, the biterms generated from the sentence “he has been here” would be (he, has), (he, been), (he, here), (has, been), (has, here), and (been, here). Considering the text stream scenario, it encounters similar challenges as bag-of-words and bigrams. To overcome this, Hu et al. (2018) developed an ensemble based on base learners trained using data chunks, each with its biterm topic model. Rakib et al. (2021) also used biterm as text representations. To evaluate their short-text stream clustering method, the authors used biterm, unigram, and bigrams. Biterms performed better than bigrams and unigrams, considering NMI values.

4.6.3 Co-occurrences

Co-occurrences count simultaneous occurrences of two particular words. Yang et al. (2021) developed a bi-weighted word relation network that considers both the co-occurrences and the word frequencies. Although co-occurrences and word frequencies are not representations, we opted to include them as a single representation because they will be part of a graph, i.e., graph-of-words, which is an actual representation.

4.6.4 Graph-of-words

Graph-of-words (GOW) is a textual representation that transforms documents into graph-based structures (Vo, 2022). It can maintain long-term relationships between words, according to the author. After generating the graphs regarding specific documents, frequent subgraph mining techniques are applied, and later the mined frequent subgraphs are used as feature representations. In Vo (2022), GOW has two parameters: sliding window and minimum support. GOW appears to have the capability of being updated in real-time. However, its use with a pre-trained Word2Vec model (that can be outdated after an arbitrary period) makes the process not fully incremental. Although they did not use the terminology graph-of-words, Yang et al. (2021) developed a corpus-level word relation network, namely EWNStream+, which retains the co-occurrence counts and word frequencies. According to the authors, EWNStream+ is incremental by receiving data batches. Van Linh et al. (2022) proposed a a novel graph convolutional topic model (GCTM) based on graph convolutional networks and LDA. The initial graph is formed using words and their relations. GCTM is tested using Word2Vec representations and WordNet. GCTM does not support incremental-fashion training, implying that the text models can become obsolete.

4.6.5 Word frequency

Word frequency is the word count. Yang et al. (2021) included word frequency as part of their word relation network, which also considers the word co-occurrences. This representation is also used to determine whether a word is outdated in the graph representation.

4.6.6 Words

Several papers chose to use the words themselves rather than any text representation. Murena et al. (2018) presented AWILDA, an LDA-based method integrated with ADWIN for topic drift detection. The authors used the words lowercased and stemmed. Abid et al. (2018) and Abid et al. (2019) described AIS-Clus, an incremental clustering method. Initially, the authors used DBSCAN (Ester et al., 1996) to generate the cluster, and then sketches were developed to summarize each cluster. The sketches contain lists of words and outliers present in a cluster. Hammer and Yazidi (2018) presented a method to handle concept drift in an abruptly changing environment. The authors used words to monitor probabilities in topics. Considering the updating scheme, words could easily be added or removed from sketches. Therefore, we consider it possible to use it in streaming scenarios, although it can become complex and time-consuming to maintain a list of words in every sketch, as demonstrated in Murena et al. (2018) and Abid et al. (2019), if not limited to respecting the constraints of data stream environments.

4.6.7 Bag-of-words

Bag-of-words (Harris, 1954) is probably one of the simplest methods for vectorizing texts, as it divides the text into tokens or words. Considering rows and columns, these tokens function as columns while the rows represent each text, such as tweets. There will be the counts of the tokens corresponding to a particular column in a text corresponding to a specified row in each cell. An evident characteristic is that bag-of-words representation in a unigram way does not represent the order of words, which can be leveraged in some applications. In streaming scenarios, it inhibits ML methods from performing properly. For example, suppose a bag-of-words representation is generated whenever each new text is inputted. In that case, the number of columns may increase, and most ML methods cannot handle dimension-changing inputs. Furthermore, even if the process runs in batches, the words of the bag-of-words may change. If the first batch defines the words to use in the bag-of-words representation, it may not recognize changes and new words, i.e., new dimensions, over time.

4.6.8 TF-IDF

Term-Frequency-Inverse Document Frequency (TF-IDF) is a statistic from the information retrieval area used for determining the importance of words to a document or a set of documents (Salton and Buckley, 1988). The calculation considers the frequency of a term, and the inverse document frequency, which defines how informative a term is across several documents. Generally, TF-IDF is used, in the stream setting, to encode data batches. It is worth noting that the term frequency calculation is remarkably similar to the bag-of-words procedure. Thus, it is common to discover the use of bag-of-words with TF-IDF. Pohl et al. (2018), Bechini et al. (2021), and Bondielli et al. (2022) used TF-IDF after obtaining a data batch to encode the terms and the texts from the stream. Li et al. (2018) utilized TF-IDF to generate the vector representations from the data batches so that a base learner could be trained to be incorporated into the ensemble. Heusinger et al. (2020a) and Heusinger et al. (2022) performed TF-IDF in an offline mode to generate a very high-dimensional vector so that they could test their dimensionality reduction strategy. Mohawesh et al. (2021) executed TF-IDF before all the processing. Later, the authors employed PCA to reduce the dimensionality of the datasets by selecting the 10,000 most meaningful components. Since TF-IDF works together with bag-of-words, it is impossible to update it incrementally because of the change in the dimensions. Assenmacher and Trautmann (2022) used TF-IDF to decide the proximity of incoming text to existing microclusters in the online phase. This calculation is also used in the offline phase, particularly when evaluating the merging of existing clusters. Fenza et al. (2023) used TF-IDF representations to generate the fuzzy lattice structure. Rabiu et al. (2023) leveraged TF-IDF to compute the input vectors to train base learners. An interesting aspect regards preprocessing in Rabiu et al. (2023): the authors utilized the Stanford CoreNLP (Manning et al., 2014) to segment words, part-of-speech tagging, and stemming. The authors used only the first three tags of noun, verb, and adjective. According to the authors, these tags “carry the most valuable information regarding reviewed items”. However, no evidence is provided.

4.6.9 Word2Vec

Word2Vec (Mikolov et al., 2013a) corresponds to two distinct model architectures for learning distributed representations: Continuous Bag-of-words (CBoW) and Skip-gram. Both are neural network architectures, where the number of neurons is the same in the input and output layers, and the single hidden layer corresponds to the embedding size. Each neuron in the input and output layers can correlate to the words in the vocabulary. The representations, after training, are often obtained by taking the connection weights between a neuron (representing a word) in the output and the hidden layers. The difference between CBoW and Skip-gram is the training step aim: CBoW aims at predicting a specific word given its surrounding words, whereas Skip-gram does the opposite, i.e., predict the word in the middle based on the surrounding words (Mikolov et al., 2013a). The papers that utilize Word2Vec use it for text representation only. Li et al. (2022) leveraged Word2Vec for reduction of data sparsity. The authors developed their method for short-text classification, and one of the general approaches for this problem is to enrich the data. The authors evaluated both Word2Vec and BERT for the short-text representation, which was later applied to a CNN to extract higher-level feature information. Although Word2Vec is a neural architecture, it has incremental versions by using gensim101010https://radimrehurek.com/gensim/ (Rehurek and Sojka, 2011) or other methods in the literature (Kaji and Kobayashi, 2017; May et al., 2017; Iturra-Bocaz and Bravo-Marquez, 2023).

4.6.10 Doc2Vec

Le and Mikolov (2014) proposed Doc2Vec to obtain documents as distributional vectors. Doc2Vec is a generalization of Word2Vec. Similarly to Word2Vec, Doc2Vec is constituted by two architectures: Paragraph Vector - Distributed Memory (PV-DM) and Distributed bag-of-words version of Paragraph Vector (PV-DBOW). In PV-DM, the document vectors are trained with the word vectors in the architectures, while in the PV-DBOW, the aim is to predict the words of a document from a document id. Periti et al. (2022) used a Doc2Vec model trained with the CCOHA and LatinISE datasets. The model is not updated during the process and may become obsolete as time passes. In Vo (2022), it is unclear whether the author utilized a pre-trained model, trained a model from scratch, or if the model was updated over time. Since Doc2Vec is a neural architecture, training and updating it can be computationally costly.

4.6.11 GloVe

Global Vectors (GloVe) is a method for generating co-occurrence-based word vector representations (Pennington et al., 2014). According to the authors, GloVe utilizes global matrix factorization and local context window methods. The method is trained in a batch manner. D’Andrea et al. (2019), Rakib et al. (2021), and Nguyen et al. (2022) used GloVe for semantic representation by using pre-trained models. Kolajo et al. (2022) claimed that GloVe is used for feature extraction. Suprem and Pu (2019a) mentioned that the proposed system, i.e., Adaptive Social Sensor Event Detection (ASSED), supports GloVe. The authors in the original paper (Pennington et al., 2014) did not describe any incremental or adaptive training. Therefore, the vector representations can become outdated over time, constituting a potential disadvantage in streaming scenarios.

4.6.12 FastText

FastText, a method presented in Bojanowski et al. (2016) as an extension of the Skip-gram method, is one of the Word2Vec architectures. Instead of accounting for the entire words, FastText considers subword partitions using n-gram vectors. Using an example from the original paper, encoding the word where𝑤𝑒𝑟𝑒whereitalic_w italic_h italic_e italic_r italic_e in a 3-gram fashion results in a 5-sized vector containing (wh, whe, her, ere, re). In addition, the approach incorporates the word where𝑤𝑒𝑟𝑒whereitalic_w italic_h italic_e italic_r italic_e integrally. This method of splitting words in n-grams helps the model handle words unseen in the training step, also named out-of-vocabulary (OOV) words. An incremental update method is not mentioned in the paper. D’Andrea et al. (2019) utilized a pre-trained FastText model (Bojanowski et al., 2016) as a text encoding method. In addition, FastText is used statically, implying that no method is presented in D’Andrea et al. (2019) for the incremental update of the text representations. However, D’Andrea et al. (2019) concatenated FastText representations to bag-of-words representations, generated in each step of an incremental procedure of accumulating data from past events.

4.6.13 BERT

Bidirectional Encoder Representation from Transformers (BERT) is a multi-purpose language model that enables several natural language processing tasks (Devlin et al., 2018), such as sentiment analysis, sequence-to-sequence, paraphrasing, and question answering. In addition, BERT can provide vector representations of text to be used in a particular downstream task. Bechini et al. (2021) used an Italian version of the pre-trained BERT model, i.e., AlBERTo (Polignano et al., 2019), for measuring semantic similarity between tweets. In Amba Hombaiah et al. (2021), BERT is the primary model. The authors tested different sampling methods for fine-tuning to pursue an incremental update of the model. BERT is also used as a text encoding method in Vo (2022), where the authors enhanced short text clustering by combining pre-trained BERT’s representations with a BiLSTM and a graph-of-words representation. Periti et al. (2022) used BERT for word representation generation in both English and Latin by using pre-trained models. Considering the aforementioned papers, only Amba Hombaiah et al. (2021) has an updating scheme for the representations. It is achieved by using fine-tuning strategies, which can enable the use of BERT in streaming scenarios while it may also become a bottleneck in the process. Susi and Shanthi (2023) leveraged BERT and variations in two moments. First, the authors used a pre-trained RoBERTa model (Liu et al., 2019) specifically suited for sentiment classification. The RoBERTa model enables automated training data generation. However, another BERT model is fine-tuned in the system whenever a sentiment drift happens. Li et al. (2022) used BERT to enrich short texts. Short texts are very sparse, and, according to the authors, using embeddings may improve the representation quality.

4.6.14 Sent2Vec

Moghadasi and Zhuang (2020) proposed a sentence embedding method that considers the sentiment score behind the sentence. Kolajo et al. (2022) used the Sent2Vec embeddings to compute the semantic representation of the input texts and then cluster these texts. If a new tweet is different from the histograms of the clusters, a concept drift is deemed to have occurred, and a new cluster is created for it. Kolajo et al. (2022) did not describe an updating scheme. Thus, the Sent2Vec model can become obsolete over time, necessitating retraining.

4.6.15 Incremental Word Context

Bravo-Marquez et al. (2022) proposed a vector representation method for texts that can be considered a table-like representation, with the columns corresponding to words and rows similarly corresponding to words. However, the column (in the original paper, called context) and the words (called vocabulary) can have different sizes. The number of contexts defines the dimension size of the vector representation. The authors calculated the positive pointwise mutual information (PPMI) in each cell, considering the words and their co-occurrences. Although the vocabulary (rows) can be updated, similarly to bag-of-words, if the contexts are fixed, the system may incur obsolescence after the context words decrease or stop appearing. Furthermore, if certain context words are exchanged with other words, the changed dimensions will not represent the same contexts, and this will be reflected in an ML model dependent on vector inputs.

In this subsection, we analyzed the text representation methods used in the selected papers. However, we do not extrapolate the same analyses to incremental versions. Thus, when we discuss that a particular method only works at least in batches, we do not extend the same conclusions to other versions, including incremental/adaptive versions when available.

4.7 Updating Mechanism of Text Representation Methods

We also consider the updating mechanism of the text representation methods. Observing how the text representation behaves over time in text stream scenarios is critical. Because of stream characteristics, i.e., fast and potentially infinite, a static text model is a problem. It is even severe in text stream scenarios under concept drift because a representation vector may become obsolete, losing quality and, thus, negatively impacting the stream mining task. Therefore, we also obtained information on the text representation updating method. Fig. 10 depicts the organization regarding the updating scheme of text representation methods. We organized in two dimensions: incremental and non-incremental. In Incremental, we consider that the representation method can be updated over time, whether in batches or instances. In Non-Incremental, we assume that the text representation method is either static during the entire process or requires complete retraining to be updated.

Refer to caption
Figure 10: Categories of mechanisms for text representation updating found in the selected papers.

4.7.1 Incremental

We list text representation methods with incremental update capabilities organized in windows/batches or instance. Considering the update in windows/batches, this indicates that the text representation method requires a new amount of data to either be worth updating or satisfy a specific constraint of the text representation method. Using BERT as in Amba Hombaiah et al. (2021) and Susi and Shanthi (2023) are examples of this category. In Amba Hombaiah et al. (2021), the BERT model is fine-tuned using texts selected by the sample methods proposed by the authors. In Susi and Shanthi (2023), the fine-tuning is performed through an updated training set. The training set is updated whenever a sentiment drift is deemed to have occurred.

Considering the incremental methods that can be updated in instances, it implies that it is unnecessary to accumulate data to update the text representation method: a single piece of information can be used for that. For example, we mention Incremental Word Context (Bravo-Marquez et al., 2022). Furthermore, given a single new input, the Graph-of-Words (Yang et al., 2021) can be updated in real time.

4.7.2 Non-Incremental

Considering the text representation methods that do not allow any update but are retrained from scratch, we list bag-of-words, bigrams, biterm, and TF-IDF. However, while in use, a few text representation methods are kept static in the text streams: FastText, Doc2Vec, and GloVe. Most are used as pre-trained models, and they can become obsolete after some time, demanding complete retraining to maintain the performance of the dependent ML model. Works such as Heusinger et al. (2022); Vo (2022) and Li et al. (2022) also leveraged static BERT and Word2Vec models.

5 Datasets

We also included a list of real-world datasets to which the methods for stream mining tasks from the selected papers were applied. The synthetic datasets were excluded since they are generally numeric or contain a sequence of unrecognizable topics. Considering Table 8, several datasets were used; however, most appeared in only one paper. In addition, some papers that shared datasets in common frequently shared authors (or co-authors) or the task, e.g., short-text classification and topic modeling. All the links in the column Information / Access were verified on 24th May 2023. In addition, some datasets are flagged as obtained by the authors. It can mean that the authors collected the datasets, manually or through APIs, but the datasets are not publicly available for download.

{ThreePartTable}
Table 8 List of datasets used in the papers and their respective resources, when available.
Dataset Papers Information / Access Stream Mining Tasks

20NewsGroup

(Rabiu et al., 2022)

http://qwone.com/similar-to\simjason/20Newsgroups/

Short-text clustering, Classification

(Vo, 2022)

(Rabiu et al., 2023)

Arxiv

(Lu et al., 2022)

https://www.kaggle.com/datasets/Cornell-University/arxiv

Classification

CrisisLexT26

(Pohl et al., 2018)

obtained from https://archive.org/details/twitterstream

Crisis management

(Olteanu et al., 2015)

EmailingList

(Melidis et al., 2018)

http://mlkd.csd.auth.gr/datasets.html

Classification

(de Moraes and Gradvohl, 2021)

EveTAR

(Abid et al., 2019)

http://qufaculty.qu.edu.qa/telsayed/evetar

Event detection

Guardian, The

(Wang et al., 2019)

obtained by the authors

Classification

Irish Times, The

(Van Linh et al., 2022)

https://www.kaggle.com/datasets/therohk/ireland-historical-news

Topic modeling

(Nguyen et al., 2022)

NELA-GT-2018

(Fenza et al., 2023)

https://doi.org/10.7910/DVN/ULHLCB (Nørregaard et al., 2019)

Classification

NELA-GT-2019

(Fenza et al., 2023)

https://doi.org/10.7910/DVN/O7FWPO (Gruppi et al., 2020)

Classification

NELA-GT-2020

(Fenza et al., 2023)

https://doi.org/10.7910/DVN/CHMUYZ (Gruppi et al., 2021)

Classification

New York Times, The

(Wang et al., 2019)

obtained by the authors

Classification

(He et al., 2018)

https://ir-datasets.com/nyt.html

Classification

(Van Linh et al., 2022)

http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

Topic modeling

(Lu et al., 2022)

https://www.dropbox.com/s/nifi5nj1oj0fu2i/data.zip?dl=0

Classification

NOAA

(Suprem and Pu, 2019a)

not provided but probably from https://data.noaa.gov/dataset/

Event detection

(Suprem et al., 2019b)

(Suprem and Pu, 2019b)

NSDQ

(Heusinger et al., 2020a)

https://github.com/ChristophRaab/NASDAQ-Dataset

Classification

(Heusinger et al., 2022)

OffensEval

(Amba Hombaiah et al., 2021)

https://competitions.codalab.org/competitions/20011

Classification

RCV1

(He et al., 2018)

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_rcv1.html

Classification

Reuters-21578

(Murena et al., 2018)

https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/

Topic modeling

SemEval2020 - Subtask 2 (CCOHA)

(Periti et al., 2022)

https://www.english-corpora.org/coha/

Semantic Shift Detection

(Alatrash et al., 2020; Schlechtweg et al., 2020)

SemEval2020 - Subtask 2 (LatinISE)

(Periti et al., 2022)

https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2506

Semantic Shift Detection

(McGillivray and Kilgarriff, 2013; Schlechtweg et al., 2020)

SO-T

(Rakib et al., 2021)

obtained by the authors

Short-text clustering

SpamAssassin

(de Moraes and Gradvohl, 2021)

http://mlkd.csd.auth.gr/datasets.html

Classification

SpamData

(Melidis et al., 2018)

http://mlkd.csd.auth.gr/datasets.html

Classification

(Chamby-Diaz et al., 2019)

(de Moraes and Gradvohl, 2021)

Ts-T, Tw, Tw-T, Tweets, Tweets-T

(Yin et al., 2018)

https://trec.nist.gov/data/microblog.html

Short-text clustering

(Rakib et al., 2021)

(Vo, 2022)

(Assenmacher and Trautmann, 2022)

Tweets, TweetSet, Twitter

(Li et al., 2018)

obtained by the authors

Short-text classification, Event detection

(Sun et al., 2021)

obtained by the authors

Short-text classification, Event detection

(Kolajo et al., 2022)

obtained by the authors

Short-text classification, Event detection

(D’Andrea et al., 2019)

obtained by the authors

Stance detection

(Bechini et al., 2021)

obtained by the authors

Stance detection

(Bondielli et al., 2022)

obtained by the authors

Stance detection

(Amba Hombaiah et al., 2021)

https://archive.org/details/twitterstream

Classification

(Yang et al., 2021)

obtained by the authors

Short-text clustering

(de Mello et al., 2018)

obtained by the authors

Concept drift detection

(Li et al., 2022)

(Wang et al., 2014)

Short-text classification

(Assenmacher and Trautmann, 2022)

obtained by the authors

Short-text clustering

(Susi and Shanthi, 2023)

obtained by the authors

Sentiment drift detection

TwitterSentiment

(Melidis et al., 2018)

https://bit.ly/twitter-sentiment-link

Classification

UCI News

(Nguyen et al., 2022)

https://www.kaggle.com/datasets/uciml/news-aggregator-dataset

Topic modeling

Usenet1

(de Moraes and Gradvohl, 2021)

http://mlkd.csd.auth.gr/datasets.html

Classification

Usenet2

(de Moraes and Gradvohl, 2021)

http://mlkd.csd.auth.gr/datasets.html

Classification

USGS

(Suprem and Pu, 2019a)

not provided but probably from https://www.usgs.gov/products/data

Event detection

(Suprem et al., 2019b)

(Suprem and Pu, 2019b)

vg.no

(Hammer and Yazidi, 2018)

obtained by the authors

Classification

Yelp datasets

(Mohawesh et al., 2021)

https://www.yelp.com/dataset

Fake reviews detection

Regarding the datasets as depicted in Table 8, some may share the same name, such as Twitter, as New York Times. However, it was impossible to assert that they are the same dataset. Thus, we added a new line in the table instead of aggregating data regarding a particular dataset. In addition, at least three mechanisms were referred to as API providers for data collection: Twitter111111https://developer.twitter.com/en/docs/twitter-api, The Guardian121212https://open-platform.theguardian.com/ and The New York Times131313https://developer.nytimes.com/apis. Thus, since the queries can be performed ranging from different dates and keywords, the datasets of the same name may correspond to different datasets.

5.1 Datasets description

Below we provide short descriptions of each dataset listed in Table 8. We highlight that some datasets include raw texts, while a few contain the bag-of-words representation of texts, i.e., preprocessed texts.

5.1.1 20NewsGroup

This dataset contains approximately 20,000 news across 20 groups. In the link provided in this paper, there are three versions of this dataset, with slight variations.

5.1.2 Arxiv

According to Lu et al. (2022), this dataset contains approximately 2 million abstracts of papers published comprising the years between 2007 and 2021.

5.1.3 CrisisLexT26

Pohl et al. (2018) cited that CrisisLexT26 (Olteanu et al., 2015) is a collection of datasets related to several crises worldwide. However, Pohl et al. (2018) used only the datasets related to the Colorado Floods, containing 751 relevant and 224 irrelevant tweets, and Australian Bushfires, containing 645 relevant and 408 irrelevant tweets.

5.1.4 EmailingList

This dataset contains 1500 samples with 913 dimensions, i.e., boolean bag-of-words, corresponding to email messages, to be classified as junk or interesting. According to Katakis et al. (2010), these samples were collected from Usenet posts existing inside the 20Newsgroup dataset.

5.1.5 EveTAR

EveTAR is an Arabic dataset that contains 1392 tweets on three terrorist events: (i) a suicide bombing in Ab, Yemen; (ii) Air strikes in Pakistan; and (iii) the Charlie Hebdo attack. This dataset is used in Abid et al. (2019) to evaluate the ability of AIS-Clus to receive texts and detect events in languages other than English.

5.1.6 Guardian, The

Wang et al. (2019) collected a news stream from The Guardian using the API. The dataset contains 10 categories and 40,000 samples, represented using Word2Vec with 300 dimensions.

5.1.7 Irish Times, The

The Irish Times dataset corresponds to a set of 1.6 million news headlines published by the Irish Times, distributed in six classes. It comprises 25 years of publications.

5.1.8 NELA-GT

NELA-GT (Nørregaard et al., 2019; Gruppi et al., 2020, 2021) corresponds to a series of datasets regarding news and media outlets. In addition, conspiracy sources are included in this dataset. The authors incorporated ground-truth ratings of aspects such as reliability, transparency, and bias. NELA-GT-2018 (Nørregaard et al., 2019) contains 713 thousand items from 194 media outlets and conspiracy sites; NELA-GT-2019 (Gruppi et al., 2020) contains 1.12 million media articles from 260 mainstream and alternative sources collected in 2019; NELA-GT-2020 (Gruppi et al., 2021) contains almost 1.8 million news stories from 519 sources.

Fenza et al. (2023) used these datasets in the fake news detection, using the instances labeled as reliable and unreliable. These datasets were merged, but the temporal order was respected.

5.1.9 New York Times, The

Wang et al. (2019) used The New York Times’ public API to collect news articles, between January 2006 and January 2018. These news articles were encoded using Word2Vec, with 300 dimensions. He et al. (2018) used a dataset collected from The New York Times, containing news articles from 1987 and 2007, distributed in 26 categories (Sandhaus, 2008). Van Linh et al. (2022) used only the title of news articles from the New York Times. The authors mentioned that the dataset contained 1,764,127 titles, with an average of five words per title. Lu et al. (2022) utilized a dataset collected from the News York Times containing 99,872 articles, dating from 1990 to 2016.

5.1.10 NOAA

National Oceanic and Atmospheric Administration (NOAA) is an agency in the United States government. It does not correspond directly to a dataset; however, Suprem and Pu (2019a), Suprem et al. (2019b), and Suprem and Pu (2019b) used NOAA reports as ground truth for the automatic classification of tweets. No details were offered about the reports’ processing or collection.

5.1.11 NSDQ

The NSDQ dataset (named after NASDAQ) corresponds to tweets regarding 15 companies listed in NASDAQ. NSDQ was compiled by the authors in Heusinger et al. (2020a) and Heusinger et al. (2022), and comprised the months of February to December 2019. This dataset contains 30,278 tweets.

5.1.12 OffensEval

Amba Hombaiah et al. (2021) used the OffensEval 2019 dataset (Zampieri et al., 2019). The dataset contains 14,000 tweets posted in 2019 divided into offensive and inoffensive.

5.1.13 RCV1

RCV1 (Lewis et al., 2004) is a dataset that contains 403,143 news from Reuters News between 1996 and 1997. The news articles are divided into three classes: industries, topics, and regions. This dataset is organized hierarchically. From this dataset, He et al. (2018) obtained a corpus with 12 subtrees (labels).

5.1.14 Reuters-21578

Murena et al. (2018) used this dataset, which contains articles with their respective categories, temporally ordered. According to Murena et al. (2018), it contains 12,902 news, each classified into several categories, totaling 90 categories.

5.1.15 SemEval2020 - Subtask 2

Periti et al. (2022) used the datasets corresponding to Task 2 of SemEval2020, regarding the semantic shift detection task. The datasets used are CCOHA (Alatrash et al., 2020) and LatinISE (McGillivray and Kilgarriff, 2013). CCOHA contains texts in English approximately ranging from 1810 to 2000, while LatinISE has Latin texts, ranging from the 2nd century BC to the 21st century AD. Both have target words, which are words that can be monitored to detect the semantic shift. These datasets are discovered in the selected papers that span the longest.

5.1.16 SO-T

Rakib et al. (2021) collected duplicated question titles regarding Python, Java, jQuery, R, and other programming languages/tools. In the paper, the authors carefully described the process of obtaining this dataset. In the end, this dataset contained 400,000 randomly selected pairs of question titles.

5.1.17 SpamAssassin and SpamData

These datasets correspond to emails collected from the Spam Assassin collection. They are represented as bag-of-words, distributed across two classes: ham and spam, in imbalanced proportions (80% and 20%, respectively). Both contain 9,324 instances; however, SpamAssassin Katakis et al. (2008) has 40,000 features, while SpamData has 499 (Katakis et al., 2009). It is noted that these datasets contain gradual drifts (de Moraes and Gradvohl, 2021).

5.1.18 Ts-T, Tw, Tw-T, Tweets, Tweets-T

Yin et al. (2018) used this dataset named Tweets, containing 30,332 tweets distributed into 269 groups, with 7.97 words per tweet on average. The authors also generated a variant dataset from Tweets, called Tweets-T, where the dataset is sorted by topic. Rakib et al. (2021) used the same dataset Tweets, called Ts-T. In Vo (2022), the same datasets presented in Yin et al. (2018) are named Tw and Tw-T, respectively.

5.1.19 Tweets, TweetSet, Twitter

Li et al. (2018) used a Tweets dataset containing approximately 400,000 tweets. They stated that the data acquisition comprises November and December 2012, using the Twitter API. The dataset in Sun et al. (2021) is also obtained through Twitter API and consists of 803,613 short texts distributed in four categories. Kolajo et al. (2022) described the dataset used in their work as “Twitter sentiment analysis training corpus”, from which they filtered 10% of the data, totaling 104,857 tweets. D’Andrea et al. (2019) collected tweets by using a Java library named GetOldTweets141414https://github.com/Jefferson-Henrique/GetOldTweets-java/. They collected 112,397 tweets posted between September 2016 and January 2017, using vaccine-related keywords. Bechini et al. (2021) extended the dataset obtained in D’Andrea et al. (2019) until September 2019, corresponding to 806,672 tweets. Bondielli et al. (2022) collected 486,688 tweets from July 2021 to December 2021 regarding the Green Pass, as the European Union Covid-19 Digital Certificate is known in Italy. Amba Hombaiah et al. (2021) used tweets to perform country hashtag prediction, in two different years: 2014 and 2017, consisting of 472,000 and 407,000 tweets respectively. The tweets were obtained from the Internet Archive151515https://archive.org/details/twitterstream. Yang et al. (2021) experimented with their approach using a Twitter dataset, namely TweetSet, containing about 144,000 tweets posted in June 2019, distributed into 16 categories. de Mello et al. (2018) collected tweets by monitoring a set of users and hashtags, i.e., words with a # at the beginning that simulate a tag for the tweet. The authors monitored, for instance, @dilmabr (former Brazilian president) and #dolar (Portuguese for dollar). The dataset size was not mentioned.

Although Twitter-based datasets were very frequent in the studied papers, as of February 2023, Twitter’s API policies have changed161616Available at: https://www.forbes.com/sites/jenaebarnes/2023/02/03/twitter-ends-its-free-api-heres-who-will-be-affected/?sh=36ad308a6266. Accessed on September 17th, 2023., and it became a paid service.

5.1.20 TwitterSentiment

TwitterSentiment (or TSentiment, as in Melidis et al. (2018)) is a balanced dataset that contains 1.6 million tweets collected between April and June 2009. These tweets are labeled as positive or negative using distant supervision. In this case, emoticons were used for labeling.

5.1.21 UCINews

The UCINews dataset contains 422,937 news collected between March and August 2014. Each news can be categorized as business, science and technology, entertainment, and health. This data collection also includes each news id, title, URL, publisher, story id, hostname, and timestamp information.

5.1.22 Usenet1 and Usenet2

Similarly to EmailingList, both Usenet1 and Usenet2 simulate a sequence of 1500 emails from 20NewsGroup dataset to a particular user to be classified as junk or interesting (Katakis et al., 2008). Both datasets have 100 features, corresponding to words, according to de Moraes and Gradvohl (2021).

5.1.23 USGS

United States Geological Survey (USGS) is a scientific agency from the United States. Similarly to NOAA, USGS reports do not correspond to datasets and are also used as ground truth to classify tweets automatically in Suprem and Pu (2019a), Suprem et al. (2019b), and Suprem and Pu (2019b).

5.1.24 vg.no

Vg.no is a Norwegian news website. Hammer and Yazidi (2018) obtained news from four topics: European Union, economy, sports, and entertainment. However, the authors did not mention the size of the collected dataset.

5.1.25 Yelp datasets

Mohawesh et al. (2021) used four real-world datasets, based on the datasets provided by Yelp, namely Yelp CHI, Yelp NYC, Yelp ZIP, and Yelp Consumer Electronics. The authors used Yelp CHI (Chicago) (Mukherjee et al., 2013), containing more than 67,000 reviews of restaurants and hotels, distributed between 2004 and 2012. Yelp NYC (Rayana and Akoglu, 2015) contains approximately 322,000 reviews of restaurants located in New York City. It comprises the years between 2004 and 2015. Yelp ZIP (Rayana and Akoglu, 2015) contains 608,598 reviews from New Jersey, Vermont, Connecticut, and Pennsylvania. Yelp Consumer Electronics (Barbado et al., 2019) contains almost 19,000 records evenly distributed between genuine and fake. These datasets include other data, such as user information, product information, rating, timestamp, and review, and were scraped/downloaded from Yelp.com.

Although SpamAssassin and EmailingList have known concept drifts (gradual and abrupt)171717According to http://mlkd.csd.auth.gr/concept_drift.html, an interesting aspect is that none of the datasets has labeled concept drifts, due to the difficulty of defining the specific points of drift, which requires a deep study on a particular dataset. Thus, some authors only warned about the unawareness of concept drift in the datasets. However, others attempted to force concept drifts by: (i) placing data partitions temporally disordered in a stream, i.e., data from 2011 and 2015 before 2012 (Mohawesh et al., 2021); or (ii) rearranging the data, sorting by classes or topics (Li et al., 2018). This aspect is extended in Section 6.

Therefore, since we could not locate repeating datasets in more than three papers, we can conclude that the research area of concept drift detection in textual streams lacks benchmark datasets. Furthermore, all the datasets used for classification are instance-level labeled, i.e., sentences/tweets labeled. In addition, the resource of one of the most recurrent datasets, i.e., TagMyNews and Snippets (Phan et al., 2010), could not be encountered across the papers. Also, it is closely related to short-text applications, which constitutes an entirely new research area.

6 Concept Drift Visualization and Simulation

It is challenging to clearly express or prove the existence of concept drifts in a particular textual dataset. However, a few works attempt to justify the existence of drifts by resorting to plots. For example, Li et al. (2018) used normalized stacked bar plots to demonstrate the topic distribution over several batches (Fig. 4 in Li et al. (2018)).

Bondielli et al. (2022) plotted the distribution of the stances across the analyzed period using a normalized stacked area plot, similar to the stacked bar plot to show the topic distribution over time. The background color regards the stance of tweets about the Green Pass, distributed in positive (in blue), neutral (in white), and negative (in red). Considering the color code aforementioned, the thicker line corresponds to the average stance at each moment in the timeline. This description relates to Fig. 3 in Bondielli et al. (2022).

However, Suprem and Pu (2019a), Heusinger et al. (2020a), and Heusinger et al. (2022) used dimensionality reduction methods, i.e., either t-SNE or PCA, to reduce high-dimensional representations to two dimensions, which can easily be plotted. Thus, Suprem and Pu (2019b), Heusinger et al. (2020a), and Heusinger et al. (2022) used t-SNE to confirm that there are drifts between texts of specific hashtags. Fig. 4 in Heusinger et al. (2020a) depicts the visual representation of concept drift. The data points of different colors in different positions indicate that texts regarding particular stock tickers have different patterns. However, it does not highlight temporal changes.

Suprem and Pu (2019a) used PCA for dimensionality reduction for plotting and suggesting a direction of drift based on data from 2014 and from four months in 2018. It is not possible to categorize the drifts shown by the images considering the literature presented in Section 2. Fig. 10 in Suprem and Pu (2019a) is a plot of text representations reduced to bi-dimensional vectors using PCA. The authors colored the data points according to the month or year of the posts’ timestamps. Posts from 2014 occupy the center left of the image, while the representations of the other posts published in 2018, identified as July, August, September, and October, occupy the center and bottom of the image. In addition, the authors drew an arrow to show the direction of the concept drift.

As aforementioned, concept drift in texts is common and can occur over time. However, depending on the characteristics of the approach and datasets, it may be challenging to execute the experiments due to the lack of certainty of the existence of drift, their potential positions, and their behavior over time. Therefore, some papers simulate drifts. For example, Murena et al. (2018); Van Linh et al. (2022); Li et al. (2022); Rabiu et al. (2023) rearranged the topics sequentially in the stream. Thus, when a new topic emerges from the stream, it is considered a drift. Mohawesh et al. (2021) simulated drift by dividing the datasets into partitions and rearranging them in different orders. For example, one of the datasets is initially ordered temporally and divided into five partitions, i.e., D1𝐷1D1italic_D 1, D2𝐷2D2italic_D 2, ..., D5𝐷5D5italic_D 5. Thus, in a specific scenario, the authors merged D1D3𝐷1𝐷3D1-D3italic_D 1 - italic_D 3 for training and used the other partitions, i.e., D2𝐷2D2italic_D 2, D4𝐷4D4italic_D 4, and D5𝐷5D5italic_D 5, for testing sequentially. Although it created a scenario of concept drift and worked for the experiment in the aforementioned papers, both scenarios are unrealistic, especially considering the temporal aspects of the aforementioned partitions in the latter example.

Ultimately, depending on the sort of text drift, it cannot be easy to visualize due to several factors, such as the inherent high dimensionality of the most frequent text representations. In addition, visually representing changes in text behavior over time can be challenging. Furthermore, developing scenarios to force concept drift in text streams can be complex, depending on the type of text drift. Generally, the datasets are described in the papers; however, sometimes, they lack evidence for the existence of text drift. Thus, it is necessary to resort to data rearrangement to simulate drifts and data visualization to search for changes in temporal patterns. However, to maintain consistency, it may be essential to consider the temporal order, especially concerning streaming scenarios.

7 Conclusion and Future Directions

In this study, we performed a systematic literature review on concept drift adaptation, specifically in text streams. A text stream is a specialization of data streams in which several texts arrive sequentially at high speeds. Sequentially handling texts is a challenge due to the constraints of data stream settings, i.e., processing time and memory consumption. In addition, we can mention characteristics of text-related settings, such as vocabulary maintenance, natural language processing, and text representation maintenance.

We selected 40 papers and extracted information according to the defined criteria. We evaluated the papers regarding categories of drift, types of drift detection, the ML model update scheme, the stream mining tasks applied, the text representation method utilized, and the update scheme of the text representation methods. We also note, in the study, the metrics used in each stream mining task.

Regarding categories of drift, we differentiated the types into real, virtual, feature drift, and semantic shift. Most works (37) approached the real drift problem, corresponding to the mapping changes between X𝑋Xitalic_X and y𝑦yitalic_y over time. Only three works considered the virtual drift, and another three tackled the semantic shift problem. Please note that a work can approach more than one drift category simultaneously. Considering the drift detection method, we investigated the papers and observed that it is possible to categorize them into adaptive, where the method adapts to the concept drift without detecting it, and explicit, where there is an explicit concept drift detection that can trigger the ML model update.

Furthermore, we investigated the manners in the methods and systems updated the machine learning models when possible. We categorized the studied papers considering the ML update scheme into four groups: (i) ensemble update, (ii) incremental, (iii) keep-compare-evolve, and (iv) retraining. Furthermore, we analyzed the applications approached in the papers according to a stream mining task categorization. The stream mining tasks found in the studies were categorized into classification, clustering, general detection, and topic modeling. Several applications were found, such as fake review detection, sentiment analysis, and novelty detection.

In addition, we organized and presented the text representation methods since they are crucial for text streams subject to concept drift. Fifteen text representation methods were identified, where Bag-of-words and Word2vec were the most frequent methods (each appeared in 11 studies). Moreover, when available, the update mechanisms of the text representations were also listed. Only two methods are fully incremental, while most studies used static text representation methods/language models. Therefore, it constitutes an open challenge.

Additionally, we listed the real-world datasets with their links when available and discussed concept drifts visualization and drifts simulation. Some papers argue that the datasets in use have drift, although such drifts are unlabeled or uncategorized. A few papers resorted to visualization techniques or data rearrangement to simulate drift to justify the existence of drifts. Concept drifts in text streams can manifest in various ways, including feature drift, semantic shift, real and virtual drifts, and topic drift. Thus, different approaches are required to manage these types of drifts.

During this study, we discovered aspects that can be addressed in future research. The research area lacks visualization methods that highlight the existence of text drift. There is no standard for generating those visualizations, especially regarding changes over time. Furthermore, different approaches for text drift simulation have been used in the literature. Standardization in these processes may be an advantage, enabling faster development of the research area. Considering the semantic shift, it can be advanced in the direction of linguistics and be studied in depth. According to the information obtained from the papers that approach semantic shift detection studied in this work, a challenging aspect is that the target words are known a priori. Thus, it appears that methods that can indicate words that suffer semantic shift in text streams are desired. Additionally, we realized that most text representation methods are not updated during the process, and, as aforementioned, it constitutes a problem over time. Therefore, developing new easy-to-use representation methods that can be updated over time may benefit the research area.

Finally, as discussed in Section 5, there are no benchmark datasets for text drift detection in text stream scenarios. The authors in the selected papers collected many datasets; however, the most frequent datasets across the papers are related to short-text scenarios or topics. Thus, it is crucial to develop a benchmark dataset for text drift detection focused on text stream scenarios in the future.

In summary, this comprehensive review provides a detailed analysis and evaluation of concept drift adaptation methods in text stream scenarios, offering valuable insights that may help readers understand the strengths and weaknesses of the current methods and open issues that need to be addressed.

Declarations

Funding

Conflict of interest

The authors declare that they have no conflict of interest.

Availability of data and material

Not applicable.

Code availability

Not applicable.

Authors’ contributions

Not applicable.

References

  • Abadi et al. (2015) Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. URL https://www.tensorflow.org/, software available from tensorflow.org
  • Abid et al. (2018) Abid A, Jamoussi S, Hamadou AB (2018) Handling Concept Drift and Feature Evolution in Textual Data Stream using the Artificial Immune System. In: International Conference on Computational Collective Intelligence, Springer, pp 363–372
  • Abid et al. (2019) Abid A, Jamoussi S, Hamadou AB (2019) AIS-Clus: A Bio-Inspired Method for Textual Data Stream Clustering. Vietnam Journal of Computer Science 6(02):223–256
  • Ahuja et al. (2019) Ahuja R, Chug A, Kohli S, Gupta S, Ahuja P (2019) The Impact of Features Extraction on the Sentiment Analysis. Procedia Computer Science 152:341–348
  • Alatrash et al. (2020) Alatrash R, Schlechtweg D, Kuhn J, Im Walde SS (2020) CCOHA: Clean Corpus of Historical American English. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp 6958–6966
  • Amba Hombaiah et al. (2021) Amba Hombaiah S, Chen T, Zhang M, Bendersky M, Najork M (2021) Dynamic Language Models for Continuously Evolving Content. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp 2514–2524
  • Antoniak (1974) Antoniak CE (1974) Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. The Annals of Statistics pp 1152–1174
  • Assenmacher and Trautmann (2022) Assenmacher D, Trautmann H (2022) Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption. In: Asian Conference on Intelligent Information and Database Systems, Springer, pp 3–16
  • Baena-Garcıa et al. (2006) Baena-Garcıa M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavalda R, Morales-Bueno R (2006) Early Drift Detection Method. In: Fourth International Workshop on Knowledge Discovery from Data Streams, Citeseer, vol 6, pp 77–86
  • Barbado et al. (2019) Barbado R, Araque O, Iglesias CA (2019) A Framework for Fake Review Detection in Online Consumer Electronics Retailers. Information Processing & Management 56(4):1234–1244
  • Barddal et al. (2017) Barddal JP, Gomes HM, Enembreck F, Pfahringer B (2017) A Survey on Feature Drift Adaptation: Definition, Benchmark, Challenges and Future Directions. Journal of Systems and Software 127:278–294
  • Barddal et al. (2020) Barddal JP, Loezer L, Enembreck F, Lanzuolo R (2020) Lessons Learned from Data Stream Classification applied to Credit Scoring. Expert Systems with Applications 162:113899
  • Barros et al. (2017) Barros RS, Cabral DR, Gonçalves Jr PM, Santos SG (2017) RDDM: Reactive Drift Detection Method. Expert Systems with Applications 90:344–355
  • Bechini et al. (2021) Bechini A, Bondielli A, Ducange P, Marcelloni F, Renda A (2021) Addressing Event-driven Concept Drift in Twitter Stream: a Stance Detection Application. IEEE Access 9:77758–77770
  • Belotti et al. (2020) Belotti F, Bianchi F, Palmonari M (2020) UNIMIB@ DIACR-Ita: Aligning Distributional Embeddings with a Compass for Semantic Change Detection in the Italian Language. EVALITA Evaluation of NLP and Speech Tools for Italian-December 17th, 2020 p 451
  • Bezerra et al. (2015) Bezerra E, Passos E, Goldschmidt R (2015) Data Mining: Conceitos, técnicas, algoritmos, orientações e aplicações. Campus, Rio de Janeiro, Brazil
  • Bifet (2017) Bifet A (2017) Classifier Concept Drift Detection and the Illusion of Progress. In: Artificial Intelligence and Soft Computing: 16th International Conference, ICAISC 2017, Zakopane, Poland, June 11-15, 2017, Proceedings, Part II 16, Springer, pp 715–725
  • Bifet and Gavalda (2007) Bifet A, Gavalda R (2007) Learning from Time-changing Data with Adaptive Windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining, SIAM, pp 443–448
  • Bifet et al. (2010) Bifet A, Holmes G, Pfahringer B, Kranen P, Kremer H, Jansen T, Seidl T (2010) MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. In: Proceedings of the First Workshop on Applications of Pattern Analysis, PMLR, pp 44–50
  • Bifet et al. (2018) Bifet A, Gavalda R, Holmes G, Pfahringer B (2018) Machine Learning for Data Streams: with Practical Examples in MOA. MIT Press
  • Blei (2012) Blei DM (2012) Probabilistic topic models. Communications of the ACM 55(4):77–84
  • Blei et al. (2003) Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research 3(Jan):993–1022
  • Bloomberg (1933) Bloomberg L (1933) Language. George Allen & Unwin
  • Bojanowski et al. (2016) Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching Word Vectors with Subword Information. arXiv preprint arXiv:160704606
  • Bondielli et al. (2022) Bondielli A, Tortora GC, Ducange P, Macri A, Marcelloni F, Renda A (2022) Online Monitoring of Stance from Tweets: The case of Green Pass in Italy. In: 2022 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS), IEEE, pp 1–8
  • Box et al. (2015) Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time Series Analysis: Forecasting and Control. John Wiley & Sons
  • Bravo-Marquez et al. (2022) Bravo-Marquez F, Khanchandani A, Pfahringer B (2022) Incremental Word Vectors for Time-Evolving Sentiment Lexicon Induction. Cognitive Computation 14(1):425–441
  • Brito and Adeodato (2023) Brito K, Adeodato PJL (2023) Machine Learning for Predicting Elections in Latin America based on Social Media Engagement and Polls. Government Information Quarterly 40(1):101782
  • Cebrián et al. (2007) Cebrián M, Alfonseca M, Ortega A (2007) The Normalized Compression Distance is Resistant to Noise. IEEE Transactions on Information Theory 53(5):1895–1900
  • Chamby-Diaz et al. (2019) Chamby-Diaz JC, Recamonde-Mendoza M, Bazzan AL (2019) Dynamic Correlation-based Feature Selection for Feature Drifts in Data Streams. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), IEEE, pp 198–203
  • Chollet et al. (2015) Chollet F, et al. (2015) Keras. https://keras.io
  • Cilibrasi and Vitányi (2005) Cilibrasi R, Vitányi PM (2005) Clustering by Compression. IEEE Transactions on Information Theory 51(4):1523–1545
  • Costa et al. (2017) Costa J, Silva C, Antunes M, Ribeiro B (2017) Adaptive Learning for Dynamic Environments: A Comparative Approach. Engineering Applications of Artificial Intelligence 65:336–345
  • Crammer et al. (2006) Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. The Journal of Machine Learning Research 7
  • D’Andrea et al. (2019) D’Andrea E, Ducange P, Bechini A, Renda A, Marcelloni F (2019) Monitoring the Public Opinion about the Vaccination Topic from Tweets Analysis. Expert Systems with Applications 116:209–226
  • Delazeri et al. (2022) Delazeri BR, Vera LL, Barddal JP, Koerich AL, et al. (2022) Evaluation of Self-taught Learning-based Representations for Facial Emotion Recognition. In: 2022 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8
  • Devlin et al. (2018) Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:181004805
  • Di Carlo et al. (2019) Di Carlo V, Bianchi F, Palmonari M (2019) Training Temporal Word Embeddings with a Compass. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 6326–6334
  • Dwi Prasetyo and Hauff (2015) Dwi Prasetyo N, Hauff C (2015) Twitter-based Election Prediction in the Developing World. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp 149–158
  • Emmons et al. (2016) Emmons S, Kobourov S, Gallant M, Börner K (2016) Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale. PloS one 11(7):e0159161
  • Ester et al. (1996) Ester M, Kriegel HP, Sander J, Xu X, et al. (1996) A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp 226–231
  • Faria et al. (2016) Faria ER, Gonçalves IJ, de Carvalho AC, Gama J (2016) Novelty Detection in Data Streams. Artificial Intelligence Review 45:235–269
  • Fenza et al. (2023) Fenza G, Gallo M, Loia V, Petrone A, Stanzione C (2023) Concept-drift Detection Index based on Fuzzy Formal Concept Analysis for Fake News Classifiers. Technological Forecasting and Social Change 194:122640
  • Friedman (1937) Friedman M (1937) The use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. Journal of the American Statistical Association 32(200):675–701
  • Friedman (1940) Friedman M (1940) A Comparison of Alternative Tests of Significance for the Problem of m Rankings. The Annals of Mathematical Statistics 11(1):86–92
  • Fu et al. (2022) Fu CL, Chen ZC, Lee YR, Lee Hy (2022) AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks. arXiv preprint arXiv:220500305
  • Gama et al. (2004) Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with Drift Detection. In: Advances in Artificial Intelligence–SBIA 2004: 17th Brazilian Symposium on Artificial Intelligence, Sao Luís, Maranhão, Brazil, September 29-October 1, 2004. Proceedings 17, Springer, pp 286–295
  • Gama et al. (2006) Gama J, Fernandes R, Rocha R (2006) Decision Trees for Mining Data Streams. Intelligent Data Analysis 10(1):23–45
  • Gama et al. (2014) Gama J, Žliobaité I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A Survey on Concept Drift Adaptation. ACM Computing Surveys (CSUR) 46(4):1–37
  • Garcia et al. (2019a) Garcia C, Esmin A, Leite D, Škrjanc I (2019a) Evolvable Fuzzy Systems from Data Streams with Missing Values: With Application to Temporal Pattern Recognition and Cryptocurrency Prediction. Pattern Recognition Letters 128:278–282
  • Garcia et al. (2019b) Garcia C, Leite D, Škrjanc I (2019b) Incremental Missing-data Imputation for Evolving Fuzzy Granular Prediction. IEEE Transactions on Fuzzy Systems 28(10):2348–2362
  • Garcia and Herrera (2008) Garcia S, Herrera F (2008) An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets´´ for all Pairwise Comparisons. Journal of Machine Learning Research 9(12)
  • Gomes et al. (2017) Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T (2017) Adaptive Random Forests for Evolving Data Stream Classification. Machine Learning 106:1469–1495
  • Gruppi et al. (2020) Gruppi M, Horne BD, Adali S (2020) NELA-GT-2019: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles. CoRR abs/2003.08444, URL https://arxiv.org/abs/2003.08444
  • Gruppi et al. (2021) Gruppi M, Horne BD, Adalı S (2021) NELA-GT-2020: A large multi-labelled news dataset for the study of misinformation in news articles. arXiv preprint arXiv:210204567
  • Hall et al. (2009) Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA Data Mining Software: an Update. ACM SIGKDD Explorations Newsletter 11(1):10–18
  • Hall (1999) Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato
  • Hamilton et al. (2016a) Hamilton WL, Leskovec J, Jurafsky D (2016a) Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, NIH Public Access, vol 2016, p 2116
  • Hamilton et al. (2016b) Hamilton WL, Leskovec J, Jurafsky D (2016b) Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
  • Hammer and Yazidi (2018) Hammer HL, Yazidi A (2018) Parameter Estimation in Abruptly Changing Dynamic Environments using Stochastic Learning Weak Estimator. Applied Intelligence 48(11):4096–4112
  • Harris (1954) Harris ZS (1954) Distributional Structure. Word 10(2-3):146–162
  • He et al. (2018) He Y, Li J, Song Y, He M, Peng H, et al. (2018) Time-evolving Text Classification with Deep Neural Networks. In: International Joint Conference on Artificial Intelligence, vol 18, pp 2241–2247
  • Hellinger (1909) Hellinger E (1909) Neue Begründung der Theorie Quadratischer Formen von Unendlichvielen Veränderlichen. Journal für die Reine und Angewandte Mathematik 1909(136):210–271
  • Heusinger et al. (2020a) Heusinger M, Raab C, Schleif FM (2020a) Analyzing Dynamic Social Media Data via Random Projection - a New Challenge for Stream Classifiers. In: 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), IEEE, pp 1–8
  • Heusinger et al. (2020b) Heusinger M, Raab C, Schleif FM (2020b) Passive Concept Drift Handling via Momentum based Robust Soft Learning Vector Quantization. In: Advances in Self-Organizing Maps, Learning Vector Quantization, Clustering and Data Visualization: Proceedings of the 13th International Workshop, WSOM+ 2019, Barcelona, Spain, June 26-28, 2019 13, Springer, pp 200–209
  • Heusinger et al. (2022) Heusinger M, Raab C, Schleif FM (2022) Dimensionality Reduction in the Context of Dynamic Social Media Data Streams. Evolving Systems 13(3):387–401
  • Hoffman et al. (2013) Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic Variational Inference. Journal of Machine Learning Research
  • Holt (2004) Holt CC (2004) Forecasting Seasonals and Trends by Exponentially Weighted Moving Averages. International Journal of Forecasting 20(1):5–10
  • Hu et al. (2018) Hu X, Wang H, Li P (2018) Online Biterm Topic Model Based Short Text Stream Classification using Short Text Expansion and Concept Drifting Detection. Pattern Recognition Letters 116:187–194
  • Iman and Davenport (1980) Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Communications in Statistics-Theory and Methods 9(6):571–595
  • Iturra-Bocaz and Bravo-Marquez (2023) Iturra-Bocaz G, Bravo-Marquez F (2023) RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams. In: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
  • Kaji and Kobayashi (2017) Kaji N, Kobayashi H (2017) Incremental Skip-gram Model with Negative Sampling. arXiv preprint arXiv:170403956
  • Katakis et al. (2008) Katakis I, Tsoumakas G, Vlahavas I (2008) An Ensemble of Classifiers for Coping with Recurring Contexts in Data Streams. In: ECAI 2008, IOS Press, pp 763–764
  • Katakis et al. (2009) Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I (2009) An Adaptive Personalized News Dissemination System. Journal of Intelligent Information Systems 32:191–212
  • Katakis et al. (2010) Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking Recurring Contexts using Ensemble Classifiers: an Application to Email Filtering. Knowledge and Information Systems 22:371–391
  • Kephart et al. (1994) Kephart JO, et al. (1994) A Biologically Inspired Immune System for Computers. In: Artificial Life IV: Proceedings of the Fourth International Workshop on the Synthesis and Simulation of Living Systems, vol 247, pp 130–139
  • Kherwa and Bansal (2019) Kherwa P, Bansal P (2019) Topic Modeling: A Comprehensive Review. EAI Endorsed Transactions on Scalable Information Systems 7(24)
  • Kitchenham and Charters (2007) Kitchenham B, Charters S (2007) Guidelines for Performing Systematic Literature Reviews in Software Engineering. Keele University and Durham University Joint Report
  • Kolajo et al. (2022) Kolajo T, Daramola O, Adebiyi AA (2022) Real-time Event Detection in Social Media Streams Through Semantic Analysis of Noisy Terms. Journal of Big Data 9(1):1–36
  • Kolmogorov (1933) Kolmogorov AN (1933) Sulla Determinazione Empirica di una Legge Didistribuzione. Giorn Dell’inst Ital Degli Att 4:89–91
  • Kolter and Maloof (2005) Kolter JZ, Maloof MA (2005) Using Additive Expert Ensembles to Cope with Concept Drift. In: Proceedings of the 22nd international conference on Machine learning, pp 449–456
  • Küçük and Can (2020) Küçük D, Can F (2020) Stance Detection: A Survey. ACM Computing Surveys (CSUR) 53(1):1–37
  • Kullback and Leibler (1951) Kullback S, Leibler RA (1951) On Information and Sufficiency. The Annals of Mathematical Statistics 22(1):79–86
  • Kutuzov et al. (2018) Kutuzov A, Øvrelid L, Szymanski T, Velldal E (2018) Diachronic Word Embeddings and Semantic Shifts: a Survey. arXiv preprint arXiv:180603537
  • Lau et al. (2014) Lau JH, Newman D, Baldwin T (2014) Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp 530–539
  • Le and Mikolov (2014) Le Q, Mikolov T (2014) Distributed Representations of Sentences and Documents. In: International Conference on Machine Learning, PMLR, pp 1188–1196
  • Leite et al. (2012) Leite D, Ballini R, Costa P, Gomide F (2012) Evolving Fuzzy Granular Modeling from Nonstationary Fuzzy Data Streams. Evolving Systems 3:65–79
  • Lewis et al. (2004) Lewis DD, Yang Y, Russell-Rose T, Li F (2004) RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5(Apr):361–397
  • Li et al. (2018) Li P, He L, Wang H, Hu X, Zhang Y, Li L, Wu X (2018) Learning from Short Text Streams with Topic Drifts. IEEE Transactions on Cybernetics 48(9):2697–2711
  • Li et al. (2022) Li P, Liu Y, Hu Y, Zhang Y, Hu X, Yu K (2022) A Drift-sensitive Distributed LSTM Method for Short Text Stream Classification. IEEE Transactions on Big Data 9(1):341–357
  • Liu et al. (2019) Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:190711692
  • Lloyd (1982) Lloyd S (1982) Least Squares Quantization in PCM. IEEE Transactions on Information Theory 28(2):129–137
  • Losing et al. (2017) Losing V, Hammer B, Wersing H (2017) Self-Adjusting Memory: How to deal with Diverse Drift Types. In: International Joint Conferences on Artificial Intelligence
  • Lu et al. (2022) Lu Y, Cheng X, Liang Z, Rao Y (2022) Graph-based Dynamic Word Embeddings. In: International Joint Conference on Artificial Intelligence
  • van der Maaten and Hinton (2008) van der Maaten L, Hinton G (2008) Visualizing Data using t-SNE. Journal of Machine Learning Research 9(86):2579–2605, URL http://jmlr.org/papers/v9/vandermaaten08a.html
  • Manning et al. (2014) Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp 55–60
  • May et al. (2017) May C, Duh K, Van Durme B, Lall A (2017) Streaming Word Embeddings with the Space-saving Algorithm. arXiv preprint arXiv:170407463
  • McGillivray and Kilgarriff (2013) McGillivray B, Kilgarriff A (2013) Tools for Historical Corpus Research, and a Corpus of Latin. New Methods in Historical Corpus Linguistics 1(3):247–257
  • McHugh (2012) McHugh ML (2012) Interrater Reliability: the Kappa Statistic. Biochemia Medica 22(3):276–282
  • Medhat et al. (2014) Medhat W, Hassan A, Korashy H (2014) Sentiment Analysis Algorithms and Applications: A Survey. Ain Shams Engineering Journal 5(4):1093–1113
  • Melidis et al. (2018) Melidis DP, Spiliopoulou M, Ntoutsi E (2018) Learning under Feature Drifts in Textual Streams. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp 527–536
  • de Mello et al. (2018) de Mello RF, Rios RA, Pagliosa PA, Lopes CS (2018) Concept Drift Detection on Social Network Data using Cross-recurrence Quantification Analysis. Chaos: An Interdisciplinary Journal of Nonlinear Science 28(8):085719
  • Mikolov et al. (2013a) Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:13013781
  • Mikolov et al. (2013b) Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26
  • Moghadasi and Zhuang (2020) Moghadasi MN, Zhuang Y (2020) Sent2Vec: A New Sentence Embedding Representation with Sentimental Semantic. In: 2020 IEEE International Conference on Big Data (Big Data), IEEE, pp 4672–4680
  • Mohawesh et al. (2021) Mohawesh R, Tran S, Ollington R, Xu S (2021) Analysis of Concept Drift in Fake Reviews Detection. Expert Systems with Applications 169:114318
  • Montanelli and Periti (2023) Montanelli S, Periti F (2023) A Survey on Contextualised Semantic Shift Detection. arXiv preprint arXiv:230401666
  • Montiel et al. (2021) Montiel J, Halford M, Mastelini SM, Bolmier G, Sourty R, Vaysse R, Zouitine A, Gomes HM, Read J, Abdessalem T, et al. (2021) River: Machine Learning for Streaming Data in Python. The Journal of Machine Learning Research 22(1):4945–4952
  • de Moraes and Gradvohl (2021) de Moraes MB, Gradvohl ALS (2021) A Comparative Study of Feature Selection Methods for Binary Text Streams. Evolving Systems 12(4):997–1013
  • Mukherjee et al. (2013) Mukherjee A, Venkataraman V, Liu B, Glance N, et al. (2013) Fake Review Detection: Classification and Analysis of Real and Pseudo Reviews. UIC-CS-03-2013 Technical Report
  • Murena et al. (2018) Murena PA, Al-Ghossein M, Abdessalem T, Cornuéjols A (2018) Adaptive Window Strategy for Topic Modeling in Document Streams. In: 2018 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–7
  • Nemenyi (1963) Nemenyi PB (1963) Distribution-free Multiple Comparisons. PhD thesis, Princeton University
  • Nguyen et al. (2022) Nguyen T, Mai T, Nguyen N, Van LN, Than K (2022) Balancing Stability and Plasticity When Learning Topic Models from Short and Noisy Text Streams. Neurocomputing 505:30–43
  • Nguyen et al. (2019) Nguyen VS, Nguyen DT, Van LN, Than K (2019) Infinite Dropout for Training Bayesian Models from Data Streams. In: 2019 IEEE International Conference on Big Data (Big Data), IEEE, pp 125–134
  • Nielsen (2019) Nielsen F (2019) On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy 21(5):485
  • Nishida et al. (2012) Nishida K, Hoshide T, Fujimura K (2012) Improving Tweet Stream Classification by Detecting Changes in Word Probability. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 971–980
  • Nørregaard et al. (2019) Nørregaard J, Horne BD, Adalı S (2019) NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. In: Proceedings of the International AAAI Conference on Web and Social Media, vol 13, pp 630–638
  • Olteanu et al. (2015) Olteanu A, Vieweg S, Castillo C (2015) What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp 994–1009
  • Ontañón (2020) Ontañón S (2020) An Overview of Distance and Similarity Functions for Structured Data. Artificial Intelligence Review 53(7):5309–5351
  • Page (1954) Page ES (1954) Continuous Inspection Schemes. Biometrika 41(1/2):100–115
  • Paszke et al. (2019) Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) PyTorch: An Imperative Style, High-performance Deep Learning Library. Advances in Neural Information Processing Systems 32
  • Patil et al. (2021) Patil MA, Kumar S, Kumar S, Garg M (2021) Concept Drift Detection for Social Media: A Survey. In: 2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), IEEE, pp 12–16
  • Pedregosa et al. (2011) Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830
  • Pennington et al. (2014) Pennington J, Socher R, Manning CD (2014) GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
  • Periti et al. (2022) Periti F, Ferrara A, Montanelli S, Ruskov M (2022) What is Done is Done: an Incremental Approach to Semantic Shift Detection. In: Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, pp 33–43
  • Phan et al. (2010) Phan XH, Nguyen CT, Le DT, Nguyen LM, Horiguchi S, Ha QT (2010) A Hidden Topic-based Framework toward Building Applications with Short Web Documents. IEEE Transactions on Knowledge and Data Engineering 23(7):961–976
  • Pohl et al. (2018) Pohl D, Bouchachia A, Hellwagner H (2018) Batch-based Active Learning: Application to Social Media Data for Crisis Management. Expert Systems with Applications 93:232–244
  • Polignano et al. (2019) Polignano M, Basile P, De Gemmis M, Semeraro G, Basile V, et al. (2019) AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks based on Tweets. In: CEUR Workshop Proceedings, CEUR, vol 2481, pp 1–6
  • Pradhan et al. (2015) Pradhan N, Gyanchandani M, Wadhvani R (2015) A Review on Text Similarity Technique used in IR and its Application. International Journal of Computer Applications 120(9):29–34
  • Raab et al. (2020) Raab C, Heusinger M, Schleif FM (2020) Reactive Soft Prototype Computing for Concept Drift Streams. Neurocomputing 416:340–351
  • Rabiu et al. (2022) Rabiu I, Salim N, Nasser M, Saeed F, Alromema W, Awal A, Joseph E, Mishra A (2022) Ensemble Method for Online Sentiment Classification Using Drift Detection-Based Adaptive Window Method. In: International Conference of Reliable Information and Communication Technology, Springer, pp 117–128
  • Rabiu et al. (2023) Rabiu I, Salim N, Nasser M, Da’u A, Eisa TAE, Dalam MEE (2023) Drift Detection Method Using Distance Measures and Windowing Schemes for Sentiment Classification. CMC - Computers Materials & Continua 74(3):6001–6017
  • Rakib et al. (2021) Rakib MRH, Zeh N, Milios E (2021) Efficient Clustering of Short Text Streams using Online-offline Clustering. In: Proceedings of the 21st ACM Symposium on Document Engineering, pp 1–10
  • Rayana and Akoglu (2015) Rayana S, Akoglu L (2015) Collective Opinion Spam Detection: Bridging Review Networks and Metadata. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 985–994
  • Rehurek and Sojka (2011) Rehurek R, Sojka P (2011) Gensim – Python Framework for Vector Space Modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3(2):2
  • Rennie et al. (2003) Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp 616–623
  • Ross et al. (2012) Ross GJ, Adams NM, Tasoulis DK, Hand DJ (2012) Exponentially Weighted Moving Average Charts for Detecting Concept Drift. Pattern Recognition Letters 33(2):191–198
  • Ryzhova et al. (2021) Ryzhova A, Ryzhova D, Sochenkov I (2021) Detection of Semantic Changes in Russian Nouns with Distributional Models and Grammatical Features. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference Dialogue
  • Salton and Buckley (1988) Salton G, Buckley C (1988) Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management 24(5):513–523
  • Sandhaus (2008) Sandhaus E (2008) The new york times annotated corpus. Linguistic Data Consortium, Philadelphia 6(12):e26752
  • Schlechtweg et al. (2020) Schlechtweg D, McGillivray B, Hengchen S, Dubossarsky H, Tahmasebi N (2020) SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. arXiv preprint arXiv:200711464
  • Sebastião and Fernandes (2017) Sebastião R, Fernandes JM (2017) Supporting the Page-Hinkley test with Empirical Mode Decomposition for Change Detection. In: Foundations of Intelligent Systems: 23rd International Symposium, ISMIS 2017, Warsaw, Poland, June 26-29, 2017, Proceedings 23, Springer, pp 492–498
  • Smirnov (1948) Smirnov N (1948) Table for Estimating the Goodness of Fit of Empirical Distributions. The Annals of Mathematical Statistics 19(2):279–281
  • Soares et al. (2019) Soares E, Garcia C, Poucas R, Camargo H, Leite D (2019) Evolving Fuzzy Set-based and Cloud-based Unsupervised Classifiers for Spam Detection. IEEE Latin America Transactions 17(09):1449–1457
  • Stewart et al. (2017) Stewart I, Arendt D, Bell E, Volkova S (2017) Measuring, Predicting and Visualizing Short-term Change in Word Representation and Usage in VKontakte Social Network. In: Eleventh International AAAI Conference on Web and Social Media
  • Sun et al. (2021) Sun G, Wang Z, Ding Z, Zhao J (2021) An Ensemble Classification Algorithm for Short Text Data Stream with Concept Drifts. IAENG International Journal of Computer Science 48(4)
  • Suprem and Pu (2019a) Suprem A, Pu C (2019a) ASSED: a Framework for Identifying Physical Events Through Adaptive Social Sensor Data Filtering. In: Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems, pp 115–126
  • Suprem and Pu (2019b) Suprem A, Pu C (2019b) Event Detection in Noisy Streaming Data with Combination of Corroborative and Probabilistic Sources. In: 2019 IEEE 5th International Conference on Collaboration and Internet Computing (CIC), IEEE, pp 168–177
  • Suprem et al. (2019a) Suprem A, Musaev A, Pu C (2019a) Concept Drift Adaptive Physical Event Detection for Social Media Streams. In: World Congress on Services, Springer, pp 92–105
  • Suprem et al. (2019b) Suprem A, Musaev A, Pu C (2019b) Concept Drift Adaptive Physical Event Detection for Social Media Streams. In: World Congress on Services, Springer, pp 92–105
  • Susi and Shanthi (2023) Susi E, Shanthi A (2023) Sentiment Drift Detection and Analysis in Real Time Twitter Data Streams. Computer Systems Science & Engineering 46(1)
  • Tahmasebia et al. (2021) Tahmasebia N, Borina L, Jatowtb A (2021) Survey of Computational Approaches to Lexical Semantic Change Detection. Computational Approaches to Semantic Change 6:1
  • Thuma et al. (2023) Thuma BS, de Vargas PS, Garcia C, de Souza Britto Jr A, Barddal JP (2023) Benchmarking feature extraction techniques for textual data stream classification. In: 2023 International Joint Conference on Neural Networks (IJCNN)
  • Tran et al. (2021) Tran B, Nguyen AD, Van LN, Than K (2021) Dynamic Transformation of Prior Knowledge into Bayesian Models for Data Streams. IEEE Transactions on Knowledge and Data Engineering
  • Tsai et al. (2019) Tsai MH, Wang Y, Kwak M, Rigole N (2019) A Machine Learning based Strategy for Election Result Prediction. In: 2019 International Conference on Computational Science and Computational Intelligence (CSCI), IEEE, pp 1408–1410
  • Van Linh et al. (2022) Van Linh N, Bach TX, Than K (2022) A Graph Convolutional Topic Model for Short and Noisy Text Streams. Neurocomputing 468:345–359
  • Vo (2022) Vo T (2022) GOWSeqStream: an Integrated Sequential Embedding and Graph-of-words for Short Text Stream Clustering. Neural Computing and Applications 34(6):4321–4341
  • Wang et al. (2014) Wang Z, Shou L, Chen K, Chen G, Mehrotra S (2014) On Summarization and Timeline Generation for Evolutionary Tweet Streams. IEEE Transactions on Knowledge and Data Engineering 27(5):1301–1315
  • Wang et al. (2019) Wang Z, Tao H, Kong Z, Chandra S, Khan L (2019) Metric Learning Based Framework for Streaming Classification with Concept Evolution. In: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8
  • Wares et al. (2019) Wares S, Isaacs J, Elyan E (2019) Data Stream Mining: Methods and Challenges for Handling Concept Drift. SN Applied Sciences 1:1–19
  • Wolf et al. (2019) Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. (2019) Huggingface’s Transformers: State-of-the-art Natural Language Processing. arXiv preprint arXiv:191003771
  • Wu et al. (2012) Wu W, Li H, Wang H, Zhu KQ (2012) Probase: A Probabilistic Taxonomy for Text Understanding. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 481–492
  • Yang et al. (2021) Yang S, Huang G, Zhou X, Mak V, Yearwood J (2021) EWNStream +: Effective and Real-time Clustering of Short Text Streams using Evolutionary Word Relation Network. International Journal of Information Technology & Decision Making 20(01):341–370
  • Yin et al. (2018) Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based Clustering of Short Text Streams. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 2634–2642
  • Zampieri et al. (2019) Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). arXiv preprint arXiv:190308983
  • Zhu and Ghahramani (2002) Zhu X, Ghahramani Z (2002) Learning from Labeled and Unlabeled Data with Label Propagation. CMU CALD Tech Report CMU-CALD-02-107