Keywords

1 Introduction

Learning in evolving environments is a challenging task. This difficulty is caused, not only by the speed and volume of data arrival, but also by the changes in the distribution that may occur. Intuitively, distributional changes may cause degradation in the performance of classification models. To this end, adaptive learning algorithms utilize drift detection methods to detect such changes and then take appropriate actions [1]. Typically, classification models are updated or, alternatively, retrained when a drift has been detected. Alternatively, ensemble learning algorithms are employed in an attempt to maintain the accuracy [26].

It follows that, in such a setting, drift detection methods resulting in fewer false positives and less false negatives are preferred. Such detectors should also detect drifts as soon as they arrive. A drift detector with a high false positive number (or rate) causes frequent retraining, leading to more resources being used [7, 8]. On the other hand, a drift detector with a high false negative number causes decay in the accuracy of classification, since it does not detect drift points. These types of oversights are costly and should be avoided in many applications, e.g. in fraud detection and emergency response settings. Moreover, drift detectors should detect drifts with the least possible delay. Correct approximation of a drift point, i.e. detecting the drift with less delay, is necessary because it helps not only to make maximum usage from the data, but also aids us to realize how the drift happened. Such insights are crucial in Business Intelligence (BI) applications. Accordingly, false positive, false negative and detection delay are considered as evaluation measures for drift detection methods [23, 24].

We introduce the Fast Hoeffding Drift Detection Method (FHDDM) based on our requirement that the accuracy of classification models should stay steady, or increase, as more instances are processed. Otherwise, the degradation in accuracy may indicate that we face concept drifts. The FHDDM algorithm, in a novel way, uses a sliding window and Hoeffding’s inequality [9] to calculate and compare the maximum probability of correct predictions observed so far with the most recent probability of correct predictions for the purpose of drift detection. We will show that the FHDDM algorithm results in less detection delay, less false positive and less false negative, when compared to the state-of-the-art.

The remainder of this paper is organized as follows: We talk about related works to concept drift detection in Sect. 2. We describe the Fast Hoeffding Drift Detection Method (FHDDM) algorithm in Sect. 3. Section 4 presents an approach for evaluating drift detectors on the basis of detection delay. We conduct our experiments on synthetic and real-world datasets in Sect. 5. Finally, we conclude the paper and discuss future works in Sect. 6.

2 Related Works

Gama et al. [1] classified concept drift detectors into three general groups of: (1) Sequential Analysis based Methods: These methods sequentially evaluate prediction results as they become available, and they alarm for drifts when a pre-defined threshold is met. The Cumulative Sum (CUSUM) [10] and Geometric Moving Average (GMA) [11] are members of this group. (2) Statistical based Methods: These methods probe the statistical parameters such as mean and standard deviation of prediction results to detect drifts in a stream. The Drift Detection Method (DDM) [12], Early Drift Detection Method (EDDM) [13] and Exponentially Weighted Moving Average (EWMA) [14] are placed in this group. (3) Windows based Methods: They usually use a fixed reference window summarizing the past information and a sliding window summarizing the most recent information. A significant difference between the distributions of these windows suggests the occurrence of a drift. Statistical tests or mathematical inequalities, with the null-hypothesis saying that the distributions are equal, can be used to decide the level of difference. Kifer’s [15], Nishida’s [16], Bach’s [17], the Adaptive Windowing (ADWIN) [18], the Hoeffding Drift Detection Methods (HDDM\(_{\mathrm{A-test}}\) and HDDM\(_{\mathrm{W-test}}\)) [19], and SeqDrift detectors [23, 24] are members of this group. As discussed in [1], drift detectors in the second and third groups have shown better performances and have been frequently considered as benchmarks in the literature [7, 13, 16, 18, 19]. We will, thus, compare our FHDDM with DDM, EDDM, ADWIN, HDDM\(_{\mathrm{A-test}}\) and HDDM\(_{\mathrm{W-test}}\). We describe each one below:

DDM: Drift Detection Method – DDM, by Gama et al. [12], monitors the error-rate of the classification model to detect drifts. On the basis of PAC learning model [20], the method considers that the error-rate of a classifier decreases or stays constant as the number of instances increases. Otherwise, it suggests the occurrence of a drift. Consider \(p_t\) as the error-rate of the classifier with standard deviation of \(s_t = \sqrt{(p_t(1 - p_t) / t)}\) at time t. As instances are processed, DDM updates two variables \(p_{min}\) and \(s_{min}\) when \(p_t + s_t < p_{min} + s_{min}\). DDM warns for a drift when \(p_t + s_t \ge p_{min} + 2 * s_{min}\), and it detects a drift when \(p_t + s_t \ge p_{min} + 3 * s_{min}\). The \(p_{min}\) and \(s_{min}\) are reset in the case of drift detection.

EDDM: Early Drift Detection Method – EDDM, by Baena-Garcia et al. [13], checks the distances between wrong predictions to detect concept drifts. The algorithm is based on the observation that facing a drift is likely when the distances between errors are smaller. EDDM calculates the average distance between two recent errors, i.e. \(p'_t\), with its standard deviation \(s'_t\) at time t. It updates two variables \(p'_{max}\) and \(s'_{max}\) when \(p'_t +2*s'_t > p'_{max} + 2*s'_{max}\). It warns for a drift if \((p'_t + 2 * s'_t) / (p'_{max} + 2*s'_{max}) < \alpha \), and it detects a drift if \((p'_t + 2 * s'_t) / (p'_{max} + 2*s'_{max}) < \beta \). They set \(\alpha \) and \(\beta \) to 0.95 and 0.90 respectively. The \(p'_{max}\) and \(s'_{max}\) are reset if a drift is detected.

ADWIN: Adaptive Sliding Window – ADWIN, by Bifet et al. [18], slides the window w on the results of predictions to detect drifts. It examines two large enough sub-windows, i.e. \(w_0\) with size \(n_0\) and \(w_1\) with size \(n_1\), of w for drift detection where \(w_0 \cdot w_1 = w\). A significant difference between the means of two sub-windows suggests a concept drift, i.e. \(|\mu _{w_0} - \mu _{w_1}| \ge \varepsilon \) where \(\varepsilon = \sqrt{\frac{1}{2m}\ln {\frac{4}{\delta '}}}\), m is the harmonic mean of \(n_0\) and \(n_1\), \(\delta ' = \delta /n\), \(\delta \) is the confidence level and n is the size of window w. After a drift detection, elements are dropped from the tail of the window until no significant difference is seen.

HDDM \(_{\mathbf{A-test}}~\diamond \) HDDM \(_{\mathbf{W-test}}\) HDDM\(_{\mathrm{A-test}}\) and HDDM\(_{\mathrm{W-test}}\) are proposed by Frias-Blanco et al. [19]. The former compares the moving averages to detect drifts. The latter uses the EMWA forgetting scheme [14] to weight the moving averages. Then, weighted moving averages are compared to detects the drift. For both cases, Hoeffding’s inequality [9] is used to set an upper bound to the level of difference between averages. The authors noted that the first and the second methods are ideal for detecting abrupt and gradual drifts, respectively.

The pros and cons of all methods will be discussed in more details in Sect. 5. However, during our preliminary experiments, we observed that the aforementioned methods may cause high numbers of false positives and false negatives. Some resulted in long detection delays, though they had short detection runtimes. In the next section, we introduce our Fast Hoeffding Drift Detection Method (FHDDM), developed to address these shortcomings.

3 Fast Hoeffding Drift Detection Method

We present our Fast Hoeffding Drift Detection Method (FHDDM) which uses the Hoeffding’s inequality [9] to detect drifts in evolving data streams. The FHDDM algorithm slides a window with a size of n on the classification results. Subsequently, it inserts a 1 into the window if the prediction result is true, otherwise it inserts 0. As inputs are processed, it calculates the probability of observing 1s, i.e. \(p_{t}^1\), in the sliding window at time t, and also keeps the maximum probability of 1 s occurring, i.e. \(p_{max}^{1}\). Equation (1) shows if the value of \(p_1\) at time t is greater than the value of \(p_{max}^{1}\) then the value of \(p_{max}^{1}\) will be updated.

$$\begin{aligned} {\textit{if } } p_{max}^{1} < p_{t}^1 \Rightarrow p_{t}^1 \rightarrow p_{max}^{1} \end{aligned}$$
(1)

On the basis of the probably approximately correct (PAC) learning model [20], the accuracy of classification would increase or stay steady as the number of instances increases; otherwise the possibility of facing drifts increases [12]. Thus, the value of \(p_{max}^{1}\) should increase or remain steady as we process instances. In other words, the possibility of facing a concept drift increases if \(p_{max}^{1}\) does not change and \(p_{t}^1\) decreases over time. Eventually, as in Eq. (2), a significant difference between \(p_{max}^{1}\) and \(p_{t}^1\) indicates the occurrence of a drift in the stream.

$$\begin{aligned} \varDelta p = p_{max}^1 - p_{t}^1 \ge \varepsilon _d \Rightarrow {\textit{Drift := True}} \end{aligned}$$
(2)

We use the Hoeffding’s inequality to define the value of \(\varepsilon _d\), Eq. (4). The Hoeffding’s inequality has a very attractive property that it is independent of the probability distribution generating the data [9, 19, 21]. That is, it assigns an upper bound for the deviation between the mean of n random variables and its expected value.

Hoeffding’s Inequality Theorem: Let \(X_1, X_2, ..., X_n\) be n independent random variables such that \(X_i \in [0, 1]\), then with probability at most \(\delta \), the difference between the empirical mean \(\overline{X} = \frac{1}{n}\sum _{i=1}^{n}X_i\) and the true mean \(E[\overline{X}]\) is at least \(\varepsilon _H\), i.e. \(Pr(|\overline{X} - E[\overline{X}]| \ge \varepsilon _H) \le \delta \), where:

$$\begin{aligned} \varepsilon _H = \sqrt{\frac{1}{2n}\ln {\frac{2}{\delta }}} \end{aligned}$$
(3)

Corollary (FHDDM test): In a stream setting, assume \(p_{t}^{1}\) is the probability of observing 1s in a sequence of n random entries, each in \(\{0, 1\}\), at time t, and \(p_{max}^1\) is the maximum probability observed so far. Let \(\varDelta p = p_{max}^{1} - p_{t}^{1} \ge 0\) be the difference between those two probabilities. Then, given the desired \(\delta \), i.e. the probability of error allowed, the Hoeffding’s inequality guarantees a drift has happened if \(\varDelta p \ge \varepsilon _d\), where:

$$\begin{aligned} \varepsilon _{d} = \sqrt{\frac{1}{2n}\ln {\frac{1}{\delta }}} \end{aligned}$$
(4)

Figure 1 depicts an illustrative example of the FHDDM algorithm. In this example, n and \(\delta \) are set to 10 and 0.2, respectively. Using Eq. (4), the value of \(\varepsilon _d\) will be equal to 0.28. In this example, a real drift occurs right after the 12th instance. The values of \(p^1\) and \(p_{max}^{1}\) are null and zero until 10 elements are inserted into the window. We have seven 1s in the window after reading the first 10 elements, and so the \(p_{10}^{1}\) is equal to 0.7. The value of \(p_{max}^{1}\) is set to 0.7 too. The 1st element is dropped out from the window before the 11th prediction status is inserted. Since the value of prediction status is 0, the value of \(p^{1}\) decreases to 0.6. The value of \(p_{max}^{1}\) stays the same, because it is greater than the current \(p^{1}\). This progress is continued until the 18th is inserted. At this moment the difference between \(p_{max}^{1}\) and \(p_{18}^{1}\) becomes more than the value of \(\varepsilon _d\). In this case, the FHDDM algorithm alarms for a drift.

Fig. 1.
figure 1

Illustration of how FHDDM works

We present the pseudocode of the FHDDM approach in Algorithm 1. First, we need to instantiate an object from FHDDM and then call its Detect function. The result of prediction, i.e. p, is sent to the Detect function as an input in order to determine whether a drift has occurred, in line 11. The oldest element is dropped out from the sliding window if it is full; then, a new element is pushed into it, as shown in lines 12 to 15. The algorithm returns False in the case of having not enough elements in the window, as depicted in lines 16 and 17. Next, the values of \(p^{1}, p_{max}^{1}\), and \(\varDelta {p}\) are calculated or updated (lines 19 to 23). In the case of having \(\varDelta {p} \ge \varepsilon _d\), it resets its parameters and alarms for a drift by returning True.

figure a

Window-based approaches [1518] usually compare two (sub)windows, e.g. \(w_1\) and \(w_2\), leading to a considerable memory usage [1]. That is, one window is used to maintain historic information (from the beginning) and the second maintains the most recent information. In contrast, FHDDM compares the current accuracy of the classifier with its best accuracy, i.e. the best experience, observed so far using one sliding window size of n. Thus, it occupies only one register, i.e. \(p_{max}^{1}\), and a sliding window size of n where \(n \ll |w_1|\) or \(|w_2|\). Eventually, unlike [1214], as we apply the Hoeffding’s inequality, our method is independent of the probability distribution of data. The Hoeffding’s inequality assumes instances are independent of each other that makes the bound independent of the probability distribution.

4 On Evaluation of Concept Drift Detectors

True Positive, False Positive and False Negative numbers are useful to evaluate the performance of concept drift detectors. Intuitively, a drift detector with the highest true positive, the lowest false positive and the lowest false negative values is preferred. Huang et al. [7] and Bifet et al. [18] used three types of tests to measure true positive, false positive, and false negative values of a drift detector. For instance, to measure the false positive, they generated a stream of bits from a stationary Bernoulli distribution. If the detector alarms for drifts, one false positive is counted for each alarm. Thus, one may use three of such tests to count true positive, false positive, and false negative numbers. However, having an approach able to count them in one test, for any stream generated by any probability distribution, is preferred. To this end, we introduce an approach to count true positive, false positive and false negative by defining the acceptable delay length \(\varDelta \). The acceptable delay length is a threshold set to determine how far the detected drift could be from the true location of drift, for being considered as true positive. Considering the acceptable delay length \(\varDelta \), we describe the true positive, false positive and false negative calculations as follows:

  • True Positive (TP): A drift detector truly detects a drift occurred at time t if it alarms for that at anytime in \([t - \varDelta , t + \varDelta ]\). We call this range as the acceptable detection interval of true positive. Eventually, the true positive rate is defined as the number of drifts correctly identified over the total number of drifts in a stream. For evaluating reactive concept drift detectors, the acceptable detection interval is \([t, t + \varDelta ]\).

  • False Positive (FP): A drift detector falsely alarms for a drift if it detects that outside of the acceptable detection intervals. The false positive rate is defined as the number of points incorrectly considered as drifts over the total number of points which are not drifts.

  • False Negative (FN): A drift detector falsely overlooks a drift occurred at time t if it does not alarm for that at anytime in \([t - \varDelta , t + \varDelta ]\). The false negative rate is defined as the number of drifts incorrectly left unidentified over the total number of drifts in a stream. For the reactive concept drift detectors, the range is \([t, t + \varDelta ]\).

Figure 2, as an example, illustrates how the true positive, false positive and false negative are counted. The upper stream shows the real locations of drifts, i.e. the squares with D inside, and the lower stream shows the result of detection at each location. The squares with T inside represent the drifts detected correctly (true positive), the squares with F inside represent the points incorrectly considered as drift points (false positive), and the squares with N inside indicate undiscovered drifts (false negative). The drift detector signals for a drift within the first acceptable detection interval and so the true positive number increases. Subsequently, it incorrectly alarms for a drift and the false positive number increases. Since the detector does not alarm for a drift within the second acceptable detection interval, the false negative number increases. The figure shows that the detector incorrectly alarms for a drift at the very end of the stream.

Fig. 2.
figure 2

Illustration of counting true positive, false positive and false negative

For data stream mining, the usage of resources will be high if the drift detector incorrectly alarms for drift repeatedly. Further, the error-rate or cost of classification would be high if the drift detector could not correctly detect the location of drifts. In other words, the error-rate of classification typically increases as does the false negative number [7, 8, 18]. Therefore, false positive and false negative are essential measures for evaluating concept drift detectors.

The delay of detection may be considered as a performance measure for drift detectors. Less detection delay results in losing less data for learning, it means more instances from the new distribution can be used for learning. The detection runtime and detection memory usage of drift detectors can be also used as performance measures. Intuitively, a drift detector able to correctly find drifts with less delays faster by consuming less resources is preferred.

5 Experimental Analysis

We discuss our experimental results by comparing the performance of FHDDM against that of DDM, EDDM, ADWIN, HDDM\(_{\mathrm{A-test}}\) and HDDM\(_{\mathrm{W-test}}\). We ran the experiments on synthetic and real-world datasets often used in concept drift detection research [6, 7, 1214, 19]. We have considered Hoeffding Tree (HT), also known as VFDT, and Naive Bayes (NB) as our incremental classifiers; they are frequently used in the literature [6, 7, 12, 13, 16, 18, 19]. In all experiments, we ran the Hoeffding Tree with \(\delta = 10^{-7}, \tau = 0.05\) and \(n_{min} = 200\) as used in the [21]. Instances are processed prequentially, which means they are first tested and then used for training. We used MOA [22], a framework for data stream mining, to implement FHDDM in and compare it with other drift detectors. Experiments are run on Intel Core i5 @ 2.8  GHz with 16 GB of RAM running Apple OS X Yosemite.

5.1 Experiments on Synthetic Datasets

Synthetic Datasets – We generated three synthetic datasets of Sine1, Mixed and Circles, as originally described in [25] and used in the literature [12, 13, 16], containing 100,000 instances with 2 classes. We also added 10 % noise to each dataset. In this way, we can consider how robust drift detectors are against noisy data streams by distinguishing noises from drifts. One of the advantages of synthetic datasets is being aware of the location of drifts. Therefore, we can measure the detection delay, true positive, false positive and false negative numbers (or rates). The datasets are described below:

  • Sine1 \(\cdot \) with abrupt concept drift: The dataset has two attributes x and y uniformly distributed in [0, 1]. The classification function is \(y = sin(x)\). Before the first drift, instances under the curve are classified as positive and others as negative. At a drift point the classification is reversed. We put the drifts at every 20,000 instances.

  • Mixed \(\cdot \) with abrupt concept drift: The dataset has two numeric attributes x and y uniformly distributed in [0, 1] as well as two boolean attributes v and w. The instances are classified as positive if at least two of the three following conditions are satisfied: \(v, w, y < 0.5 + 0.3 * sin(2\pi x)\). The classification is reversed after drifts. Drifts happen at every 20,000 instances.

  • Circles \(\cdot \) with gradual concept drift: It has two attributes x and y uniformly distributed in [0, 1]. The function of a circle \(<(x_c,y_c), r_c\>\) is \((x - x_c)^2 + (y - y_c)^2 = r_c^2\) where \((x_c, y_c)\) is its centre and \(r_c\) is the radius. Four circles of \(<(0.2,0.5),0.15\>\), \(<(0.4,0.5),0.2\>\), \(<(0.6,0.5),0.25\>\), and \(<(0.8,0.5),0.3\>\) classify instances in order. Instances inside the circle are classified as positive. A drift happens when the classification function, i.e. circle function, changes. Drifts occur at every 25,000 instances.

Experiments – We ran Hoeffding Tree (HT) and Naive Bayes (NB) with each drift detector for 100 times and then averaged the detection delays, true positives, false positives, false negatives, detection runtimes (in millisecond), memory usage (in bytes) of drift detectors as well as the accuracies of classifiers. The acceptable drift detection delay length, i.e. \(\varDelta \), was set to 250 for the Sine1 and Mixed datasets and to 1000 for the Circles dataset. We consider a longer \(\varDelta \) for the Circles dataset because it contains gradual concept drifts. Preliminary experiments and inspections confirmed that a longer \(\varDelta \) should be considered for gradual drifts, otherwise the false negative numbers would increase. We ran FHDDM with a sliding window size of 25 on the Sine1 and Mixed datasets, and with a sliding window size of 100 on the Circles dataset. We considered a wider sliding window size for the Circles dataset to make sure we have enough examples in the window as we are facing with the gradual drifts. Preliminary inspections helped us to adjust our window sizes for resulting in less detection delay, less false positive and less false negative. Since FHDDMs’ sliding windows are small and they compare \(p_{t}^1\) with the \(p_{max}^1\), we need to set \(\delta \) to a small value to make sure the \(\varepsilon _d\) is big enough. It was, therefore, set \(\delta \) to \(10^{-7}\) for our experiments. All other drift detectors were run with the default parameters as set in MOA (or as in the original papers).

Table 1(a) represents the results of experiments on the Sine1 dataset. FHDDM has the lowest false positive and false negative averages with both classifiers. HDDM\(_{\mathrm{W-test}}\) results in the lowest delay followed by FHDDM with small margins. DDM and EDDM exhibit the highest detection delay. They are also the only two drift detectors with false negative averages. EDDM and ADWIN have considerable false positive averages. As shown in Table 1 (b), we achieve the highest classification accuracies by FHDDM and HDDM\(_{\mathrm{W-test}}\) with both classifiers. It is clearly seen that ADWIN has the longest runtimes and the highest memory usages with considerable margins.

Table 1. Results of experiments on Sine1 dataset (10 % Noise)

We show the results of experiments on the Mixed dataset in Table 2. FHDDM and ADWIN have the highest true positive averages without causing any false negatives. FHDDM has the smallest false positive averages while EDDM and ADWIN have the highest averages. HDDM\(_{\mathrm{W-test}}\) and FHDDM have the shortest detection delays. As represented in Table 2(b), we achieve the highest classification accuracies by FHDDM with both classifiers. EDDM and ADWIN result in the shortest and longest detection runtimes, respectively. ADWIN considerably occupies the memory.

Table 2. Results of experiments on Mixed dataset (10 % Noise)

Tables 3(a) and (b) hold the experiments results on the Circles dataset. FHDDM results in the shortest detection delay, the highest false positive, the lowest false positive and the lowest false negative with Hoeffding Tree. ADWIN has the shortest detection delay and the highest true positive average with Naive Bayes and it is followed by FHDDM. EDDM and ADWIN have the highest false positive averages. In the terms of classification accuracies, we achieve the highest ones by FHDDM with either of classifiers. Like the previous experiments, EDDM has the shortest detection runtimes and ADWIN has the highest memory occupations.

Table 3. Results of experiments on Circles dataset (10 % Noise)

We compared FHDDM with existing drift detection methods on the synthetic datasets containing abrupt and gradual concept drifts. In conclusion, FHDDM had the first or second shortest detection delay, the highest true positive average, the lowest false positive average, and the lowest false negative average. Further, the detection runtime and memory occupation was comparable to HDDM\(_{\mathrm{A-test}}\)’s and HDDM\(_{\mathrm{W-test}}\)’s. Importantly, FHDDM led to the highest classification accuracies with both Hoeffding Tree and Naive Bayes.

5.2 Experiments on Real-World Datasets

Real-World Datasets – We considered the Airlines [26], Poker Hand [27] and Electricity [28] datasets widely used in concept drift research [6, 7, 12, 13, 18, 19]. The preprocessed and normalized version of datasets are available at MOA websiteFootnote 1. The datasets are described below:

  • Airlines: This dataset was created to be used as a non-stationary data stream for evaluating learning algorithms [26]. It contains 539,383 records of flight schedules defined by 7 attributes. The task is to predict if a flight is delayed or not. Concept drift could appear as the result of changes in the flights schedules, e.g. changes in day, time, and the length of flights.

  • Poker Hand: It comprises 1,000,000 instances with 11 attributes. Each instance is an example of a hand consisting of five playing cards drawn from a standard deck of 52. Each card is described by two attributes (suit and rank), for ten predictive attributes. The class predicts the poker hand. Concept drift happens as changing the card at hand, i.e. the poker hand [4].

  • Electricity: It has 45,312 instances, with 8 input attributes, recorded every half an hour for a period of two years from Australian New South Wales Electricity. The classification task is to predict a rise (Up) or a fall (Down) in the electricity price. The concept drift may happen because of changes in consumption habits, unexpected events and seasonality [29].

Experiments – The ground truth for drifts is not available for the real-world datasets. This implies that we do not know whether drifts occur in these datasets or where they occur [6, 7]. We, therefore, cannot measure the detection delay, true positive, false positive, and false negative numbers of drift detectors in this section. We only evaluate the number of drifts detected and the accuracy of classification. All classifiers and drift detectors were run with the default parameters. For FHDDM, we only present the results obtained by the sliding window size of 25 because we usually obtained better classification accuracies with size 25 on the real-world datasets in our preliminary experiments, though the margins of differences were small.

Table 4. The results of experiments on Airlines dataset
Table 5. The results of experiments on Poker Hand dataset
Table 6. The results of experiments on Electricity dataset

Tables 4, 5 and 6 summarize the results of experiments on the aforementioned datasets. The accuracy of classification has improved by using drift detectors. The classification accuracy with FHDDM is among the highest ones. It also detects less drifts compared to ADWIN, HDDM\(_{\mathrm{A-test}}\) and HDDM\(_{\mathrm{W-test}}\) while their classification accuracies are similar. As argued in [7], there are two possible cases if a drift detector detects less number of drifts compared to the other drift detectors while they all lead to similar classification accuracies: (1) That drift detector caused less false positive compared to others, or (2) Not detected drifts, i.e. false negatives, were less significant drifts. The second case implicitly says having less number of drifts detected leading to lower classification accuracy suggests significant false negatives. Therefore, based on these arguments, it is more likely that FHDDM caused fewer false positives. Its detection runtime is also comparable with HDDM\(_{\mathrm{A-test}}\)’s and HDDM\(_{\mathrm{W-test}}\)’s. In all cases, FHDDM resulted in shorter detection runtimes, less memory occupations and higher classification accuracies compared to ADWIN.

6 Conclusion and Future Work

Adapting classification learners is essential when they are used to learn from data in evolving environments. In this paper, we introduced a new concept drift detection method, so-called FHDDM, that uses the Hoeffding’s inequality. The method works based on the fact that the accuracy of a classifier should increase or stay steady as more instances arrive; otherwise it implies the existence of drift points in the stream. FHDDM slides a window with a size of n on the stream and measures the \(p_{t}^{1}\), i.e. the probability of correct classification predictions in the most recent n instances at time t. It updates the value of \(p_{max}^{1}\) that holds the maximum probability of correct predictions seen so far. A significant difference, bounded by Hoeffding’s inequality, between \(p_{t}^{1}\) and \(p_{max}^{1}\) suggests a drift. In addition, we introduced an approach to count true positive, false positive and false negative of drift detectors by considering their delay of detection for evolving data streams.

We experimentally evaluated our method on the synthetic and real-world datasets. Experiments on the synthetic datasets indicated that FHDDM detects drifts with a shorter delay, leading to the highest true positive, the lowest false positive and the lowest false negative, when compared to the state-of-the-art. When considering real-world datasets, the classification accuracies of our method were consistently high.

In the future, we will investigate the performance of our FHDDM approach on imbalanced and highly noisy data streams as well as streams containing outliers. We will also consider implementing an adaptable window size, as based on the trends of prediction results. In addition, we plan to study the sensitivity of FHDDM’s parameters, i.e. size of sliding window and confidence level, along with other drift detectors’ and consider their performances in different domains. It would also be worthwhile to compare FHDDM with other drift detectors, as proposed in [14, 24], amongst others. Finally, we intend to use our proposed method in anomaly detection and business intelligence applications.