Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Adaptive Clustering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 1

Adaptive Clustering for Dynamic IoT Data Streams


Daniel Puschmann, Payam Barnaghi, Senior Member, IEEE, and Rahim Tafazolli, Senior Member, IEEE

Abstract—The emergence of the Internet of Things (IoT) has There are usually huge amount of data produced in IoT
led to the production of huge volumes of real-world streaming applications, however, these data lack having labels, which
data. We need effective techniques to process IoT data streams makes these types of methods infeasible to be used. While
and to gain insights and actionable information from real-
world observations and measurements. Most existing approaches clustering methods avoid this pitfall since they do not need
are application or domain dependent. We propose a method supervised learning, they work best in offline scenarios where
which determines how many different clusters can be found all data is present from the start and the data distribution
in a stream based on the data distribution. After selecting the remains fixed. In this paper, we propose a clustering method
number of clusters, we use an online clustering mechanism with the ability to cope with changes in the data stream which
to cluster the incoming data from the streams. Our approach
remains adaptive to drifts by adjusting itself as the data changes. makes it more suitable for IoT data streams.
We benchmark our approach against state-of-the-art stream Data is usually clustered according to different criteria; e.g.
clustering algorithms on data streams with data drift. We show similarity and homogeneity. The clustering results in a data
how our method can be applied in a use case scenario involving analysis scenario can be interpreted as categories in a dataset
near real-time traffic data. Our results allow to cluster, label and and can be used to assign data to various groups (i.e. clusters).
interpret IoT data streams dynamically according to the data
distribution. This enables to adaptively process large volumes of In this paper we discuss an adaptable clustering method that
dynamic data online based on the current situation. We show analyses the distribution of data and updates the cluster cen-
how our method adapts itself to the changes. We demonstrate troids according to the online changes in the data stream. This
how the number of clusters in a real-world data stream can be allows creating dynamic clusters and assigning data to these
determined by analysing the data distributions. clusters not only by their features (e.g. geometric distances),
Index Terms—Internet of Things, Stream Processing, Adaptive but also by investigating how the data is distributed at a given
Clustering time. We evaluate this clustering method against several state-
of-the-art methods on evolving data streams.
I. I NTRODUCTION To showcase the applicability of our work, we use a case study
from an intelligent traffic analysis scenario. In this scenario we

T HE shift from the desktop computing era towards ubiq-


uitous computing and the IoT has given rise to huge
amounts of continuous data collected from the physical world.
cluster the traffic sensor measurements according to features
such as average speed of vehicles and number of cars. These
clusters can then be analysed to assign them a label; for
The data produced in the IoT context has several characteris- example, a cluster that always includes the highest number of
tics which makes it different from other data used in common cars, according to the overall density of the cars at a given time
database systems and machine learning or data analytics. and/or the capacity of a street, will be given the ”busy” tag. By
IoT data can come from multiple different heterogeneous further abstracting we can identify events such as traffic jams,
sources and domains, for example numerical observations and which can be used as an input for automated decision making
measurements from different sensors or textual input from systems such as automatic rerouting via GPS navigators.
social media streams. Common data streams usually follow The remainder of the paper is organised as follows. In Section
a Gaussian distribution over a long-term period. However, in II we present the state of the art and discuss the benefits
IoT applications we need to consider short-term snapshots and drawbacks of different stream cluster algorithms. We
of the data, in which we can have a wider range of more present related work to analyse stream data with concept
sporadic distributions. Furthermore the nature of IoT data and data drifts. The silhouette coefficient is chosen as a
streams is dynamic and its underlying data distribution can metric for measuring the cluster quality and the mathematical
change over time. Another point is that the data comes in large backgrounds of the method described in Section II. In Section
quantities and is produced in real-time or close to real-time. III, we introduce the concepts of our adaptive online clustering
This necessitates development of IoT specific data analytics method which automatically computes the best number of
solutions which can handle the heterogeneity, dynamicity and clusters based on the data distribution. Section IV describes the
velocity of the data streams. proposed adaptive clustering method in more technical details.
To group the data coming from the streams, we can use We present evaluations of our work in Section V. In Section
clustering or classification methods. Classification methods V-A we compare our method against state-of-the-art methods
require supervised learning and need labelled training data. on a synthesised data set. We have conducted a case study
D. Puschmann, P. Barnaghi and R. Tafazolli are with the Institute for using traffic data and present the results in Section V-B. In
Communication Systems at the University of Surrey Section VI, we discuss the significance of our work and outline
Copyright (c) 2012 IEEE. Personal use of this material is permitted. the future work.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
Manuscript received July 2016; revised September 2016. Month XX, 2016

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 2

II. R ELATED WORK that chooses the right k with seven other approaches. In
their experiment the method which performs best in terms of
Several approaches for the clustering problem exist, how- choosing the number of clusters and cluster recovery works as
ever we will only take a closer look at a particular approach: follows: Clusters are chosen based on new anomalies in the
Lloyds Algorithm, better known under the name k-means [1]. data and a threshold based on Hartigans rule [9] is used to
It should be noted that this particular approach has been eliminate small superfluous clusters.
selected to be improved for the purpose of streaming data DenSream was introduced by Cao et al. [10] to cluster
clustering because of its simplicity. The concept of utilising streaming data under the conditions of changing data dis-
the data distribution can be also applied to determining the tributions and noise in data streams. DenStream creates and
parameter k for k-Median [2] or the number of classes in maintains dense micro-clusters in an online process. Whenever
unsupervised multi class support vector machines [3] and other a clustering request is issued, a macro-cluster method (e.g.
clustering algorithms. DBSCAN [11]) is used to compute the final cluster result on
k-means splits a given data set into k different clusters. It does top of the micro-cluster centroids.
so by first choosing k random points within the data sets as It should be noted that in Chiang and Mirkins experimental
initial cluster centroids and then assigning each data point to setting only uses data generated from clusters with a Gaussian
the most suitable of these clusters while adjusting the centre. distribution. We argue that data from the real-world not neces-
This process is repeated with the output as the new input sarily follows a Gaussian distribution. There are a large range
arguments until the centroids converge towards stable points. of distributions which might fit the data better in different
Since the final results of the clustering is heavily dependent on environments and applications such as Cauchy, exponential or
the initial centroids, the whole process is carried out several triangular distributions. In order to reflect this, our selection
times with different initial parameters. For a data set of fixed criteria for the number of clusters is the shape of the data
size this might not be a problem; however in the context of distribution of the different data features.
streaming data this characteristic of the algorithm leads to Transferring the cluster problem from a fixed environment
heavy computational overload. to streaming data brings another dimension into play for
Convergence of k-means to a clustering using the random interpreting the data. This dimension is the situation in which
restarts not only means that this procedure takes additional the data is produced. For our purpose we define situation as the
time, but depending on the data set, k-means can produce way the data is distributed in the data stream combined with
lower quality clusters. k-means++ [4] is a modification of k- statistical properties of the data stream in a time frame. This
means that intelligently selects the initial centroids based on situation depends both on the location and time. For example,
randomised seeding for the initial cluster centroids. While the categorising outdoor temperature readings into three different
first center is chosen randomly from a uniform distribution, categories (i.e. cold, average, warm) is heavily dependent on
the following centroids are selected with probability weighted the location, e.g. on the proximity to the equator. For exam-
based on their proportion to the overall potential. ple, what is considered hot weather in the UK is perceived
STREAM [5] is a one-pass clustering algorithm which treats differently somewhere in the Caribbean.
data sets, which are too large for to be processed in-memory, Similarly our interpretation of data can change when we fix
as a data stream; however, the approach has shown limitations the location but look at measurements taken at different points
in cases where the data stream evolves over time leading to in time. For example, consider a temperature reading of 10◦ in
misclustering. Aggarwal et al. [6] introduce their approach the UK. If this measurement was taken in winter, we certainly
called CluStream that is able to deal with these cases. CluS- consider this as warm. If it was taken in summer though it
tream is also able to give information about past clusters for a would be considered as cold temperature. This phenomenon
user defined time horizon. Their solution works by dividing the where the same input data leads to a different outcome in the
stream cluster problem into an online micro-cluster component output is known as concept drift [12], [13].
and an offline macro-clustering component. One drawback of There are several existing methods and solution focusing on
Aggarwal et al.’s approach is that the number of clusters has concept drift; some of the recent works in this domain are
to be either known in advanced and fixed, or chosen by a user reviewed in [14]. Over the last decade a lot of research was
in each step, which means that human supervision has to be dedicated to handling concept drift in supervised learning
involved in the process. scenarios mainly utilising decision trees [15] or ensemble clas-
Another well-known approach in stream clustering is sifiers [16]; however adaptation mechanisms in unsupervised
StreamKM++ [7]. This approach is based on k-means++ [4]. methods have only recently started to be investigated [14].
However, again the number of clusters needs to be known There are different types of concept drift. If only the data
beforehand. StreamKM++ constructs and maintains a core-set distribution changes without any effect on the output, it is
representing the data stream. After the data stream is pro- called virtual drift. Real concept drift denotes cases where the
cessed, the core-set is clustered with k-means++. Because of output for the same input changes. This usually has one of the
that, StreamKM++ is not designed for evolving data streams. following reasons. Either the perception of the categories or
There are several approaches to deal with the problem of objectives has changed or changes in the outcome are triggered
identifying how many different clusters can be found in a data by changes in the data distribution. We argue that in the IoT
set. Chiang and Mirkin [8] have conducted an experimental domain and especially in smart city applications, the latter type
study in which they proposed a new method called ik-means of concept drift is more important. In order to avoid confusion

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 3

between the different types of concept drift we introduce the items are only computed when a clustering request on the data
term “data drift” to describe real concept drift that is caused stream is made.
by changes in the data stream. This approach works in scenarios where the clustering result
Smith et al. [17] have developed a tool for creating data is not needed continuously. However, if the clustering result is
streams with data drifts through human interactions. In their needed on a continuous basis and the offline calculation of the
experiments they have found that current unsupervised adap- data stream representation has to be issued in high frequency,
tation techniques such as Near Centroid Classifier (NCC) [18] the efficiency gain of the methods are lost and the response
can fall victim to cyclic mislabelling, rendering the clustering time in online applications with large volumes of dynamic data
results useless. While Smith et al. [17] found that semi- is limited by applying the macro clusters. Therefore a new
supervised (Semi-supervised NCC (SNCC)) and hybrid adap- method with low computational complexity that can produce
tations of the technique (Semi- and Nearest Centroid Classifier cluster results directly during processing the stream is required.
(HNCC)) lead to more robust results; adaptive methods are Our proposed solution to this problem is to create a clustering
also needed in scenarios for which labels are not available mechanism in which the centroids change and adapt to data
and therefore only unsupervised learning can be applied. drifts. We propose an adaptive method to re-calibrate and
Cabanes et al. [19] introduce a method which constructs a adjust the centroids.
synthetic representation of the data stream from which the data
distribution can be estimated. Using a dissimilarity measure for A. Silhouette coefficient
comparing the data distributions, they are able to identify data The common metrics to evaluate the performance of clus-
drifts in the input streams. Their work is limited by the fact tering such as homogeneity, completeness and v-measure are
that they only present preliminary results and are still working mainly suitable for offline and static clustering methods where
on an adaptive version of their approach. a ground truth in the form of class labels is available. However,
Estimating the data distribution is an essential step for iden- in our method, as the centroids are adapted with the data
tifying and handling data drifts. The data distribution can drifts, the latter metrics will not provide an accurate view of
be calculated using Kernel Density Estimation (KDE) [20], the performance. In order to measure the effectiveness of our
[21]. The most important parameter for KDE is the band- method we use the silhouette metric. The use of the silhouette
width selection. There are different methods to choose this coefficient as a criterion to choose the right value for number
parameter automatically from the provided data. They include of clusters has been proposed by Pravilovic et al. [28]. This
computationally light rules of thumb such as Scotts rule [22] metric is used in various works to measure the performance
and Silvermans rule [23] and computationally heavy methods of the clustering methods including the MOA framework
such as cross-validation [24]. A detailed survey on bandwidth [26], [27] that is used in this work for the evaluation and
selection for KDE is provided in [25]. However, the easily comparisons.
computed rules are sufficient for most practical purposes. The silhouette metric as a quality measure for clustering algo-
Bifet et al. [26] introduced Massive Online Analysis (MOA), rithms was initially proposed by Rousseeuw [29]. Intuitively
a framework for analysing evolving data streams with a broad it computes how well each data point fits into its assigned
range of techniques implemented for stream learning. Ini- cluster compared to how well it would fit into the next best
tially MOA only supported methods for stream classification; cluster (i.e. the cluster with the second smallest distance).
extensions of the framework have added additional func-
tionalities for data stream analytics. Particularly interesting b(i) − a(i)
s(i) = (1)
for the presented work is an extension of the framework max((a(i), b(i))
which provides an easy way to compare different stream The silhouette for one data points is defined in Equation 1,
clustering algorithms. In addition to providing implementation whereby i represents the data point, b(i) is the average distance
of state-of-the-art stream clustering algorithms and evaluation to each of the points in the next best cluster and a(i) is the
measures, Kranen et al. [27] introduce new data generators for average distance to each of the points of the assigned cluster.
evolving streams based on randomised Radial Base Functions The total silhouette score is obtained by taking the average
(randomRBFGenerator). We compare our method against their of all s(i). From this definition we can see that the silhouette
implementations of DenStream [10] and CluStream [6] which width s(i) is always between −1 and 1. The interpretation
are both designed to handle evolving data streams. of the values is as follows: values closer to 1 represent better
Stream cluster algorithms such as CluStream [6] and Den- categorisation of the data point to the assigned cluster, while a
Stream [10] stay adaptive to evolving data streams by splitting value close to -1 denotes less efficiency in the categorisation,
the clustering into offline and online parts. The online part i.e. the data point would have better fit into the next-nearest
continuously retrieves a representation of the data stream. cluster. Following that a silhouette width of 0 is neutral, that
This is done through the computation of micro-clusters. The is to say a data point with this value would fit equally well
micro-clusters allow for efficient and accurate computations in both clusters. We average over all silhouette values s(i) to
of clusters by applying common clustering methods such as obtain the score of the overall clustering. This average value
k-means++, DBSCAN [11] or similar methods as a macro is the overall silhouette score and can be used to compare the
cluster whenever a cluster request is issued by the end-user or quality of different cluster results.
an application which uses the stream clustering mechanism. Rousseuw [29] points out that single, strong outliers can lead
This means that the actual clusters and the labels for the data to misleading results; therefore it has to be made sure that

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 4

intuition that a directional change, referred to as a turning point


(tp), in the probability distribution can signify the beginning
of a new category. The tps which ended up producing the
best cluster result are visualised as arrows in Figure 1b.
We split the PDF in areas of equi-probable distributions as
visualised by the blue vertical lines. This idea is inspired by
the Symbolic Aggregate Approximation (SAX) algorithm [30]
where a Gaussian distribution is split into equi-probable areas
(a) Distribution of the features (b) PDF of the features ”average that are used to map continuous data from streams to discrete
”average speed” (a) and ”number speed” (a) and ”number of cars” symbolised representations. Following this approach we can
of cars” (b) (b) split into equi-probable areas
obtain smaller and denser cluster in areas with many data
Fig. 1: Distribution of the different features and their resulting points, whereas in areas with less data points we get wider
PDFs and sparser cluster.
Following that the number of areas in the PDF can be consid-
ered as a possible k for the k-means algorithm, the centres of
there are no singleton clusters in the results. In order to use these areas are then considered as possible initial centroids.
the Silhouette Coefficient in a streaming setting, we have to In contrary to the random initialisation of the original k-
define the time frame from which the data points are taken into means [1] we propose a way to intelligently select the initial
account for measuring the cluster quality. A natural contender centroids. This makes the clustering part of the algorithm
for that time frame is the last time the centroids have been re- deterministic and random restarts become unnecessary.
calculated, since this is the point in time when we discovered Since in general, different features of a data stream do not
a data drift and the new clustering has to adapt to the data follow the same distribution, the PDF curves obtained from
stream from then on. different features contain more than one possible number
for k and also provide different candidates for the centroids
III. A DAPTIVE S TREAMING C LUSTERING even if k happens to have the same value. Furthermore, the
Most of the stream clustering methods need to know in combination of the different feature distributions could also
advance how many clusters can be found within a data stream allow for combinations of optimal clusters which lie between
or at the very least are dependent on different parametrisation the minimum and maximum number of turning points of the
for their outcome. However, we are dealing with dynamic en- distribution functions. We test for:
vironments where the distribution of data streams can change k ∈ [tpmin , tpmin + tpmax ] (2)
over time. There is a need for adaptive clustering algorithms
that adapt their parameters and the way they cluster the data How can we then decide which of these value of k and which
based on changes of the data stream. centroid candidates lead to a better cluster results? In order
With the abundance of data produced, one of the main ques- to answer this question we need a metric with which we can
tions is not only what to do with the data but also what compare the resulting clusters for different values of k when
possibilities have not been yet considered. If new insights we apply the clustering mechanism to an initial sample set of
are obtained from the data these can in turn inspire new size n. The metric must satisfy the following properties:
applications and services. One can even go further and ignore 1) independence of k
any prior knowledge and assumptions (e.g. type of data cate- 2) independence of knowledge of any possible ground truth
gories of results) in order to retrieve such insights. However,
previously known knowledge can influence the expectations Property one comes as no surprise. Since we have to compare
and therefore can enhance and/or alter the results. cluster results with different k values, the metric must not be
biased by the number of k’s. For instance this would be the
case if we chose variance as a comparison criterion. In this
A. Finding the right number of clusters case there would be a strong bias towards higher values of k
One of the key problems in working with unknown data is since the variance within the clusters converges to zero as the
how to determine the number of clusters that can be found in number of clusters converges to the sample set size.
different segments of the data. We propose that the distribution The second property is derived from the fact that our approach
of the data can give good indications of the categories. Since does not take prior knowledge into consideration. On one
usually the data has several features we look at each of the hand the approach is domain independent. On the other hand
distributions of different features. The shape of the probability one of the main objectives is to extract information which
distribution curve gives good approximations of how many is inherent in the data itself, and therefore should not be
clusters we need to group the data. obstructed by assumptions of any kind. This property instantly
Figure 1a shows the distribution of two different features excludes the majority of commonly used metrics for cluster
(average speed and number of cars) from the use case de- evaluation. Evaluation criteria such as purity, homogeneity
scribed in Section V-B. Figure 1b shows the Probability and completeness all evaluate the quality of the clusters by
Density Functions (PDFs) that were computed using KDE comparing the assigned labels of the data to the labels of the
[20], [21] from the data shown in Figure 1b. We follow the ground truth in one way or another.

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 5

The silhouette metric described in Section II-A satisfies both initialisation period and is repeated at the beginning of each
properties. In order to estimate the quality of the clusters, adjustment step. More information about the data collection
they are examined by computing how well the data points fit and the adjustment can be found in Section V-B. Initially the
into their respective cluster in comparison to the next-nearest probability density functions of each of the features of the data
neighbour cluster. are computed using KDE [20], [21]. The continuous PDFs are
represented by discrete arrays.
B. Dealing with the data drift Algorithm 1 DETERMINE C ENTROIDS (A, k, n)
In scenarios where the input data is fixed, once the k-means
Require: Data matrix A = {a0 , a1 , . . . , an } with each ai
algorithm with random restarts converges, the clusters become
being an array of length m containing all values of feature
fixed. The resulting clusters can be then reused to categorise
n
similar data sets.
1: %Therefore sample j is the data point: [a0 [j], . . . , an [j]]
However in the case of streaming data two observation which
Ensure: List C = {c0 , c1 , . . . , cmax(tps) } of clusterings, with
are taken on two different (and widespread) time points do
each ci being a list of centroids with length k = tpmin + i
not necessary have the same meaning and consequently will
2: pdf [] = ∅
belong to different clusters. This in turn leads to a different
3: tps[] = ∅
clustering of the two observations. Identical data can have
4: for i ← 1 to n do
different meaning when produced in a different situation.
5: %Array containing the PDFs of each feature
For example, imagine observing 50 people at 3pm during a
6: pdf [i] = gaussianKDE(a[i])
weekday walking over a university campus. This would not be
7: %Array containing the number of turning points of
considered as ”busy” given the usual number of students in
the PDF
the area. Observing the same number of people at 3am would
8: tps[i] = countT urningP oints(pdf [i])
however be considered as exceptional, giving the indication
9: end for
that a special event is happening.
10: C[] = ∅
We incorporate this into our clustering mechanism by adapting
11: for i in range(min(tps), min(tps) + max(tps) do
the centroids of the clusters based on the current distribution
12: betas[] = ∅
in the data stream. The data drift detection is triggered by
13: %Each f represents the PDF of a feature
changes in the statistical properties of the probability density
14: for f in pdf do
function. The justification for our method is based on prop-
15: betas[] = f indBetas(f, tps[i])
erties of stochastic convergence. Convergence in mean square
16: C[i] = list of means between two adjacent betas
(see equation 3) implies convergence in probability, which in
17: end for
turn implies convergence in probability and distribution [31].
18: end for
The formula for convergence in the mean square is given in
19: return C
equation 3.
lim E|Xn − X|2 = 0 (3) These PDF representations are then fed into Algorithm
n→∞
2. Turning points can be determined by analysing the first
During training we store the standard deviation and expected
derivative. They have the property that dy/dx = 0, where
value of the data with the current distribution. When process-
dx is the difference between two infinitely close x values of
ing new incoming data, we track how the expected value and
the PDF, dy is the difference between two infinitely close y
standard deviation changes given the new values. Equation
values of the PDF and dy/dx is the slope of the PDF. This
3 states that as a sequence approaches infinite length, its
is a necessary but not sufficient criteria for having a turning
mean squared approaches the value of the random variable
point. Only if the sign of dy/dx changes from negative to
X (in our case defined by the previously computed PDF).
positive or vice versa, we actually have a turning point in
However, we can make the assumption that as we get more
our function. These are just the definitions for local maximum
and more values, that if the expected value converges to such
and minimum points respectively. Finding these points we can
that E|Xn − X|2 = ε for n >> 0 and ε > 0, the current
present the turning points of a feature PDF and this number
time series data is no longer converging to the distribution that
can be used to determine the right number of clusters.
we predicted with the PDF estimation. Therefore we have a
We use the heuristic that the right amount of clusters lies
change in the underlying distribution of the data stream and
between the smallest number of turning points of a feature
trigger a recalibration of our methods. If we can observe a
PDF and this number added to the maximum number of
higher quality in the new clusters, the old centroids will be
turning points found in any of the PDFs.
adjusted.
Once the number of turning points - and therefore the possible
A detailed description of the algorithm follows in the next
values for the number of clusters - for each feature are
Section.
computed, Algorithm 1 determines candidates for the initial
centroids.
IV. A DAPTIVE S TREAMING k- MEANS : A CLOSER LOOK The computation then splits the PDF curve into equi-probable
Algorithm 1 shows how the centroids in our adaptive areas (similar to the SAX algorithm [30]). The boundaries of
method are computed. This takes place after a configurable these areas are called beta points. Since we are interested in the

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 6

centre of these region, the middle points between two adjacent Algorithm 3 STREAMING KM EANS (D)
betas are computed and saved as initial centroids. Require: Data stream D, length of data sequence used for
initialisation l
Algorithm 2 COUNT T URNING P OINTS (f )
Ensure: Continuous clustering of the data input stream
1: % initialisation phase
Require: Array f representing a probability density distribu- 2: for cCs in determineCentroids(D.next(l)) do
tion 3: currentk = length(cCs)
Ensure: Number of turning points tps 4: nCs = kmeans(cCs, D.next(l), currentk ))
1: ∆y = εy ∗ max(f ) 5: if silhoutte(nCs) < lastSil then
2: % Little gradient should be recognised as no gradient 6: % We found a new best clustering
3: ∆x = εx ∗ max(f ) 7: centroids = nCs
4: % Areas of no ascent/descent should only be counted if 8: lastSil = silhoutte(nCs)
they last long enough 9: k = currentk
5: for x in f do 10: end if
6: if changedDirection(x, f, ∆y , ∆x ) then 11: end for
7: tps + + 12: % Continuous clustering phase
8: end if 13: loop
9: end for 14: if changeDetected(D) then
10: return tps 15: centroids = determineCentroids()
16: centroids =
Once the candidates for the initial clusters are identified 17: kmeans(centroids, D.getData(), k)
- one for each feature - a normal k-means is run on the 18: else
dataset with the initial centroids as starting points for the 19: centroids =
clustering. The results are then compared by computing the 20: kmeans(centroids, D.getData(), k)
silhouettes. For an in-depth description we refer the reader to 21: end if
the paper by Rousseeuw [29]; however a short elaboration on 22: end loop
the mathematical background is also given in Section II-A.
Incoming data points are fed into the k-means with the current
cluster centroids and assigned to their nearest cluster. The takes O(nkd) since no iterations of the algorithm are needed.
cluster centroid of the assigned cluster is adjusted to reflect Then for each clustering we compute the silhouette score.
the new centre of the cluster including the inserted value. Calculating the silhouette score is computationally intensive
We give a brief complexity analysis of Algorithms 1 through since each distance pair has to be computed. Here we can
3. We start with Algorithm 2, since it is used by Algorithm 1. apply the following steps to increase the performance. In-
Since the Algorithm goes along the array f , which represents stead of calculating the distance pairs for each value of k,
the PDF, the complexity lies in O(length(f )). This array has we can initially compute the distance pair matrix and pass
exactly the same length as the array which has been fed into it to the silhouette calculation. For cases where we have
the function gaussianKDE, computing the pdf representa- a huge size of data values, we perform random sampling
tion. Algorithm 1 is called with a matrix of dimensions n × l. to decrease the number of distances pairs that have to be
Here n is the number of features, l being the length of the computed. In practice, sampling has been shown to provide
initial data sequence. Therefore length(f ) = l and we have close approximations to the actual silhouette score at a fraction
a complexity of O(l) for Algorithm 2. At the same time, the of the computational cost. During the online clustering phase,
gaussianKDE function scales linearly with the size of the assigning the nearest cluster to new incoming values takes
input array, resulting as well in the complexity of O(l). only O(1) time. Recalibrating the cluster centroids requires
For Algorithm 1 we first look at line 3 to 6. Here for each of one call of Algorithm 1 (O(n · l)) and another run of k-means
the n feature vectors both gaussianKDE and Algorithm 2 (O(nkd)).
are called. This results in a complexity of O(n · l).
We then examine line 8 to 14. We can see that the outer
for loop runs exactly max(tps) times, which in practice is V. E VALUATION
a small constant. The inner for loop runs in the length of the We test the proposed method both on synthesised data and
number of the PDFs; since we compute one PDF for each on a real-world data set. We evaluate the method against
feature this is equal to n times. Function f indBetas goes state-of-the-art methods using data sets which are generated
along the input array to find the beta values and therefore in different ways and discuss in which cases the method has
scales in the length of the input array. Putting this information an advantage over existing approaches and in which cases it
together results again in a complexity of O(n · l). Since both is outperformed by them.
parts of the Algorithm have the same complexity, the total We introduce a novel way of generating data streams with
complexity equals O(n · l). data drift. The drift is introduced both by shifting the centroids
Algorithm 3 uses Algorithm 1 for finding the initial cen- in randomised intervals and by changing the data distribution
troids. Running k-means with determined initial centroids function used to randomly draw the data from the centroids.

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 7

Here we can scale up the dimensionality of the generated data, These changes can be re-occurring in time (for example in the
with each feature having its own distribution to draw the data case of traffic during rush hours and off-peak times) or more
from. sudden changes (for example traffic congestions caused by an
Finally, we show how our method can be used in a real-world accident).
case study by applying it to the output of traffic sensors which For that reason we introduce a novel way of generating
measure the average speed and the number of cars in street data with data drift. The centroids are selected through Latin
segments. Hypercube Sampling (LHS) [32]. The number of clusters
and dimensions are fixed beforehand. Similar to the method
before, each centroid is assigned with a standard deviation and
A. Synthesised Data
weight. Furthermore, each dimension is given a distribution
To evaluate our method we test our method against two es- function, which later is used to generate the data samples.
tablished stream cluster algorithms: CluStream [6] (horizon = Considering that each dimension represents a feature of a
1000, maxN umKernels = 100, kernelRadiF actor = 2) data stream, this models the fact that in IoT applications
and DenStream [10] (horizon = 1000, epsilon = 0.02, we are dealing with largely heterogeneous data streams in
beta = 0.2, mu = 1, initP oints = 1000, of f line = 2, which the features do not follow the same data distribution.
lambda = 0.25) . For this we use two different ways Our current implementation supports triangular, Gaussian,
of generating data streams with data drift. The first one, exponential, and Cauchy distributions. The implementation is
randomRBFGenerator, was introduced by Kranen et al. [27]. easily expandable and can support other common or custom
Given an initial fixed number of centroids and dimension, the distributions. The data generation code is available via our
centroids are randomly generated and assigned with a standard website at: http://iot.ee.surrey.ac.uk/
deviation and a weight. New samples are generated as follows: Data drift is added sporadically and is independent for each
Using the weight of the centroids as a selection criteria, one dimension through two different ways. The first is a directional
of the centroids is picked. By choosing a random direction, change of random length. The second is that over the course
the new sample is offset by a vector of length drawn from of generating the data, the data distribution used for the
a Gaussian distribution with the standard deviation of the dimension is changed for one or more of the dimensions. Both
centroid. This creates clusters of varying densities. Each time changes appear in random intervals.
a sample is drawn, the centroids are moved with a constant We compare our method against CluStream and DenStream.
speed, initialised by an additional parameter, creating the data Figure 2a shows the performance on data generated by the
drift. randomRBFGenerator with data drift. The results for the data
This however, has a drawback that the data drift is not generated by the introduced novel way with different number
of features are shown in Figures 2b to 2d. 100 centroids have
been used for the data generation. For the visualisation, the
silhouette score has been normalised to a range between 0
and 1 as done within the MOA framework1 .
On the data produced by the randomRBFGenerator, our novel
method constantly outperforms CluStream by around 13%.
DenStream performs better at times, however the silhouette
score of DenStream drops below the levels of CluStream at
times, suggesting that the method does not adapt consistently
(a) Silhouette coefficient for syn- to the drift within the data. As seen in Figures 4 to 7, for the
(b) Silhouette coefficient for syn-
thesised data generated by ran- synthesised data with different number of features, our novel
thesised data with 3 features
domRBFGenerator
method constantly perform around 40% better than CluStream
and more than 280% than DenStream.

B. Case Study: Real-Time Traffic Data


To showcase how our approach can be applied to real-world
scenarios, we use (near-)real-time traffic data from the city of
Aarhus 2 . 449 Traffic sensors are deployed in the city which
produce new values every five minutes. The data is pulled and
(c) Silhouette coefficient for syn- (d) Silhouette coefficient for syn- fed into an implementation of our clustering algorithm that is
thesised data with 4 features thesised data with 5 features described Section IV. Before the value of k is computed and
Fig. 2: Silhouette coefficient comparison on synthetic data sets the initial centroids are determined, the data is collected for
one hour which equates to 5035 data points. The main data
natural, as the centroids are constantly shifting. We argue that is collected for a period of 5 months. Over the course of one
during a short time frame, the data stream roughly follows 1 http://www.cs.waikato.ac.nz/\∼abifet/MOA/API/\ silhouette\
a certain distribution. The underlying distribution can then coefficient\ 8java\ source.html
change between time-frames, triggered by situational changes. 2 http://www.odaa.dk/dataset/realtids-trafikdata

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 8

Number of Cluster Silhouette Coefficient


3 0.450828
4 0.361470
5 0.336280
6 0.409701

Fig. 4: Silhouette Coefficients


(a) Traffic Density Morning (b) Traffic Density Noon

in another situation it could mean ”very quiet”. Similarly 15


cars in the city centre can have a different meaning during
the day than at night.
We use a ring buffer as a data cache, that captures the data
produced in the last hour. Whenever a re-calculation process
is triggered based on a detected data drift because the data
(c) Traffic Density Evening no longer converges to the mean square (see Section III-B),
we use the silhouette coefficient score to check if the new
Fig. 3: Traffic densities at different times in Aarhus centroids lead to a better cluster quality. If that is the case,
the current centroids are then adjusted. The definition of the
silhouette coefficient and its computations can be found in
Section II-A.
day, 122787 samples are collected. For the clustering we use Figure 4 shows the computed silhouette coefficients for
number of cars and average speed measured by the sensors. clustering the initial data sequence with different number of
For the purpose of evaluation and visualisation, a timestamp clusters. The number of clusters is chosen according to the
and location information are also added to each data point. highest value of the coefficient and is emphasised in bold.
The intuition is that the clustering of traffic data is dependent For performance reasons, a sampling size of 1000 has been
on the time of day and the location. This perspective allows chosen to compute these values. It can be decided based on
us to cluster incoming data in a way to better reflect the the data input and the task at hand if the number of clusters
current situation. Figures 3a, 3b and 3c visualise the traffic should stay constant through the remainder of clustering the
density as a heat map in the morning, at noon and in the data. In our use case scenario, we do not change the number
evening in central Aarhus respectively. Light green areas in of clusters.
the map show a low traffic density, while red areas indicate Figures 5 shows how the centroids of the clusters change at
a high density. No colouring means that there are no sensors different times on a working day and on a Saturday for the
nearby. In the morning (Figure 3a) there is only moderate different re-calculation times. The way the data is clustered
traffic density spread around the city. At noon (Figure 3b) we differs significantly between the two days. While the average
can see that two big centres of very high density (strong red speed remains roughly the same, the amount of vehicles
hue) have emerged. Figure 3c shows that in the evening there varies a lot. Most prominently this difference can be seen
is now only very low to moderate traffic density in the city. in the centroids of the cluster representing high number
Several reasons for data shift on varying degrees of temporal of cars. For example, Figure in 5a the number of cars is
granularity are conceivable. During the day the density of considerably higher at noon than in the evening. Resulting
traffic changes according to the time. In rush hours where from the changed centroids, data points may be assigned to
people are simultaneously trying to get to or back from work, different clusters depending on the time compared to using
the data distribution of traffic data differs greatly from less non-adaptive clustering. For example, using the centroids on
busy times during working hours or at night. Because of working day at noon on the data published during the same
the same reasons, the data distribution is also quite different time on Saturday, 180 out of 3142 values would be assigned
in weekends as in weekdays. During holidays, e.g. around to a different cluster.
Christmas or Easter, the dynamics of traffic can change a In order to interpret how busy an area is, it is necessary to
great deal. All these changes in the traffic data distribution also take into consideration all adjacent time points in that
lead to a need of reconsidering what can be categorised as a area. Therefore the output of our approach can be used to
”busy” or ”quiet” street, in other words we are dealing with further abstract from the data by matching patterns within
data drift in these cases, as the same input leads to different a time frame in an area. The results can be fed into an
output at different times. event detection and interpretation method. To clarify the
Our approach deals with the data drift by re-calculating the reasoning behind this, two examples are given. Let’s consider
centroids based on the current distribution. This means that a measurement of a fast moving car. Only if the measurements
defining a street as ”busy” or ”quiet” is relative and depends in close time range are similar, we can interpret this traffic
on the time of the day and the location. For example, 15 as a good traffic flow. If this is not the case, a different event
cars at a given time in one street could mean ”busy” while has taken place, e. maybe an ambulance rushed through the

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 9

considering the distribution of the data at that time.


Figure 6b shows how the silhouette coeeficients compare over
the course of one week. The adaptive clustering performs
mainly better than the non-adaptive. The cluster quality of
the adaptive solution follows a daily pattern. During the day
the quality drops to levels of the non-adaptive solution. This
can be explained through the fact that during the night the
traffic flow is more clear cut, for example there are more
(a) Cluster centroids for number of(b) Cluster centroids for average cases where there is no traffic at all. The adaptive solution
cars on a working day speed on a working day is able to exploit this fact and adapt itself to produce better
clusters that are closer to the actual categories that can be
found in the data streams. During the day, there are many
more data samples on the edge of clusters that could be
clustered into either one of adjacent clusters, leading to a
worse silhouette coefficient score. Here the quality of the
adaptive clustering at times drops to values near the quality of
the non-adaptive. However, the next iteration of the adaptive
clustering improves the cluster quality again automatically
based on changes in the data distribution. At some points in
(c) Cluster centroids for number of(d) Cluster centroids for average time it drops below the quality of the non-adaptive one: in
cars on a Saturday speed on a Saturday these cases the quality quickly recovers to better values again.
Fig. 5: Centroids adapting to changes in the data stream. Low, Overall the mean of the silhouette coefficient is 0.41 in the
medium and high refer to the level of traffic density the cluster non-adaptive setting and 0.463 in the adaptive setting which
is representing. translates to an average improvement of 12.2% in cluster
quality.

street.
Another example would be the measurement of slow moving
cars. If this is a singular measurement, it could mean VI. C ONCLUSION
for example that somebody is driving slow because s/he In this paper we have introduced an adaptive clustering
is searching for a parking space. However if the same method that is designed for dynamic IoT data streams. The
measurement accumulates it could mean that a traffic jam is method adapts to data drifts of the underlying data streams.
taking place. The proposed method is also able to determine the number
In order to ensure that the adaptive method leads to of categories found inherently in the data stream based on
the data distribution and a cluster quality measure. The
adaptive method works without prior knowledge and is able
to discover inherent categories from the data streams.
We have conducted a set of experiments using synthesised
data and data taken from an traffic use-case scenario where
we analyse traffic measurements from the city of Aarhus.
We run adaptive stream clustering method and compare it
against a non-adaptive stream cluster algorithm. Overall the
(a) Silhouette coefficient over the(b) Silhouette coefficient over the clusters produced using an adaptive setting have an average
course of one day based on hourlycourse of one week based on
measures improvement of 12.2% in the cluster quality metric (i.e.
hourly measures
silhouette coefficient) over the clusters produced using a
Fig. 6: Silhouette coefficient on traffic data set non-adaptive setting.
Compared to state-of-the-art stream cluster methods,
meaningful results we have conducted another experiment. our novel approach shows significant improvements
We compare our adaptive stream clustering method with a on synthesised data sets: Against CluStream there are
non-adaptive streaming version of the k-means algorithm, performance improvements between 13% and 40%. On data
i.e. the centroids are never re-calculated. The silhouette generated by randomRBFgenerator, DenStream has better
coefficient of the clusters in both setting are computed in cluster quality at few points of the experiment, is generally
equal time intervals. Figure 6a shows how the silhouette outperformed by our method though. On the other synthesised
coefficients compare over the course of one day. For example, data streams, our novel approach shows an improvement of
at 22/02/2014, 05:30:00 the non-adaptive approach scores a more than 280% compared to DenStream.
silhouette coefficient of 0.410 while the adaptive approach The results of our clustering method can be used as an input
scores 0.538, an improvement of 31.2%. This means items for pattern and event recognition methods and for analysing
clustered by the adaptive methods have a better convergence the real-world streaming data. To clarify our approach we

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 10

have used k-means as the underlying clustering mechanism, gramme for the CityPulse project under grant agreement no.
however the concepts of our approach can also be applied 609035.
to other clustering methods. For the latter the distribution
analysis and cluster update mechanisms can be directly R EFERENCES
adapted from the current work and only the cluster and
centroid adaptation mechanisms should be implemented for [1] S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE Transactions
on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
other clustering solution. [2] P. S. Bradley, O. L. Mangasarian, and W. N. Street, “Clustering Via
For the future work we plan to apply the proposed solution Concave Minimization,” Advances in Neural Information Processing
to different types of multi-modal data in the IoT domain. We Systems, pp. 368–374, 1997.
[3] L. Xu and D. Schuurmans, “Unsupervised and Semi-Supervised Multi-
will also investigate the concept drift and clustering updates Class Support Vector Machines,” in Proc. of the 20th National Confer-
based on user requirement changes and target changes. In this ence on Artificial Intelligence, 2005, pp. 904–910.
work we proposed a clustering method designed to deal with [4] D. Arthur, “k-Means ++ : The Advantages of Careful Seeding,” in
Proc. of the 18th annual ACM-SIAM Symposium on Discrete Algorithms,
drifts in the data. For this we have not considered the spatial Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
dimension of the data. Spatial clustering and auto-correlation [5] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan, “Clustering Data
are important topics in data mining and we aim to extend our Streams,” Proc. 41st Annual Symposium on Foundations of Computer
work with solutions to this problem. Science, pp. 359–366, 2000.
[6] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A Framework for
Clustering Evolving Data Streams,” in Proc. of the 29th International
Conference on Very Large Data Bases, 2003, pp. 81–92.
[7] M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lam-
A PPENDIX mersen, and C. Sohler, “StreamKM++: A Clustering Algorithm for Data
We have applied our approach on an additional multi- Streams,” Journal of Experimental Algorithmics, vol. 17, no. 1, pp. 173–
187, 2012.
variate, real world data set well-known in stream clustering and [8] M. M.-t. Chiang and B. Mirkin, “Intelligent Choice of the Number of
classification tasks, the forest cover types data set. The data set Clusters in k-Means Clustering : An Experimental Study with Different
was originally introduced by Blackard et al. [33] and is avail- Cluster Spreads,” Journal of Classification, vol. 40, no. 1, pp. 3–40,
2010.
able online in the UCI Machine Learning Repository (http: [9] J. A. Harrington, Clustering Algorithms. John Wiley and Sons, 1975.
//archive.ics.uci.edu/ml/datasets/Covertype). The forest cover [10] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-Based Clustering over
types for 30 x 30 meter cells were obtained from the US an Evolving Data Stream with Noise,” in Conference on Data Mining,
no. 2, 2006, pp. 328–339.
Forest Service (USFS) Region 2 Resource Information System [11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based
(RIS) data, contains 10 continuous variables and has more than Algorithm for Discovering Clusters in Large Spatial Databases with
580000 data samples. Noise,” in Conference on Knowledge Discovery & Data Mining, vol. 2,
1996, pp. 226–231.
Figure 7 shows that while DenStream performs better on av-
[12] J. C. Schlimmer and R. H. Granger, “Incremental Learning from Noisy
erage, the cluster quality drops down to misclustering (values Data,” Machine Learning, vol. 1, no. 3, pp. 317–354, 1986.
below 0.5) at times during the stream clustering. Our approach [13] G. Widmer and M. Kubat, “Learning in the Presence of Concept Drift
shows a consistent cluster quality while processing the data and Hidden Contexts,” Machine Learning, vol. 23, no. 1, pp. 69–101,
1996.
stream. [14] J. Gama, I. Zliobaite, A. Bifet, M. Pecheniztkiy, and A. Bouchachia, “A
Survey on Concept Drift Adaptation,” ACM Computing Surveys, vol. 46,
no. 6, 2014.
[15] H. Yang and S. Fong, “Countering the Concept-Drift Problem in Big
Data Using iOVFDT,” in IEEE International Congress on Big Data.
Ieee, 2013, pp. 126–132.
[16] D. M. Farid, L. Zhang, A. Hossain, C. M. Rahman, R. Strachan,
G. Sexton, and K. Dahal, “An Adaptive Ensemble Classifier for Mining
Concept Drifting Data Streams,” Expert Systems with Applications,
vol. 40, no. 15, pp. 5895–5906, 2013.
[17] J. Smith and N. Dulay, “Exploring Concept Drift using Interactive
Simulations,” IEEE International Conference on Pervasive Computing
and Communications Workshop (PERCOM Workshops), pp. 49–54,
2013.
[18] K. Förster, D. Roggen, and G. Tröster, “Unsupervised Classifier Self-
Calibration through Repeated Context Occurrences : Is there Robustness
against Sensor Displacement to Gain?” in IEEE International Sympo-
sium on Wearable Computers, 2009, pp. 77–84.
[19] G. Cabanes, Y. Bennani, and N. Grozavu, “Unsupervised Learning for
Analyzing the Dynamic Behavior of Online Banking Fraud,” 2013 IEEE
13th International Conference on Data Mining Workshops, pp. 513–520,
2013.
[20] E. Parzen, “On Estimation of a Probability Density Function and Mode,”
The Annals of Mathematical Statistics, vol. 33, no. 3, pp. 1065–1076,
Fig. 7: Forest cover type data set 1962.
[21] M. Rosenblatt, “Remarks on Some Nonparametric Estimates of a
Density Function,” The Annals of Mathematical Statistics, vol. 27, no. 3,
pp. 832–837, 1956.
ACKNOWLEDGMENT [22] D. W. Scott, Multivariate Density Estimation: Theory, Practice, and
Visualization. John Wiley and Sons, 1992.
The research leading to these results has received funding [23] B. W. Silverman, “Density Estimation for Statistics and Data Analysis,”
from the European Commissions Seventh Framework Pro- in Density Estimation for Statistics and Data Analysis, 1986.

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2016.2618909, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, XX 2016 11

[24] D. W. Scott and G. R. Terrell, “Biased and Unbiased Cross-Validation Rahim Tafazolli is a professor and the Director of
in Density Estimation,” Journal of the American Statistical Association, the Institute for Communication Systems, University
vol. 82, no. 400, pp. 1131–1146, 1987. of Surrey. He has been active in research for more
[25] M. C. Jones, J. S. Marron, and S. J. Sheather, “A Brief Survey of than 20 years, has authored and co-authored more
Bandwidth Selection for Density Estimation,” Journal of the American than 360 papers in refereed international journals
Statistical Association, vol. 91, no. 433, pp. 401–407, 1996. and conferences. He is the Founder and past Chair-
[26] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: Massive man of IET International Conference on 3rd Gen-
Online Analysis,” Journal of Machine Learning Research, vol. 11, pp. eration Mobile Communications. He is a Fellow of
1601–1604, 2010. the IET and WWRF. He is Chairman of EU Expert
[27] P. Kranen, H. Kremer, T. Jansen, T. Seidl, A. Bifet, G. Holmes, and Group on Mobile Platform (e-mobility SRA) and
B. Pfahringer, “Clustering Performance on Evolving Data Streams : Chairman of Post-IP working group in e-mobility,
Assessing Algorithms and Evaluation Measures within MOA,” in IEEE past Chairman of WG3. He is a senior member of the IEEE.
International Conference on Data Mining Workshops, 2010.
[28] S. Pravilovic, A. Appice, and D. Malerba, “Integrating Cluster Analysis
to the ARIMA Model for Forecasting Geosensor Data,” Foundations of
Intelligent Systems, pp. 234–243, 2014.
[29] P. J. Rousseeuw, “Silhouettes: A Graphical Aid to the Interpretation and
Validation of Cluster Analysis,” Journal of Computational and Applied
Mathematics, vol. 20, pp. 53–65, 1987.
[30] J. Lin, E. Keogh, S. Lonardi, and B. Chiu, “A Symbolic Representation
of Time Series, with Implications for Streaming Algorithms,” in Proc.
of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining
and Knowledge Discovery, 2003, pp. 2–11.
[31] R. J. Sefling, Clustering Algorithms. John Wiley and Sons, 2009, vol.
162.
[32] M. D. McKay, R. J. Beckman, and W. J. Conover, “Comparison of
Three Methods for Selecting Values of Input Variables in the Analysis
of Output from a Computer Code,” Technometrics, vol. 21, no. 2, pp.
239–245, 1979.
[33] J. A. Blackard and D. J. Dean, “Comparative Accuracies of Artificial
Neural Networks and Discriminant Analysis in Predicting Forest Cover
Types from Cartographic Variables,” Computers and Electronics in
Agriculture, vol. 24, no. 3, pp. 131–151, 1999.

Daniel Puschmann is currently pursuing his Ph.D.


degree from the Institute for Communication Sys-
tems, University of Surrey. His research is focused
on information abstraction and extracting actionable
information from streaming data produced in the
Internet of Things using stream processing and ma-
chine learning techniques.

Payam Barnaghi is a Reader at the Institute for


Communication Systems Research at the University
of Surrey. Hes also the coordinator of the EU FP7
CityPulse project. His research interests include ma-
chine learning, the Internet of Things, the Semantic
Web, adaptive algorithms, and information search
and retrieval. He is a senior member of IEEE and a
Fellow of the Higher Education Academy.

2327-4662 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like