Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

TT06

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Efficient algorithms for sequence segmentation

Evimaria Terzi ∗ Panayiotis Tsaparas †

Abstract The error in this approximate representation is measured us-


The sequence segmentation problem asks for a partition of the se- ing some error function, e.g. the sum of squares. Different
quence into k non-overlapping segments that cover all data points error functions may be used depending on the application.
such that each segment is as homogeneous as possible. This Given an error function, the goal is to find the segmentation
problem can be solved optimally using dynamic programming in of the sequence and the corresponding representatives that
O(n2 k) time, where n is the length of the sequence. Given that minimize the error in the representation of the underlying
sequences in practice are too long, a quadratic algorithm is not an data. We call this problem a segmentation problem. Seg-
adequately fast solution. Here, we present an alternative constant- mentation problems, particularly for multivariate time series,
factor approximation algorithm with running time O(n4/3 k5/3 ). arise in many data mining applications, including bioinfor-
We call this algorithm the D N S algorithm. We also consider the matics [5, 15, 17] and context-aware systems [10].
recursive application of the D N S algorithm, that results in a faster This basic version of the sequence-segmentation prob-
algorithm (O(n log log n) running time) with O(log n) approxima- lem can be solved optimally in time O(n2 k) using dynamic
tion factor, and study the accuracy/efficiency tradeoff. Extensive programming [3], where n is the length of the sequence
experimental results show that these algorithms outperform other and k the number of segments. This quadratic algorithm,
widely-used heuristics. The same algorithms can speed up solu- though optimal, is not satisfactory for data-mining applica-
tions for other variants of the basic segmentation problem while tions where n is usually very large. In practice, faster heuris-
maintaining constant their approximation factors. Our techniques tics are used. Though the latter are usually faster (O(n log n)
can also be used in a streaming setting, with sublinear memory re- or O(n)), there are no guarantees on the quality of the solu-
quirements. tions they produce.
In this paper, we present a new divide and seg-
1 Introduction ment (D N S) algorithm for the sequence segmentation prob-
lem. The D N S algorithm has sub-quadratic running time,
Recently, there has been an increasing interest in the data-
O(n4/3 k 5/3 ), and it is a 3-approximation algorithm for the
mining community for mining sequential data. This is
segmentation problem. That is, the error of the segmenta-
due to the existence of abundance of sequential datasets
tion it produces is provably no more than 3 times that of
that are available for analysis, arising from applications in
the optimal segmentation. Additionally, we explore several
telecommunications, stock-market analysis, bioinformatics,
more efficient variants of the algorithm and we quantify the
text processing, click-stream mining and many more . The
accuracy/efficiency tradeoff. More specifically, we define a
main problem associated with the analysis of these datasets is
variant that runs in time O(n log log n) and has an O(log n)
that they consist of huge number of data points. The analysis
approximation ratio. All algorithms can be made to use sub-
of such data requires efficient and scalable algorithms.
linear amount of memory, making them applicable to the
A central problem related to time-series analysis is the
case that the data needs to be processed in an streaming fash-
construction of a compressed and concise representation of
ion. We also propose an algorithm that requires logarithmic
the data, so that it is handled efficiently. One commonly used
space, and linear time, albeit, with no approximation guar-
such representation is the piecewise-constant approximation.
antees.
A piecewise-constant representation approximates a time se-
Extensive experiments on both real and synthetic
ries T of length n using k non-overlapping and contiguous
datasets demonstrate that in practice our algorithms per-
segments that span the whole sequence. Each segment is
form significantly better than the worst-case theoretical up-
represented by a single (constant) point, e.g., the mean of the
per bounds. It is often the case that the more efficient variants
points in the segment. We call this point the representative
of our algorithms are the ones that produce the best results,
of the segment, since it represents the points in the segment.
even though they are inferior in theory. In many cases our al-
gorithms give results equivalent to the optimal algorithm. We
∗ HIIT, Basic Research Unit Department of Computer Science University
also compare our algorithms against different popular heuris-
of Helsinki, Finland, email:terzi@cs.helsinki.fi tics that are known to work well in practice. Although these
† HIIT, Basic Research Unit Department of Computer Science University
heuristics output results of good quality our algorithms still
of Helsinki, Finland, email:tsaparas@cs.helsinki.fi
perform consistently better. This can often be achieved with a curve that changes curvature only once. The problem of
computational cost comparable to the cost of these heuristics. finding unimodal segmentations is discussed in [9]. This
Finally, we show that the proposed algorithms can be applied problem can be solved optimally in polynomial time, using
to variants of the basic segmentation problem, like for exam- a variation of the basic dynamic-programming algorithm.
ple the one defined in [7]. We show that for this problem we
achieve similar speedups for the existing approximation al- 1.2 Roadmap. The rest of the paper is structured as fol-
gorithms, while maintaining constant approximation factors. lows. Section 2 provides the necessary definitions, and the
optimal dynamic-programming algorithm. In Section 3 we
1.1 Related Work. There is a large body of work that pro- describe the basic D N S algorithm, and we analyze its run-
poses and compares segmentation algorithms for sequential ning time and approximation ratio. In Section 4 we consider
(mainly time-series) data. The papers related to this topic a recursive application of our approach, resulting in more ef-
follow usually one of the following three trends: (i) Pro- ficient algorithms. Section 5 includes a detailed experimen-
pose heuristic algorithms for solving a segmentation problem tal evaluation of our algorithms, and comparisons with other
faster than the optimal dynamic-programming algorithm. commonly used heuristics. Section 6 considers applications
Usually these algorithms are fast and perform well in prac- of our techniques to other segmentation problems. We con-
tice. (ii) Devise approximation algorithms, with provable er- clude the paper in Section 7.
ror bounds, and (iii) Propose new variations of the basic seg-
mentation problem. These variations usually impose some 2 Preliminaries
constraint on the structure of the representatives of the seg- Let T = (t1 , t2 , . . . , tn ) be a d-dimensional sequence of
ments. Our work lies in the intersection of categories (i) and length n with ti ∈ Rd , i.e., ti = (ti1 , ti2 , . . . , tid ).
(ii) since we provide fast algorithms with bounded approx- A k-segmentation S of a sequence of length n is a
imation factors. At the same time, we claim that our tech- partition of {1, 2, . . . , n} into k non-overlapping contiguous
niques can be used for solving problems proposed in cate- subsequences (segments), S = {s1 , s2 , . . . , sk }. Each
gory (iii) as well. segment si consists of |si | points. The representation of
The bulk of papers related to segmentations are in cat- sequence T when segmentation S is applied to it, collapses
egory (i). Since the optimal algorithm for solving the se- the values of the sequence within each segment s into
quence segmentation problem is quadratic, faster heuristics a single value µs (e.g., the mean). We call this value
that work well in practice are valuable. The most popular of the representative of the segment, and each point t ∈
these algorithms are the top-down and the bottom-up greedy s is “represented” by the value µs . Collapsing points
algorithms. The first runs in time O(n) while the second into representatives results in less accuracy in the sequence
needs time O(n log n). Both these algorithms work well in representation. We measure this loss in accuracy using
practice. In Section 5.1 we discuss them in detail, and we the error function Ep . Given a sequence T , the error of
evaluate them experimentally. Online versions of the seg- segmentation S is defined as
mentation problems have also been studied [11, 16]. In this
! p1
case, new data points are coming in an online fashion. The XX
p
goal of the segmentation algorithm is to output a good seg- Ep (T, S) = |t − µs | .
mentation (in terms of representation error) at all points in s∈S t∈s
time. In some cases, like for example in [11], it is assumed
We consider the cases where p = 1, 2. For simplicity, we
that the maximum tolerable error is also part of the input.
will sometimes write Ep (S) instead of Ep (T, S), when the
The most interesting work in category (ii) is presented
sequence T is implied.
in [8]. The authors present a fast segmentation algorithm
The segmentation problem asks for the segmentation
with provable error bounds. Our work has similar motiva-
that minimizes the error Ep . The representative of each
tion, but approaches the problem from a different point of
segment depends on p. For p = 1 the optimal representative
view.
for each segment is the median of the points in the segment;
Variations of the basic segmentation problem have been
for p = 2 the optimal representative of the points in a
studied extensively. In [7], the authors consider the problem
segment is their mean. Depending on the constraints one
of partitioning a sequence into k contiguous segments under
imposes on the representatives, one can consider several
the restriction that those segments are represented using only
variants of the segmentation problem. We first consider
h < k distinct representatives. We will refer to this problem
the basic k-segmentation problem, where no constraints are
as the (k, h)-segmentation problem. Another restriction that
imposed on the representatives of the segments. In Section 6
is of interest particularly in paleontological applications, is
of the paper we consider the (k, h)-segmentation problem, a
unimodality. In this variation the representatives of the
variant of the k-segmentation problem defined in [7], where
segments are required to follow a unimodal curve, that is,
only h distinct representatives can be used, for some h < k.
2.1 The segmentation problem. We now give a formal concatenated to form the (weighted) sequence T 0 . Then the
definition of the segmentation problem, and we describe the dynamic-programming algorithm is applied on T 0 . The k-
optimal algorithm for solving it. Let Sn,k denote the set segmentation of T 0 is output as the final segmentation.
of all k-segmentations of sequences of length n. For some
sequence T , and for error measure Ep , we define the optimal Algorithm 1 The D N S algorithm
segmentation as Input: Sequence T of n points, number of segments k,
Sopt (T, k) = arg min Ep (T, S) . value χ.
S∈Sn,k Ouput: A segmentation of T into k segments.
1: Partition T into χ disjoint intervals T1 , . . . , Tχ .
That is, Sopt is the k-segmentation S that minimizes the
2: for all i ∈ {1, . . . , χ} do
Ep (T, S). For a given sequence T of length n the formal
definition of the k-segmentation problem is the following: 3: (Si , Mi ) = DP(Ti , k)
4: end for
P ROBLEM 1. (O PTIMAL k- SEGMENTATION ) Given a se- 5: Let T 0 = M1 ⊕ M2 ⊕ · · · ⊕ Mχ be the sequence defined
quence T of length n, an integer value k, and the error func- by the concatenation of the representatives, weighted by
tion Ep , find Sopt (T, k). the length of the interval they represent.
6: Return the optimal segmentation of (S,M ) of T 0 using
Problem 1 is known to be solvable in polynomial time [3].
The solution consists of a standard dynamic-programming the dynamic programming algorithm.
2
(DP) algorithm and can be computed in time O(n k). The
main recurrence of the dynamic-programming algorithm is The following example illustrates the execution of D N S.
the following:

E XAMPLE 1. Consider the time series of length n = 20 that


(2.1) Ep (Sopt (T [1 . . . n] , k)) = is shown in Figure 3.1. We show the execution of the D N S
minj<n {Ep (Sopt (T [1 . . . j] , k − 1)) algorithm for k = 2 and using χ = 3. In step 1 the sequence
+Ep (Sopt (T [j + 1, . . . , n] , 1))} . is divided into three disjoint and contiguous intervals T1 , T2
and T3 . Subsequently, the dynamic-programming algorithm
is applied to each one of those intervals. The result of this
where T [i . . . , j] denotes the subsequence of T that contains are the six weighted points on which dynamic-programming
all points in positions from i to j, with i, j included. We is applied again. For this input sequence, the output 2-
note that the dynamic-programming algorithm can also be segmentation found by the D N S algorithm is the same as the
used in the case of weighted sequences, where each point is optimal segmentation.
associated with a weight. Then the representatives (mean
or median) are defined to be the weighted representatives The running time of the algorithm is easy to analyze.
(weighted mean or median).
T HEOREM 3.1. The running time of the D N S algorithm is
at most O(n4/3 k 5/3 ) for χ = ( nk )2/3 .
3 Divide and Segment for the k-segmentation problem
3.1 D IVIDE &S EGMENT algorithm. In this section we de- Proof. Assume that D N S partitions T into χ equal-length
scribe the D IVIDE &S EGMENT (D N S) algorithm for Prob- intervals. The running time of the D N S algorithm as a
lem 1. The algorithm is faster than the standard dynamic- function of χ is
programming algorithm and its approximation factor is con-
stant. The main idea of the algorithm is to divide the problem  2
into smaller subproblems, solve the subproblems optimally n
R(χ) = χ k + (χk)2 k
and combine their solutions to form the final solution. The χ
recurrence 2.1 is a building component of D N S. The output n2
of the algorithm is a k-segmentation of the input sequence. = k + χ2 k 3 .
χ
Algorithm 1 shows the outline of D N S. In step 1, the in-
put sequence T is partitioned into χ disjoint subsequences. The minimum of function R(χ) is achieved when χ0 =
2
n 3
Each one of them is segmented optimally using dynamic pro- and this gives R(χ0 ) = 2n4/3 k 5/3 . 
k
gramming. For subsequence Ti , the output of this step is a
segmentation Si of Ti and a set Mi of k weighted points. Throughout this paper we assume that the representative
These are the representatives of the segments of segmenta- of each segment can be computed in constant time in the
tion Si , weighted by the length of the segment they repre- dynamic-programming subroutine of the algorithm. For the
sent. All the χk representatives of the χ subsequences are E2 -error function, it is possible to compute the mean in
constant time by storing the partial sums of squares and
Input sequence T consisting of n = 20 points squares of partial sums. For the E1 -error function, we are
not aware of any method that computes the median of the
10 segment in constant time. In the sequel, we assume that in
9
8 this case a preprocessing step of computing medians of all
7 possible segments is performed.
6
5 We note that the D N S algorithm can also be used in
4 the case where the data must be processed in a streaming
3
2 fashion. Assuming that we have an estimate of the size of
1 the sequence n, then the algorithm processes the points in
Partition the sequence into χ = 3 disjoint intervals batches of size n/χ. For each such batch it computes the
optimal k-segmentation, and stores the representatives. The
10 spaceprequired is M (χ) = n/χ + χk. √ This is minimized for
9
8 χ = n/k, resulting in space M = 2 nk.
T2
7
6 T1 T3
5 3.1.1 Analysis of the D N S algorithm. For the proof of the
4 approximation factor of the D N S algorithm we first make the
3
2 following observation.
1

O BSERVATION 1. Let Si = Sopt (Ti , k) for i = 1, . . . χ, and


Save optimally the k -segmentation problem in each partition (k=2) Sopt = Sopt (T, k). If t is the representative assigned to point
t ∈ T by segmentation Si after the completion of the for loop
10 (Step 2) of the D N S algorithm, then we have
9
8 χ
7
T2 X X
6 T1 T3 dp (t, t)p = Ep (Ti , Si )p ≤ Ep (T, Sopt )p .
5
t∈T i=1
4
3
2 Proof. For each interval Ti consider the segmentation points
1
of Sopt that lie within Ti . These points together with the
starting and ending points of interval Ti define a segmen-
Sequence T 0 consisting of χ · k = 6 representatives tation of Ti into ki0 segments with ki0 ≤ k. Denote this
w=4
segmentation by Si0 . Then for every interval Ti and its cor-
10
9
responding segmentation Si0 defined as above we have that:
8 Ep (Ti , Si ) ≤ Ep (Ti , Si0 ). This is true since Si is the optimal
7
k-segmentation for subsequence Ti and ki0 ≤ k. Thus we
6
5 have
4 p p
3
w=8
w=2 Ep (Ti , Si ) ≤ Ep (Ti , Si0 ) .
w=4 w=1 w=1
2
1
Summing over all Ti ’s we get

Solve k -segmentation on T 0 (k=2)


X χ
X p
10 dp (t, t)p = Ep (Ti , Si )
9 t∈T i=1
8
χ
X
7
p
6 ≤ Ep (Ti , Si0 )
5
i=1
4
3 = Ep (T, Sopt )p .
2
1


We now prove the approximation factors for E1 and E2


Figure 1: Illustration of the D N S algorithm error measures.
T HEOREM 3.2. For a sequence T and error measure E1 T HEOREM 3.3. For a sequence T and error measure E2
let O PT1 = E1 (Sopt (T, k)) be the E1 -error for the optimal let O PT2 = E2 (Sopt (T, k)) be the E2 -error for the optimal
k-segmentation. Also let D N S 1 be the E1 -error for the k- k-segmentation. Also let D N S 2 be the E2 -error for the k-
segmentation output by the D N S algorithm. We have that segmentation output by the D N S algorithm. We have that
D N S1 ≤ 3 · O PT1 . D N S2 ≤ 3 · O PT2 .

Proof. Let S be the segmentation of sequence T output by


Proof. Consider the same notation as in Theorem 3.2. The
the D N S(T, k, χ) algorithm, and let µt be the representative
E2 error of the optimal dynamic-programming algorithm is
assigned to some point t ∈ T in S. Also, let λt denote the
representative of t in the optimal segmentation Sopt (T, k). sX
The E1 -error of the optimal segmentation is O PT2 = E2 (Sopt (T, k)) = d2 (t, λt )2 .
X t∈T
O PT1 = E1 (Sopt (T, k)) = d1 (t, λt ) .
t∈T
Let S be the output of the D N S(T, k, χ) algorithm. The error
The E1 error of the D N S algorithm is given by of the D N S algorithm is given by
X
D N S 1 = E1 (T, S) = d1 (t, µt ). sX
t∈T D N S 2 = E2 (T, S) = d2 (t, µt )2 .
t∈T
Now let t be the representative of the segment to which point
t is assigned after the completion of the for loop in Step 2 of The proof continues along the same lines as the proof
the D N S algorithm. Due to the optimality of the dynamic- of Theorem 3.2 but uses Fact 3.1 and Cauchy-Schwartz
programming algorithm in Step 4 of the algorithm we have inequality.
X X Using the triangular inequality of d2 we get
(3.2) d1 (t, µt ) ≤ d1 (t, λt ) .
t∈T t∈T X
D N S22 = d2 (t, µt )2
We can now obtain the desired result as follows t∈T
X 2
X ≤ d2 (t, t) + d2 (t, µt )
DNS1 = d1 (t, µt ) t∈T
t∈T X X
X  = d2 (t, t)2 + d2 (t, µt )2
(3.3) ≤ d1 (t, t) + d1 (t, µt ) t∈T t∈T
t∈T X
X  +2 · d2 (t, t) · d2 (t, µt ) .
(3.4) ≤ d1 (t, t) + d1 (t, λt ) t∈T
t∈T
X 
(3.5) ≤ d1 (t, t) + d1 (t, t) + d1 (t, λt )
t∈T
From Observation 1 we have that
X X
(3.6) ≤ 2· d1 (t, λt ) + d1 (t, λt ) X X
t∈T t∈T
d2 (t, t)2 ≤ d2 (t, λt )2 = O PT22 .
t∈T t∈T
= 3 · O PT1 .

Inequalities 3.3 and 3.5 follow from the triangular inequality, Using the above inequality, the optimality of dynamic pro-
inequality 3.4 follows from Equation 3.2, and inequality 3.6 gramming in Step 4 of the algorithm, and Fact 3.1 we have
follows from Observation 1. 
X X
d2 (t, µt )2 ≤ d2 (t, λt )2
Next we prove the 3-approximation result for E2 . For t∈T t∈T
this, we need the following simple fact. X 
≤ 2· d2 (t, t)2 + d2 (t, λt )2
FACT 3.1. (D OUBLE T RIANGULAR I NEQUALITY) Let d be t∈T
X
a distance metric. Then for points x, y and z and p ∈ N+ we ≤ 4· d2 (t, λt )2
have t∈T
d(x, y)2 ≤ 2 · d(x, z)2 + 2 · d(z, y)2 . = 4 · O PT22 .
Finally using the Cauchy-Schwartz inequality we get As a first step in the analysis of the RD N S we consider
sX the approximation ratio of the `-RD N S algorithm. We can
X
2· d2 (t, t) · d2 (t, µt ) ≤ 2 · d2 (t, t)2 prove the following theorem.
t∈T t∈T
sX T HEOREM 4.1. The `-RD N S algorithm is an O(2` ) approx-
· d2 (t, µt )2 imation algorithm for the E1 -error function, and an O(6`/2 )
t∈T approximation algorithm for the E2 -error function, with re-
q q spect to Problem 1.
≤ 2· O PT22 · 4 · O PT22
Proof. (Sketch) The proof in both cases follows by induction
= 4 · O PT22 .
of `. The exact approximation ratio is 2`+1 − 1
on the valueq
Combining all the above we conclude that for E1 , and 95 6` − 45 for E2 . We will sketch the proof for
E1 . The proof for E2 follows along the same lines.
D N S22 ≤ 9 · O PT22 . From Theorem 3.2, we have that the theorem is true

for ` = 1. Assume now that it is true for some ` ≥ 1.
We will prove it for ` + 1. At the first level of recursion
4 Recursive D N S algorithm the (` + 1)-RD N S algorithm, breaks the sequence T into χ
subsequences T1 , . . . , Tχ . For each one of these we call the
The D N S algorithm applies the “divide-and-segment” idea `-RD N S algorithm, producing a set R of χk representatives.
once, splitting the sequence into subsequences, partitioning Similar to the proof of Theorem 3.2, let t̄ ∈ R denote the
each of subsequence optimally, and then merging the results. representative in R that corresponds to point t. Consider also
We now consider the recursive D N S algorithm (RD N S) the optimal segmentation of each of these intervals, and let
which recursively splits each of the subsequences, until no O denote the set of χk representatives. Let e t ∈ O denote the
further splits are possible. Algorithm 2 shows the outline of representative of point t in O. From the inductive hypothesis
the RD N S algorithm. we have that
X X
Algorithm 2 The RD N S algorithm d1 (t, t̄) ≤ 2`+1 − 1 d1 (t, e
t)
Input: Sequence T of n points, number of segments k, t∈T t∈T
value χ.
Ouput: A segmentation of T into k segments. Now let µt be the representative of point t in the
1: if |T | ≤ B then segmentation output by the (` + 1)-RD N S algorithm. Also
2: Return the optimal partition (S,M ) of T using the let λt denote the representative of point t in the optimal
dynamic-programming algorithm. segmentation. Let RD N S 1 denote the E1 -error of the (`+1)-
3: end if RD N S algorithm, and O PT 1 denote the E1 -error of the
4: Partition T into χ intervals T1 , . . . , Tχ . optimal segmentation. We have that
5: for all i ∈ {1, . . . , χ} do X X
6: (Si , Mi ) = RD N S(Ti , k, χ) RD N S 1 = d1 (t, µt ) and O PT1 = d1 (t, λt )
7: end for t∈T t∈T
8: Let T 0 = M1 ⊕ M2 ⊕ · · · ⊕ Mχ be the sequence defined
by the concatenation of the representatives, weighted by From the triangular inequality we have that
the length of the interval they represent. X X X
9: Return the optimal partition (S,M ) of T 0 using the d1 (t, µt ) ≤ d1 (t, t̄) + d1 (t̄, µt )
dynamic-programming algorithm. t∈T t∈T t∈T
X X
≤ 2`+1 − 1 d1 (t, e
t) + d1 (t̄, µt )
t∈T t∈T
The value B is a constant that defines the base case for
the recursion. Alternatively, one could directly determine
the depth ` of the recursive calls to RD N S. We will refer to From Observation 1, and Equation 3.2, we have that
such an algorithm, as the `-RD N S algorithm. For example, X X
the simple D N S algorithm, corresponds to the 1-RD N S d1 (t, e
t) ≤ d1 (t, λt )
algorithm. We also note that at every recursive call of the t∈T t∈T
RD N S algorithm the number χ of intervals into which we
X X
partition the sequence may be a function of sequence length. d1 (t̄, µt ) ≤ d1 (t̄, λt )
However, for simplicity we use χ instead of χ(n). t∈T t∈T

Using the above inequalities and the triangular inequality we T HEOREM 4.3. For χ = n the RD N S algorithm is an
obtain O(log n) approximation algorithm for Problem 1 for both
X E1 and E2 error functions. The√running time of the algo-
RD N S 1 = d1 (t, µt ) rithm is O(n log log n), using O( n) space, when operating
t∈T in a streaming fashion.
X X
≤ 2`+1 − 1 d1 (t, λt ) + d1 (t̄, λt )
t∈T t∈T Proof. (Sketch) It is not hard to see that after ` recursive calls
X `
the size of the input segmentation is O(n1/2 ). Therefore,
≤ 2`+1 − 1 d1 (t, λt )
the depth of the recursion is O(log log n). From Theorem 4.1
t∈T
X X we have that the approximation ratio of the algorithm is
+ d1 (t, t̄) + d1 (t, λt ) O(log n). The running time of the algorithm is given by the
t∈T t∈T recurrence
X X √ √ 
≤ 2`+1 d1 (t, λt ) + 2`+1 − 1 d1 (t, e
t) R(n) = nR n + nk 3 .
t∈T t∈T
X Solving the recurrence we obtain running time
`+2
≤ 2 −1 d1 (t, λt ) O(n log log n). The space required is bounded √ by the
t∈T size of the top level of the recursion, and it is O( n). 

= 2`+2 − 1 O PT1
The following corollary is an immediate consequence
The proof for the E2 follows similarly. Instead of using of the proof of Theorem 4.3 and it provides an accu-
the binomial identity as in the proof of Theorem 3.3, we racy/efficiency tradeoff.
obtain a more clean recursive formula for the approximation √
error by applying the double triangular inequality.  C OROLLARY 4.1. For χ = n, the `-RD N S algorithm is
an O(2` ) approximation algorithm for the E1 -error function,
We now consider possible values for χ. First, we set χ and an O(6`/2 ) approximation algorithm for the E2 -error
to be a constant. We can prove the following theorem. function, with respect to Problem 1. The running time of the
`
algorithm is O(n1+1/2 + n`).
T HEOREM 4.2. For any constant χ the running time of the
RD N S algorithm is O(n), where n is the length of the input
5 Experiments
sequence. The algorithm can operate on data that arrive in
streaming fashion using O(log n) space. 5.1 Segmentation heuristics. Since sequence segmenta-
tion is a basic problem particularly in time-series analysis
Proof. (Sketch) The running time of the RD N S algorithm is several algorithms have been proposed in the literature with
given by the following recursion the intention to improve the running time of the optimal
  dynamic-programming algorithm. These algorithms have
n
R(n) = χR + (χk)2 k. been proved very useful in practice, however no approxi-
χ mation bounds are known for them. For completeness we
Solving the recursion we get that R(n) = O(n). briefly describe them here.
In a case that the data arrive in a stream, the algorithm The T OP -D OWN greedy algorithm (TD) starts with
can build the recursion tree online, in a bottom-up fashion. the unsegmented sequence (initially there is just a single
At each level of the recursion tree, we only need to maintain segment) and it introduces a new boundary at every greedy
at most χk entries that correspond to the leftmost branch of step. That is, in the i-th step it introduces the i-th segment
the tree. The depth of the recursion is O(log n), resulting in boundary by splitting one of the existing i segments into two.
O(log n) space overall.  The new boundary is selected in such a way that it minimizes
the overall error. No change is made to the existing i − 1
Therefore, for constant χ, we obtain an efficient algo- boundary points. The splitting process is repeated until the
rithm, both in time and space. Unfortunately, we do not number of segments of the output segmentation reaches k.
have any approximation guarantees, since the best approx- This algorithm, or variations of it with different stopping
imation bound we can prove using Theorem 4.1 is O(n). We conditions are used in [4, 6, 14, 18]. The running time of
can however obtain significantly better approximation guar- the algorithm is O(nk).
antees if we are willing to tolerate
√ a small increase in the In the B OTTOM -U P greedy algorithm (BU) initially
running time. We set χ = n, where n is the length of each point forms a segment on its own. At each step, two
the input sequence at each specific recursive call.
√ That is, at consecutive segments that cause the smallest increase in the
each√recursive call we split the sequence into n pieces of error are merged. The algorithm stops when k segments
size n. are formed. The complexity of the bottom-up algorithm is
O(n log n). BU performs well in terms of error and it has optimal solution. The error ratio is shown as a function of the
been used widely in time-series segmentation [9, 16]. number of segments (Figure 2), or the variance of the gen-
The L OCAL I TERATIVE R EPLACEMENT (LiR) and erated datasets (Figure 3). In all cases, the D N S algorithm
G LOBAL I TERATIVE R EPLACEMENT (GiR) are randomized consistently outperforms all other heuristics, and the error it
algorithms for sequence segmentations proposed in [10]. achieves is very close to that of the optimal algorithm. Note
Both algorithms start with a random k-segmentation. At that in contrast to the steady behavior of D N S the quality of
each step they pick one segment boundary (randomly or in the results of the other heuristics varies for the different pa-
some order) and search for the best position to put it back. rameters and no conclusions on their behavior on arbitrary
The algorithms repeat these steps until they converge, i.e., datasets can be drawn.
they cannot improve the error of the output segmentation. This phenomenon is even more pronounced when we
The two algorithms differ in the types of replacements of the experiment with real data. Figure 4 is a sample of similar
segmentation boundaries they are allowed to do. Consider experimental results obtained using the datasets balloon,
a segmentation s1 , s2 , . . . , sk . Now assume that both (LiR) darwin, winding, xrates and phone from the UCR repository.
and (GiR) pick segment boundary si for replacement. LiR is The D N S performs extremely well in terms of accuracy, and
only allowed to put a new boundary between points si−1 and it is again very robust across different datasets for different
si+1 . On the other hand, GiR is allowed to put a new seg- values of k. Overall, GiR performs the best among the rest
ment boundary anywhere on the sequence. Both algorithms of the heuristics. However, there are cases (e.g., the balloon
run in time O(In), where I is the number of iterations nec- dataset) where GiR is severely outperformed.
essary for convergence.
Although extensive experimental evidence shows that 5.4 Exploring the benefits of the recursion. We addi-
these algorithms perform well in practice, there is no known tionally compare the basic D N S algorithm with different
guarantee of their worst-case error ratio. versions of RD N S. The first one, F ULL -RD N S (full recur-
sion), is the RD N S algorithm when we set the value of χ
5.2 Experimental setup. We show the qualitative perfor- to be a constant. This algorithm runs in linear time (see
mance of the proposed algorithms via an extensive experi- Theorem 4.2). However, we have not derived any approx-
mental study. For this, we compare the family of “divide- imation bound for it (other than O(n)). The second one,
and-segment” algorithms with all the heuristics described in √ S QRT-RD N S, is the RD N S algorithm when we set χ to be
the previous subsection. We also explore the quality of the n. At every recursive call of this
√ algorithm the parental
results given by RD N S compared to D N S for different pa- segment of size s is split into O( s) subsegments of the
rameters of the recursion (i.e., number of recursion levels, same size. This variation of the recursive algorithm runs in
value of χ). time O(n log log n) and has approximation ratio O(log n)
For the study we use two types of datasets: (a) synthetic (see Theorem 4.3). We study experimentally the tradeoffs
and (b) real data. The synthetic data are generated as follows: between the running time and the quality of the results ob-
First we fix the dimensionality d of the data. Then we tained using the three different alternatives of “divide-and-
select k segment boundaries, which are common for all the segment” methods on synthetic and real datasets. We also
d dimensions. For the j-th segment of the i-th dimension compare the quality of those results with the results obtained
we select a mean value µij , which is uniformly distributed using GiR algorithm. We choose this algorithm for compar-
in [0, 1]. Points are then generated by adding a noise value ison since it has proved to be the best among all the other
sampled from the normal distribution N (µij , σ 2 ). For the heuristics. In Figures 5 and 6 we plot the error ratios of the
experiments we present here we have fixed the number of algorithms as a function of the number of segments and the
segments k = 10. We have generated datasets with d = variance for the synthetic datasets. Figure 7 shows the ex-
1, 5, 10, and standard deviations varying from 0.05 to 0.9. periments on real datasets.
The real datasets were downloaded from the UCR time- From the results we can make the following observa-
series data mining archive [12]1 . tions. First, all the algorithms of the divide-and-segment
family perform extremely well, giving results close to the
5.3 Performance of the D N S algorithm. Figures 2 and 3 optimal segmentation and usually better than the results ob-
show the performance of different algorithms for the syn- tained by GiR. The full recursion (F ULL -RD N S) does harm
thetic datasets. In particular we plot the error ratio OAPT for the quality of the results. However, we note that in order
A being the error of the solutions found by the algorithms to study the full effect of recursion on the performance of
D N S, BU, TD, LiR and GiR. O PT represents the error of the the algorithm we set χ = 2, the minimum possible value.
We believe that for larger values of χ the performance of
F ULL -RD N S will be closer to that of D N S (for which we
1 The interested reader can find the datasets at have χ = (n/k)2/3 ). Finally, there are cases where S QRT-
http://www.cs.ucr.edu/∼eamonn/TSDMA/.
Synthetic Datasets; d = 1; var = 0.5 Synthetic Datasets; d = 5; var = 0.5 Synthetic Datasets; d = 10; var = 0.5
1.07 1.015
DnS DnS
BU DnS BU
1.06 TD BU TD
LiR TD LiR
GiR LiR GiR
1.05 1.01 GiR
1.01
Error Ratio

Error Ratio

Error Ratio
1.04

1.03
1.005
1.005
1.02

1.01

1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments

Figure 2: Error ratio of different algorithms with respect to O PT as a function of the number of segments

Synthetic Datasets; d = 1; k=10 Synthetic Datasets; d = 5; k=10 Synthetic Datasets; d = 10; k=10
1.8 1.15 1.07
DnS DnS DnS
BU BU BU
1.7 TD TD 1.06 TD
LiR LiR LiR
GiR GiR GiR
1.6
1.05
1.1
1.5
Error Ratio

Error Ratio

Error Ratio
1.04
1.4
1.03
1.3
1.05
1.02
1.2

1.1 1.01

1 1 1
0.050.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.050.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.050.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
variance variance variance

Figure 3: Error ratio of different algorithms with respect to O PT as a function of the variance of the generated datasets

balloon dataset darwin dataset winding dataset


1.45 1.035 1.025

1.4 DnS DnS


1.03 DnS
BU BU BU
1.35 TD TD 1.02
TD
LiR LiR LiR
GiR
1.025 GiR
1.3 GiR
Error Ratio

Error Ratio

Error Ratio

1.015
1.25 1.02

1.2 1.015
1.01
1.15
1.01
1.1 1.005
1.005
1.05

1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments
shuttle dataset exchange−rates dataset phone dataset
1.5 1.4 1.025
DnS
1.45 DnS BU DnS
BU 1.35 TD BU
TD LiR TD
1.4 GiR
1.02
LiR 1.3 LiR
GiR GiR
1.35
1.25
Error Ratio

Error Ratio

Error Ratio

1.3 1.015

1.25 1.2

1.2 1.01
1.15
1.15
1.1
1.1 1.005
1.05
1.05

1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments

Figure 4: Error ratio of different algorithms with respect to O PT as a function of the number of segments for different real
datasets
1.009
with h < k, the optimal (k, h)-segmentation is defined as
1.008
balloon
darwin
1.007 winding
phone
(6.7) Sopt (T, k, h) = arg min Ep (T, S) .
S∈Sn,k,h
1.006

Eror Ratio
1.005
Therefore, the optimal (k, h)-segmentation is defined as
1.004
follows:
1.003

1.002 P ROBLEM 2. (O PTIMAL (k, h)- SEGMENTATION ) Given a


1.001 sequence T of length n, integer values k and h with h <
1
1 2 3 4 5 6 k ≤ n and error function Ep , find Sopt (T, k, h).
Number of recursive calls

6.1 Algorithms for the (k, h)-segmentation problem


Figure 8: Error ratio of `-RD N S for different number of The (k, h)-segmentation problem is known to be NP-Hard
recursion calls; Real datasets. for d ≥ 2 and h < k, since it contains clustering as its special
case [7]. Approximation algorithms, with provable approxi-
RD N S (and in some settings F ULL -RD N S) performs even mation guarantees are presented in [7] and their running time
better than simple D N S. This phenomenon is due to the dif- is O(n2 (k + h)). We now discuss two of the algorithms pre-
ference in the number and the positions of the splitting points sented in [7]. We subsequently modify these algorithms, so
the two algorithms pick for the division step. It appears that, that they use the D N S algorithm as their subroutine.
in some cases, performing more levels of recursion helps the Algorithm S EGMENTS 2L EVELS: The algorithm initially
algorithm identify better segment boundaries, and thus pro- solves the k-segmentation problem obtaining a segmentation
duce segmentations of lower cost. S. Then it solves the (n, h)-segmentation problem obtaining
Figure 8 shows how the error of the segmentation output a set L of h levels. Finally, the representative µs of each
by `-RD N S changes for different number of recursion levels, segment s ∈ S is assigned to the level in L that is the closest
for four real datasets (balloon, darwing, phone and winding). to µs .
Note that even for 5 levels of recursion the ratio never Algorithm C LUSTER S EGMENTS: As before, the algorithm
exceeds 1.008. initially solves the k-segmentation problem obtaining a seg-
mentation S. Each segment s ∈ S is represented by its
6 Applications to other segmentation problems representative µs weighted by the length of the segment |s|.
Here, we discuss the application of the simple D N S algo- Finally, a set L of h levels is produced by clustering the k
rithm for a variant of the k-segmentation problem, namely weighted points into h clusters.
the (k, h)-segmentation [7]. Similar to the k-segmentation,
the (k, h)-segmentation of sequence T asks again for a par- 6.2 Applying D N S to the (k, h)-segmentation problem
tition of T in k segments. The main difference is that now Step 1 of both S EGMENTS 2L EVELS and C LUSTER S EG -
MENTS algorithms uses the optimal dynamic-programming
the representatives of each segment are not chosen indepen-
dently. In the (k, h)-segmentation only h < k distinct rep- algorithm for solving the k-segmentation problem. Using
resentatives can be used to represent the k segments. We D N S instead we can achieve the following approximation
have picked this problem to demonstrate the usefulness of results:
the D N S algorithm because of the applicability of (k, h)- T HEOREM 6.1. If algorithm S EGMENTS 2L EVELS uses
segmentation to the analysis of long genetic sequences. For D N S for obtaining the k-segmentation, and the cluster-
that kind of analysis, efficient algorithms for the (k, h)- ing step is done using an α approximation algorithm, then
segmentation problem are necessary. the overall approximation factor of S EGMENTS 2L EVELS is
Let S be a (k, h)-segmentation of the sequence T . For (6 + α) for both E1 and E2 -error measures.
each segment s of the segmentation S, let `s be the represen-
tative for this segment (there are at most h representatives). When the data points are of dimension 1 (d = 1) then clus-
The error Ep of the (k, h)-segmentation is defined as follows tering can be solved optimally using dynamic programming
and thus the approximation factor is 7 for both E1 and E2
! p1 error measures. For d > 1 and for both E1 and E2 error
XX
p measures the best α is 1 +  using the algorithms proposed
Ep (T, S) = |t − `s | .
s∈S t∈s in [1] and [13] respectively.
Let Sn,k,h denote the family of all segmentations of se- T HEOREM 6.2. Algorithm C LUSTER S EGMENTS that uses
quences of length n into k segments using h representatives. for obtaining the k-segmentation, has approximation
D N S in√
In a similar way to the k-segmentation, for a given sequence factor 29 for E2 -error measure, and 11 for E1 -error
T of length n and error measure Ep , and for given k, h ∈ N measure.
Synthetic Datasets; d = 1; var = 0.5 Synthetic Datasets; d = 5; var = 0.5 Synthetic Datasets; d = 10; var = 0.5
1.025 1.004
DnS DnS
Sqrt−RDnS Sqrt−RDnS
DnS Full−RDnS 1.0035 Full−RDnS
Sqrt−RDnS GiR GiR
1.02
Full−RDnS
GiR
1.003
1.01
1.0025

Eror Ratio
Eror Ratio

Eror Ratio
1.015

1.002

1.01
1.005 1.0015

1.001
1.005
1.0005

1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments

Figure 5: Error ratio of D N S and RD N S algorithms with respect to O PT for synthetic datasets.

Synthetic Datasets; d = 1; k = 10 Synthetic Datasets; d = 5; k = 10 Synthetic Datasets; d = 10; k = 10


1.009 1.007 1.0025
DnS DnS DnS
Sqrt−RDnS Sqrt−RDnS Sqrt−RDnS
1.008 Full−RDnS Full−RDnS Full−RDnS
1.006
GiR GiR GiR
1.007 1.002
1.005
1.006

Eror Ratio
Eror Ratio

Eror Ratio

1.0015
1.005 1.004

1.004 1.003
1.001
1.003
1.002
1.002 1.0005
1.001
1.001

1 1 1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Variance Variance Variance

Figure 6: Error ratio of D N S and RD N S algorithms with respect to O PT for synthetic datasets.

balloon dataset darwin dataset winding dataset


1.4 1.015 1.009

1.35 1.008
DnS DnS DnS
Sqrt−RDnS Sqrt−RDnS Sqrt−RDnS
Full−RDnS Full−RDnS 1.007 Full−RDnS
1.3 GiR
GiR GiR
1.01 1.006
1.25
Eror Ratio

Eror Ratio

Eror Ratio

1.005
1.2
1.004
1.15
1.005 1.003
1.1
1.002

1.05 1.001

1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments
shuttle dataset exchange−rates dataset phone dataset
1.15 1.09 1.007
DnS
DnS Sqrt−RDnS
Sqrt−RDnS 1.08 Full−RDnS
Full−RDnS 1.006
GiR DnS
GiR 1.07 Sqrt−RDnS
Full−RDnS
1.005 GiR
1.1 1.06
Eror Ratio

Eror Ratio

Eror Ratio

1.05 1.004

1.04 1.003

1.05 1.03
1.002
1.02
1.001
1.01

1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments

Figure 7: Error ratio of D N S and RD N S algorithms with respect to O PT for real datasets
Notice that the clustering step of the C LUSTER S EGMENTS [5] H. J. Bussemaker, H. Li, and E. D. Siggia. Regulatory
algorithm does not depend on n and thus one can assume element detection using a probabilistic segmentation model.
that clustering can be solved optimally in constant time, In ISMB, pages 67–74, 2000.
since usually k << n. However, if this step is solved [6] D. Douglas and T. Peucker. Algorithms for the reduction of
approximately using the clustering algorithms of [1] and the number of points required to represent a digitized line or
its caricature. Canadian Cartographer, 10(2):112–122, 1973.
[13], the approximation ratios of the C LUSTER S EGMENTS
[7] A. Gionis and H. Mannila. Finding recurrent sources in
algorithm√ that uses D N S for segmenting, becomes 11 +  for sequences. In RECOMB, pages 115–122, Berlin, Germany,
E1 and 29 +  for E2 . 2003.
Given Theorem 3.1 and using the linear-time clustering [8] S. Guha, N. Koudas, and K. Shim. Data-streams and his-
algorithm for E2 proposed in [13] and the linear-time version tograms. In STOC, pages 471–475, 2001.
of the algorithm proposed in [2] for E1 we get the following [9] N. Haiminen and A. Gionis. Unimodal segmentation of
result: sequences. In ICDM, pages 106–113, 2004.
[10] J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmäki, and
C OROLLARY 6.1. Algorithms S EGMENTS 2L EVELS and H. Toivonen. Time series segmentation for context recogni-
C LUSTER S EGMENTS when using D N S in their first step run tion in mobile devices. In ICDM, pages 203–210, 2001.
in time O(n4/3 k 5/3 ) for both E1 and E2 error measure. [11] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online
algorithm for segmenting time series. In ICDM, pages 289–
In a similar way one can derive the benefits of using the D N S 296, 2001.
[12] E. Keogh and T. Folias. The UCR time series data mining
and R-D N S algorithms to other segmentation problems (like
archive, 2002.
for example unimodal segmentations [9]). [13] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time
(1+) -approximation algorithm for k-means clustering in any
7 Conclusions dimensions. In FOCS, pages 454–462, 2004.
In this paper we described a family of approximation al- [14] V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen,
gorithms for the k-segmentation problem. The most ba- and J. Allan. Mining concurrent text and time series. In In
sic of those algorithms (D N S) works in time O(n4/3 k 5/3 ) proceedings of the 6th ACM SIGKDD Int’l Conference on
and has is a 3-approximation algorithm. We have described Knowledge Discovery and Data Mining Workshop on Text
Mining, pages 37–44, 2000.
and analyzed several variants of this basic algorithm that are
[15] W. Li. DNA segmentation as a model selection process. In
faster, but have worse approximation bounds. Furthermore, RECOMB, pages 204–210, 2001.
we quantified the accuracy versus speed tradeoff. Our ex- [16] T. Palpanas, M. Vlachos, E. J. Keogh, D. Gunopulos, and
perimental results on both synthetic and real datasets show W. Truppel. Online amnesic approximation of streaming time
that the proposed algorithms outperform other heuristics pro- series. In ICDE, pages 338–349, 2004.
posed in the literature and that the approximation achieved in [17] M. Salmenkivi, J. Kere, and H. Mannila. Genome segmenta-
practice is far below the bounds we obtained analytically. tion using piecewise constant intensity models and reversible
jump MCMC. In ECCB, pages 211–218, 2002.
Acknowledgments [18] H. Shatkay and S. B. Zdonik. Approximate queries and
representations for large data sequences. In ICDE ’96:
We would like to thank Aris Gionis and Heikki Mannila for Proceedings of the Twelfth International Conference on Data
helpful discussions and advice. Engineering, pages 536–545, 1996.

References

[1] S. Arora, P. Raghavan, and S. Rao. Approximation schemes


for Euclidean k-medians and related problems. In STOC,
pages 106–113, 1998.
[2] V. Arya, N. Garg, R. Khandekar, K. Munagala, and V. Pandit.
Local search heuristic for k-median and facility location
problems. In STOC, pages 21–29. ACM Press, 2001.
[3] R. Bellman. On the approximation of curves by line segments
using dynamic programming. Communications of the ACM,
4(6), 1961.
[4] P. Bernaola-Galvan, R. R.-R. R, and J. Oliver. Compositional
segmentation and long-range fractal correlations in dna se-
quences. Phys Rev E Stat Phys Plasmas Fluids Relat Inter-
discip Topics, 53(5):5181–5189, 1996.

You might also like