TT06
TT06
TT06
Inequalities 3.3 and 3.5 follow from the triangular inequality, Using the above inequality, the optimality of dynamic pro-
inequality 3.4 follows from Equation 3.2, and inequality 3.6 gramming in Step 4 of the algorithm, and Fact 3.1 we have
follows from Observation 1.
X X
d2 (t, µt )2 ≤ d2 (t, λt )2
Next we prove the 3-approximation result for E2 . For t∈T t∈T
this, we need the following simple fact. X
≤ 2· d2 (t, t)2 + d2 (t, λt )2
FACT 3.1. (D OUBLE T RIANGULAR I NEQUALITY) Let d be t∈T
X
a distance metric. Then for points x, y and z and p ∈ N+ we ≤ 4· d2 (t, λt )2
have t∈T
d(x, y)2 ≤ 2 · d(x, z)2 + 2 · d(z, y)2 . = 4 · O PT22 .
Finally using the Cauchy-Schwartz inequality we get As a first step in the analysis of the RD N S we consider
sX the approximation ratio of the `-RD N S algorithm. We can
X
2· d2 (t, t) · d2 (t, µt ) ≤ 2 · d2 (t, t)2 prove the following theorem.
t∈T t∈T
sX T HEOREM 4.1. The `-RD N S algorithm is an O(2` ) approx-
· d2 (t, µt )2 imation algorithm for the E1 -error function, and an O(6`/2 )
t∈T approximation algorithm for the E2 -error function, with re-
q q spect to Problem 1.
≤ 2· O PT22 · 4 · O PT22
Proof. (Sketch) The proof in both cases follows by induction
= 4 · O PT22 .
of `. The exact approximation ratio is 2`+1 − 1
on the valueq
Combining all the above we conclude that for E1 , and 95 6` − 45 for E2 . We will sketch the proof for
E1 . The proof for E2 follows along the same lines.
D N S22 ≤ 9 · O PT22 . From Theorem 3.2, we have that the theorem is true
for ` = 1. Assume now that it is true for some ` ≥ 1.
We will prove it for ` + 1. At the first level of recursion
4 Recursive D N S algorithm the (` + 1)-RD N S algorithm, breaks the sequence T into χ
subsequences T1 , . . . , Tχ . For each one of these we call the
The D N S algorithm applies the “divide-and-segment” idea `-RD N S algorithm, producing a set R of χk representatives.
once, splitting the sequence into subsequences, partitioning Similar to the proof of Theorem 3.2, let t̄ ∈ R denote the
each of subsequence optimally, and then merging the results. representative in R that corresponds to point t. Consider also
We now consider the recursive D N S algorithm (RD N S) the optimal segmentation of each of these intervals, and let
which recursively splits each of the subsequences, until no O denote the set of χk representatives. Let e t ∈ O denote the
further splits are possible. Algorithm 2 shows the outline of representative of point t in O. From the inductive hypothesis
the RD N S algorithm. we have that
X X
Algorithm 2 The RD N S algorithm d1 (t, t̄) ≤ 2`+1 − 1 d1 (t, e
t)
Input: Sequence T of n points, number of segments k, t∈T t∈T
value χ.
Ouput: A segmentation of T into k segments. Now let µt be the representative of point t in the
1: if |T | ≤ B then segmentation output by the (` + 1)-RD N S algorithm. Also
2: Return the optimal partition (S,M ) of T using the let λt denote the representative of point t in the optimal
dynamic-programming algorithm. segmentation. Let RD N S 1 denote the E1 -error of the (`+1)-
3: end if RD N S algorithm, and O PT 1 denote the E1 -error of the
4: Partition T into χ intervals T1 , . . . , Tχ . optimal segmentation. We have that
5: for all i ∈ {1, . . . , χ} do X X
6: (Si , Mi ) = RD N S(Ti , k, χ) RD N S 1 = d1 (t, µt ) and O PT1 = d1 (t, λt )
7: end for t∈T t∈T
8: Let T 0 = M1 ⊕ M2 ⊕ · · · ⊕ Mχ be the sequence defined
by the concatenation of the representatives, weighted by From the triangular inequality we have that
the length of the interval they represent. X X X
9: Return the optimal partition (S,M ) of T 0 using the d1 (t, µt ) ≤ d1 (t, t̄) + d1 (t̄, µt )
dynamic-programming algorithm. t∈T t∈T t∈T
X X
≤ 2`+1 − 1 d1 (t, e
t) + d1 (t̄, µt )
t∈T t∈T
The value B is a constant that defines the base case for
the recursion. Alternatively, one could directly determine
the depth ` of the recursive calls to RD N S. We will refer to From Observation 1, and Equation 3.2, we have that
such an algorithm, as the `-RD N S algorithm. For example, X X
the simple D N S algorithm, corresponds to the 1-RD N S d1 (t, e
t) ≤ d1 (t, λt )
algorithm. We also note that at every recursive call of the t∈T t∈T
RD N S algorithm the number χ of intervals into which we
X X
partition the sequence may be a function of sequence length. d1 (t̄, µt ) ≤ d1 (t̄, λt )
However, for simplicity we use χ instead of χ(n). t∈T t∈T
√
Using the above inequalities and the triangular inequality we T HEOREM 4.3. For χ = n the RD N S algorithm is an
obtain O(log n) approximation algorithm for Problem 1 for both
X E1 and E2 error functions. The√running time of the algo-
RD N S 1 = d1 (t, µt ) rithm is O(n log log n), using O( n) space, when operating
t∈T in a streaming fashion.
X X
≤ 2`+1 − 1 d1 (t, λt ) + d1 (t̄, λt )
t∈T t∈T Proof. (Sketch) It is not hard to see that after ` recursive calls
X `
the size of the input segmentation is O(n1/2 ). Therefore,
≤ 2`+1 − 1 d1 (t, λt )
the depth of the recursion is O(log log n). From Theorem 4.1
t∈T
X X we have that the approximation ratio of the algorithm is
+ d1 (t, t̄) + d1 (t, λt ) O(log n). The running time of the algorithm is given by the
t∈T t∈T recurrence
X X √ √
≤ 2`+1 d1 (t, λt ) + 2`+1 − 1 d1 (t, e
t) R(n) = nR n + nk 3 .
t∈T t∈T
X Solving the recurrence we obtain running time
`+2
≤ 2 −1 d1 (t, λt ) O(n log log n). The space required is bounded √ by the
t∈T size of the top level of the recursion, and it is O( n).
= 2`+2 − 1 O PT1
The following corollary is an immediate consequence
The proof for the E2 follows similarly. Instead of using of the proof of Theorem 4.3 and it provides an accu-
the binomial identity as in the proof of Theorem 3.3, we racy/efficiency tradeoff.
obtain a more clean recursive formula for the approximation √
error by applying the double triangular inequality. C OROLLARY 4.1. For χ = n, the `-RD N S algorithm is
an O(2` ) approximation algorithm for the E1 -error function,
We now consider possible values for χ. First, we set χ and an O(6`/2 ) approximation algorithm for the E2 -error
to be a constant. We can prove the following theorem. function, with respect to Problem 1. The running time of the
`
algorithm is O(n1+1/2 + n`).
T HEOREM 4.2. For any constant χ the running time of the
RD N S algorithm is O(n), where n is the length of the input
5 Experiments
sequence. The algorithm can operate on data that arrive in
streaming fashion using O(log n) space. 5.1 Segmentation heuristics. Since sequence segmenta-
tion is a basic problem particularly in time-series analysis
Proof. (Sketch) The running time of the RD N S algorithm is several algorithms have been proposed in the literature with
given by the following recursion the intention to improve the running time of the optimal
dynamic-programming algorithm. These algorithms have
n
R(n) = χR + (χk)2 k. been proved very useful in practice, however no approxi-
χ mation bounds are known for them. For completeness we
Solving the recursion we get that R(n) = O(n). briefly describe them here.
In a case that the data arrive in a stream, the algorithm The T OP -D OWN greedy algorithm (TD) starts with
can build the recursion tree online, in a bottom-up fashion. the unsegmented sequence (initially there is just a single
At each level of the recursion tree, we only need to maintain segment) and it introduces a new boundary at every greedy
at most χk entries that correspond to the leftmost branch of step. That is, in the i-th step it introduces the i-th segment
the tree. The depth of the recursion is O(log n), resulting in boundary by splitting one of the existing i segments into two.
O(log n) space overall. The new boundary is selected in such a way that it minimizes
the overall error. No change is made to the existing i − 1
Therefore, for constant χ, we obtain an efficient algo- boundary points. The splitting process is repeated until the
rithm, both in time and space. Unfortunately, we do not number of segments of the output segmentation reaches k.
have any approximation guarantees, since the best approx- This algorithm, or variations of it with different stopping
imation bound we can prove using Theorem 4.1 is O(n). We conditions are used in [4, 6, 14, 18]. The running time of
can however obtain significantly better approximation guar- the algorithm is O(nk).
antees if we are willing to tolerate
√ a small increase in the In the B OTTOM -U P greedy algorithm (BU) initially
running time. We set χ = n, where n is the length of each point forms a segment on its own. At each step, two
the input sequence at each specific recursive call.
√ That is, at consecutive segments that cause the smallest increase in the
each√recursive call we split the sequence into n pieces of error are merged. The algorithm stops when k segments
size n. are formed. The complexity of the bottom-up algorithm is
O(n log n). BU performs well in terms of error and it has optimal solution. The error ratio is shown as a function of the
been used widely in time-series segmentation [9, 16]. number of segments (Figure 2), or the variance of the gen-
The L OCAL I TERATIVE R EPLACEMENT (LiR) and erated datasets (Figure 3). In all cases, the D N S algorithm
G LOBAL I TERATIVE R EPLACEMENT (GiR) are randomized consistently outperforms all other heuristics, and the error it
algorithms for sequence segmentations proposed in [10]. achieves is very close to that of the optimal algorithm. Note
Both algorithms start with a random k-segmentation. At that in contrast to the steady behavior of D N S the quality of
each step they pick one segment boundary (randomly or in the results of the other heuristics varies for the different pa-
some order) and search for the best position to put it back. rameters and no conclusions on their behavior on arbitrary
The algorithms repeat these steps until they converge, i.e., datasets can be drawn.
they cannot improve the error of the output segmentation. This phenomenon is even more pronounced when we
The two algorithms differ in the types of replacements of the experiment with real data. Figure 4 is a sample of similar
segmentation boundaries they are allowed to do. Consider experimental results obtained using the datasets balloon,
a segmentation s1 , s2 , . . . , sk . Now assume that both (LiR) darwin, winding, xrates and phone from the UCR repository.
and (GiR) pick segment boundary si for replacement. LiR is The D N S performs extremely well in terms of accuracy, and
only allowed to put a new boundary between points si−1 and it is again very robust across different datasets for different
si+1 . On the other hand, GiR is allowed to put a new seg- values of k. Overall, GiR performs the best among the rest
ment boundary anywhere on the sequence. Both algorithms of the heuristics. However, there are cases (e.g., the balloon
run in time O(In), where I is the number of iterations nec- dataset) where GiR is severely outperformed.
essary for convergence.
Although extensive experimental evidence shows that 5.4 Exploring the benefits of the recursion. We addi-
these algorithms perform well in practice, there is no known tionally compare the basic D N S algorithm with different
guarantee of their worst-case error ratio. versions of RD N S. The first one, F ULL -RD N S (full recur-
sion), is the RD N S algorithm when we set the value of χ
5.2 Experimental setup. We show the qualitative perfor- to be a constant. This algorithm runs in linear time (see
mance of the proposed algorithms via an extensive experi- Theorem 4.2). However, we have not derived any approx-
mental study. For this, we compare the family of “divide- imation bound for it (other than O(n)). The second one,
and-segment” algorithms with all the heuristics described in √ S QRT-RD N S, is the RD N S algorithm when we set χ to be
the previous subsection. We also explore the quality of the n. At every recursive call of this
√ algorithm the parental
results given by RD N S compared to D N S for different pa- segment of size s is split into O( s) subsegments of the
rameters of the recursion (i.e., number of recursion levels, same size. This variation of the recursive algorithm runs in
value of χ). time O(n log log n) and has approximation ratio O(log n)
For the study we use two types of datasets: (a) synthetic (see Theorem 4.3). We study experimentally the tradeoffs
and (b) real data. The synthetic data are generated as follows: between the running time and the quality of the results ob-
First we fix the dimensionality d of the data. Then we tained using the three different alternatives of “divide-and-
select k segment boundaries, which are common for all the segment” methods on synthetic and real datasets. We also
d dimensions. For the j-th segment of the i-th dimension compare the quality of those results with the results obtained
we select a mean value µij , which is uniformly distributed using GiR algorithm. We choose this algorithm for compar-
in [0, 1]. Points are then generated by adding a noise value ison since it has proved to be the best among all the other
sampled from the normal distribution N (µij , σ 2 ). For the heuristics. In Figures 5 and 6 we plot the error ratios of the
experiments we present here we have fixed the number of algorithms as a function of the number of segments and the
segments k = 10. We have generated datasets with d = variance for the synthetic datasets. Figure 7 shows the ex-
1, 5, 10, and standard deviations varying from 0.05 to 0.9. periments on real datasets.
The real datasets were downloaded from the UCR time- From the results we can make the following observa-
series data mining archive [12]1 . tions. First, all the algorithms of the divide-and-segment
family perform extremely well, giving results close to the
5.3 Performance of the D N S algorithm. Figures 2 and 3 optimal segmentation and usually better than the results ob-
show the performance of different algorithms for the syn- tained by GiR. The full recursion (F ULL -RD N S) does harm
thetic datasets. In particular we plot the error ratio OAPT for the quality of the results. However, we note that in order
A being the error of the solutions found by the algorithms to study the full effect of recursion on the performance of
D N S, BU, TD, LiR and GiR. O PT represents the error of the the algorithm we set χ = 2, the minimum possible value.
We believe that for larger values of χ the performance of
F ULL -RD N S will be closer to that of D N S (for which we
1 The interested reader can find the datasets at have χ = (n/k)2/3 ). Finally, there are cases where S QRT-
http://www.cs.ucr.edu/∼eamonn/TSDMA/.
Synthetic Datasets; d = 1; var = 0.5 Synthetic Datasets; d = 5; var = 0.5 Synthetic Datasets; d = 10; var = 0.5
1.07 1.015
DnS DnS
BU DnS BU
1.06 TD BU TD
LiR TD LiR
GiR LiR GiR
1.05 1.01 GiR
1.01
Error Ratio
Error Ratio
Error Ratio
1.04
1.03
1.005
1.005
1.02
1.01
1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments
Figure 2: Error ratio of different algorithms with respect to O PT as a function of the number of segments
Synthetic Datasets; d = 1; k=10 Synthetic Datasets; d = 5; k=10 Synthetic Datasets; d = 10; k=10
1.8 1.15 1.07
DnS DnS DnS
BU BU BU
1.7 TD TD 1.06 TD
LiR LiR LiR
GiR GiR GiR
1.6
1.05
1.1
1.5
Error Ratio
Error Ratio
Error Ratio
1.04
1.4
1.03
1.3
1.05
1.02
1.2
1.1 1.01
1 1 1
0.050.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.050.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.050.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
variance variance variance
Figure 3: Error ratio of different algorithms with respect to O PT as a function of the variance of the generated datasets
Error Ratio
Error Ratio
1.015
1.25 1.02
1.2 1.015
1.01
1.15
1.01
1.1 1.005
1.005
1.05
1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments
shuttle dataset exchange−rates dataset phone dataset
1.5 1.4 1.025
DnS
1.45 DnS BU DnS
BU 1.35 TD BU
TD LiR TD
1.4 GiR
1.02
LiR 1.3 LiR
GiR GiR
1.35
1.25
Error Ratio
Error Ratio
Error Ratio
1.3 1.015
1.25 1.2
1.2 1.01
1.15
1.15
1.1
1.1 1.005
1.05
1.05
1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments
Figure 4: Error ratio of different algorithms with respect to O PT as a function of the number of segments for different real
datasets
1.009
with h < k, the optimal (k, h)-segmentation is defined as
1.008
balloon
darwin
1.007 winding
phone
(6.7) Sopt (T, k, h) = arg min Ep (T, S) .
S∈Sn,k,h
1.006
Eror Ratio
1.005
Therefore, the optimal (k, h)-segmentation is defined as
1.004
follows:
1.003
Eror Ratio
Eror Ratio
Eror Ratio
1.015
1.002
1.01
1.005 1.0015
1.001
1.005
1.0005
1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments
Figure 5: Error ratio of D N S and RD N S algorithms with respect to O PT for synthetic datasets.
Eror Ratio
Eror Ratio
Eror Ratio
1.0015
1.005 1.004
1.004 1.003
1.001
1.003
1.002
1.002 1.0005
1.001
1.001
1 1 1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Variance Variance Variance
Figure 6: Error ratio of D N S and RD N S algorithms with respect to O PT for synthetic datasets.
1.35 1.008
DnS DnS DnS
Sqrt−RDnS Sqrt−RDnS Sqrt−RDnS
Full−RDnS Full−RDnS 1.007 Full−RDnS
1.3 GiR
GiR GiR
1.01 1.006
1.25
Eror Ratio
Eror Ratio
Eror Ratio
1.005
1.2
1.004
1.15
1.005 1.003
1.1
1.002
1.05 1.001
1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments
shuttle dataset exchange−rates dataset phone dataset
1.15 1.09 1.007
DnS
DnS Sqrt−RDnS
Sqrt−RDnS 1.08 Full−RDnS
Full−RDnS 1.006
GiR DnS
GiR 1.07 Sqrt−RDnS
Full−RDnS
1.005 GiR
1.1 1.06
Eror Ratio
Eror Ratio
Eror Ratio
1.05 1.004
1.04 1.003
1.05 1.03
1.002
1.02
1.001
1.01
1 1 1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Number of segments Number of segments Number of segments
Figure 7: Error ratio of D N S and RD N S algorithms with respect to O PT for real datasets
Notice that the clustering step of the C LUSTER S EGMENTS [5] H. J. Bussemaker, H. Li, and E. D. Siggia. Regulatory
algorithm does not depend on n and thus one can assume element detection using a probabilistic segmentation model.
that clustering can be solved optimally in constant time, In ISMB, pages 67–74, 2000.
since usually k << n. However, if this step is solved [6] D. Douglas and T. Peucker. Algorithms for the reduction of
approximately using the clustering algorithms of [1] and the number of points required to represent a digitized line or
its caricature. Canadian Cartographer, 10(2):112–122, 1973.
[13], the approximation ratios of the C LUSTER S EGMENTS
[7] A. Gionis and H. Mannila. Finding recurrent sources in
algorithm√ that uses D N S for segmenting, becomes 11 + for sequences. In RECOMB, pages 115–122, Berlin, Germany,
E1 and 29 + for E2 . 2003.
Given Theorem 3.1 and using the linear-time clustering [8] S. Guha, N. Koudas, and K. Shim. Data-streams and his-
algorithm for E2 proposed in [13] and the linear-time version tograms. In STOC, pages 471–475, 2001.
of the algorithm proposed in [2] for E1 we get the following [9] N. Haiminen and A. Gionis. Unimodal segmentation of
result: sequences. In ICDM, pages 106–113, 2004.
[10] J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmäki, and
C OROLLARY 6.1. Algorithms S EGMENTS 2L EVELS and H. Toivonen. Time series segmentation for context recogni-
C LUSTER S EGMENTS when using D N S in their first step run tion in mobile devices. In ICDM, pages 203–210, 2001.
in time O(n4/3 k 5/3 ) for both E1 and E2 error measure. [11] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online
algorithm for segmenting time series. In ICDM, pages 289–
In a similar way one can derive the benefits of using the D N S 296, 2001.
[12] E. Keogh and T. Folias. The UCR time series data mining
and R-D N S algorithms to other segmentation problems (like
archive, 2002.
for example unimodal segmentations [9]). [13] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time
(1+) -approximation algorithm for k-means clustering in any
7 Conclusions dimensions. In FOCS, pages 454–462, 2004.
In this paper we described a family of approximation al- [14] V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen,
gorithms for the k-segmentation problem. The most ba- and J. Allan. Mining concurrent text and time series. In In
sic of those algorithms (D N S) works in time O(n4/3 k 5/3 ) proceedings of the 6th ACM SIGKDD Int’l Conference on
and has is a 3-approximation algorithm. We have described Knowledge Discovery and Data Mining Workshop on Text
Mining, pages 37–44, 2000.
and analyzed several variants of this basic algorithm that are
[15] W. Li. DNA segmentation as a model selection process. In
faster, but have worse approximation bounds. Furthermore, RECOMB, pages 204–210, 2001.
we quantified the accuracy versus speed tradeoff. Our ex- [16] T. Palpanas, M. Vlachos, E. J. Keogh, D. Gunopulos, and
perimental results on both synthetic and real datasets show W. Truppel. Online amnesic approximation of streaming time
that the proposed algorithms outperform other heuristics pro- series. In ICDE, pages 338–349, 2004.
posed in the literature and that the approximation achieved in [17] M. Salmenkivi, J. Kere, and H. Mannila. Genome segmenta-
practice is far below the bounds we obtained analytically. tion using piecewise constant intensity models and reversible
jump MCMC. In ECCB, pages 211–218, 2002.
Acknowledgments [18] H. Shatkay and S. B. Zdonik. Approximate queries and
representations for large data sequences. In ICDE ’96:
We would like to thank Aris Gionis and Heikki Mannila for Proceedings of the Twelfth International Conference on Data
helpful discussions and advice. Engineering, pages 536–545, 1996.
References