Abstract
The problem of finding unusual time series has recently attracted much attention, and several promising methods are now in the literature. However, virtually all proposed methods assume that the data reside in main memory. For many real-world problems this is not be the case. For example, in astronomy, multi-terabyte time series datasets are the norm. Most current algorithms faced with data which cannot fit in main memory resort to multiple scans of the disk /tape and are thus intractable. In this work we show how one particular definition of unusual time series, the time series discord, can be discovered with a disk aware algorithm. The proposed algorithm is exact and requires only two linear scans of the disk with a tiny buffer of main memory. Furthermore, it is very simple to implement. We use the algorithm to provide further evidence of the effectiveness of the discord definition in areas as diverse as astronomy, web query mining, video surveillance, etc., and show the efficiency of our method on datasets which are many orders of magnitude larger than anything else attempted in the literature.
Similar content being viewed by others
References
Ameen J, Bash R (2006) Mining time series for identifying unusual sub-sequences with applications. In: 1st International conference on innovative computing, information and control vol 1, 574–577
Angiulli F, Fassetti F (2007a) Detecting distance-based outliers in streams of data. In: CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 811–820
Angiulli F, Fassetti F (2007b) Very efficient mining of distance-based outliers. In: CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 791–800
Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 29–38
Berchtold S, Böhm C, Keim D, Kriegel H (1997) A cost model for nearest neighbor search in high-dimensional data space. In: Proceedings of the 16th ACM symposium on principles of database systems (PODS) pp 78–86
Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. SIGMOD Rec 29(2): 93–104
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’03) pp 493–498
Chuah M, Fu F (2007) ECG anomaly detection via time series analysis. Technical report LU-CSE-07-001
Davies S, Moore A (2000) Mix-nets: factored mixtures of gaussians in Bayesian networks with mixed continuous and discrete variables. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, pp 168–175
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: Proceedings of the 6th conference on symposium on opearting systems design and Implementation
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23(2): 419–429
Fu A, Leung O, Keogh E, Lin J (2006) Finding time series discords based on Haar transform. In: Proceedings of the 2nd international conference on advanced data mining and applications, pp 31–41
Ghoting A, Parthasarathy S, Otey M (2006) Fast mining of distance-based outliers in high dimensional datasets. In: Proceedings of the 6th SIAM international conference on data mining
Hung E, Cheung D (2002) Parallel mining of outliers in large database. Distrib Parallel Databases 12(1): 5–26
Jagadish H, Koudas N, Muthukrishnan S (1999) Mining deviants in a time series database. In: Proceedings of the 25th international conference on very large data bases, pp 102–113
Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 102–111
Keogh E, Lin J, Fu A (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the 5th IEEE international conference on data mining, pp 226–233
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd international conference on very large data bases (VLDB), pp 392–403
Lozano E, Acuna E (2005) Parallel algorithms for distance-based and density-based outliers. In: ICDM ’05: Proceedings of the Fifth IEEE international conference on data mining, pp 729–732
Malatesta K, Beck S, Menali G, Waagen E (2005) The AAVSO data validation project. J Am Assoc Variable Star Observ (JAAVSO) 78: 31–44
Naftel A, Khalid S (2006) Classifying spatiotemporal object trajectories using unsupervised learning in the coefficient feature space. Multimedia Syst 12(3): 227–238
OGLE project http://bulge.astro.princeton.edu/~ogle/
Pokrajac D, Lazarevic A, Latecki L (2007) Incremental local outlier detection for data streams. In: IEEE symposium on computational intelligence and data mining, pp 504–515
Protopapas P, Giammarco J, Faccioli L, Struble M, Dave R, Alcock C (2006) Finding outlier light-curves in catalogs of periodic variable stars. Mon Notices R Astronom Soc 369: 677–696
Riedewald M, Agrawal D, Abbadi A, Korn F (2003) Accessing scientific data: simpler is better. In: Proceedings of the 8th international symposium in spatial and temporal databases, pp 214–232
Shapiro M (1977) The choice of reference points in best-match file searching. Commun ACM 20(5): 339–343
Silverman B (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, London
Stoyan D (2006) On estimators of the nearest neighbour distance distribution function for stationary point processes. Metrica 64(2): 139–150
TAO project http://www.pmel.noaa.gov/tao/index.shtml
Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 394–403
Wang C, Wang X (2000) Multilevel filtering for high dimensional nearest neighbor search. In: ACM SIGMOD workshop on research issues in data mining and knowledge discovery, pp 37–43
Wang D, Fortier P, Michel H, Mitsa T (2006) Hierarchical agglomerative clustering based t-outlier detection. In: 6th international conference on data mining—Workshops pp 731–738
Wei L, Keogh E, Xi X (2006) SAXually explicit images: finding unusual shapes. In: Proceedings of the 6th international conference on data mining, pp 711–720
Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana C (2006) Fast time series classification using numerosity reduction. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, pp 1033–1040
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yankov, D., Keogh, E. & Rebbapragada, U. Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17, 241–262 (2008). https://doi.org/10.1007/s10115-008-0131-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0131-9