Article

Towards parameter-free data mining

Authors:

Stefano Lonardi,

Chotirat Ann RatanamahatanaAuthors Info & Claims

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 206 - 215

https://doi.org/10.1145/1014052.1014077

Published: 22 August 2004 Publication History

Abstract

Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process.Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.

References

[1]

Allison, L., Stern, L., Edgoose, T., Dix, T.I. Sequence Complexity for Biological Sequence Analysis. Computers & Chemistry 24(1): 43--55 (2000)

[2]

Benedetto, D., Caglioti, E., & Loreto, V. Language trees and zipping. Physical Review Letters 88, 048702, (2002).

[3]

Chen, X., Kwong, S., & Li, M. A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of RECOMB 2000: 107

Digital Library

[4]

Dasgupta, D. & Forrest,S. Novelty Detection in Time Series Data using Ideas from Immunology." In Proc. of the International Conference on Intelligent Systems (1999).

[5]

Domingos, P. A process-oriented heuristic for model selection. In Machine Learning Proceedings of the Fifteenth International Conference, pages 127--135. San Francisco, CA, 1998.

Digital Library

[6]

Elkan, C. Using the triangle inequality to accelerate k-Means. In Proc. of ICML 2003. pp 147--153

[7]

Elkan, C. Magical thinking in data mining: lessons from CoIL challenge 2000. SIGKDD, 2001. pp 426--431.

Digital Library

[8]

Ergun, F., Muthukrishnan, S., & Sahinalp, S.C. Comparing Sequences with Segment Rearrangements. FSTTCS 2003:

[9]

Faloutsos, C., & Lin, K. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc of 24th ACM SIGMOD, 1995.

Digital Library

[10]

Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., & Ziv, J. On the Entropy of DNA: Algorithms and Measurements Based on Memory and Rapid Convergence, Proc. of the Symp. on Discrete Algorithms, 1995. pp 48--57.

Digital Library

[11]

Flexer, A. Statistical evaluation of neural networks experiments: Minimum requirements and current practice. In Proc. of the 13th European Meeting on Cybernetics and Systems Research, vol. 2, pp 1005-1008, Austria, 1996

[12]

Gatlin, L. Information Theory and the Living Systems. Columbia University Press, 1972.

[13]

Gavrilov, M., Anguelov, D., Indyk, P., Motwahl, R. Mining the stock market: which measure is best? Proc. of the 6th ACM SIGKDD, 2000

Digital Library

[14]

Ge, X. & Smyth, P. Deformable Markov model templates for time-series pattern matching. In proceedings of the 6th ACM SIGKDD. Boston, MA, Aug 20-23, 2000. pp 81--90.

Digital Library

[15]

Goldberger, A.L., Amaral, L., Glass, L, Hausdorff, J.M., Ivanov, P.Ch., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Circulation 101(23):e215-e220

[16]

Goodman, J. Comment on "Language Trees and Zipping", unpublished manuscript, 2002 (available at {http://research.microsoft.com/~joshuago/}.

[17]

Kalpakis, K., Gada, D., & Puttagunta, V. Distance measures for effective clustering of ARIMA time-series. In proc. of the IEEE ICDM, 2001. San Jose, CA. pp 273--280.

Digital Library

[18]

Keogh, E. http://www.cs.ucr.edu/~eamonn/SIGKDD2004.

[19]

Keogh, E. & Folias, T. The UCR Time Series Data Mining Archive. Riverside CA. 2002. {http://www.cs.ucr.edu/~eamonn/TSDMA/index.html}.

[20]

Keogh, E. & Kasetty, S. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proc. of SIGKDD, 2002.

Digital Library

[21]

Keogh, E., Lin, J., & Truppel, W. Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research. In proc. of the 3rd IEEE ICDM, 2003. Melbourne, FL. Nov 19-22, 2003. pp 115--122.

Digital Library

[22]

Li, M., Badger, J.H., Chen, X., Kwong, S, Kearney, P., & Zhang, H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17: 149--154, 2001.

[23]

Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P. The similarity metric. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003. Pages: 863 -- 872

Digital Library

[24]

Li, M. & Vitanyi, P. An Introduction to Kolmogorov Complexity and Its Applications. Second Edition, Springer Verlag, 1997.

Digital Library

[25]

Lin, J., Keogh, E., Lonardi, S. & Chiu, B. A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA. June 13, 2003

Digital Library

[26]

Loewenstern, D., Hirsh, H., Yianilos, P., & Noordewier, M. DNA Sequence Classification using Compression-Based Induction, DIMACS Technical Report 95-04, April 1995.

Digital Library

[27]

Loewenstern, D., & Yianilos, P.N. Significantly lower entropy estimates for natural DNA sequences, Journal of Computational Biology, 6(1), 1999.

[28]

Ma, J. & Perkins, S. Online Novelty Detection on Temporal Sequences. Proc. International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003.

Digital Library

[29]

Quinlan, J.R. & Rivest, R.L. Inferring Decision Trees Using the Minimum Description Length Principle. Information and Computation, 80:227--248, 1989.

Digital Library

[30]

Ratanamahatana, C.A. & Keogh, E. Making Time-series Classification More Accurate Using Learned Constraints. In proceedings of SIAM International Conference on Data Mining (SDM '04), Lake Buena Vista, Florida, April 22-24, 2004.

[31]

Rissanen, J. Modeling by shortest data description. Automatica, vol. 14 (1978), pp. 465--471.

Digital Library

[32]

Salzberg, S.L. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1(3), 1997.

Digital Library

[33]

Shahabi, C., Tian, X., & Zhao, W. TSA-tree: A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Surprise and Trend Queries The 12th Int'l Conf on Scientific and Statistical Database Management (SSDBM 2000)

Digital Library

[34]

Vlachos, M., Hadjieleftheriou, M., Gunopulos, D. & Keogh. E. Indexing Multi-Dimensional Time-Series with Support for Multiple Distance Measures. In the 9th ACM SIGKDD. August 24 - 27, 2003. Washington, DC, USA. pp 216--225.

Digital Library

[35]

Wang, C. & Wang, X. S. Supporting content-based searches on time series via approximation. In proceedings of the 12th Int'l Conference on Scientific and Statistical Database Management. Berlin, Germany, Jul 26-28, 2000. pp 69--81.

Digital Library

[36]

Yairi, T., Kato, Y., & Hori, K. Fault Detection by Mining Association Rules from House-keeping Data, Proc. of Int'l Sym. on AI, Robotics and Automation in Space, 2001.

Cited By

Takamoto AHironaka SUmemura K(2024)Composer Classification Using Maximum Probability Partitioning Based on Compression Principles圧縮原理に基づく最大確率分割情報量を用いた作曲者分類Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.39-2_F-NA139:2(F-NA1_1-10)Online publication date: 1-Mar-2024
https://doi.org/10.1527/tjsai.39-2_F-NA1
Schranz CMayr SBernhart SHalmich C(2024)Nearest advocate: a novel event-based time delay estimation algorithm for multi-sensor time-series data synchronizationEURASIP Journal on Advances in Signal Processing10.1186/s13634-024-01143-12024:1Online publication date: 5-Apr-2024
https://doi.org/10.1186/s13634-024-01143-1
Lee KLee SKo JOkoshi TKo JLiKamWa R(2024)Poster: A Memory Efficient Parameter-free Time-series Classification via gzipProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661424(690-691)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661424
Show More Cited By

Index Terms

Towards parameter-free data mining
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Compression-based data mining of sequential data

The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, ...
Free parallel data mining
Mining uncertain data

As an important data mining and knowledge discovery task, association rule mining searches for implicit, previously unknown, and potentially useful pieces of information—in the form of rules revealing associative relationships—that are embedded in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

August 2004

874 pages

ISBN:1581138881

DOI:10.1145/1014052

General Chairs:
Won Kim
Cyber Database Solutions
,
Ronny Kohavi
Amazon.com
,
Program Chairs:
Johannes Gehrke
Cornell University
,
William DuMouchel
AT&T Labs Research

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD04

Sponsor:

KDD04: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 22 - 25, 2004

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

388
Total Citations
View Citations
4,504
Total Downloads

Downloads (Last 12 months)88
Downloads (Last 6 weeks)6

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Takamoto AHironaka SUmemura K(2024)Composer Classification Using Maximum Probability Partitioning Based on Compression Principles圧縮原理に基づく最大確率分割情報量を用いた作曲者分類Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.39-2_F-NA139:2(F-NA1_1-10)Online publication date: 1-Mar-2024
https://doi.org/10.1527/tjsai.39-2_F-NA1
Schranz CMayr SBernhart SHalmich C(2024)Nearest advocate: a novel event-based time delay estimation algorithm for multi-sensor time-series data synchronizationEURASIP Journal on Advances in Signal Processing10.1186/s13634-024-01143-12024:1Online publication date: 5-Apr-2024
https://doi.org/10.1186/s13634-024-01143-1
Lee KLee SKo JOkoshi TKo JLiKamWa R(2024)Poster: A Memory Efficient Parameter-free Time-series Classification via gzipProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661424(690-691)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661424
Chen LWang DZheng YGuo X(2024)Sequence Similarity Measurement for Multi-Human Motion Ability Assessment2024 IEEE International Conference on Mechatronics and Automation (ICMA)10.1109/ICMA61710.2024.10633115(363-368)Online publication date: 4-Aug-2024
https://doi.org/10.1109/ICMA61710.2024.10633115
Chomiak THu B(2024)Time-series forecasting through recurrent topologyCommunications Engineering10.1038/s44172-023-00142-83:1Online publication date: 9-Jan-2024
https://doi.org/10.1038/s44172-023-00142-8
Sun RFan Y(2024)Usage-aware representation learning for critical information identification in transportation networksTransportation Research Part C: Emerging Technologies10.1016/j.trc.2024.104538160(104538)Online publication date: Mar-2024
https://doi.org/10.1016/j.trc.2024.104538
Wan XHan X(2024)Efficient Top-k Frequent Itemset Mining on Massive DataData Science and Engineering10.1007/s41019-024-00241-29:2(177-203)Online publication date: 6-Feb-2024
https://doi.org/10.1007/s41019-024-00241-2
Murai TKoga H(2023)Improved Recurrence Plots Compression Distance by Learning Parameter for Video Compression QualityEntropy10.3390/e2506095325:6(953)Online publication date: 19-Jun-2023
https://doi.org/10.3390/e25060953
Mamo NAzzopardi JLayfield C(2023)The myth of reproducibility: A review of event tracking evaluations on TwitterFrontiers in Big Data10.3389/fdata.2023.10673356Online publication date: 5-Apr-2023
https://doi.org/10.3389/fdata.2023.1067335
Takamoto AKohara YYoshida MUmemura K(2023)Improving the Accuracy and Efficiency of Compression-based Dissimilarity Measure using Information Quantity in Data Classification Problemsデータ分類タスクにおけるCompression-based Dissimilarity Measureの精度と速度の改良Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.38-1_A-M7138:1(A-M71_1-15)Online publication date: 1-Jan-2023
https://doi.org/10.1527/tjsai.38-1_A-M71
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents