Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1014052.1014077acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Towards parameter-free data mining

Published: 22 August 2004 Publication History

Abstract

Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process.Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.

References

[1]
Allison, L., Stern, L., Edgoose, T., Dix, T.I. Sequence Complexity for Biological Sequence Analysis. Computers & Chemistry 24(1): 43--55 (2000)
[2]
Benedetto, D., Caglioti, E., & Loreto, V. Language trees and zipping. Physical Review Letters 88, 048702, (2002).
[3]
Chen, X., Kwong, S., & Li, M. A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of RECOMB 2000: 107
[4]
Dasgupta, D. & Forrest,S. Novelty Detection in Time Series Data using Ideas from Immunology." In Proc. of the International Conference on Intelligent Systems (1999).
[5]
Domingos, P. A process-oriented heuristic for model selection. In Machine Learning Proceedings of the Fifteenth International Conference, pages 127--135. San Francisco, CA, 1998.
[6]
Elkan, C. Using the triangle inequality to accelerate k-Means. In Proc. of ICML 2003. pp 147--153
[7]
Elkan, C. Magical thinking in data mining: lessons from CoIL challenge 2000. SIGKDD, 2001. pp 426--431.
[8]
Ergun, F., Muthukrishnan, S., & Sahinalp, S.C. Comparing Sequences with Segment Rearrangements. FSTTCS 2003:
[9]
Faloutsos, C., & Lin, K. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc of 24th ACM SIGMOD, 1995.
[10]
Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., & Ziv, J. On the Entropy of DNA: Algorithms and Measurements Based on Memory and Rapid Convergence, Proc. of the Symp. on Discrete Algorithms, 1995. pp 48--57.
[11]
Flexer, A. Statistical evaluation of neural networks experiments: Minimum requirements and current practice. In Proc. of the 13th European Meeting on Cybernetics and Systems Research, vol. 2, pp 1005-1008, Austria, 1996
[12]
Gatlin, L. Information Theory and the Living Systems. Columbia University Press, 1972.
[13]
Gavrilov, M., Anguelov, D., Indyk, P., Motwahl, R. Mining the stock market: which measure is best? Proc. of the 6th ACM SIGKDD, 2000
[14]
Ge, X. & Smyth, P. Deformable Markov model templates for time-series pattern matching. In proceedings of the 6th ACM SIGKDD. Boston, MA, Aug 20-23, 2000. pp 81--90.
[15]
Goldberger, A.L., Amaral, L., Glass, L, Hausdorff, J.M., Ivanov, P.Ch., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Circulation 101(23):e215-e220
[16]
Goodman, J. Comment on "Language Trees and Zipping", unpublished manuscript, 2002 (available at {http://research.microsoft.com/~joshuago/}.
[17]
Kalpakis, K., Gada, D., & Puttagunta, V. Distance measures for effective clustering of ARIMA time-series. In proc. of the IEEE ICDM, 2001. San Jose, CA. pp 273--280.
[18]
Keogh, E. http://www.cs.ucr.edu/~eamonn/SIGKDD2004.
[19]
Keogh, E. & Folias, T. The UCR Time Series Data Mining Archive. Riverside CA. 2002. {http://www.cs.ucr.edu/~eamonn/TSDMA/index.html}.
[20]
Keogh, E. & Kasetty, S. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proc. of SIGKDD, 2002.
[21]
Keogh, E., Lin, J., & Truppel, W. Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research. In proc. of the 3rd IEEE ICDM, 2003. Melbourne, FL. Nov 19-22, 2003. pp 115--122.
[22]
Li, M., Badger, J.H., Chen, X., Kwong, S, Kearney, P., & Zhang, H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17: 149--154, 2001.
[23]
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P. The similarity metric. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003. Pages: 863 -- 872
[24]
Li, M. & Vitanyi, P. An Introduction to Kolmogorov Complexity and Its Applications. Second Edition, Springer Verlag, 1997.
[25]
Lin, J., Keogh, E., Lonardi, S. & Chiu, B. A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA. June 13, 2003
[26]
Loewenstern, D., Hirsh, H., Yianilos, P., & Noordewier, M. DNA Sequence Classification using Compression-Based Induction, DIMACS Technical Report 95-04, April 1995.
[27]
Loewenstern, D., & Yianilos, P.N. Significantly lower entropy estimates for natural DNA sequences, Journal of Computational Biology, 6(1), 1999.
[28]
Ma, J. & Perkins, S. Online Novelty Detection on Temporal Sequences. Proc. International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003.
[29]
Quinlan, J.R. & Rivest, R.L. Inferring Decision Trees Using the Minimum Description Length Principle. Information and Computation, 80:227--248, 1989.
[30]
Ratanamahatana, C.A. & Keogh, E. Making Time-series Classification More Accurate Using Learned Constraints. In proceedings of SIAM International Conference on Data Mining (SDM '04), Lake Buena Vista, Florida, April 22-24, 2004.
[31]
Rissanen, J. Modeling by shortest data description. Automatica, vol. 14 (1978), pp. 465--471.
[32]
Salzberg, S.L. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1(3), 1997.
[33]
Shahabi, C., Tian, X., & Zhao, W. TSA-tree: A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Surprise and Trend Queries The 12th Int'l Conf on Scientific and Statistical Database Management (SSDBM 2000)
[34]
Vlachos, M., Hadjieleftheriou, M., Gunopulos, D. & Keogh. E. Indexing Multi-Dimensional Time-Series with Support for Multiple Distance Measures. In the 9th ACM SIGKDD. August 24 - 27, 2003. Washington, DC, USA. pp 216--225.
[35]
Wang, C. & Wang, X. S. Supporting content-based searches on time series via approximation. In proceedings of the 12th Int'l Conference on Scientific and Statistical Database Management. Berlin, Germany, Jul 26-28, 2000. pp 69--81.
[36]
Yairi, T., Kato, Y., & Hori, K. Fault Detection by Mining Association Rules from House-keeping Data, Proc. of Int'l Sym. on AI, Robotics and Automation in Space, 2001.

Cited By

View all
  • (2024)Composer Classification Using Maximum Probability Partitioning Based on Compression Principles圧縮原理に基づく最大確率分割情報量を用いた作曲者分類Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.39-2_F-NA139:2(F-NA1_1-10)Online publication date: 1-Mar-2024
  • (2024)Nearest advocate: a novel event-based time delay estimation algorithm for multi-sensor time-series data synchronizationEURASIP Journal on Advances in Signal Processing10.1186/s13634-024-01143-12024:1Online publication date: 5-Apr-2024
  • (2024)Poster: A Memory Efficient Parameter-free Time-series Classification via gzipProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661424(690-691)Online publication date: 3-Jun-2024
  • Show More Cited By

Index Terms

  1. Towards parameter-free data mining

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2004
    874 pages
    ISBN:1581138881
    DOI:10.1145/1014052
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 August 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. anomaly detection
    2. clustering
    3. parameter-free data mining

    Qualifiers

    • Article

    Conference

    KDD04

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)88
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 01 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Composer Classification Using Maximum Probability Partitioning Based on Compression Principles圧縮原理に基づく最大確率分割情報量を用いた作曲者分類Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.39-2_F-NA139:2(F-NA1_1-10)Online publication date: 1-Mar-2024
    • (2024)Nearest advocate: a novel event-based time delay estimation algorithm for multi-sensor time-series data synchronizationEURASIP Journal on Advances in Signal Processing10.1186/s13634-024-01143-12024:1Online publication date: 5-Apr-2024
    • (2024)Poster: A Memory Efficient Parameter-free Time-series Classification via gzipProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661424(690-691)Online publication date: 3-Jun-2024
    • (2024)Sequence Similarity Measurement for Multi-Human Motion Ability Assessment2024 IEEE International Conference on Mechatronics and Automation (ICMA)10.1109/ICMA61710.2024.10633115(363-368)Online publication date: 4-Aug-2024
    • (2024)Time-series forecasting through recurrent topologyCommunications Engineering10.1038/s44172-023-00142-83:1Online publication date: 9-Jan-2024
    • (2024)Usage-aware representation learning for critical information identification in transportation networksTransportation Research Part C: Emerging Technologies10.1016/j.trc.2024.104538160(104538)Online publication date: Mar-2024
    • (2024)Efficient Top-k Frequent Itemset Mining on Massive DataData Science and Engineering10.1007/s41019-024-00241-29:2(177-203)Online publication date: 6-Feb-2024
    • (2023)Improved Recurrence Plots Compression Distance by Learning Parameter for Video Compression QualityEntropy10.3390/e2506095325:6(953)Online publication date: 19-Jun-2023
    • (2023)The myth of reproducibility: A review of event tracking evaluations on TwitterFrontiers in Big Data10.3389/fdata.2023.10673356Online publication date: 5-Apr-2023
    • (2023)Improving the Accuracy and Efficiency of Compression-based Dissimilarity Measure using Information Quantity in Data Classification Problemsデータ分類タスクにおけるCompression-based Dissimilarity Measureの精度と速度の改良Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.38-1_A-M7138:1(A-M71_1-15)Online publication date: 1-Jan-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media