Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Efficiently supporting ad hoc queries in large datasets of time sequences

Published: 01 June 1997 Publication History

Abstract

Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access.
In this paper we consider a very large dataset comprising multiple distinct time sequences. Each point in the sequence is a numerical value. We show how to compress such a dataset into a format that supports ad hoc querying, provided that a small error can be tolerated when the data is uncompressed. Experiments on large, real world datasets (AT&T customer calling patterns) show that the proposed method achieves an average of less than 5% error in any data value after compressing to a mere 2.5% of the original space (i.e., a 40:1 compression ratio), with these numbers not very sensitive to dataset size. Experiments on aggregate queries achieved a 0.5% reconstruction error with a space requirement under 2%.

References

[1]
Rakesh Agrawal, Christos Faloutsos, and Arun Swami. fficient similarity search in sequence databases. In Fourth Int. Conf. on Foundations of Data Organization and Algorithms (FODO), pages 69-84, Evanston, Illinois, October 1993. also available through anonymous ftp, from olympos.cs.umd.edu: ftp/pub/TechReports/fodo.ps.
[2]
Richard A. Becker, John M. Chambers, and Allan R. Wilks. The New S Language. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, CA, 1988.
[3]
R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973.
[4]
Susan T. Dumais. Latent semantic indexing (lsi) and trec-2. In D. K. Harman, editor, The Second Text Retrieval Conference (TREC.2), pages 105-115, Gaithersburg, MD, March 1994. NIST. Special publication 500-215.
[5]
Susan Eggers and Arie Shoshani. Efficient access of compressed data. In Proceedings of the 6th VLDB Con}erence, volume 6, pages 205-211, 1980.
[6]
Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. Proc. of ~th International Symposium on Large Spatial Databases, 1995.
[7]
Christos Faloutsos. Searching Multimedia Databases by Content. Kluwer Academic Inc., 1996. ISBN 0- 7923-9777-0.
[8]
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Technical Report No. MSR-TR-95-22, Microsoft, 1995.
[9]
John A. Hartigan. Clustering Algorithms. John Wiley & Sons, 1975.
[10]
Ted Johnson and Dennis Shasha. Hierarchical split cube forests for decision support. Technical report, Draft, September 1996.
[11]
I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986.
[12]
J. Li, D. Rotem, and H. Wong. A new compression method with fast searching on large databases. In Proceedings of the I3th VLDB Conference, volume 13, pages 311-318, Brighton, England, 1987.
[13]
F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4):354-359, 1983.
[14]
Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. Proc. o/VLDB Con/., pages 144-155, September 1994.
[15]
William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes in C. Cambridge University Press, 1992. 2nd Edition.
[16]
Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hail, 1993.
[17]
Lawrence Richard Rabiner and Bernard Gold. Theory and Application of Digital Signal Processing. Prentice- Hall, Englewood Cliffs, N.J., 1975.
[18]
Edie Rasmussen. Clustering algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 419-442. Prentice Hall, 1992.
[19]
Mary Beth Ruskai, Gregory Beylkin, Ronald Coilman, Ingrid Daubechies, Stephane Mallat, Yves Meyer, and Louise Raphael. Wavelets and Their Applications. Jones and Bartlett Publishers, Boston, MA, 1992.
[20]
G. Salton, E.A. Fox, and H. Wu. Extended boolean information retrieval. CACM, 26(11):1022-1036, November 1983.
[21]
Manfred Schroeder. Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise. W.H. Freeman and Company, New York, 1991.
[22]
D.G. Severance and G.M. Lohman. Differential files: Their application to the maintenance of large databases. ACM TODS, 1(3):256-267, September 1976.
[23]
:lames A. Storer. Data Compression: Methods and Theory. Computer Science Press, Inc., 1988.
[24]
Gilbert Strang. Linear Algebra and its Applications. Academic Press, 1980. 2nd edition.
[25]
M. Turk and A. Pentland. Eigenfaces for recognition. Journal o.f Cognitive Neuroscience, 3(1):71-86, 1991.
[26]
C.J. Van-Rijsbergen. In/ormation Retrieval Butterworths, London, England, 1979. 2nd edition.
[27]
Andreas S. Weigend and Neil A. Gerschenfeld. Time Series Prediction: Forecasting the Future and Understanding the Past. Addison Wesley, 1994.
[28]
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In SIGMOD '96, pages 103-114, Montreal, Canada, June 1996.
[29]
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, IT-23(3):337-343, May 1977.

Cited By

View all
  • (2024)A novel Prophet model based on Gaussian linear fuzzy information granule for long-term time series prediction1Journal of Intelligent & Fuzzy Systems10.3233/JIFS-230313(1-15)Online publication date: 22-Mar-2024
  • (2024)DIDS: Double Indices and Double Summarizations for Fast Similarity SearchProceedings of the VLDB Endowment10.14778/3665844.366585117:9(2198-2211)Online publication date: 1-May-2024
  • (2024)A Novel ML Method for Temporal Evolution of Geographic Clusters of Disease Spread PatternsProceedings of the International Symposium on Intelligent Computing and Networking 202410.1007/978-3-031-67447-1_11(149-164)Online publication date: 8-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 26, Issue 2
June 1997
583 pages
ISSN:0163-5808
DOI:10.1145/253262
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data
    June 1997
    594 pages
    ISBN:0897919114
    DOI:10.1145/253260
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1997
Published in SIGMOD Volume 26, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)135
  • Downloads (Last 6 weeks)14
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A novel Prophet model based on Gaussian linear fuzzy information granule for long-term time series prediction1Journal of Intelligent & Fuzzy Systems10.3233/JIFS-230313(1-15)Online publication date: 22-Mar-2024
  • (2024)DIDS: Double Indices and Double Summarizations for Fast Similarity SearchProceedings of the VLDB Endowment10.14778/3665844.366585117:9(2198-2211)Online publication date: 1-May-2024
  • (2024)A Novel ML Method for Temporal Evolution of Geographic Clusters of Disease Spread PatternsProceedings of the International Symposium on Intelligent Computing and Networking 202410.1007/978-3-031-67447-1_11(149-164)Online publication date: 8-Aug-2024
  • (2024)A Flight Parameter-Based Flight Load Prediction Method for Aircraft Fatigue Life Monitoring via Maneuver Recognition and Deep LearningProceedings of the UNIfied Conference of DAMAS, IncoME and TEPEN Conferences (UNIfied 2023)10.1007/978-3-031-49421-5_88(1073-1082)Online publication date: 29-May-2024
  • (2023)Similarity Measurement and Classification of Temporal Data Based on Double Mean RepresentationAlgorithms10.3390/a1607034716:7(347)Online publication date: 19-Jul-2023
  • (2023)Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding DistancesProceedings of the VLDB Endowment10.14778/3594512.359453016:8(2019-2032)Online publication date: 22-Jun-2023
  • (2023)Transforming Complex Problems Into K-Means SolutionsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.323766745:7(9149-9168)Online publication date: 1-Jul-2023
  • (2023)Repr2Seq: A Data-to-Text Generation Model for Time Series2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191421(1-8)Online publication date: 18-Jun-2023
  • (2023)Recognition and Segmentation of Heat Treatment Processes Based on PLR2023 9th International Conference on Computer and Communications (ICCC)10.1109/ICCC59590.2023.10507268(1886-1890)Online publication date: 8-Dec-2023
  • (2023)Multivariate Time Series Retrieval with Binary Coding from TransformerNeural Information Processing10.1007/978-981-99-1639-9_33(397-408)Online publication date: 15-Apr-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media