Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Approximate computation of multidimensional aggregates of sparse data using wavelets

Published: 01 June 1999 Publication History

Abstract

Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.
In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a smalll number of I/Os, depending upon the desired accuracy.
We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.

References

[1]
S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan. On the computation of multidimensional aggregates. In Proceedings of the 1996 International Conference on Very Large Databases, Mumbai, India, 1996.
[2]
A. Aggaxwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications o/ the ACM, 31(9):1116-1127, 1988.
[3]
D . Barbara, W. DuMouchel, C. Faloutsos, P. J. H aas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. Bulletin o.f the Technical Committee on Data Engineering, 20(4), 1997.
[4]
U.S. Census Bureau. Census bureau databases. The online data are available on the web at http ://www. census, gov/.
[5]
C. Faloutsos, H. V. Jagadish, and N. D. Sidiropoulos. Recovering information from summary data. In Proceedings of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997.
[6]
J. Gray, A. Bosworth, A. I,ayman, and H. Pirahesh. Data cube: A relational aggregation operator general- ~zing group-by, cross-tabs and subtotals. In Proceedings of the 12th Annual IEEE Conference on Data Engineering (ICDE '96), pages 131-139, 1996.
[7]
P.B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, Seattle, WA, June 1998.
[8]
C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997.
[9]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the 1997 A CM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997.
[10]
Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996.
[11]
B. Jawerth and W. Sweldens. An overview of wavelet based multire.solution analyses. SIAM Rev., 36(3):377- 412, 1994.
[12]
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, pages 448-459, Seattle, WA, June 1998.
[13]
N. Pendse and R. Creeth. The OLAP report, 1998. The online report is available on the web at http ://www. olapreport, com/Analyses, htm/.
[14]
~. Poosala arid Y. E. Ioannidis. Estimation of queryresult distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Databases, Bombay, India, September 1996.
[15]
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceeding.~ of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997.
[16]
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. Shekita. Improved histograms for selectivity estimation of range redicates. In Proceedings of the 1996 A CM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996.
[17]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD International Confernce on Management of Data, pages 23-34, 1979.
[18]
E.J. Stollnitz, T. D. Derose, and D. H. Salesin. Wavelets for Computer Graphics. Morgan Kaufmann, 1996.
[19]
D.E. Vengroff. A transparent parallel I/O environment. In Proceedings of the 199,{ DAGS Symposium on Parallel Computa tion, July 1994.
[20]
D.E. Vengroff. TPIE User Manual and Reference. Duke University, 1997. The manual and software distribution are available on the web at http ://www. cs. duke. edu/TPlE/.
[21]
J.S. Vitter. ExternM memory Mgorithms and data structures. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization, DI- MACS series. American Mathematical Society, to appear 1999. Available via the author's web page http://www, cs. duke. edu/'j sv/.
[22]
J.S. Vitter and E. A. M, Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2- 3):110-147, 1994. Special double issue on Large-Scale Memories.
[23]
D.E. Vengroff and J. S. Vitter. I/O-efficient scientific computation using TPIE. In Proceedings of the Goddard Conference on Mass Storage Systems and T~hnologies, NASA Conference Publication 3340, VohJme II, pages 553-570, College Park, MD, September 1fi96.
[24]
J.S. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proceedings of Seventh International Conference on Information and Knowledge Management, pages 96-104, Washington D.C., November 1998.
[25]
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proceedings of the 1997 A CM SIGMOD International Conference on Managemen~ of Data, Tucson, AZ, May 1997.

Cited By

View all
  • (2023)SynopsisDB: Distributed Synopsis-based Data Processing SystemCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589394(289-291)Online publication date: 4-Jun-2023
  • (2023)ODTT: Optimized Dynamic Taxonomy Tree with Differential Privacy2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386286(3264-3273)Online publication date: 15-Dec-2023
  • (2022)B-DP: Dynamic Collection and Publishing of Continuous Check-In Data with Best-Effort Differential PrivacyEntropy10.3390/e2403040424:3(404)Online publication date: 14-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 28, Issue 2
June 1999
599 pages
ISSN:0163-5808
DOI:10.1145/304181
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
    June 1999
    604 pages
    ISBN:1581130848
    DOI:10.1145/304182
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999
Published in SIGMOD Volume 28, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)152
  • Downloads (Last 6 weeks)33
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)SynopsisDB: Distributed Synopsis-based Data Processing SystemCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589394(289-291)Online publication date: 4-Jun-2023
  • (2023)ODTT: Optimized Dynamic Taxonomy Tree with Differential Privacy2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386286(3264-3273)Online publication date: 15-Dec-2023
  • (2022)B-DP: Dynamic Collection and Publishing of Continuous Check-In Data with Best-Effort Differential PrivacyEntropy10.3390/e2403040424:3(404)Online publication date: 14-Mar-2022
  • (2022)Enabling efficient and general subpopulation analytics in multidimensional data streamsProceedings of the VLDB Endowment10.14778/3551793.355186715:11(3249-3262)Online publication date: 1-Jul-2022
  • (2022)Efficient two-dimensional Haar$$^+$$ synopsis construction for the maximum absolute error measureThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00551-228:5(675-701)Online publication date: 11-Mar-2022
  • (2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
  • (2021)Wavelet-based dynamic and privacy-preserving similitude data models for edge computingWireless Networks10.1007/s11276-020-02457-227:1(351-366)Online publication date: 1-Jan-2021
  • (2021)Workload-aware wavelet synopses for sliding window aggregatesDistributed and Parallel Databases10.1007/s10619-020-07307-w39:2(445-482)Online publication date: 1-Jun-2021
  • (2020)MoHAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433810(1-16)Online publication date: 9-Nov-2020
  • (2020)Summarizing Hierarchical Multidimensional Data2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00081(877-888)Online publication date: Apr-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media