Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/304182.304199acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free access

Approximate computation of multidimensional aggregates of sparse data using wavelets

Published: 01 June 1999 Publication History

Abstract

Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.
In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a smalll number of I/Os, depending upon the desired accuracy.
We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.

References

[1]
S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan. On the computation of multidimensional aggregates. In Proceedings of the 1996 International Conference on Very Large Databases, Mumbai, India, 1996.
[2]
A. Aggaxwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications o/ the ACM, 31(9):1116-1127, 1988.
[3]
D . Barbara, W. DuMouchel, C. Faloutsos, P. J. H aas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. Bulletin o.f the Technical Committee on Data Engineering, 20(4), 1997.
[4]
U.S. Census Bureau. Census bureau databases. The online data are available on the web at http ://www. census, gov/.
[5]
C. Faloutsos, H. V. Jagadish, and N. D. Sidiropoulos. Recovering information from summary data. In Proceedings of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997.
[6]
J. Gray, A. Bosworth, A. I,ayman, and H. Pirahesh. Data cube: A relational aggregation operator general- ~zing group-by, cross-tabs and subtotals. In Proceedings of the 12th Annual IEEE Conference on Data Engineering (ICDE '96), pages 131-139, 1996.
[7]
P.B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, Seattle, WA, June 1998.
[8]
C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997.
[9]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the 1997 A CM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997.
[10]
Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996.
[11]
B. Jawerth and W. Sweldens. An overview of wavelet based multire.solution analyses. SIAM Rev., 36(3):377- 412, 1994.
[12]
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, pages 448-459, Seattle, WA, June 1998.
[13]
N. Pendse and R. Creeth. The OLAP report, 1998. The online report is available on the web at http ://www. olapreport, com/Analyses, htm/.
[14]
~. Poosala arid Y. E. Ioannidis. Estimation of queryresult distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Databases, Bombay, India, September 1996.
[15]
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceeding.~ of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997.
[16]
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. Shekita. Improved histograms for selectivity estimation of range redicates. In Proceedings of the 1996 A CM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996.
[17]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD International Confernce on Management of Data, pages 23-34, 1979.
[18]
E.J. Stollnitz, T. D. Derose, and D. H. Salesin. Wavelets for Computer Graphics. Morgan Kaufmann, 1996.
[19]
D.E. Vengroff. A transparent parallel I/O environment. In Proceedings of the 199,{ DAGS Symposium on Parallel Computa tion, July 1994.
[20]
D.E. Vengroff. TPIE User Manual and Reference. Duke University, 1997. The manual and software distribution are available on the web at http ://www. cs. duke. edu/TPlE/.
[21]
J.S. Vitter. ExternM memory Mgorithms and data structures. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization, DI- MACS series. American Mathematical Society, to appear 1999. Available via the author's web page http://www, cs. duke. edu/'j sv/.
[22]
J.S. Vitter and E. A. M, Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2- 3):110-147, 1994. Special double issue on Large-Scale Memories.
[23]
D.E. Vengroff and J. S. Vitter. I/O-efficient scientific computation using TPIE. In Proceedings of the Goddard Conference on Mass Storage Systems and T~hnologies, NASA Conference Publication 3340, VohJme II, pages 553-570, College Park, MD, September 1fi96.
[24]
J.S. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proceedings of Seventh International Conference on Information and Knowledge Management, pages 96-104, Washington D.C., November 1998.
[25]
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proceedings of the 1997 A CM SIGMOD International Conference on Managemen~ of Data, Tucson, AZ, May 1997.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
June 1999
604 pages
ISBN:1581130848
DOI:10.1145/304182
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS99

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)130
  • Downloads (Last 6 weeks)24
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Moments Method for Approximate Data Cube QueriesProceedings of the ACM on Management of Data10.1145/36511472:2(1-23)Online publication date: 14-May-2024
  • (2023)Multidimensional Information System Metadata Description Using the “Data Vault” MethodologyDistributed Computer and Communication Networks10.1007/978-3-031-30648-8_2(17-28)Online publication date: 1-May-2023
  • (2021)FlashPProceedings of the VLDB Endowment10.14778/3446095.344609614:5(721-729)Online publication date: 1-Jan-2021
  • (2020)MoHA: A Composable System for Efficient In-Situ Analytics on Heterogeneous HPC SystemsSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00086(1-16)Online publication date: Nov-2020
  • (2019)Bounded Approximate Query ProcessingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.287736231:12(2262-2276)Online publication date: 1-Dec-2019
  • (2018)Data Reclassification of Multidimensional Information System Designed Using Cluster Method of Metadata DescriptionDevelopments in Language Theory10.1007/978-3-319-99447-5_21(246-256)Online publication date: 24-Aug-2018
  • (2017)Statistical interpretation of soil property profiles from sparse data using Bayesian compressive samplingGéotechnique10.1680/jgeot.16.P.14367:6(523-536)Online publication date: Jun-2017
  • (2017)Combining the real-time wavelet denoising and long-short-term-memory neural network for predicting stock indexes2017 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI.2017.8280883(1-8)Online publication date: Nov-2017
  • (2017)A New Approach to Online, Multivariate Network Traffic Analysis2017 26th International Conference on Computer Communication and Networks (ICCCN)10.1109/ICCCN.2017.8038520(1-6)Online publication date: Jul-2017
  • (2017)The Application of Classification Schemes While Describing Metadata of the Multidimensional Information System Based on the Cluster MethodDistributed Computer and Communication Networks10.1007/978-3-319-66836-9_26(307-318)Online publication date: 7-Sep-2017
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media