Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/564691.564745acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Dwarf: shrinking the PetaCube

Published: 03 June 2002 Publication History

Abstract

Dwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube. A Petabyte 25-dimensional cube was shrunk this way to a 2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage reduction ratio. Still, Dwarf provides 100% precision on cube queries and is a self-sufficient structure which requires no access to the fact table. What makes Dwarf practical is the automatic discovery,in a single pass over the fact table, of the prefix and suffix redundancies without user involvement or knowledge of the value distributions.This paper describes the Dwarf structure and the Dwarf cube construction algorithm. Further optimizations are then introduced for improving clustering and query performance. Experiments with the current implementation include comparisons on detailed measurements with real and synthetic datasets against previously published techniques. The comparisons show that Dwarfs by far out-perform these techniques on all counts: storage space, creation time, query response time, and updates of cubes.

References

[1]
{AAD+} S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In VLDB 1996.
[2]
{AGP} S. Acharya, P. B. Gibbons, and V. Poosala. Congressional Samples for Approximate Answering of Group-By Queries. In SIGMOD 2000, Dallas.
[3]
{Bla} Jock A. Blackard. The Forest CoverType Dataset. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/covtype.
[4]
{BPT} E. Baralis, S. Paraboschi, and E. Teniente. Materialized View Selection in a Multidimensional Database. In VLDB 1997, Athens.
[5]
{BR} K. S. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. In SIGMOD 1999, Philadelphia.
[6]
{BS98} D. Barbara and M. Sullivan. A Space-Efficient way to support Approximate Multidimensional Databases. Technical report, ISSE-TR-98-03, George Mason University, 1998.
[7]
{Cou98} Olap Council. APB-1 Benchmark. http://www.olapcouncil.org/research/bmarkco.htm, 1998.
[8]
{DANR96} P. M. Deshpande, S. Agarwal, J. F. Naughton, and R. Ramakrishnan. Computation of multidimensional aggregates. Technical Report 1314, University of Wisconsin - Madison, 1996.
[9]
{FH00} L. Fu and J. Hammer. CUBIST: A New Algorithm for Improving the Performance of Ad-hoc OLAP Queries. In DOLAP, 2000.
[10]
{FSGM+} M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing Iceberg Queries Efficiently. In VLDB 1998.
[11]
{GBLP} J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In ICDE 1996, New Orleans. IEEE.
[12]
{GHRU} H. Gupta, V. Harinarayan, A. Rajaraman, and J. Ullman. Index Selection for OLAP. In ICDE 1997, Burmingham.
[13]
{GM} P. B. Gibbons and Y. Matias. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. In SIGMOD 1998, Seattle.
[14]
{Gup} H. Gupta. Selections of Views to Materialize in a Data Warehouse. In ICDT 1997, Delphi.
[15]
{HHW} J. M. Hellerstein, P. J. Haas, and H. Wang. Online Aggregation. In SIGMOD 1997, Tucson.
[16]
{HRU} V. Harinarayan, A. Rajaraman, and J. Ullman. Implementing Data Cubes Efficiently. In SIGMOD 1996, Montreal.
[17]
{HWL} C. Hahn, S. Warren, and J. London. Edited synoptic cloud reports from ships and land stations over the globe. http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html.
[18]
{JS97} T. Johnson and D. Shasha. Some Approaches to Index Design for Cube Forests. Data Engineering Bulletin, 20(1), March 1997.
[19]
{KR} Y. Kotidis and N. Roussopoulos. An Alternative Storage Organization for ROLAP Aggregate Views Based on Cubetrees. In SIGMOD 1998, Seattle.
[20]
{RKR} N. Roussopoulos, Y. Kotidis, and M. Roussopoulos. Cubetree: Organization of and Bulk Incremental Updates on the Data Cube. In SIGMOD 1997, Tucson.
[21]
{RS} K. A. Ross and D. Srivastana. Fast Computation of Sparse Datacubes. In VLDB 1997, pages 116-125, Athens, Greece.
[22]
{RSDK01} N. Roussopoulos, J. Sismanis, A. Deligiannakis, and Y. Kotidis. The Dwarf Structure for Creating, Storing, and Querying Highly Compressed Data Cubes. Application to U.S. patent office submitted, June 2001.
[23]
{SAG96} S. Sarawagi, R. Agrawal, and A. Gupta. On computing the data cube. Technical Report RJ10026, IBM Almaden Research Center, San Jose, CA, 1996.
[24]
{SDN} A. Shukla, P. M. Deshpande, and J. F. Naughton. Materialized View Selection for Multidimensional Datasets. In VLDB 1998, New York City.
[25]
{SDRK02} Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf: Shrinking the PetaCube. Technical Report CS-TR 4342, University of Maryland, College Park, February 2002.
[26]
{SFB} J. Shanmugasundaram, U. Fayyad, and P. S. Bradley. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In KDD 1999.
[27]
{VWI} J. S Vitter, M. Wang, and B. Iyer. Data Cube Approximation and Histograms via Wavelets. In CIKM 1998.
[28]
{WLFY02} W. Wang, H. Lu, J. Feng, and J. Xu Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. In ICDE, 2002.
[29]
{ZDN} Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD 1997.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data
June 2002
654 pages
ISBN:1581134975
DOI:10.1145/564691
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS02

Acceptance Rates

SIGMOD '02 Paper Acceptance Rate 42 of 240 submissions, 18%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Mining Interesting Aggregate TuplesIntelligent Systems and Applications10.1007/978-3-031-47715-7_16(229-243)Online publication date: 30-Jan-2024
  • (2022)High-Dimensional Data CubesProceedings of the VLDB Endowment10.14778/3565838.356583915:13(3828-3840)Online publication date: 1-Sep-2022
  • (2022)A comparative study of cluster-based Big Data Cube implementationsFuture Generation Computer Systems10.1016/j.future.2022.03.024133(240-253)Online publication date: Aug-2022
  • (2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
  • (2021)Effective and Efficient Summarization for Non-hierarchical Data2021 Ivannikov Ispras Open Conference (ISPRAS)10.1109/ISPRAS53967.2021.00019(100-106)Online publication date: Dec-2021
  • (2021)NeuralCubes: Deep Representations for Visual Data Exploration2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671390(550-561)Online publication date: 15-Dec-2021
  • (2021)Reduced Quotient Cube: Maximize Query Answering Capacity in OLAPIEEE Access10.1109/ACCESS.2021.3120278(1-1)Online publication date: 2021
  • (2021)A Complete Index Base for Querying Data CubeIntelligent Systems and Applications10.1007/978-3-030-82196-8_36(486-500)Online publication date: 3-Aug-2021
  • (2020)MAP-Vis: A Distributed Spatio-Temporal Big Data Visualization Framework Based on a Multi-Dimensional Aggregation Pyramid ModelApplied Sciences10.3390/app1002059810:2(598)Online publication date: 14-Jan-2020
  • (2020)Summarizing Hierarchical Multidimensional Data2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00081(877-888)Online publication date: Apr-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media