Article

Dwarf: shrinking the PetaCube

Authors:

Yannis Sismanis,

Antonios Deligiannakis,

Nick Roussopoulos,

Yannis KotidisAuthors Info & Claims

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data

Pages 464 - 475

https://doi.org/10.1145/564691.564745

Published: 03 June 2002 Publication History

Abstract

Dwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube. A Petabyte 25-dimensional cube was shrunk this way to a 2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage reduction ratio. Still, Dwarf provides 100% precision on cube queries and is a self-sufficient structure which requires no access to the fact table. What makes Dwarf practical is the automatic discovery,in a single pass over the fact table, of the prefix and suffix redundancies without user involvement or knowledge of the value distributions.This paper describes the Dwarf structure and the Dwarf cube construction algorithm. Further optimizations are then introduced for improving clustering and query performance. Experiments with the current implementation include comparisons on detailed measurements with real and synthetic datasets against previously published techniques. The comparisons show that Dwarfs by far out-perform these techniques on all counts: storage space, creation time, query response time, and updates of cubes.

References

[1]

{AAD+} S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In VLDB 1996.

Digital Library

[2]

{AGP} S. Acharya, P. B. Gibbons, and V. Poosala. Congressional Samples for Approximate Answering of Group-By Queries. In SIGMOD 2000, Dallas.

Digital Library

[3]

{Bla} Jock A. Blackard. The Forest CoverType Dataset. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/covtype.

[4]

{BPT} E. Baralis, S. Paraboschi, and E. Teniente. Materialized View Selection in a Multidimensional Database. In VLDB 1997, Athens.

Digital Library

[5]

{BR} K. S. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. In SIGMOD 1999, Philadelphia.

Digital Library

[6]

{BS98} D. Barbara and M. Sullivan. A Space-Efficient way to support Approximate Multidimensional Databases. Technical report, ISSE-TR-98-03, George Mason University, 1998.

[7]

{Cou98} Olap Council. APB-1 Benchmark. http://www.olapcouncil.org/research/bmarkco.htm, 1998.

[8]

{DANR96} P. M. Deshpande, S. Agarwal, J. F. Naughton, and R. Ramakrishnan. Computation of multidimensional aggregates. Technical Report 1314, University of Wisconsin - Madison, 1996.

[9]

{FH00} L. Fu and J. Hammer. CUBIST: A New Algorithm for Improving the Performance of Ad-hoc OLAP Queries. In DOLAP, 2000.

Digital Library

[10]

{FSGM+} M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing Iceberg Queries Efficiently. In VLDB 1998.

Digital Library

[11]

{GBLP} J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In ICDE 1996, New Orleans. IEEE.

Digital Library

[12]

{GHRU} H. Gupta, V. Harinarayan, A. Rajaraman, and J. Ullman. Index Selection for OLAP. In ICDE 1997, Burmingham.

Digital Library

[13]

{GM} P. B. Gibbons and Y. Matias. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. In SIGMOD 1998, Seattle.

Digital Library

[14]

{Gup} H. Gupta. Selections of Views to Materialize in a Data Warehouse. In ICDT 1997, Delphi.

Digital Library

[15]

{HHW} J. M. Hellerstein, P. J. Haas, and H. Wang. Online Aggregation. In SIGMOD 1997, Tucson.

Digital Library

[16]

{HRU} V. Harinarayan, A. Rajaraman, and J. Ullman. Implementing Data Cubes Efficiently. In SIGMOD 1996, Montreal.

Digital Library

[17]

{HWL} C. Hahn, S. Warren, and J. London. Edited synoptic cloud reports from ships and land stations over the globe. http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html.

[18]

{JS97} T. Johnson and D. Shasha. Some Approaches to Index Design for Cube Forests. Data Engineering Bulletin, 20(1), March 1997.

[19]

{KR} Y. Kotidis and N. Roussopoulos. An Alternative Storage Organization for ROLAP Aggregate Views Based on Cubetrees. In SIGMOD 1998, Seattle.

Digital Library

[20]

{RKR} N. Roussopoulos, Y. Kotidis, and M. Roussopoulos. Cubetree: Organization of and Bulk Incremental Updates on the Data Cube. In SIGMOD 1997, Tucson.

Digital Library

[21]

{RS} K. A. Ross and D. Srivastana. Fast Computation of Sparse Datacubes. In VLDB 1997, pages 116-125, Athens, Greece.

Digital Library

[22]

{RSDK01} N. Roussopoulos, J. Sismanis, A. Deligiannakis, and Y. Kotidis. The Dwarf Structure for Creating, Storing, and Querying Highly Compressed Data Cubes. Application to U.S. patent office submitted, June 2001.

[23]

{SAG96} S. Sarawagi, R. Agrawal, and A. Gupta. On computing the data cube. Technical Report RJ10026, IBM Almaden Research Center, San Jose, CA, 1996.

[24]

{SDN} A. Shukla, P. M. Deshpande, and J. F. Naughton. Materialized View Selection for Multidimensional Datasets. In VLDB 1998, New York City.

Digital Library

[25]

{SDRK02} Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf: Shrinking the PetaCube. Technical Report CS-TR 4342, University of Maryland, College Park, February 2002.

Digital Library

[26]

{SFB} J. Shanmugasundaram, U. Fayyad, and P. S. Bradley. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In KDD 1999.

Digital Library

[27]

{VWI} J. S Vitter, M. Wang, and B. Iyer. Data Cube Approximation and Histograms via Wavelets. In CIKM 1998.

Digital Library

[28]

{WLFY02} W. Wang, H. Lu, J. Feng, and J. Xu Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. In ICDE, 2002.

[29]

{ZDN} Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD 1997.

Digital Library

Cited By

Phan-Luong V(2024)Mining Interesting Aggregate TuplesIntelligent Systems and Applications10.1007/978-3-031-47715-7_16(229-243)Online publication date: 30-Jan-2024
https://doi.org/10.1007/978-3-031-47715-7_16
John SKoch C(2022)High-Dimensional Data CubesProceedings of the VLDB Endowment10.14778/3565838.356583915:13(3828-3840)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565839
Caetano AHirata CSilva R(2022)A comparative study of cluster-based Big Data Cube implementationsFuture Generation Computer Systems10.1016/j.future.2022.03.024133(240-253)Online publication date: Aug-2022
https://doi.org/10.1016/j.future.2022.03.024
Show More Cited By

Index Terms

Dwarf: shrinking the PetaCube
1. Information systems

Recommendations

Fast Computation of Iceberg Dwarf
SSDBM '04: Proceedings of the 16th International Conference on Scientific and Statistical Database Management

Iceberg Dwarf (IceDwarf for short) combines thestrength of Iceberg-Cube and Dwarf. It exploits the elegantDwarf structure for cube tuple store and eliminates thoseunsatisfied sub-dwarfs. By only storing nontrivial cubetuples, IceDwarf reduces the size ...
A clustered Dwarf structure to speed up queries on data cubes
DaWaK'07: Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery

Dwarf is a highly compressed structure, which compresses the cube by eliminating the semantic redundancies while computing a data cube. Although it has high compression ratio, Dwarf is slower in querying and more difficult in updating due to its ...
A Clustered Dwarf Structure to Speed Up Queries on Data Cubes
DaWaK '07: Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery

Dwarf is a highly compressed structure, which compresses the cube by eliminating the semantic redundancies while computing a data cube. Although it has high compression ratio, Dwarf is slower in querying and more difficult in updating due to its ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data

June 2002

654 pages

ISBN:1581134975

DOI:10.1145/564691

Conference Chair:
Bongki Moon
University of Wisconsin - Madison
,
General Chair:
David DeWitt
University of Wisconsin - Madison
,
Program Chair:
Michael Franklin
University of California, Berkeley

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS02

Sponsor:

SIGMOD/PODS02: International Conference on Management of Data and Symposium on Principles Database and Systems

June 3 - 6, 2002

Wisconsin, Madison

Acceptance Rates

SIGMOD '02 Paper Acceptance Rate 42 of 240 submissions, 18%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

143
Total Citations
View Citations
1,130
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Phan-Luong V(2024)Mining Interesting Aggregate TuplesIntelligent Systems and Applications10.1007/978-3-031-47715-7_16(229-243)Online publication date: 30-Jan-2024
https://doi.org/10.1007/978-3-031-47715-7_16
John SKoch C(2022)High-Dimensional Data CubesProceedings of the VLDB Endowment10.14778/3565838.356583915:13(3828-3840)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565839
Caetano AHirata CSilva R(2022)A comparative study of cluster-based Big Data Cube implementationsFuture Generation Computer Systems10.1016/j.future.2022.03.024133(240-253)Online publication date: Aug-2022
https://doi.org/10.1016/j.future.2022.03.024
Battle LScheidegger C(2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
https://doi.org/10.1109/TVCG.2020.3028891
Ji XYan XRen KWang XTang B(2021)Effective and Efficient Summarization for Non-hierarchical Data2021 Ivannikov Ispras Open Conference (ISPRAS)10.1109/ISPRAS53967.2021.00019(100-106)Online publication date: Dec-2021
https://doi.org/10.1109/ISPRAS53967.2021.00019
Wang ZCashman DLi MLi JBerger MLevine JChang RScheidegger C(2021)NeuralCubes: Deep Representations for Visual Data Exploration2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671390(550-561)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671390
Wang QYou JZou BChen YHuang XJia L(2021)Reduced Quotient Cube: Maximize Query Answering Capacity in OLAPIEEE Access10.1109/ACCESS.2021.3120278(1-1)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3120278
Phan-Luong V(2021)A Complete Index Base for Querying Data CubeIntelligent Systems and Applications10.1007/978-3-030-82196-8_36(486-500)Online publication date: 3-Aug-2021
https://doi.org/10.1007/978-3-030-82196-8_36
Guan XXie CHan LZeng YShen DXing W(2020)MAP-Vis: A Distributed Spatio-Temporal Big Data Visualization Framework Based on a Multi-Dimensional Aggregation Pyramid ModelApplied Sciences10.3390/app1002059810:2(598)Online publication date: 14-Jan-2020
https://doi.org/10.3390/app10020598
Kim ALakshmanan LSrivastava D(2020)Summarizing Hierarchical Multidimensional Data2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00081(877-888)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00081
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents