Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

High-Dimensional Data Cubes

Published: 01 September 2022 Publication History

Abstract

This paper introduces an approach to supporting high-dimensional data cubes at interactive query speeds and moderate storage cost. The approach is based on binary(-domain) data cubes that are judiciously partially materialized; the missing information can be quickly reconstructed using statistical or linear programming techniques. This enables new applications such as exploratory data analysis for feature engineering and other fields of data science. Moreover, it removes the need to compromise when building a data cube - all columns that we might ever wish to use can be included as dimensions. Our approach also speeds up certain dice, roll-up, and drill-down operations on data cubes with hierarchical dimensions compared to traditional data cubes.

References

[1]
Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, and Sunita Sarawagi. 1996. On the Computation of Multidimensional Aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96). 506--521.
[2]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). 29--42.
[3]
Elena Baralis, Stefano Paraboschi, and Ernest Teniente. 1997. Materialized Views Selection in a Multidimensional Database. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). 156--165.
[4]
Daniel Barbará and Mark Sullivan. 1997. Quasi-Cubes: Exploiting Approximations in Multidimensional Databases. SIGMOD Record 26, 3 (1997), 12--17.
[5]
Daniel Barbará and Xintao Wu. 2000. Using Loglinear Models to Compress Datacubes. In Proceedings of the 1st International Conference on Web-Age Information Management (WAIM '00). 311--323.
[6]
Sachin Basil John and Christoph Koch. 2022. High-dimensional Data Cubes. (2022), 15. Retrieved September 25, 2022 from http://infoscience.epfl.ch/record/292499
[7]
Andreas Björklund, Thore Husfeldt, Petteri Kaski, and Mikko Koivisto. 2007. Fourier Meets Möbius: Fast Subset Convolution. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC '07). 67--74.
[8]
Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems 32, 2 (jun 2007), 9.
[9]
Surajit Chaudhuri and Umeshwar Dayal. 1997. An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26, 1 (1997), 65--74.
[10]
Zhimin Chen and Vivek R. Narasayya. 2005. Efficient Computation of Multiple Group By Queries. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05). 263--274.
[11]
James W. Cooley and John W. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comp. 19, 90 (1965), 297--301.
[12]
Roberto Fontana and Patrizia Semeraro. 2018. Representation of multivariate Bernoulli distributions with a given set of specified moments. Journal of Multivariate Analysis 168 (2018), 290--303.
[13]
Saul I. Gass. 2003. Linear Programming: Methods and Applications. Courier Corporation.
[14]
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53.
[15]
Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). 1917--1923.
[16]
Himanshu Gupta and Inderpal Singh Mumick. 2005. Selection of Views to Materialize in a Data Warehouse. IEEE Transactions on Knowledge and Data Engineering 17, 1 (2005), 24--43.
[17]
Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implementing Data Cubes Efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD '96). 205--216.
[18]
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD '97). 171--182.
[19]
David C. Hoaglin, Frederick Mosteller, and John W. Tukey (Eds.). 2006. Exploring Data Tables, Trends, and Shapes. John Wiley & Sons.
[20]
Kenneth Hoffman. 1971. Linear Algebra. Englewood Cliffs, NJ, Prentice-Hall.
[21]
Chris Jermaine, Subramanian Arumugam, Abhijit Pol, and Alin Dobra. 2008. Scalable approximate query processing with the DBO engine. ACM Transactions on Database Systems 33, 4 (2008), 23:1--23:54.
[22]
Ruoming Jin, Leonid Glimcher, Chris Jermaine, and Gagan Agrawal. 2006. New Sampling-Based Estimators for OLAP Queries. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). 18.
[23]
Minsuk Kahng, Dezhi Fang, and Duen Horng (Polo) Chau. 2016. Visual exploration of machine learning results using data cube analysis. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA '16). 1--6.
[24]
Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi. 2014. Distributed and interactive cube exploration. In IEEE 30th International Conference on Data Engineering (ICDE '14). 472--483.
[25]
Laks V. S. Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient Cube: How to Summarize the Semantics of a Data Cube. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB '02). 778--789.
[26]
Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later. Proceedings of the VLDB Endowment 2012 5, 12 (aug 2012), 1790--1801.
[27]
Fangling Leng, Yubin Bao, Ge Yu, Daling Wang, and Yuntao Liu. 2006. An Efficient Indexing Technique for Computing High Dimensional Data Cubes. In Proceedings of the 7th International Conference on Advances in Web-Age Information Management (WAIM '06). 557--568.
[28]
Alon Y. Levy, Alberto O. Mendelzon, and Yehoshua Sagiv. 1995. Answering Queries Using Views. In Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '95). 95--104.
[29]
Xiaolei Li, Jiawei Han, and Hector Gonzalez. 2004. High-Dimensional OLAP: A Minimal Cubing Approach. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04). 528--539.
[30]
Xiaolei Li, Jiawei Han, Zhijun Yin, Jae-Gil Lee, and Yizhou Sun. 2008. Sampling cube: a framework for statistical olap over sampling data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). 779--790.
[31]
Eric Lo, Ben Kao, Wai-Shing Ho, Sau Dan Lee, Chun Kit Chui, and David W. Cheung. 2008. OLAP on sequence data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIMOD '08). 649--660.
[32]
Konstantinos Morfonios and Yannis E. Ioannidis. 2006. CURE for Cubes: Cubing Using a ROLAP Engine. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06). 379--390.
[33]
Konstantinos Morfonios, Stratis Konakas, Yannis E. Ioannidis, and Nikolaos Kotsis. 2007. ROLAP implementations of the data cube. Comput. Surveys 39, 4 (2007), 12.
[34]
New York City Department of Finance. 2021. Parking Violations Issued - Fiscal Year 2021. Retrieved August 4, 2022 from https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2021/kvfd-bves
[35]
Patrick E. O'Neil, Elizabeth J. O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In Performance Evaluation and Benchmarking (TPTPC '09). 237--252.
[36]
Athanasios Papoulis and S Unnikrishna Pillai. 2002. Probability, Random Variables and Stochastic Processes (4 ed.). McGraw-Hill Professional.
[37]
Kenneth A. Ross and Divesh Srivastava. 1997. Fast Computation of Sparse Datacubes. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). 116--125.
[38]
Eyal Rozenberg. 2020. Star Schema Benchmark data set generator (ssb-dbgen). Retrieved August 4, 2022 from https://github.com/eyalroz/ssb-dbgen
[39]
Iztok Savnik. 2013. Index Data Structure for Fast Subset and Superset Queries. In Availability, Reliability, and Security in Information Systems and HCI. Springer, 134--148.
[40]
Jayavel Shanmugasundaram, Usama M. Fayyad, and Paul S. Bradley. 1999. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '99). 223--232.
[41]
Amit Shukla, Prasad Deshpande, and Jeffrey F. Naughton. 1998. Materialized View Selection for Multidimensional Datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB '98). 488--499.
[42]
Rodrigo Rocha Silva, Celso Massaki Hirata, and Joubert de Castro Lima. 2020. Big high-dimension data cube designs for hybrid memory systems. Knowledge and Information Systems 62, 12 (2020), 4717--4746.
[43]
Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, and Yannis Kotidis. 2002. Dwarf: shrinking the PetaCube. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD '02). 464--475.
[44]
Divesh Srivastava, Shaul Dar, H. V. Jagadish, and Alon Y. Levy. 1996. Answering Queries with Aggregation Using Views. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96). 318--329.
[45]
Jozef L Teugels. 1990. Some representations of the multivariate Bernoulli and binomial distributions. Journal of Multivariate Analysis 32, 2 (1990), 256--268.
[46]
Jeffrey Scott Vitter, Min Wang, and Balakrishna R. Iyer. 1998. Data Cube Approximation and Histograms via Wavelets. In Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management (CIKM '98). 96--104.
[47]
Wei Wang, Hongjun Lu, Jianlin Feng, and Jeffrey Xu Yu. 2002. Condensed Cube: An Efficient Approach to Reducing Data Cube Size. In Proceedings of the 18th International Conference on Data Engineering (ICDE '02). 155--165.
[48]
Yihong Zhao, Prasad Deshpande, and Jeffrey F. Naughton. 1997. An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD '97). 159--170.

Cited By

View all
  • (2024)The Moments Method for Approximate Data Cube QueriesProceedings of the ACM on Management of Data10.1145/36511472:2(1-23)Online publication date: 14-May-2024
  • (2023)Aggregation and Exploration of High-Dimensional Data Using the Sudokube Data Cube EngineCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589729(175-178)Online publication date: 4-Jun-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 13
September 2022
278 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2022
Published in PVLDB Volume 15, Issue 13

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The Moments Method for Approximate Data Cube QueriesProceedings of the ACM on Management of Data10.1145/36511472:2(1-23)Online publication date: 14-May-2024
  • (2023)Aggregation and Exploration of High-Dimensional Data Using the Sudokube Data Cube EngineCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589729(175-178)Online publication date: 4-Jun-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media