Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

A Session-Based Approach to Fast-But-Approximate Interactive Data Cube Exploration

Published: 13 February 2018 Publication History

Abstract

With the proliferation of large datasets, sampling has become pervasive in data analysis. Sampling has numerous benefits—from reducing the computation time and cost to increasing the scope of interactive analysis. A popular task in data science, well-suited toward sampling, is the computation of fast-but-approximate aggregations over sampled data. Aggregation is a foundational block of data analysis, with data cube being its primary construct. We observe that such aggregation queries are typically issued in an ad-hoc, interactive setting. In contrast to one-off queries, a typical query session consists of a series of quick queries, interspersed with the user inspecting the results and formulating the next query. The similarity between session queries opens up opportunities for reusing computation of not just query results, but also error estimates. Error estimates need to be provided alongside sampled results for the results to be meaningful. We propose Sesame, a rewrite and caching framework that accelerates the entire interactive <underline>ses</underline>sion of aggregation queries over <underline>sam</underline>pl<underline>e</underline>d data. We focus on two unique and computationally expensive aspects of this use case: query speculation in the presence of sampling, and error computation, and provide novel strategies for result and error reuse. We demonstrate that our approach outperforms conventional sampled aggregation techniques by at least an order of magnitude, without modifying the underlying database.

References

[1]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the Eurosys.
[2]
Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated selection of materialized views and indexes in SQL databases. In Proceedings of the VLDB.
[3]
Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. 2001. Data prefetching by dependence graph precomputation. In Proceedings of the ISCA.
[4]
Lee Averell and Andrew Heathcote. 2011. The form of the forgetting curve and the fate of memories. Journal of Mathematical Psychology 55, 1 (2011), 25--35.
[5]
Brian Babcock, Surajit Chaudhuri, and Gautam Das. 2003. Dynamic sample selection for approximate query processing. In Proceedings of the SIGMOD.
[6]
Elena Baralis, Tania Cerquitelli, Silvia Chiusano, and Anais Grand. 2013. P-Mine: Parallel itemset mining on large datasets. In Proceedings of the ICDEW.
[7]
Mike Barnett, Badrish Chandramouli, Robert DeLine, Steven Drucker, Danyel Fisher, Jonathan Goldstein, Patrick Morrison, and John Platt. 2013. Stat&excl;-An interactive analytics environment for big data. In Proceedings of the SIGMOD.
[8]
Leilani Battle, Michael Stonebraker, and Remco Chang. 2013. Dynamic reduction of query result sets for interactive visualization. In Proceedings of the 2013 IEEE International Conference on Big Data.
[9]
Ugur Cetintemel, Mitch Cherniack, Justin DeBrabant, Yanlei Diao, Kyriaki Dimitriadou, Alexander Kalinin, Olga Papaemmanouil, and Stanley B. Zdonik. 2013. Query steering for interactive data exploration. In Proceedings of the CIDR.
[10]
Tony F. Chan, Gene H. Golub, and Randall J. LeVeque. 1982. Updating formulae and a pairwise algorithm for computing sample variances. In Proceedings of the COMPSTAT.
[11]
Badrish Chandramouli, Jun Yang, and Amin Vahdat. 2006. Distributed network querying with bounded approximate caching. In Proceedings of the DASFAA.
[12]
Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. In Proceedings of the TODS.
[13]
Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving hash join performance through prefetching. In Proceedings of the TODS.
[14]
Rada Chirkova and Jun Yang. 2011. Materialized Views. Foundations and Trends in Databases.
[15]
William Cochran. 2007. Sampling Techniques. Wiley 8 Sons.
[16]
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, John Gerth, Justin Talbot, Khaled Elmeleegy, and Russell Sears. 2010. Online aggregation and continuous query support in mapreduce. In Proceedings of the SIGMOD.
[17]
Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: An automatic query steering framework for interactive data exploration. In Proceedings of the SIGMOD.
[18]
Tapio Elomaa and Matti Kääriäinen. 2002. Progressive Rademacher sampling. In Proceedings of the AAAI/IAAI.
[19]
Alfonso Estrada and Eduardo F. Morales. 2004. NSC: A new progressive sampling algorithm. In Proceedings of the Workshop on Machine Learning for Scientific data Analysis (IBERAMIA’04).
[20]
Minos N. Garofalakis and others. 2001. Approximate query processing: Taming the terabytes. In Proceedings of the VLDB.
[21]
Rainer Gemulla, Peter J. Haas, and Wolfgang Lehner. 2013. Non-uniformity issues and workarounds in bounded-size sampling. In Proceedings of the VLDB.
[22]
Phillip B. Gibbons, Viswanath Poosala, Swarup Acharya, Yair Bartal, Yossi Matias, S. Muthukrishnan, Sridhar Ramaswamy, and Torsten Suel. 1998. AQUA: System and techniques for approximate query answering. In Proceedings of the Bell Labs TR.
[23]
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53.
[24]
Baohua Gu, Bing Liu, Feifang Hu, and Huan Liu. 2001. Efficiently determining the starting sample size for progressive sampling. In Proceedings of the ECML.
[25]
Alon Halevy. 2001. Answering queries using views. In Proceedings of the VLDB.
[26]
Nicolas Hanusse, Sofian Maabout, and Radu Tofan. 2011. Revisiting the partial data cube materialization. In East European Conference on Advances in Databases and Information Systems. Springer, 70--83.
[27]
Venky Harinarayan, Anand Rajaraman, and J. D. Ullman. 1996. Implementing data cubes efficiently. In Proceedings of the SIGMOD.
[28]
Joseph M. Hellerstein, Ron Avnur, Andy Chou, Christian Hidber, Chris Olston, Vijayshankar Raman, Tali Roth, and Peter J. Haas. 1999. Interactive data analysis: The control project. Computer 32, 8 (1999), 51--59.
[29]
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online aggregation. In Proceedings of the SIGMOD.
[30]
Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran. 2015. Smart drill-down: A new data exploration operator. In Proceedings of the VLDB.
[31]
George H. John and Pat Langley. 1996. Static versus dynamic sampling for data mining. In Proceedings of the KDD.
[32]
Minsuk Kahng, Dezhi Fang, and Duen Horng Polo Chau. 2016. Visual exploration of machine learning results using data cube analysis. In Proceedings of the HILDA.
[33]
Panos Kalnis, Wee Siong Ng, Beng Chin Ooi, Dimitris Papadias, and Kian-Lee Tan. 2002. An adaptive peer-to-peer network for distributed caching of OLAP results. In Proceedings of the SIGMOD.
[34]
Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi. 2014. Distributed and interactive cube exploration. In Proceedings of the ICDE.
[35]
Micheline Kamber, Jiawei Han, and Jenny Chiang. 1997. Metarule-guided mining of multi-dimensional association rules using data cubes. In Proceedings of the KDD.
[36]
Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of the AVI.
[37]
Anja Klein, Rainer Gemulla, Philipp Rösch, and Wolfgang Lehner. 2006. Derby/s: A DBMS for sample-based query answering. In Proceedings of the SIGMOD.
[38]
Edwin M. Knorr and Raymond T. Ng. 1997. A unified notion of outliers: Properties and computation. In Proceedings of the KDD.
[39]
Donald E. Knuth. 2014. Art of Computer Programming, Volume 2: Seminumerical Algorithms.
[40]
Marcel Kornacker and Justin Erickson. 2012. Cloudera impala: Real-time queries in Apache Hadoop, for real. (2012).
[41]
Yannis Kotidis and Nick Roussopoulos. 1999. DynaMat: A dynamic view management system for data warehouses. In Proceedings of the SIGMOD.
[42]
Bum Chul Kwon, Janu Verma, Peter J. Haas, and Cagatay Demiralp. 2017. Sampling for scalable visual analytics. IEEE Computer Graphics and Applications 37, 1 (2017), 100--108.
[43]
Xiaolei Li, Jiawei Han, Zhijun Yin, Jae-Gil Lee, and Yizhou Sun. 2008. Sampling cube: A framework for statistical OLAP over sampling data. In Proceedings of the SIGMOD (2008).
[44]
Lauro Lins, James T. Klosowski, and Carlos Scheidegger. 2013. Nanocubes for real-time exploration of spatiotemporal datasets. In Proceedings of the TVCG.
[45]
Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. In Proceedings of the TVCG.
[46]
Zhicheng Liu, Biye Jiang, and Jeffrey Heer. 2013. imMens: Real-time visual querying of big data. Computer Graphics Forum, vol. 32. Wiley Online Library, 421--430.
[47]
Steve Lohr. 2012. The age of big data. New York Times 11, 2012 (2012), SR1.
[48]
Elzbieta Malinowski and Esteban Zimányi. 2008. Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications. Springer.
[49]
Imene Mami and Zohra Bellahsene. 2012. A survey of view selection methods. In Proceedings of the SIGMOD (2012).
[50]
James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela H. Byers. 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey 8 Company (2011).
[51]
Dominik Moritz, Danyel Fisher, Bolin Ding, and Chi Wang. 2017. Trust, but verify: Optimistic visualizations of approximate queries for exploring big data. In Proceedings of the CHI.
[52]
Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan. 2012. Data cube materialization and mining over mapreduce. In Proceedings of the TKDE.
[53]
Raymond T. Ng, Alan Wagner, and Yu Yin. 2001. Iceberg-cube computation with PC clusters. In Proceedings of the SIGMOD.
[54]
Supriya Nirkhiwale, Alin Dobra, and Christopher Jermaine. 2013. A sampling algebra for aggregate estimation. In Proceedings of the VLDB.
[55]
Christopher Olston, Edward Bortnikov, Khaled Elmeleegy, Flavio Junqueira, and Benjamin Reed. 2009. Interactive analysis of web-scale data. In Proceedings of the CIDR.
[56]
Cícero A. L. Pahins, Sean A. Stephens, Carlos Scheidegger, and Joao L. D. Comba. 2017. Hashedcubes: Simple, low memory, real-time visual exploration of big data. In Proceedings of the TVCG.
[57]
Niketan Pansare, Vinayak R. Borkar, Chris Jermaine, and Tyson Condie. 2011. Online aggregation for large mapreduce jobs. In Proceedings of the PVLDB.
[58]
Olga Papaemmanouil, Yanlei Diao, Kyriaki Dimitriadou, and Liping Peng. 2016. Interactive data exploration via machine learning models. IEEE Data Eng. Bull. 39, 4 (2016), 38--49.
[59]
Liping Peng, Enhui Huang, Yuqing Xing, Anna Liu, and Yanlei Diao. 2017. Uncertainty sampling and optimization for interactive database exploration. In Proceedings of the UMass TR.
[60]
Luis L. Perez and Christopher M. Jermaine. 2014. History-aware query optimization with materialized intermediate views. In Proceedings of the ICDE.
[61]
Meikel Poess, Raghunath Othayoth Nambiar, and David Walrath. 2007. Why you should run TPC-DS: A workload analysis. In Proceedings of the VLDB.
[62]
Foster Provost, David Jensen, and Tim Oates. 1999. Efficient progressive sampling. In Proceedings of the SIGKDD.
[63]
Chengjie Qin and Florin Rusu. 2013. Parallel online aggregation in action. In Proceedings of the SSDBM.
[64]
Philipp Rösch and others. 2013. Optimizing sample design for approximate query processing. In Proceedings of the IJKBO.
[65]
Kenneth A. Ross, Divesh Srivastava, and S. Sudarshan. 1996. Materialized view maintenance and integrity constraint checking: Trading space for time. In Proceedings of the SIGMOD.
[66]
Prasan Roy, Srinivasan Seshadri, S. Sudarshan, and Siddhesh Bhobe. 2000. Efficient and extensible algorithms for multi query optimization. In Proceedings of the SIGMOD.
[67]
Carsten Sapia. 1999. On modeling and predicting query behavior in OLAP systems. In Proceedings of the DMDW.
[68]
Sunita Sarawagi and Gayatri Sathe. 2000. i3: Intelligent, interactive investigation of OLAP data cubes. In Proceedings of the SIGMOD.
[69]
B. Shneiderman. 1984. Response time and display rate in human performance with computers. In Proceedings of the CSUR.
[70]
Lefteris Sidirourgos, Martin L. Kersten, and Peter A. Boncz. 2011. SciBORQ: Scientific data management with bounds on runtime and quality. In Proceedings of the CIDR.
[71]
Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, and Yannis Kotidis. 2002. Dwarf: Shrinking the petacube. In Proceedings of the SIGMOD.
[72]
A. J. Smith. 1978. Sequentiality and prefetching. In Proceedings of the TODS.
[73]
Chris Stolte, Diane Tang, and Pat Hanrahan. 2002. Polaris: A system for query, analysis, and visualization of multidimensional relational databases. In Proceedings of the TVCG.
[74]
Farhan Tauheed, Thomas Heinis, Felix Schürmann, Henry Markram, and Anastasia Ailamaki. 2012. SCOUT: Prefetching for latent structure following queries. In Proceedings of the VLDB.
[75]
D. Tunkelang. 2009. Faceted search. Synthesis Lectures on Information Concepts, Retrieval, and Services 1, 1 (2009), 1--80.
[76]
Yi Wang, Linchuan Chen, and Gagan Agrawal. 2015a. Supporting online analytics with user-defined estimation and early termination in a mapreduce-like framework. In Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems.
[77]
Yichuan Wang, Xin Liu, David Chu, and Yunxin Liu. 2015b. EarlyBird: Mobile prefetching of social network feeds via content preference mining and usage pattern analysis. In Proceedings of the 16th ACM International Symposium on Mobile Ad Hoc Networking and Computing.
[78]
B. P. Welford. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4, 3 (1962), 419--420.
[79]
Eugene Wu, Fotis Psallidas, Zhengjie Miao, Haoci Zhang, Laura Rettig, Yifan Wu, and Thibault Sellam. 2017. Combining design and performance in a data visualization management system. In Proceedings of the CIDR.
[80]
Sai Wu, Shouxu Jiang, Beng Chin Ooi, and Kian-Lee Tan. 2009. Distributed online aggregations. In Proceedings of the VLDB.
[81]
Fei Xu, Christopher Jermaine, and Alin Dobra. 2008. Confidence bounds for sampling-based group by estimates. In Proceedings of the TODS.
[82]
Osmar R. Zaiane, Man Xin, and Jiawei Han. 1998. Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. In Proceedings of the 1998 IEEE International Forum on Research and Technology Advances in Digital Libraries (ADL’98).
[83]
Kai Zeng, Shi Gao, Jiaqi Gu, Barzan Mozafari, and Carlo Zaniolo. 2014. ABS: A system for scalable approximate queries with accuracy guarantees. In Proceedings of the SIGMOD.
[84]
Emanuel Zgraggen, Alex Galakatos, Andrew Crotty, Jean-Daniel Fekete, and Tim Kraska. 2016. How progressive visualizations affect exploratory analysis. In Proceedings of the TVCG.
[85]
Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, and Guy Lohman. 2001. On supporting containment queries in relational database management systems. In Proceedings of the SIGMOD.

Cited By

View all
  • (2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
  • (2021)LAQP: Learning-based approximate query processingInformation Sciences10.1016/j.ins.2020.09.070546(1113-1134)Online publication date: Feb-2021
  • (2020)Amplifying Domain Expertise in Clinical Data PipelinesJMIR Medical Informatics10.2196/196128:11(e19612)Online publication date: 5-Nov-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 12, Issue 1
Special Issue (IDEA) and Regular Papers
February 2018
363 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3178542
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 February 2018
Accepted: 01 March 2017
Revised: 01 January 2017
Received: 01 December 2015
Published in TKDD Volume 12, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Error Reuse
  2. aggregation
  3. faceted exploration
  4. interactive visualization
  5. session

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)12
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
  • (2021)LAQP: Learning-based approximate query processingInformation Sciences10.1016/j.ins.2020.09.070546(1113-1134)Online publication date: Feb-2021
  • (2020)Amplifying Domain Expertise in Clinical Data PipelinesJMIR Medical Informatics10.2196/196128:11(e19612)Online publication date: 5-Nov-2020
  • (2020)AGAMI: Scalable Visual Analytics over Multidimensional Data Streams2020 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT)10.1109/BDCAT50828.2020.00020(57-66)Online publication date: Dec-2020
  • (2019)HillviewProceedings of the VLDB Endowment10.14778/3342263.334227912:11(1442-1457)Online publication date: 1-Jul-2019
  • (2019)IDEAaS: Interactive Data Exploration As-a Service2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00096(345-348)Online publication date: Jul-2019
  • (2019)A Relevance-based approach for Big Data ExplorationFuture Generation Computer Systems10.1016/j.future.2019.05.056Online publication date: May-2019
  • (2018)AQP++Proceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183747(1477-1492)Online publication date: 27-May-2018
  • (2017)A Unified Correlation-based Approach to Sampling Over JoinsProceedings of the 29th International Conference on Scientific and Statistical Database Management10.1145/3085504.3085524(1-12)Online publication date: 27-Jun-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media