research-article

Rapid sampling for visualizations with ordering guarantees

Authors:

Aditya Parameswaran,

Sam Madden, and

Ronitt RubinfeldAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 5

Pages 521 - 532

https://doi.org/10.14778/2735479.2735485

Published: 01 January 2015 Publication History

Abstract

Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual properties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes.

References

[1]

S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. SIGMOD, pages 487--498, 2000.

Digital Library

[2]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. SIGMOD, pages 574--576, 1999.

Digital Library

[3]

S. Agarwal et al. Blinkdb: queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013.

Digital Library

[4]

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29, 1996.

Digital Library

[5]

B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. SIGMOD, pages 539--550, 2003.

Digital Library

[6]

G. Burtini et al. Time series compression for adaptive chart generation. In CCECE 2013, pages 1--6. IEEE, 2013.

[7]

R. Canetti, G. Even, and O. Goldreich. Lower bounds for sampling algorithms for estimating the average. Inf. Process. Lett., 53(1): 17--25, 1995.

Digital Library

[8]

G. Casella and R. Berger. Statistical Inference. Duxbury, June 2001.

[9]

K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In VLDB, pages 111--122, 2000.

Digital Library

[10]

S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, pages 534--542, 2001.

Digital Library

[11]

S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2), June 2007.

Digital Library

[12]

W. G. Cochran. Sampling techniques. John Wiley & Sons, 1977.

[13]

P. R. Doshi, E. A. Rundensteiner, and M. O. Ward. Prefetching for visual data exploration. In DASFAA 2003, pages 195--202. IEEE, 2003.

Digital Library

[14]

P. Esling and C. Agon. Time-series data mining. ACM Computing Surveys (CSUR), 45(1): 12, 2012.

Digital Library

[15]

D. Fisher. Hotmap: Looking at geographic attention. IEEE Computer Society, November 2007. Demo at http://hotmap.msresearch.us.

Digital Library

[16]

D. Fisher. Incremental, approximate database queries and uncertainty for exploratory visualization. In LDAV'11, pages 73--80, 2011.

[17]

D. Fisher, I. O. Popov, S. M. Drucker, and M. C. Schraefel. Trust me, I'm partially right: incremental visualization lets analysts explore large datasets faster. In CHI'12, pages 1673--1682, 2012.

Digital Library

[18]

Flight Records. http://stat-computing.org/dataexpo/2009/the-data.html. 2009.

[19]

M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. VLDB, pages 725--, 2001.

Digital Library

[20]

P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, pages 541--550, 2001.

Digital Library

[21]

H. Gonzalez et al. Google fusion tables: web-centered data management and collaboration. In SIGMOD Conference, pages 1061--1066, 2010.

Digital Library

[22]

D. Guo. Coordinating computational and visual approaches for interactive feature selection and multivariate clustering. Information Visualization, 2(4): 232--246, 2003.

Digital Library

[23]

P. J. Haas et al. Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci., 52(3): 550--569, 1996.

Digital Library

[24]

P. Hanrahan. Analytic database technologies for a new kind of user: the data enthusiast. In SIGMOD Conference, pages 577--578, 2012.

Digital Library

[25]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD Conference, 1997.

Digital Library

[26]

W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301): 13--30, 1963.

[27]

W.-C. Hou, G. Özsoyoglu, and B. K. Taneja. Statistical estimators for relational algebra expressions. In PODS, pages 276--287, 1988.

Digital Library

[28]

W.-C. Hou, G. Özsoyoglu, and B. K. Taneja. Processing aggregate relational queries with hard time constraints. In SIGMOD Conference, pages 68--77, 1989.

Digital Library

[29]

Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. VLDB '99, pages 174--185, 1999.

Digital Library

[30]

C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the dbo engine. ACM Trans. Database Syst., 33(4), 2008.

Digital Library

[31]

S. Joshi and C. Jermaine. Robust stratified sampling plans for low selectivity queries. In ICDE 2008, pages 199--208. IEEE, 2008.

Digital Library

[32]

S. Kandel et al. Profiler: integrated statistical analysis and visualization for data quality assessment. In AVI, pages 547--554, 2012.

Digital Library

[33]

A. Key, B. Howe, D. Perry, and C. Aragon. Vizdeck: Self-organizing dashboards for visual analytics. SIGMOD '12, pages 681--684, 2012.

Digital Library

[34]

A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. In Technical Report, ArXiv, Added December 2014.

[35]

A. Kim, S. Madden, and A. Parameswaran. Needletail: A system for browsing queries (demo). In Technical Report, Available at: i.stanford.edu/~adityagp/ntail-demo.pdf, 2014.

[36]

N. Koudas. Space efficient bitmap indexing. In CIKM, pages 194--201, 2000.

Digital Library

[37]

R. J. Lipton et al. Efficient sampling strategies for relational database operations. Theor. Comput. Sci., 116(1&2): 195--226, 1993.

Digital Library

[38]

T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3): 225--331, 2009.

Digital Library

[39]

Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. Computer Graphics Forum (Proc. EuroVis), 32, 2013.

Digital Library

[40]

M. Livny et al. Devise: Integrated querying and visualization of large datasets. In SIGMOD Conference, pages 301--312, 1997.

Digital Library

[41]

H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50--60, 1947.

[42]

A. Parameswaran, N. Polyzotis, and H. Garcia-Molina. SeeDB: Visualizing Database Queries Efficiently. In VLDB, 2014.

Digital Library

[43]

J. Seo et al. A rank-by-feature framework for interactive exploration of multidimensional data. Information Visualization, pages 96--113, 2005.

Digital Library

[44]

R. J. Serfling et al. Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, 2(1): 39--48, 1974.

[45]

Spotfire Inc. spotfire.com (retrieved March 24, 2014).

[46]

C. Stolte, D. Tang, and P. Hanrahan. Polaris: a system for query, analysis, and visualization of multidimensional databases. Commun. ACM, 51(11), 2008.

Digital Library

[47]

E. R. Tufte and P. Graves-Morris. The visual display of quantitative information, volume 2. Graphics press Cheshire, CT, 1983.

Digital Library

[48]

L. Wasserman. All of Statistics. Springer, 2003.

[49]

K. Wu et al. Analyses of multi-level and multi-component compressed bitmap indexes. ACM Trans. Database Syst., 35(1), 2010.

Digital Library

[50]

K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. ACM Trans. Database Syst., 31(1): 1--38, 2006.

Digital Library

[51]

J. Yang et al. Visual hierarchical dimension reduction for exploration of high dimensional datasets. VISSYM '03, pages 19--28, 2003.

Digital Library

Cited By

Maroulis SStamatopoulos VPapastefanatos GTerrovitis M(2024)Visualization-Aware Time Series Min-Max Caching with Error Bound GuaranteesProceedings of the VLDB Endowment10.14778/3659437.365946017:8(2091-2103)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659460
Sanca VChrysogelos PAilamaki A(2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665261
Siddiqui TNarasayya VDumitru MChaudhuri S(2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.14778/3636218.3636222
Show More Cited By

Index Terms

Rapid sampling for visualizations with ordering guarantees
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Visualizing Visualizations: User Interfaces for Managing and Exploring Scientific Visualization Data

The process of scientific visualization is inherently iterative. A good visualization comes from experimenting with visualization, rendering, and viewing parameters to bring out the most relevant information in the data. A good data visualization system ...
Read More
Constructing Interactive Visualizations with iVoLVER
CHI EA '16: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems

iVoLVER, the Interactive Visual Language for Visualization Extraction and Reconstruction, is a web-based pen and touch system that graphically supports the construction of interactive visualizations and allows the extraction of data from different types ...
Read More
Deconstructing and restyling D3 visualizations
UIST '14: Proceedings of the 27th annual ACM symposium on User interface software and technology

The D3 JavaScript library has become a ubiquitous tool for developing visualizations on the Web. Yet, once a D3 visualization is published online its visual style is difficult to change. We present a pair of tools for deconstructing and restyling ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 5

January 2015

181 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2015

Published in PVLDB Volume 8, Issue 5

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
261
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Maroulis SStamatopoulos VPapastefanatos GTerrovitis M(2024)Visualization-Aware Time Series Min-Max Caching with Error Bound GuaranteesProceedings of the VLDB Endowment10.14778/3659437.365946017:8(2091-2103)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659460
Sanca VChrysogelos PAilamaki A(2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665261
Siddiqui TNarasayya VDumitru MChaudhuri S(2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.14778/3636218.3636222
Luo QWang YYi KWang SLi F(2023)Secure Sampling for Approximate Multi-party Query ProcessingProceedings of the ACM on Management of Data10.1145/36173391:3(1-27)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617339
Sanca VChrysogelos PAilamaki A(2023)LAQy: Efficient and Reusable Query Approximations via Lazy SamplingProceedings of the ACM on Management of Data10.1145/35893191:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589319
Wang YWang YChen XZhao YZhang FWu EFu CYu X(2023)OM3: An Ordered Multi-level Min-Max Representation for Interactive Progressive Visualization of Time SeriesProceedings of the ACM on Management of Data10.1145/35892901:2(1-24)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589290
Sheoran NChockchowwat SChheda AWang SVerma RPark Y(2023)A Step Toward Deep Online AggregationProceedings of the ACM on Management of Data10.1145/35892691:2(1-28)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589269
Qiao YJing YZhang HHe ZZhang KWang X(2023)BlinkViz: Fast and Scalable Approximate Visualization on Very Large Datasets using Neural-Enhanced Mixed Sum-Product NetworksProceedings of the ACM Web Conference 202310.1145/3543507.3583411(1734-1742)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583411
Lee DTang DAgarwal KBoonmark TChen CKang JMukhopadhyay USong JYong MHearst MParameswaran A(2022)LuxProceedings of the VLDB Endowment10.14778/3494124.349415115:3(727-738)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3494124.3494151
Sanca VAilamaki A(2022)Sampling-Based AQP in Modern Analytical EnginesProceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535095(1-8)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533737.3535095
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents