Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Rapid sampling for visualizations with ordering guarantees

Published: 01 January 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual properties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes.

    References

    [1]
    S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. SIGMOD, pages 487--498, 2000.
    [2]
    S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. SIGMOD, pages 574--576, 1999.
    [3]
    S. Agarwal et al. Blinkdb: queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013.
    [4]
    N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29, 1996.
    [5]
    B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. SIGMOD, pages 539--550, 2003.
    [6]
    G. Burtini et al. Time series compression for adaptive chart generation. In CCECE 2013, pages 1--6. IEEE, 2013.
    [7]
    R. Canetti, G. Even, and O. Goldreich. Lower bounds for sampling algorithms for estimating the average. Inf. Process. Lett., 53(1): 17--25, 1995.
    [8]
    G. Casella and R. Berger. Statistical Inference. Duxbury, June 2001.
    [9]
    K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In VLDB, pages 111--122, 2000.
    [10]
    S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, pages 534--542, 2001.
    [11]
    S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2), June 2007.
    [12]
    W. G. Cochran. Sampling techniques. John Wiley & Sons, 1977.
    [13]
    P. R. Doshi, E. A. Rundensteiner, and M. O. Ward. Prefetching for visual data exploration. In DASFAA 2003, pages 195--202. IEEE, 2003.
    [14]
    P. Esling and C. Agon. Time-series data mining. ACM Computing Surveys (CSUR), 45(1): 12, 2012.
    [15]
    D. Fisher. Hotmap: Looking at geographic attention. IEEE Computer Society, November 2007. Demo at http://hotmap.msresearch.us.
    [16]
    D. Fisher. Incremental, approximate database queries and uncertainty for exploratory visualization. In LDAV'11, pages 73--80, 2011.
    [17]
    D. Fisher, I. O. Popov, S. M. Drucker, and M. C. Schraefel. Trust me, I'm partially right: incremental visualization lets analysts explore large datasets faster. In CHI'12, pages 1673--1682, 2012.
    [18]
    Flight Records. http://stat-computing.org/dataexpo/2009/the-data.html. 2009.
    [19]
    M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. VLDB, pages 725--, 2001.
    [20]
    P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, pages 541--550, 2001.
    [21]
    H. Gonzalez et al. Google fusion tables: web-centered data management and collaboration. In SIGMOD Conference, pages 1061--1066, 2010.
    [22]
    D. Guo. Coordinating computational and visual approaches for interactive feature selection and multivariate clustering. Information Visualization, 2(4): 232--246, 2003.
    [23]
    P. J. Haas et al. Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci., 52(3): 550--569, 1996.
    [24]
    P. Hanrahan. Analytic database technologies for a new kind of user: the data enthusiast. In SIGMOD Conference, pages 577--578, 2012.
    [25]
    J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD Conference, 1997.
    [26]
    W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301): 13--30, 1963.
    [27]
    W.-C. Hou, G. Özsoyoglu, and B. K. Taneja. Statistical estimators for relational algebra expressions. In PODS, pages 276--287, 1988.
    [28]
    W.-C. Hou, G. Özsoyoglu, and B. K. Taneja. Processing aggregate relational queries with hard time constraints. In SIGMOD Conference, pages 68--77, 1989.
    [29]
    Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. VLDB '99, pages 174--185, 1999.
    [30]
    C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the dbo engine. ACM Trans. Database Syst., 33(4), 2008.
    [31]
    S. Joshi and C. Jermaine. Robust stratified sampling plans for low selectivity queries. In ICDE 2008, pages 199--208. IEEE, 2008.
    [32]
    S. Kandel et al. Profiler: integrated statistical analysis and visualization for data quality assessment. In AVI, pages 547--554, 2012.
    [33]
    A. Key, B. Howe, D. Perry, and C. Aragon. Vizdeck: Self-organizing dashboards for visual analytics. SIGMOD '12, pages 681--684, 2012.
    [34]
    A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. In Technical Report, ArXiv, Added December 2014.
    [35]
    A. Kim, S. Madden, and A. Parameswaran. Needletail: A system for browsing queries (demo). In Technical Report, Available at: i.stanford.edu/~adityagp/ntail-demo.pdf, 2014.
    [36]
    N. Koudas. Space efficient bitmap indexing. In CIKM, pages 194--201, 2000.
    [37]
    R. J. Lipton et al. Efficient sampling strategies for relational database operations. Theor. Comput. Sci., 116(1&2): 195--226, 1993.
    [38]
    T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3): 225--331, 2009.
    [39]
    Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. Computer Graphics Forum (Proc. EuroVis), 32, 2013.
    [40]
    M. Livny et al. Devise: Integrated querying and visualization of large datasets. In SIGMOD Conference, pages 301--312, 1997.
    [41]
    H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50--60, 1947.
    [42]
    A. Parameswaran, N. Polyzotis, and H. Garcia-Molina. SeeDB: Visualizing Database Queries Efficiently. In VLDB, 2014.
    [43]
    J. Seo et al. A rank-by-feature framework for interactive exploration of multidimensional data. Information Visualization, pages 96--113, 2005.
    [44]
    R. J. Serfling et al. Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, 2(1): 39--48, 1974.
    [45]
    Spotfire Inc. spotfire.com (retrieved March 24, 2014).
    [46]
    C. Stolte, D. Tang, and P. Hanrahan. Polaris: a system for query, analysis, and visualization of multidimensional databases. Commun. ACM, 51(11), 2008.
    [47]
    E. R. Tufte and P. Graves-Morris. The visual display of quantitative information, volume 2. Graphics press Cheshire, CT, 1983.
    [48]
    L. Wasserman. All of Statistics. Springer, 2003.
    [49]
    K. Wu et al. Analyses of multi-level and multi-component compressed bitmap indexes. ACM Trans. Database Syst., 35(1), 2010.
    [50]
    K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. ACM Trans. Database Syst., 31(1): 1--38, 2006.
    [51]
    J. Yang et al. Visual hierarchical dimension reduction for exploration of high dimensional datasets. VISSYM '03, pages 19--28, 2003.

    Cited By

    View all

    Index Terms

    1. Rapid sampling for visualizations with ordering guarantees
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 8, Issue 5
      January 2015
      181 pages

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 January 2015
      Published in PVLDB Volume 8, Issue 5

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)1

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Visualization-Aware Time Series Min-Max Caching with Error Bound GuaranteesProceedings of the VLDB Endowment10.14778/3659437.365946017:8(2091-2103)Online publication date: 1-Apr-2024
      • (2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
      • (2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
      • (2023)Secure Sampling for Approximate Multi-party Query ProcessingProceedings of the ACM on Management of Data10.1145/36173391:3(1-27)Online publication date: 13-Nov-2023
      • (2023)LAQy: Efficient and Reusable Query Approximations via Lazy SamplingProceedings of the ACM on Management of Data10.1145/35893191:2(1-26)Online publication date: 20-Jun-2023
      • (2023)OM3: An Ordered Multi-level Min-Max Representation for Interactive Progressive Visualization of Time SeriesProceedings of the ACM on Management of Data10.1145/35892901:2(1-24)Online publication date: 20-Jun-2023
      • (2023)A Step Toward Deep Online AggregationProceedings of the ACM on Management of Data10.1145/35892691:2(1-28)Online publication date: 20-Jun-2023
      • (2023)BlinkViz: Fast and Scalable Approximate Visualization on Very Large Datasets using Neural-Enhanced Mixed Sum-Product NetworksProceedings of the ACM Web Conference 202310.1145/3543507.3583411(1734-1742)Online publication date: 30-Apr-2023
      • (2022)LuxProceedings of the VLDB Endowment10.14778/3494124.349415115:3(727-738)Online publication date: 4-Feb-2022
      • (2022)Sampling-Based AQP in Modern Analytical EnginesProceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535095(1-8)Online publication date: 12-Jun-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media