Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Adaptive sampling for rapidly matching histograms

Published: 01 June 2018 Publication History

Abstract

In exploratory data analysis, analysts often have a need to identify histograms that possess a specific distribution, among a large class of candidate histograms, e.g., find countries whose income distribution is most similar to that of Greece. This distribution could be a new one that the user is curious about, or a known distribution from an existing histogram visualization. At present, this process of identification is brute-force, requiring the manual generation and evaluation of a large number of histograms. We present FastMatch: an end-to-end approach for interactively retrieving the histogram visualizations most similar to a user-specified target, from a large collection of histograms. The primary technical contribution underlying FastMatch is a probabilistic algorithm, HistSim, a theoretically sound sampling-based approach to identify the top-k closest histograms under 1 distance. While HistSim can be used independently, within FastMatch we couple HistSim with a novel system architecture that is aware of practical considerations, employing asynchronous block-based sampling policies. FastMatch obtains near-perfect accuracy with up to 35× speedup over approaches that do not use sampling on several real-world datasets.

References

[1]
Boost Statistical Distributions and Functions. https://www.boost.org/doc/libs/1_67_0/libs/math/doc/html/dist.html, 2006.
[2]
Flight Records. http://stat-computing.org/dataexpo/2009/the-data.html, 2009.
[3]
NYC Taxi Trip Records. https://github.com/toddwschneider/nyc-taxi-data/, 2015.
[4]
WA Police Stop Records. https://stacks.stanford.edu/file/druid: py883nd2578/WA-clean.csv.gz, 2017.
[5]
S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In ACM SIGMOD Record, volume 29, pages 487--498. ACM, 2000.
[6]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In ACM Sigmod Record, volume 28, pages 574--576. ACM, 1999.
[7]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, New York, NY, USA, 2013. ACM.
[8]
D. Alabi and E. Wu. Pfunk-h: approximate query processing using perceptual models. In Proceedings of the 1st Workshop on Human-In-the-Loop Data Analytics, page 10, 2016.
[9]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29. ACM, 1996.
[10]
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, New York, New York, USA, 2003.
[11]
R. Bardenet, O.-A. Maillard, et al. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361--1385, 2015.
[12]
T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In FOCS, 2000.
[13]
J. T. Behrens. Principles and procedures of exploratory data analysis. Psychological Methods, 2(2):131, 1997.
[14]
G. Beskales, M. A. Soliman, and I. F. Ilyas. Efficient search for the top-k probable nearest neighbors in uncertain databases. PVLDB, 1(1):326--339, 2008.
[15]
G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.
[16]
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. PVLDB, 10(2--3):199--223, 2001.
[17]
C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and evaluation. In ACM SIGMOD Record, volume 27, pages 355--366. ACM, 1998.
[18]
S.-O. Chan, I. Diakonikolas, G. Valiant, and P. Valiant. Optimal algorithms for testing closeness of discrete distributions. In SODA, pages 1193--1203, 2014.
[19]
S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, pages 534--542. IEEE, 2001.
[20]
S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In ACM SIGMOD Record, volume 27, pages 436--447. ACM, 1998.
[21]
C.-M. Chen and Y. Ling. A sampling-based estimator for top-k selection query. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 617--627. IEEE, 2002.
[22]
Y. Chen and K. Yi. Two-level sampling for join size estimation. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 759--774. ACM, 2017.
[23]
E. Cohen, N. Grossaug, and H. Kaplan. Processing top-k queries from samples. Computer Networks, 52(14):2605--2622, 2008.
[24]
C. Daskalakis, I. Diakonikolas, R. ODonnell, R. A. Servedio, and L.-Y. Tan. Learning sums of independent integer random variables. In FOCS, pages 217--226. IEEE, 2013.
[25]
I. Diakonikolas. Personal communication, 2017.
[26]
B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and C. Wang. Sample + seek: Approximating aggregates with distribution precision guarantee. In SIGMOD, 2016.
[27]
D. Fisher, I. Popov, S. Drucker, and m.c. Schraefel. Trust me, i'm partially right. In CHI, page 1673, New York, New York, USA, may 2012. ACM Press.
[28]
V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. PVLDB, pages 176--187, 2000.
[29]
A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International statistical review, 70(3):419--435, 2002.
[30]
P. Hanrahan. Analytic database technologies for a new kind of user: The data enthusiast. In SIGMOD, pages 577--578, New York, NY, USA, 2012. ACM.
[31]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. ACM SIGMOD Record, 26(2):171--182, jun 1997.
[32]
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13--30, 1963.
[33]
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65--70, 1979.
[34]
W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja. Processing aggregate relational queries with hard time constraints. In ACM SIGMOD Record, volume 18, pages 68--77. ACM, 1989.
[35]
I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv., 40(4), 2008.
[36]
Y. E. Ioannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. In ACM SIGMOD Record, volume 24, pages 233--244. ACM, 1995.
[37]
Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. PVLDB, pages 174--185, 1999.
[38]
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. pages 24--27, 1998.
[39]
N. L. Johnson, A. W. Kemp, and S. Kotz. Univariate discrete distributions, volume 444. John Wiley & Sons, 2005.
[40]
U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. M4: a visualization-oriented time series data aggregation. PVLDB, 7(10):797--808, 2014.
[41]
S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In AVI, pages 547--554. ACM, 2012.
[42]
A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 8(5):521--532, 2015.
[43]
A. Kim, L. Xu, T. Siddiqui, S. Huang, S. Madden, and A. Parameswaran. Optimally leveraging density and locality for exploratory browsing and sampling. In Proceedings of the 3rd Workshop on Human-In-the-Loop Data Analytics, pages 1--7, 2018.
[44]
H.-P. Kriegel, P. Kunath, and M. Renz. Probabilistic nearest-neighbor query on uncertain objects. Advances in databases: concepts, systems and applications, pages 337--348, 2007.
[45]
E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business Media, 2006.
[46]
L. D. Lins, J. T. Klosowski, and C. E. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE TVCG, 19(12):2456--2465, 2013.
[47]
R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling, volume 19. ACM, 1990.
[48]
R. J. Lipton, J. F. Naughton, D. A. Schneider, and S. Seshadri. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116(1):195--226, 1993.
[49]
Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. IEEE TVCG, 20(12):2122--2131, 2014.
[50]
Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. In CGF, volume 32, pages 421--430. Wiley Online Library, 2013.
[51]
S. Macke, Y. Zhang, S. Huang, and A. Parameswaran. Fastmatch: Adaptive algorithms for rapid discovery of relevant histogram visualizations. Technical report, Available at: http://data-people.cs.illinois.edu/papers/fastmatch.pdf, 2017.
[52]
C. McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148--188, 1989.
[53]
D. Moritz, D. Fisher, B. Ding, and C. Wang. Trust, but verify: Optimistic visualizations of approximate queries for exploring big data. In CHI, pages 2904--2915. ACM, 2017.
[54]
B. Mozafari. Approximate query engines: Commercial challenges and research opportunities. In SIGMOD, pages 521--524. ACM, 2017.
[55]
B. Nichols, D. Buttlar, and J. Farrell. Pthreads programming: A POSIX standard for better multiprocessing. " O'Reilly Media, Inc.", 1996.
[56]
Y. Park, M. Cafarella, and B. Mozafari. Visualization-aware sampling for very large databases. In ICDE, pages 755--766. IEEE, 2016.
[57]
A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-k frequent itemsets through progressive sampling. Data Mining and Knowledge Discovery, 21(2):310--326, 2010.
[58]
P. Pirolli and S. Card. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, volume 5, pages 2--4, 2005.
[59]
C. Qin and F. Rusu. Pf-ola: a high-performance framework for parallel online aggregation. Distributed and Parallel Databases, 32(3):337--375, 2014.
[60]
S. Rahman, M. Aliakbarpour, H. K. Kong, E. Blais, K. Karahalios, A. Parameswaran, and R. Rubinfield. I've seen "enough": Incrementally improving visualizations to support rapid decision making. PVLDB, 10(11):1262--1273, 2017.
[61]
C. Ré, N. N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probabilistic data. In ICDE, pages 886--895, 2007.
[62]
T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. PVLDB, 10(4):457--468, 2016.
[63]
A. S. Silberstein, R. Braynard, C. Ellis, K. Munagala, and J. Yang. A sampling-based approach to optimizing top-k queries in sensor networks. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 68--68. IEEE, 2006.
[64]
M. A. Soliman, I. F. Ilyas, and K. C. Chang. Top-k query processing in uncertain databases. In ICDE, pages 896--905, 2007.
[65]
M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015.
[66]
B. Waggoner. L p testing and learning of discrete distributions. In ITCS, pages 347--356. ACM, 2015.
[67]
K. Wu, E. Otoo, and A. Shoshani. Compressed bitmap indices for efficient query processing. Lawrence Berkeley National Laboratory, 2001.
[68]
K. Wu, K. Stockinger, and A. Shoshani. Breaking the curse of cardinality on bitmap indexes. In Scientific and Statistical Database Management, pages 348--365. Springer, 2008.
[69]
S. Wu, B. C. Ooi, and K.-L. Tan. Continuous sampling for online aggregation over multiple queries. In SIGMOD, pages 651--662. ACM, 2010.
[70]
Y. Wu, B. Harb, J. Yang, and C. Yu. Efficient evaluation of object-centric exploration queries for visualization. PVLDB, 8(12):1752--1763, 2015.
[71]
K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica. G-ola: Generalized on-line aggregation for interactive analysis on big data. In SIGMOD, pages 913--918. ACM, 2015.

Cited By

View all
  • (2024)A Remote Sensing Spectral Index Guided Bitemporal Residual Attention Network for Wildfire Burn Severity MappingIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.346053117(17187-17206)Online publication date: 2024
  • (2023)Magnetic resonance imaging enhancement using prior knowledge and a denoising scheme that combines total variation and histogram matching techniquesFrontiers in Applied Mathematics and Statistics10.3389/fams.2023.10417509Online publication date: 2-Mar-2023
  • (2023)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: 8-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 10
June 2018
248 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2018
Published in PVLDB Volume 11, Issue 10

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Remote Sensing Spectral Index Guided Bitemporal Residual Attention Network for Wildfire Burn Severity MappingIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.346053117(17187-17206)Online publication date: 2024
  • (2023)Magnetic resonance imaging enhancement using prior knowledge and a denoising scheme that combines total variation and histogram matching techniquesFrontiers in Applied Mathematics and Statistics10.3389/fams.2023.10417509Online publication date: 2-Mar-2023
  • (2023)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: 8-Dec-2023
  • (2022)Semantics and Anomaly Preserving Sampling Strategy for Large-Scale Time Series DataACM/IMS Transactions on Data Science10.1145/35119182:4(1-25)Online publication date: 30-Mar-2022
  • (2022)Fast Error-Bounded Distance Distribution ComputationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.305824134:11(5364-5377)Online publication date: 1-Nov-2022
  • (2022)Comprehensive and Efficient Workload SummarizationDatenbank-Spektrum10.1007/s13222-022-00427-w22:3(249-256)Online publication date: 17-Nov-2022
  • (2021)COMPAREProceedings of the VLDB Endowment10.14778/3476249.347629114:11(2419-2431)Online publication date: 1-Jul-2021
  • (2021)Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00150(1703-1714)Online publication date: Apr-2021
  • (2021)RSP-Hist: Approximate Histograms for Big Data Exploration on Hadoop Clusters2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00058(412-417)Online publication date: Dec-2021
  • (2020)Comprehensive and efficient workload compressionProceedings of the VLDB Endowment10.14778/3430915.343093114:3(418-430)Online publication date: 1-Nov-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media