research-article

Adaptive sampling for rapidly matching histograms

Authors:

Aditya ParameswaranAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 10

Pages 1262 - 1275

https://doi.org/10.14778/3231751.3231753

Published: 01 June 2018 Publication History

Abstract

In exploratory data analysis, analysts often have a need to identify histograms that possess a specific distribution, among a large class of candidate histograms, e.g., find countries whose income distribution is most similar to that of Greece. This distribution could be a new one that the user is curious about, or a known distribution from an existing histogram visualization. At present, this process of identification is brute-force, requiring the manual generation and evaluation of a large number of histograms. We present FastMatch: an end-to-end approach for interactively retrieving the histogram visualizations most similar to a user-specified target, from a large collection of histograms. The primary technical contribution underlying FastMatch is a probabilistic algorithm, HistSim, a theoretically sound sampling-based approach to identify the top-k closest histograms under ℓ₁ distance. While HistSim can be used independently, within FastMatch we couple HistSim with a novel system architecture that is aware of practical considerations, employing asynchronous block-based sampling policies. FastMatch obtains near-perfect accuracy with up to 35× speedup over approaches that do not use sampling on several real-world datasets.

References

[1]

Boost Statistical Distributions and Functions. https://www.boost.org/doc/libs/1_67_0/libs/math/doc/html/dist.html, 2006.

[2]

Flight Records. http://stat-computing.org/dataexpo/2009/the-data.html, 2009.

[3]

NYC Taxi Trip Records. https://github.com/toddwschneider/nyc-taxi-data/, 2015.

[4]

WA Police Stop Records. https://stacks.stanford.edu/file/druid: py883nd2578/WA-clean.csv.gz, 2017.

[5]

S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In ACM SIGMOD Record, volume 29, pages 487--498. ACM, 2000.

Digital Library

[6]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In ACM Sigmod Record, volume 28, pages 574--576. ACM, 1999.

Digital Library

[7]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, New York, NY, USA, 2013. ACM.

Digital Library

[8]

D. Alabi and E. Wu. Pfunk-h: approximate query processing using perceptual models. In Proceedings of the 1st Workshop on Human-In-the-Loop Data Analytics, page 10, 2016.

Digital Library

[9]

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29. ACM, 1996.

Digital Library

[10]

B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, New York, New York, USA, 2003.

Digital Library

[11]

R. Bardenet, O.-A. Maillard, et al. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361--1385, 2015.

[12]

T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In FOCS, 2000.

Digital Library

[13]

J. T. Behrens. Principles and procedures of exploratory data analysis. Psychological Methods, 2(2):131, 1997.

[14]

G. Beskales, M. A. Soliman, and I. F. Ilyas. Efficient search for the top-k probable nearest neighbors in uncertain databases. PVLDB, 1(1):326--339, 2008.

Digital Library

[15]

G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.

[16]

K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. PVLDB, 10(2--3):199--223, 2001.

Digital Library

[17]

C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and evaluation. In ACM SIGMOD Record, volume 27, pages 355--366. ACM, 1998.

Digital Library

[18]

S.-O. Chan, I. Diakonikolas, G. Valiant, and P. Valiant. Optimal algorithms for testing closeness of discrete distributions. In SODA, pages 1193--1203, 2014.

Digital Library

[19]

S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, pages 534--542. IEEE, 2001.

Digital Library

[20]

S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In ACM SIGMOD Record, volume 27, pages 436--447. ACM, 1998.

Digital Library

[21]

C.-M. Chen and Y. Ling. A sampling-based estimator for top-k selection query. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 617--627. IEEE, 2002.

Digital Library

[22]

Y. Chen and K. Yi. Two-level sampling for join size estimation. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 759--774. ACM, 2017.

Digital Library

[23]

E. Cohen, N. Grossaug, and H. Kaplan. Processing top-k queries from samples. Computer Networks, 52(14):2605--2622, 2008.

Digital Library

[24]

C. Daskalakis, I. Diakonikolas, R. ODonnell, R. A. Servedio, and L.-Y. Tan. Learning sums of independent integer random variables. In FOCS, pages 217--226. IEEE, 2013.

Digital Library

[25]

I. Diakonikolas. Personal communication, 2017.

[26]

B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and C. Wang. Sample + seek: Approximating aggregates with distribution precision guarantee. In SIGMOD, 2016.

Digital Library

[27]

D. Fisher, I. Popov, S. Drucker, and m.c. Schraefel. Trust me, i'm partially right. In CHI, page 1673, New York, New York, USA, may 2012. ACM Press.

Digital Library

[28]

V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. PVLDB, pages 176--187, 2000.

Digital Library

[29]

A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International statistical review, 70(3):419--435, 2002.

[30]

P. Hanrahan. Analytic database technologies for a new kind of user: The data enthusiast. In SIGMOD, pages 577--578, New York, NY, USA, 2012. ACM.

Digital Library

[31]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. ACM SIGMOD Record, 26(2):171--182, jun 1997.

Digital Library

[32]

W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13--30, 1963.

[33]

S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65--70, 1979.

[34]

W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja. Processing aggregate relational queries with hard time constraints. In ACM SIGMOD Record, volume 18, pages 68--77. ACM, 1989.

Digital Library

[35]

I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv., 40(4), 2008.

Digital Library

[36]

Y. E. Ioannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. In ACM SIGMOD Record, volume 24, pages 233--244. ACM, 1995.

Digital Library

[37]

Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. PVLDB, pages 174--185, 1999.

Digital Library

[38]

H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. pages 24--27, 1998.

[39]

N. L. Johnson, A. W. Kemp, and S. Kotz. Univariate discrete distributions, volume 444. John Wiley & Sons, 2005.

[40]

U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. M4: a visualization-oriented time series data aggregation. PVLDB, 7(10):797--808, 2014.

Digital Library

[41]

S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In AVI, pages 547--554. ACM, 2012.

Digital Library

[42]

A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 8(5):521--532, 2015.

Digital Library

[43]

A. Kim, L. Xu, T. Siddiqui, S. Huang, S. Madden, and A. Parameswaran. Optimally leveraging density and locality for exploratory browsing and sampling. In Proceedings of the 3rd Workshop on Human-In-the-Loop Data Analytics, pages 1--7, 2018.

Digital Library

[44]

H.-P. Kriegel, P. Kunath, and M. Renz. Probabilistic nearest-neighbor query on uncertain objects. Advances in databases: concepts, systems and applications, pages 337--348, 2007.

Digital Library

[45]

E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business Media, 2006.

[46]

L. D. Lins, J. T. Klosowski, and C. E. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE TVCG, 19(12):2456--2465, 2013.

Digital Library

[47]

R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling, volume 19. ACM, 1990.

Digital Library

[48]

R. J. Lipton, J. F. Naughton, D. A. Schneider, and S. Seshadri. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116(1):195--226, 1993.

Digital Library

[49]

Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. IEEE TVCG, 20(12):2122--2131, 2014.

[50]

Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. In CGF, volume 32, pages 421--430. Wiley Online Library, 2013.

Digital Library

[51]

S. Macke, Y. Zhang, S. Huang, and A. Parameswaran. Fastmatch: Adaptive algorithms for rapid discovery of relevant histogram visualizations. Technical report, Available at: http://data-people.cs.illinois.edu/papers/fastmatch.pdf, 2017.

[52]

C. McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148--188, 1989.

[53]

D. Moritz, D. Fisher, B. Ding, and C. Wang. Trust, but verify: Optimistic visualizations of approximate queries for exploring big data. In CHI, pages 2904--2915. ACM, 2017.

Digital Library

[54]

B. Mozafari. Approximate query engines: Commercial challenges and research opportunities. In SIGMOD, pages 521--524. ACM, 2017.

Digital Library

[55]

B. Nichols, D. Buttlar, and J. Farrell. Pthreads programming: A POSIX standard for better multiprocessing. " O'Reilly Media, Inc.", 1996.

Digital Library

[56]

Y. Park, M. Cafarella, and B. Mozafari. Visualization-aware sampling for very large databases. In ICDE, pages 755--766. IEEE, 2016.

[57]

A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-k frequent itemsets through progressive sampling. Data Mining and Knowledge Discovery, 21(2):310--326, 2010.

Digital Library

[58]

P. Pirolli and S. Card. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, volume 5, pages 2--4, 2005.

[59]

C. Qin and F. Rusu. Pf-ola: a high-performance framework for parallel online aggregation. Distributed and Parallel Databases, 32(3):337--375, 2014.

Digital Library

[60]

S. Rahman, M. Aliakbarpour, H. K. Kong, E. Blais, K. Karahalios, A. Parameswaran, and R. Rubinfield. I've seen "enough": Incrementally improving visualizations to support rapid decision making. PVLDB, 10(11):1262--1273, 2017.

Digital Library

[61]

C. Ré, N. N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probabilistic data. In ICDE, pages 886--895, 2007.

[62]

T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. PVLDB, 10(4):457--468, 2016.

Digital Library

[63]

A. S. Silberstein, R. Braynard, C. Ellis, K. Munagala, and J. Yang. A sampling-based approach to optimizing top-k queries in sensor networks. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 68--68. IEEE, 2006.

Digital Library

[64]

M. A. Soliman, I. F. Ilyas, and K. C. Chang. Top-k query processing in uncertain databases. In ICDE, pages 896--905, 2007.

[65]

M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015.

Digital Library

[66]

B. Waggoner. L p testing and learning of discrete distributions. In ITCS, pages 347--356. ACM, 2015.

Digital Library

[67]

K. Wu, E. Otoo, and A. Shoshani. Compressed bitmap indices for efficient query processing. Lawrence Berkeley National Laboratory, 2001.

[68]

K. Wu, K. Stockinger, and A. Shoshani. Breaking the curse of cardinality on bitmap indexes. In Scientific and Statistical Database Management, pages 348--365. Springer, 2008.

Digital Library

[69]

S. Wu, B. C. Ooi, and K.-L. Tan. Continuous sampling for online aggregation over multiple queries. In SIGMOD, pages 651--662. ACM, 2010.

Digital Library

[70]

Y. Wu, B. Harb, J. Yang, and C. Yu. Efficient evaluation of object-centric exploration queries for visualization. PVLDB, 8(12):1752--1763, 2015.

Digital Library

[71]

K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica. G-ola: Generalized on-line aggregation for interactive analysis on big data. In SIGMOD, pages 913--918. ACM, 2015.

Digital Library

Cited By

Wu MHuang QSui TPeng BYu M(2024)A Remote Sensing Spectral Index Guided Bitemporal Residual Attention Network for Wildfire Burn Severity MappingIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.346053117(17187-17206)Online publication date: 2024
https://doi.org/10.1109/JSTARS.2024.3460531
Schiano Di Cola VMango DBottino AAndolfo LCuomo S(2023)Magnetic resonance imaging enhancement using prior knowledge and a denoising scheme that combines total variation and histogram matching techniquesFrontiers in Applied Mathematics and Statistics10.3389/fams.2023.10417509Online publication date: 2-Mar-2023
https://doi.org/10.3389/fams.2023.1041750
Chang ZLi FShen Y(2023)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: 8-Dec-2023
https://dl.acm.org/doi/10.1109/TKDE.2023.3340673
Show More Cited By

Recommendations

Colour adjacency histograms for image matching
CAIP'07: Proceedings of the 12th international conference on Computer analysis of images and patterns

The use of 2D colour adjacency histograms for image matching in image retrieval scenarios is investigated. We present an algorithm for extracting representative colours from an image and a new method for matching 1D colour histograms and 2D colour ...
Efficient matching of large-size histograms

As we know, histogram matching is a commonly-adopted technique in the applications of pattern recognition. The matching of two patterns can be accomplished by matching their corresponding histograms. In general, the number of features and the resolution ...
Enhancing SURF Feature Matching Using Colour Histograms
IMVIP '11: Proceedings of the 2011 Irish Machine Vision and Image Processing Conference

A strategy is proposed that enhances the local feature matching capabilities of the SURF descriptor by utilising colour histograms. The results compare variations of the RGB, HSV and Opponent colour spaces on a dataset of image pairs that undergo ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 10

June 2018

248 pages

ISSN:2150-8097

Editors:
Jian Pei
Simon Fraser University
,
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2018

Published in PVLDB Volume 11, Issue 10

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
84
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu MHuang QSui TPeng BYu M(2024)A Remote Sensing Spectral Index Guided Bitemporal Residual Attention Network for Wildfire Burn Severity MappingIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.346053117(17187-17206)Online publication date: 2024
https://doi.org/10.1109/JSTARS.2024.3460531
Schiano Di Cola VMango DBottino AAndolfo LCuomo S(2023)Magnetic resonance imaging enhancement using prior knowledge and a denoising scheme that combines total variation and histogram matching techniquesFrontiers in Applied Mathematics and Statistics10.3389/fams.2023.10417509Online publication date: 2-Mar-2023
https://doi.org/10.3389/fams.2023.1041750
Chang ZLi FShen Y(2023)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: 8-Dec-2023
https://dl.acm.org/doi/10.1109/TKDE.2023.3340673
Ahmed SIslam MRajan H(2022)Semantics and Anomaly Preserving Sampling Strategy for Large-Scale Time Series DataACM/IMS Transactions on Data Science10.1145/35119182:4(1-25)Online publication date: 30-Mar-2022
https://dl.acm.org/doi/10.1145/3511918
Zhang JYiu MTang BLi Q(2022)Fast Error-Bounded Distance Distribution ComputationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.305824134:11(5364-5377)Online publication date: 1-Nov-2022
https://doi.org/10.1109/TKDE.2021.3058241
Deep SGruenheid AKoutris PViglas SNaughton J(2022)Comprehensive and Efficient Workload SummarizationDatenbank-Spektrum10.1007/s13222-022-00427-w22:3(249-256)Online publication date: 17-Nov-2022
https://doi.org/10.1007/s13222-022-00427-w
Siddiqui TChaudhuri SNarasayya V(2021)COMPAREProceedings of the VLDB Endowment10.14778/3476249.347629114:11(2419-2431)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476249.3476291
Macke SAliakbarpour MDiakonikolas IParameswaran ARubinfeld R(2021)Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00150(1703-1714)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00150
Salloum SHuang J(2021)RSP-Hist: Approximate Histograms for Big Data Exploration on Hadoop Clusters2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00058(412-417)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00058
Deep SGruenheid AKoutris PNaughton JViglas S(2020)Comprehensive and efficient workload compressionProceedings of the VLDB Endowment10.14778/3430915.343093114:3(418-430)Online publication date: 1-Nov-2020
https://dl.acm.org/doi/10.14778/3430915.3430931
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents