Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1645953.1646101acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Fast and effective histogram construction

Published: 02 November 2009 Publication History

Abstract

Histogram construction or sequence segmentation is a basic task with applications in database systems, information retrieval, and knowledge management. Its aim is to approximate a sequence by line segments. Unfortunately, the quadratic algorithm that derives an optimal histogram for Euclidean error lacks the desired scalability. Therefore, sophisticated approximation algorithms have been recently proposed, while several simple heuristics are used in practice. Still, these solutions fail to resolve the efficiency-quality tradeoff in a satisfactory manner. In this paper we take a fresh view on the problem. We propose conceptually clear and scalable algorithms that efficiently derive high-quality histograms. We experimentally demonstrate that existing approximation schemes fail to deliver the desired efficiency and conventional heuristics do not fare well on the side of quality. On the other hand, our schemes match or exceed the quality of the former and the efficiency of the latter.

References

[1]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, 1999.
[2]
R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6):284, 1961.
[3]
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. VLDB Journal, 10(2-3), 2001.
[4]
K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. TODS, 27(2), 2002.
[5]
M. Garofalakis and A. Kumar. Wavelet synopses for general error metrics. TODS, 30(4), 2005.
[6]
P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. TODS, 27(3), 2002.
[7]
A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, small-space algorithms for approximate histogram maintenance. In STOC, 2002.
[8]
S. Guha. On the space-time of optimal, approximate and streaming algorithms for synopsis construction problems. VLDB Journal, 17(6), 2008.
[9]
S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. IEEE Transactions on Information Theory, 54(2):811--830, 2008.
[10]
S. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram construction problems. TODS, 31(1):396--438, 2006.
[11]
S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. In VLDB, 2004.
[12]
J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmäki, and H. Toivonen. Time series segmentation for context recognition in mobile devices. In ICDM, 2001.
[13]
Y. E. Ioannidis. Universality of serial histograms. In VLDB, 1993.
[14]
Y. E. Ioannidis. Approximations in database systems. In ICDT, 2003.
[15]
Y. E. Ioannidis. The history of histograms (abridged). In VLDB, 2003.
[16]
Y. E. Ioannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. In SIGMOD, 1995.
[17]
Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. In VLDB, 1999.
[18]
H. V. Jagadish. personal communication, 2008.
[19]
H. V. Jagadish, H. Jin, B. C. Ooi, and K.-L. Tan. Global optimization of histograms. In SIGMOD, 2001.
[20]
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, 1998.
[21]
P. Karras. Multiplicative synopses for relative-error metrics. In EDBT, 2009.
[22]
P. Karras. Optimality and scalability in lattice histogram construction. In VLDB, 2009.
[23]
P. Karras and N. Mamoulis. One-pass wavelet synopses for maximum-error metrics. In VLDB, 2005.
[24]
P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data structure. In ICDE, 2007.
[25]
P. Karras and N. Mamoulis. Hierarchical synopses with optimal error guarantees. TODS, 33(3):1--53, 2008.
[26]
P. Karras and N. Mamoulis. Lattice histograms: a resilient synopsis structure. In ICDE, 2008.
[27]
P. Karras, D. Sacharidis, and N. Mamoulis. Exploiting duality in summarization with deterministic guarantees. In KDD, 2007.
[28]
R. Kooi. The Optimization of Queries in Relational Databases. PhD thesis, 1980.
[29]
K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-Anonymity. In ICDE, 2006.
[30]
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In SIGMOD, 1998.
[31]
M. Muralikrishna and D. J. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In SIGMOD, 1988.
[32]
G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In SIGMOD, 1984.
[33]
V. Poosala, V. Ganti, and Y. E. Ioannidis. Approximate query answering using histograms. IEEE Data Eng. Bull., 22(4):5--14, 1999.
[34]
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, 1997.
[35]
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In SIGMOD, 1996.
[36]
F. Reiss, M. Garofalakis, and J. M. Hellerstein. Compact histograms for hierarchical identifiers. In VLDB, 2006.
[37]
M. Salmenkivi, J. Kere, and H. Mannila. Genome segmentation using piecewise constant intensity models and reversible jump MCMC. In ECCB, 2002.
[38]
E. Terzi and P. Tsaparas. Efficient algorithms for sequence segmentation. In SIAM SDM, 2006.
[39]
J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In SIGMOD, 1999.

Cited By

View all
  • (2021)A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan EnumerationData Science and Engineering10.1007/s41019-020-00149-76:1(86-101)Online publication date: 15-Jan-2021
  • (2019)Histogram partitioning algorithms for adaptive and autonomous threshold estimation in cognitive radio–based industrial wireless sensor networksTransactions on Emerging Telecommunications Technologies10.1002/ett.367930:10Online publication date: 15-Oct-2019
  • (2017)Faster BlockMax WAND with Variable-sized BlocksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080780(625-634)Online publication date: 7-Aug-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
November 2009
2162 pages
ISBN:9781605585123
DOI:10.1145/1645953
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. efficiency
  2. histograms
  3. segmentation

Qualifiers

  • Research-article

Conference

CIKM '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2021)A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan EnumerationData Science and Engineering10.1007/s41019-020-00149-76:1(86-101)Online publication date: 15-Jan-2021
  • (2019)Histogram partitioning algorithms for adaptive and autonomous threshold estimation in cognitive radio–based industrial wireless sensor networksTransactions on Emerging Telecommunications Technologies10.1002/ett.367930:10Online publication date: 15-Oct-2019
  • (2017)Faster BlockMax WAND with Variable-sized BlocksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080780(625-634)Online publication date: 7-Aug-2017
  • (2013)Entropy-based histograms for selectivity estimationProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505756(1939-1948)Online publication date: 27-Oct-2013
  • (2013)Efficient and scalable monitoring and summarization of large probabilistic dataProceedings of the 2013 SIGMOD/PODS Ph.D. symposium10.1145/2483574.2483586(61-66)Online publication date: 23-Jun-2013
  • (2011)Fast and accurate computation of equi-depth histograms over data streamsProceedings of the 14th International Conference on Extending Database Technology10.1145/1951365.1951376(69-80)Online publication date: 21-Mar-2011

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media