Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2915230acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Distributed Wavelet Thresholding for Maximum Error Metrics

Published: 14 June 2016 Publication History

Abstract

Modern data analytics involve simple and complex computations over enormous numbers of data records. The volume of data and the increasingly stringent response-time requirements place increasing emphasis on the efficiency of approximate query processing. A major challenge over the past years has been the efficient construction of fixed-space synopses that provide a deterministic quality guarantee, often expressed in terms of a maximum error metric. For data reduction, wavelet decomposition has proved to be a very effective tool, as it can successfully approximate sharp discontinuities and provide accurate answers to queries. However, existing polynomial time wavelet thresholding schemes that minimize maximum error metrics are constrained with impractical time and space complexities for large datasets. In order to provide a practical solution to the problem, we develop parallel algorithms that take advantage of key-properties of the wavelet decomposition and allocate tasks to multiple workers. To that end, we present i) a general framework for the parallelization of existing dynamic programming algorithms, ii) a parallel version of one such DP-based algorithm and iii) a new parallel greedy algorithm for the problem. To the best of our knowledge, this is the first attempt to scale algorithms for wavelet thresholding for maximum error metrics via a state-of-the-art distributed runtime. Our extensive experiments on both real and synthetic datasets over Hadoop show that the proposed algorithms achieve linear scalability and superior running-time performance compared to their centralized counterparts. Furthermore, our distributed greedy algorithm outperforms the distributed version of the current state-of-the-art dynamic programming algorithm by 2 to 4 times, without compromising the quality of results.

References

[1]
Linked sensor data. https://wiki.knoesis.org/index.php/SSW\_Datasets.
[2]
Nyc taxi trip data 2013. https://archive.org/details/nycTaxiTripData2013.
[3]
The Internet of Things, Hadoop, and the Big Data Approach. http://data-informed.com/internet-things-hadoop-big-data-approach/.
[4]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In ACM SIGMOD Record, volume 28, pages 275--286. ACM, 1999.
[5]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29--42. ACM, 2013.
[6]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20--29. ACM, 1996.
[7]
L. Amsaleg, M. J. Franklin, A. Tomasic, and T. Urhan. Improving responsiveness for wide-area data access. In IEEE Data Engineering Bulletin. Citeseer, 1997.
[8]
P. Cao and Z. Wang. Efficient top-k query calculation in distributed networks. In Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, pages 206--215. ACM, 2004.
[9]
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. The VLDB Journal--The International Journal on Very Large Data Bases, 10(2--3):199--223, 2001.
[10]
G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1--3):1--294, 2012.
[11]
G. Cormode, M. Garofalakis, and D. Sacharidis. Fast approximate wavelet tracking on streams. In Advances in Database Technology-EDBT 2006, pages 4--22. Springer, 2006.
[12]
M. Garofalakis and P. B. Gibbons. Wavelet synopses with error guarantees. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 476--487. ACM, 2002.
[13]
M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum-error metrics. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 166--176. ACM, 2004.
[14]
P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In ACM SIGMOD Record, volume 27, pages 331--342. ACM, 1998.
[15]
P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In VLDB, volume 97, pages 466--475, 1997.
[16]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, volume 1, pages 79--88, 2001.
[17]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. One-pass wavelet decompositions of data streams. Knowledge and Data Engineering, IEEE Transactions on, 15(3):541--554, 2003.
[18]
S. Guha. Space efficiency in synopsis construction algorithms. In Proceedings of the 31st international conference on Very large data bases, pages 409--420. VLDB Endowment, 2005.
[19]
Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. In VLDB, volume 99, pages 174--185, 1999.
[20]
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, volume 98, pages 275--286, 1998.
[21]
J. Jestes, K. Yi, and F. Li. Building wavelet histograms on large data in mapreduce. Proceedings of the VLDB Endowment, 5(2):109--120, 2011.
[22]
P. Karras and N. Mamoulis. One-pass wavelet synopses for maximum-error metrics. In Proceedings of the 31st international conference on Very large data bases, pages 421--432. VLDB Endowment, 2005.
[23]
P. Karras and N. Mamoulis. The haar+ tree: a refined synopsis data structure. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 436--445. IEEE, 2007.
[24]
P. Karras, D. Sacharidis, and N. Mamoulis. Exploiting duality in summarization with deterministic guarantees. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 380--389. ACM, 2007.
[25]
T. Li, Q. Li, S. Zhu, and M. Ogihara. A survey on wavelet applications in data mining. ACM SIGKDD Explorations Newsletter, 4(2):49--68, 2002.
[26]
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In ACM SIGMOD Record, volume 27, pages 448--459. ACM, 1998.
[27]
S. Muthukrishnan. Subquadratic algorithms for workload-aware haar wavelet synopses. In FSTTCS 2005: Foundations of Software Technology and Theoretical Computer Science, pages 285--296. Springer, 2005.
[28]
C. Pang, Q. Zhang, D. Hansen, and A. Maeder. Unrestricted wavelet synopses under maximum error bound. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pages 732--743. ACM, 2009.
[29]
E. J. Stollnitz, T. D. DeRose, and D. H. Salesin. Wavelets for computer graphics: theory and applications. Morgan Kaufmann, 1996.
[30]
I. Trummer and C. Koch. An incremental anytime algorithm for multi-objective query optimization. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1941--1953. ACM, 2015.
[31]
J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In ACM SIGMOD Record, volume 28, pages 193--204. ACM, 1999.

Cited By

View all
  • (2022)Efficient two-dimensional Haar$$^+$$ synopsis construction for the maximum absolute error measureThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00551-228:5(675-701)Online publication date: 11-Mar-2022
  • (2019)Scaling the Construction of Wavelet Synopses for Maximum Error MetricsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.286718531:9(1794-1808)Online publication date: 1-Sep-2019
  • (2018)Approximate Query Processing: What is New and Where to Go?Data Science and Engineering10.1007/s41019-018-0074-43:4(379-397)Online publication date: 14-Sep-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate query processing
  2. distributed data management
  3. greedy thresholding
  4. wavelet synopses

Qualifiers

  • Research-article

Funding Sources

  • European Commission

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Efficient two-dimensional Haar$$^+$$ synopsis construction for the maximum absolute error measureThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00551-228:5(675-701)Online publication date: 11-Mar-2022
  • (2019)Scaling the Construction of Wavelet Synopses for Maximum Error MetricsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.286718531:9(1794-1808)Online publication date: 1-Sep-2019
  • (2018)Approximate Query Processing: What is New and Where to Go?Data Science and Engineering10.1007/s41019-018-0074-43:4(379-397)Online publication date: 14-Sep-2018

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media