Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2791347.2791349acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

A novel approach for approximate aggregations over arrays

Published: 29 June 2015 Publication History

Abstract

Approximate aggregation has been a popular approach for interactive data analysis and decision making, especially on large-scale datasets. While there is clearly a need to apply this approach for scientific datasets comprising massive arrays, existing algorithms have largely been developed for relational data, and cannot handle both dimension-based and value-based predicates efficiently while maintaining accuracy. In this paper, we present a novel approach for approximate aggregations over array data, using bitmap indices or bitvectors as the summary structure, as they preserve both spatial and value distribution of the data. We develop approximate aggregation algorithms using only the bitvectors and certain additional pre-aggregation statistics (equivalent to a 1-dimensional histogram) that we require. Another key development is choosing a binning strategy that can improve aggregation accuracy -- we introduce a v-optimized binning strategy and its weighted extension, and present a bitmap construction algorithm with such binning. We compare our method with other existing methods including sampling and multi-dimensional histograms, as well as the use of other binning strategies with bitmaps. We demonstrate both high accuracy and efficiency of our approach. Specifically, we show that in most cases, our method is more accurate than other methods by at least one order of magnitude. Despite achieving much higher accuracy, our method can require significantly less storage than multi-dimensional histograms.

References

[1]
ACOS. http://mirador.gsfc.nasa.gov/cgibin/mirador/presentNavigation.pl?tree=project&project=ACOS.
[2]
G. Antoshenkov. Byte-aligned bitmap compression. In DCC, page 476. IEEE, 1995.
[3]
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, pages 539--550. ACM, 2003.
[4]
D. Barbará and X. Wu. Supporting online queries in rolap. In Data Warehousing and Knowledge Discovery, pages 234--243. Springer, 2000.
[5]
P. G. Brown. Overview of SciDB: large scale array storage, processing and analysis. In SIGMOD, pages 963--968, 2010.
[6]
F. Buccafurri, G. Lax, D. Saccà, L. Pontieri, and D. Rosaci. Enhancing histograms by tree-like bucket indices. VLDB, 17:1041--1061, 2008.
[7]
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. VLDB J., 10(2-3):199--223, 2001.
[8]
S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, pages 534--542. IEEE, 2001.
[9]
S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approach for approximate answering of aggregate queries. In SIGMOD Rec., volume 30, pages 295--306. ACM, 2001.
[10]
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 32(2):9, 2007.
[11]
J. Chou, K. Wu, O. Rubel, M. Howison, J. Qiang, B. Austin, E. W. Bethel, R. D. Ryne, A. Shoshani, et al. Parallel index and query for large scale data analysis. In SC, pages 1--11. IEEE, 2011.
[12]
W. G. Cochran. Sampling techniques. John Wiley & Sons, 2007.
[13]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In NSDI, volume 10, page 20, 2010.
[14]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online aggregation and continuous query support in mapreduce. In SIGMOD, pages 1115--1118. ACM, 2010.
[15]
L. Gosink, J. Shalf, K. Stockinger, K. Wu, and W. Bethel. HDF5-FastQuery: Accelerating complex queries on HDF datasets using fast bitmap indices. In SSDBM, pages 149--158. IEEE, 2006.
[16]
N. Goyal and Y. Sharma. New binning strategy for bitmap indices on high cardinality attributes. In COMPUTE, page 22. ACM, 2009.
[17]
S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In STOC, pages 471--475. ACM, 2001.
[18]
S. Guha and K. Shim. A note on linear time algorithms for maximum error histograms. Knowledge and Data Engineering, IEEE Transactions on, 19(7):993--997, 2007.
[19]
S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. In VLDB, pages 300--311. VLDB Endowment, 2004.
[20]
D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Approximating multi-dimensional aggregate range queries over real attributes. SIGMOD Rec., 29(2):463--474, 2000.
[21]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD Rec., volume 26, pages 171--182. ACM, 1997.
[22]
Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. In VLDB, volume 99, pages 174--185, 1999.
[23]
H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, volume 98, pages 24--27, 1998.
[24]
B. Jawerth and W. Sweldens. An overview of wavelet based multiresolution analyses. SIAM review, 36(3):377--412, 1994.
[25]
C. Jermaine. Robust estimation with sampling and approximate pre-aggregation. In VLDB, pages 886--897. VLDB Endowment, 2003.
[26]
C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the dbo engine. TODS, 33(4):23, 2008.
[27]
R. Jin, L. Glimcher, C. Jermaine, and G. Agrawal. New sampling-based estimators for olap queries. In ICDE, pages 18--18. IEEE, 2006.
[28]
N. Koudas. Space efficient bitmap indexing. In CIKM, pages 194--201. ACM, 2000.
[29]
I. Lazaridis and S. Mehrotra. Progressive approximate aggregate queries with a multi-resolution tree structure. In SIGMOD Rec., volume 30, pages 401--412. ACM, 2001.
[30]
J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. In SIGMOD Rec., volume 28, pages 205--214. ACM, 1999.
[31]
Y. Matias and D. Urieli. Optimal workload-based weighted wavelet synopses. In ICDT, pages 368--382. Springer, 2005.
[32]
M. Muralikrishna and D. J. DeWitt. Equi-depth multidimensional histograms. In SIGMOD Rec., volume 17, pages 28--36. ACM, 1988.
[33]
S. Muthukrishnan, V. Poosala, and T. Suel. On rectangular partitionings in two dimensions: Algorithms, complexity and applications. In ICDT, pages 236--256. Springer, 1999.
[34]
P. O'Neil and D. Quass. Improved query performance with variant indexes. In SIGMOD Rec., volume 26, pages 38--49. ACM, 1997.
[35]
N. Pansare, V. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. VLDB, 4(11):1135--1145, 2011.
[36]
V. Pascucci and R. J. Frank. Global static indexing for real-time exploration of very large regular grids. In SC, pages 45--45. IEEE, 2001.
[37]
V. Poosala and V. Ganti. Fast approximate answers to aggregate queries on a data cube. In SSDBM, pages 24--33. IEEE, 1999.
[38]
V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. SIGMOD Rec., 25(2):294--305, 1996.
[39]
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, volume 97, pages 486--495, 1997.
[40]
D. Rotem, K. Stockinger, and K. Wu. Optimizing candidate check costs for bitmap indices. In CIKM, pages 648--655. ACM, 2005.
[41]
R. R. Sinha, S. Mitra, and M. Winslett. Bitmap indexes for large scientific data sets: A case study. In IPDPS, pages 10--pp. IEEE, 2006.
[42]
R. R. Sinha and M. Winslett. Multi-resolution bitmap indexes for scientific data. TODS, 32(3):16, 2007.
[43]
K. Stockinger, K. Wu, and A. Shoshani. Evaluation strategies for bitmap indices with binning. In Database and Expert Systems Applications, pages 120--129. Springer, 2004.
[44]
E. J. Stollnitz and T. D. De Rose. Wavelets for computer graphics: theory and applications. Morgan Kaufmann, 1996.
[45]
Y. Su, G. Agrawal, and J. Woodring. Indexing and parallel query processing support for visualizing climate datasets. In ICPP, pages 249--258. IEEE, 2012.
[46]
Y. Su, Y. Wang, and G. Agrawal. In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. In HPDC, 2015.
[47]
Y. Su, Y. Wang, G. Agrawal, and R. Kettimuthu. SDQuery DSI: integrating data management support with a wide area data transfer protocol. In SC, page 47. ACM, 2013.
[48]
E. Terzi and P. Tsaparas. Efficient algorithms for sequence segmentation. In SDM. Citeseer, 2006.
[49]
J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In SIGMOD, pages 193--204. ACM Press, 1999.
[50]
J. S. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In CIKM, pages 96--104. ACM, 1998.
[51]
C. Wang, A. Garcia, and H.-W. Shen. Interactive level-of-detail selection using image-based quality metric for large volume visualization. TVCG, 13(1):122--134, 2007.
[52]
Y. Wang, A. Nandi, and G. Agrawal. SAGA: Array Storage as a DB with Support for Structural Aggregations. In SSDBM, page 9. ACM, 2014.
[53]
Y. Wang, Y. Su, and G. Agrawal. Supporting a Light-Weight Data Management Layer Over HDF5. In CCGRID, pages 335--342, may 2013.
[54]
Y. Wang, Y. Su, A. Gagan, and T. Liu. SciSD: Novel Subgroup Discovery Over Scientific Datasets Using Bitmap Indices. Technical report, OSU-CISRC-3/15-TR03, Ohio State University, 2015.
[55]
J. Woodring, J. Ahrens, J. Figg, J. Wendelberger, S. Habib, and K. Heitmann. In-situ sampling of a large-scale particle simulation for interactive visualization and analysis. In Computer Graphics Forum, volume 30, pages 1151--1160. Wiley Online Library, 2011.
[56]
K. Wu, W. Koegler, J. Chen, and A. Shoshani. Using bitmap index for interactive exploration of large datasets. In SSDBM, pages 65--74. IEEE, 2003.
[57]
K. Wu, E. J. Otoo, and A. Shoshani. Compressing bitmap indexes for faster search operations. In SSDBM, pages 99--108. IEEE, 2002.
[58]
K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. TODS, 31(1):1--38, 2006.
[59]
K. Wu, K. Stockinger, and A. Shoshani. Breaking the curse of cardinality on bitmap indexes. In SSDBM, pages 348--365. Springer, 2008.
[60]
K.-L. Wu and P. S. Yu. Range-based bitmap indexing for high cardinality attributes with skew. In COMPSAC, pages 61--66. IEEE, 1998.
[61]
M.-C. Wu. Query optimization for selections using bitmaps. In SIGMOD Rec., volume 28, pages 227--238. ACM, 1999.
[62]
M.-C. Wu and A. P. Buchmann. Encoded bitmap indexing for data warehouses. In ICDE, pages 220--230. IEEE, 1998.
[63]
G. Zhu, Y. Wang, and G. Agrawal. SciCSM: Novel Contrast Set Mining Over Scientific Datasets Using Bitmap Indices. In SSDBM, 2015.
[64]
G. K. Zipf. Human behavior and the principle of least effort. 1949.

Cited By

View all
  • (2022)ReSKY: Efficient Subarray Skyline Computation in Array DatabasesDistributed and Parallel Databases10.1007/s10619-022-07419-540:2-3(261-298)Online publication date: 17-Jul-2022
  • (2022)Hierarchical Bitmap Indexing for Range Queries on Multidimensional ArraysDatabase Systems for Advanced Applications10.1007/978-3-031-00123-9_40(509-525)Online publication date: 8-Apr-2022
  • (2021)A scalable array storage for efficient maintenance of future dataThe Journal of Supercomputing10.1007/s11227-020-03554-xOnline publication date: 2-Jan-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management
June 2015
390 pages
ISBN:9781450337090
DOI:10.1145/2791347
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • DOE Office of Science, Advanced Sci- entific Computing Research

Conference

SSDBM 2015

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)ReSKY: Efficient Subarray Skyline Computation in Array DatabasesDistributed and Parallel Databases10.1007/s10619-022-07419-540:2-3(261-298)Online publication date: 17-Jul-2022
  • (2022)Hierarchical Bitmap Indexing for Range Queries on Multidimensional ArraysDatabase Systems for Advanced Applications10.1007/978-3-031-00123-9_40(509-525)Online publication date: 8-Apr-2022
  • (2021)A scalable array storage for efficient maintenance of future dataThe Journal of Supercomputing10.1007/s11227-020-03554-xOnline publication date: 2-Jan-2021
  • (2021)Advances in MapReduce Big Data Processing: Platform, Tools, and AlgorithmsArtificial Intelligence and IoT10.1007/978-981-33-6400-4_6(105-128)Online publication date: 13-Feb-2021
  • (2020)Segmented In-Advance Data Analytics for Fast Scientific DiscoveryIEEE Transactions on Cloud Computing10.1109/TCC.2016.25411428:2(432-442)Online publication date: 1-Apr-2020
  • (2020)MoHA: A Composable System for Efficient In-Situ Analytics on Heterogeneous HPC SystemsSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00086(1-16)Online publication date: Nov-2020
  • (2019)Accelerating array joining with integrated value-indexProceedings of the 31st International Conference on Scientific and Statistical Database Management10.1145/3335783.3335790(145-156)Online publication date: 23-Jul-2019
  • (2018)COMPASSProceedings of the 30th International Conference on Scientific and Statistical Database Management10.1145/3221269.3223033(1-12)Online publication date: 9-Jul-2018
  • (2018)Research on SVM Multi-Classification Based on Particle Swarm Algorithm2018 International Symposium on Computer, Consumer and Control (IS3C)10.1109/IS3C.2018.00075(270-273)Online publication date: Dec-2018
  • (2018)What-If Analysis with Conflicting Goals: Recommending Data Ranges for Exploration2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00018(89-100)Online publication date: Apr-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media