Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Sharing data and work across concurrent analytical queries

Published: 01 July 2013 Publication History

Abstract

Today's data deluge enables organizations to collect massive data, and analyze it with an ever-increasing number of concurrent queries. Traditional data warehouses (DW) face a challenging problem in executing this task, due to their query-centric model: each query is optimized and executed independently. This model results in high contention for resources. Thus, modern DW depart from the query-centric model to execution models involving sharing of common data and work. Our goal is to show when and how a DW should employ sharing. We evaluate experimentally two sharing methodologies, based on their original prototype systems, that exploit work sharing opportunities among concurrent queries at run-time: Simultaneous Pipelining (SP), which shares intermediate results of common sub-plans, and Global Query Plans (GQP), which build and evaluate a single query plan with shared operators.
First, after a short review of sharing methodologies, we show that SP and GQP are orthogonal techniques. SP can be applied to shared operators of a GQP, reducing response times by 20%-48% in workloads with numerous common sub-plans. Second, we corroborate previous results on the negative impact of SP on performance for cases of low concurrency. We attribute this behavior to a bottleneck caused by the push-based communication model of SP. We show that pull-based communication for SP eliminates the overhead of sharing altogether for low concurrency, and scales better on multi-core machines than push-based SP, further reducing response times by 82%-86% for high concurrency. Third, we perform an experimental analysis of SP, GQP and their combination, and show when each one is beneficial. We identify a trade-off between low and high concurrency. In the former case, traditional query-centric operators with SP perform better, while in the latter case, GQP with shared operators enhanced by SP give the best results.

References

[1]
TPC-H Benchmark: Standard Specification, Revision 2.14.3.
[2]
S. Arumugam et al. The DataPath system: a data-centric analytic processing engine for large data warehouses. In Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data, pages 519-530, 2010.
[3]
G. Candea et al. A scalable, predictable join operator for highly concurrent data warehouses. Proc. of the VLDB Endowment, 2(1):277-288, 2009.
[4]
G. Candea et al. Predictable performance and high query concurrency for data analytics. The Int'l Journal on Very Large Data Bases, 20(2):227-248, 2011.
[5]
H.-T. Chou et al. An evaluation of buffer management strategies for relational database systems. In Proc. of the 11th Int'l Conf. on Very Large Data Bases, pages 127-141, 1985.
[6]
J. Cieslewicz et al. Adaptive aggregation on chip multiprocessors. In Proc. of the 33rd Int'l Conf. on Very Large Data Bases, pages 339-350, 2007.
[7]
L. Colby et al. Red brick vista™: aggregate computation and management. In Proc. of the 14th Int'l Conf. on Data Engineering, pages 174-177, 1998.
[8]
C. Cook. Database Architecture: The Storage Engine, 2001. http://msdn.microsoft.com/library/aa902689(v=sql.80).aspx.
[9]
N. N. Dalvi et al. Pipelining in multi-query optimization. In Proc. of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Databases, pages 59-70, 2001.
[10]
J. Dean et al. MapReduce: Simplified data processing on large clusters. Communications ACM, 51(1):107-113, 2008.
[11]
G. Giannikis et al. SharedDB: killing one thousand queries with one stone. Proc. of the VLDB Endowment, 5(6):526-537, 2012.
[12]
S. Harizopoulos et al. A case for staged database systems. In Proc. of the 2003 Conf. on Innovative Data Systems Research, 2003.
[13]
S. Harizopoulos et al. QPipe: a simultaneously pipelined relational query engine. In Proc. of the 2005 ACM SIGMOD Int'l Conf. on Management of Data, pages 383-394, 2005.
[14]
R. Johnson et al. To share or not to share? In Proc. of the 33rd Int'l Conf. on Very Large Data Bases, pages 351-362, 2007.
[15]
R. Johnson et al. Shore-MT: a scalable storage manager for the multicore era. In Proc. of the 12th Int'l Conf. on Extending Database Technology: Advances in Database Technology, pages 24-35, 2009.
[16]
T. Johnson et al. 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. In Proc. of the 20th Int'l Conf. on Very Large Data Bases, pages 439-450, 1994.
[17]
R. Kimball et al. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. John Wiley & Sons, Inc., 2nd edition, 2002.
[18]
C. Lang et al. Increasing Buffer-Locality for Multiple Relational Table Scans through Grouping and Throttling. In Proc. of the 23rd Int'l Conf. on Data Engineering, pages 1136-1145, 2007.
[19]
N. Megiddo et al. ARC: A Self-Tuning, Low Overhead Replacement Cache. In Proc. of the 2nd USENIX Conf. on File and Storage Technologies, pages 115-130, 2003.
[20]
M. Mehta et al. Batch Scheduling in Parallel Database Systems. In Proc. of the 9th Int'l Conf. on Data Engineering, pages 400-410, 1993.
[21]
P. O. Neil et al. Star Schema Benchmark. 2009.
[22]
E. J. O'Neil et al. The LRU-K page replacement algorithm for database disk buffering. In Proc. of the 1993 ACM SIGMOD Int'l Conf. on Management of Data, pages 297-306, 1993.
[23]
L. Qiao et al. Main-memory scan sharing for multicore cpus. Proc. of the VLDB Endowment, 1(1):610-621, 2008.
[24]
N. Roussopoulos. View indexing in relational databases. ACM Trans. Database Syst., 7(2):258-290, 1982.
[25]
P. Roy et al. Efficient and extensible algorithms for multi query optimization. In Proc. of the 2000 ACM SIGMOD Int'l Conf. on Management of Data, pages 249-260, 2000.
[26]
P. Russom. High-Performance Data Warehousing. TDWI, 2012. http://tdwi.org/research/2012/10/tdwi-best-practices-report-high-performance-data-warehousing.aspx.
[27]
T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23-52, 1988.
[28]
J. Shim et al. Dynamic Caching of Query Results for Decision Support Systems. In Proc. of the 11th Int'l Conf. on Scientific and Statistical Database Management, pages 254-263, 1999.
[29]
P. Unterbrunner et al. Predictable performance for unpredictable workloads. Proc. of the VLDB Endowment, 2(1):706-717, 2009.
[30]
M. Zukowski et al. Cooperative scans: dynamic bandwidth sharing in a DBMS. In Proc. of the 33rd Int'l Conf. on Very Large Data Bases, pages 723-734, 2007.

Cited By

View all
  • (2024)RTScan: Efficient Scan with Ray Tracing CoresProceedings of the VLDB Endowment10.14778/3648160.364818317:6(1460-1472)Online publication date: 3-May-2024
  • (2021)SIMD-MIMD cocktail in a hybrid memory glassProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463782(1-12)Online publication date: 14-Jun-2021
  • (2021)Resource-efficient Shared Query Execution via Exploiting Time SlacknessProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457282(1797-1810)Online publication date: 9-Jun-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 9
July 2013
180 pages

Publisher

VLDB Endowment

Publication History

Published: 01 July 2013
Published in PVLDB Volume 6, Issue 9

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)3
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)RTScan: Efficient Scan with Ray Tracing CoresProceedings of the VLDB Endowment10.14778/3648160.364818317:6(1460-1472)Online publication date: 3-May-2024
  • (2021)SIMD-MIMD cocktail in a hybrid memory glassProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463782(1-12)Online publication date: 14-Jun-2021
  • (2021)Resource-efficient Shared Query Execution via Exploiting Time SlacknessProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457282(1797-1810)Online publication date: 9-Jun-2021
  • (2020)Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with FlinkComplexity10.1155/2020/66171492020Online publication date: 1-Jan-2020
  • (2020)BinDex: A Two-Layered Index for Fast and Robust ScansProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380563(909-923)Online publication date: 11-Jun-2020
  • (2020)Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing FrameworksInformation Systems Frontiers10.1007/s10796-020-09995-223:1(35-51)Online publication date: 4-Mar-2020
  • (2020)A Chunk-Based Hash Table Caching Method for In-Memory Hash JoinsWeb Information Systems Engineering – WISE 202010.1007/978-3-030-62008-0_26(376-389)Online publication date: 20-Oct-2020
  • (2019)AStreamProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319884(607-622)Online publication date: 25-Jun-2019
  • (2018)Big data multi-query optimisation with Apache FlinkInternational Journal of Web Engineering and Technology10.5555/3272336.327234013:1(78-97)Online publication date: 1-Jan-2018
  • (2018)OLTPshareProceedings of the VLDB Endowment10.14778/3229863.322986611:12(1769-1780)Online publication date: 1-Aug-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media