Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Extensive experiments on a prototype implementation of our system show significant benefits of worksharing for both TPC-DS workloads and detailed micro-benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Our method can be easily extended for sharing similar join operators, for example by applying the “equivalence classes” approach used in (Zhou et al. 2007). Despite technical simplicity, our current optimization problem formulation would end-up discarding such potential SEs, due to their large memory footprints. Hence, we currently preempt such SEs from being considered.

  2. For the sake of readability, we omit the description of several other optimizations – such as the removal of duplicate predicates – that we have implemented.

  3. In light of the end-to-end MQO process, the last phase amounts to rewrite the queries in the input set to useselected CEs. Such rewrite can introduce additional work, which we currently neglect in our modeling approach: indeed, query rewriting involves highly selective operations, with low cost. This means we assume the dominating cost to be that of reading from RAM, which we found experimentally to be true.

  4. Source code of our prototype is available as an open source contribution, available here: https://github.com/DistributedSystemsGroup/spark-sql-worksharing

  5. Note that the operator in Apache Spark is a transformation. As a consequence, it takes effect only upon the first call toan action, with the first (rewritten) query. Thus, the first query effectively “pays the price” for caching.

  6. The attentive reader might have noticed that also our method eventually spills some contents of the cached data to disk. This is explained by two effects: i) Apache Spark dynamically adjusts at runtime the amount of memory dedicated to store cached data, and thus overrides the 50% setting we use in our experiments; ii) our methodology is based on cardinality estimation to compute the weight of a CE: as a consequence, estimation errors might induce the system to spill some records on disk.

  7. Data compression techniques can be helpful in this case, but we defer their analysis to future work.

References

  • Agrawal, S., Chaudhuri, S., & Narasayya, V. R. (2000). Automated selection of materialized views and indexes in sql databases. VLDB, 2000, 496–505.

    Google Scholar 

  • Agrawal, P., Kifer, D., & Olston, C. (2008). Scheduling shared scans of large data files. Proceedings of the VLDB Endowment, 1(1), 958–969.

    Article  Google Scholar 

  • Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al. (2015). Spark sql: Relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM.

  • Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L. (2010). The datapath system: A data-centric analytic processing engine for large data warehouses. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ‘10, pp. 519–530. ACM, New York, NY, USA. https://doi.org/10.1145/1807167.1807224.

  • Azim, T., Karpathiotakis, M., & Ailamaki, A. (2017). Recache: Reactive caching for fast analytics over heterogeneous data. Proceedings of the VLDB Endowment, 11(3).

  • Baril, X., Bellahsene, Z. (2003). Selection of materialized views: A cost-based approach. In: Advanced Information Systems Engineering, pp. 665–680. Springer.

  • Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R. (2011). Incoop: Mapreduce for incremental computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 7. ACM.

  • Candea, G., Polyzotis, N., & Vingralek, R. (2009). A scalable, predictable join operator for highly concurrent data warehouses. Proc. VLDB Endow, 2(1), 277–288. https://doi.org/10.14778/1687627.1687659.

    Article  Google Scholar 

  • Candea, G., Polyzotis, N., & Vingralek, R. (2011). Predictable performance and high query concurrency for data analytics. The VLDB Journal, 20(2), 227–248. https://doi.org/10.1007/s00778-011-0221-2.

    Article  Google Scholar 

  • Dalvi, N.N., Sanghai, S.K., Parsan, R., Sudarshan, S. (2001). Pipelining in multi-query optimization. In: ACM PODS, pp. 59–70. ACM.

  • Databricks: Spark sql performance test (2018). https://github.com/databricks/spark-sql-perf

  • Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.

    Article  Google Scholar 

  • Derakhshan, R., Dehne, F.K., Korn, O., Stantic, B. (2006). Simulated annealing for materialized view selection in data warehousing environment. In: Databases and applications, pp. 89–94.

  • Dursun, K., Binnig, C., Cetintemel, U., Kraska, T. (2017). Revisiting reuse in main memory database systems. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1275–1289. ACM.

  • Elghandour, I., & Aboulnaga, A. (2012). Restore: Reusing results of mapreduce jobs. Proceedings of the VLDB Endowment, 5(6), 586–597.

    Article  Google Scholar 

  • El-Helw, A., Raghavan, V., Soliman, M. A., Caragea, G., Gu, Z., & Petropoulos, M. (2015). Optimization of common table expressions in mpp database systems. Proceedings of the VLDB Endowment, 8(12), 1704–1715.

    Article  Google Scholar 

  • Finkelstein, S. (1982). Common expression analysis in database applications. In: Proceedings of the 1982 ACM SIGMOD international conference on Management of data, pp. 235–245. ACM.

  • Floratou, A., Megiddo, N., Potti, N., Özcan, F., Kale, U., Schmitz-Hermes, J. (2016). Adaptive caching in big sql using the hdfs cache. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 321–333. ACM.

  • Giannikis, G., Alonso, G., & Kossmann, D. (2012). Shareddb: Killing one thousand queries with one stone. Proc. VLDB Endow, 5(6), 526–537. https://doi.org/10.14778/2168651.2168654.

    Article  Google Scholar 

  • Goldstein, J., Larson, P.Å. (2001). Optimizing queries using materialized views: a practical, scalable solution. In: ACM SIGMOD Record, vol. 30, pp. 331–342. ACM.

  • Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y., & Zhuang, L. (2010). Nectar: Automatic management of data and computation in datacenters. OSDI, 10, 1–8.

    Google Scholar 

  • Harizopoulos, S., Shkapenyuk, V., Ailamaki, A. (2005). Qpipe: A simultaneously pipelined relational query engine. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ‘05, pp. 383–394. ACM, New York, NY, USA. https://doi.org/10.1145/1066157.1066201.

  • Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D. (2007). Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operating Systems Review, vol. 41, pp. 59–72. ACM.

  • Ivanova, M. G., Kersten, M. L., Nes, N. J., & Gonçalves, R. A. (2010). An architecture for recycling intermediates in a column-store. ACM Transactions on Database Systems (TODS), 35(4), 24.

    Article  Google Scholar 

  • Kalnis, P., Mamoulis, N., & Papadias, D. (2002). View selection using randomized search. Data & Knowledge Engineering, 42(1), 89–111.

    Article  Google Scholar 

  • Kellerer, H., Pferschy, U., & Pisinger, D. (2004). Introduction to NP-completeness of knapsack problems. Berlin: Springer.

    Book  Google Scholar 

  • Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P. (2011). A platform for scalable one-pass analytics using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 985–996. ACM.

  • Li, B., Mazur, E., Diao, Y., McGregor, A., & Shenoy, P. (2012). Scalla: a platform for scalable one-pass analytics using mapreduce. ACM Transactions on Database Systems (TODS), 37(4), 27.

    Google Scholar 

  • Merkle, R.C. (1980). Protocols for public key cryptosystems. Security and Privacy, IEEE Symposium on p. 122.

  • Michiardi, P., Carra, D., Migliorini, S. (2019). In-memory caching for multi-query optimization of data-intensive scalable computing workloads. In: Workshops of the EDBT/ICDT Joint Conference, EDBT/ICDT-WS.

  • Mistry, H., Roy, P., Sudarshan, S., Ramamritham, K. (2001). Materialized view selection and maintenance using multi-query optimization. In: ACM SIGMOD Record, vol. 30, pp. 307–318. ACM.

  • Nagel, F., Boncz, P., Viglas, S.D. (2013). Recycling in pipelined query evaluation. In: Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pp. 338–349. IEEE.

  • Nykiel, T., Potamias, M., Mishra, C., Kollios, G., & Koudas, N. (2010). Mrshare2: Sharing across multiple queries in mapreduce. Proc. VLDB Endow, 3(1–2), 494–505. https://doi.org/10.14778/1920841.1920906.

    Article  Google Scholar 

  • Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G. (2015). Making sense of performance in data analytics frameworks. In: 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pp. 293–307. USENIX Association.

  • Psaroudakis, I., Athanassoulis, M., & Ailamaki, A. (2013). Sharing data and work across concurrent analytical queries. Proc. VLDB Endow, 6(9), 637–648. https://doi.org/10.14778/2536360.2536364.

    Article  Google Scholar 

  • Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S. (2000). Efficient and extensible algorithms for multi query optimization. In: ACM SIGMOD Record, vol. 29, pp. 249–260. ACM.

  • Sellis, T. K. (1988). Multiple-query optimization. ACM Trans. Database Syst., 13(1), 23–52. https://doi.org/10.1145/42201.42203.

    Article  Google Scholar 

  • Shim, J., Scheuermann, P., Vingralek, R. (1999). Dynamic caching of query results for decision support systems. In: IEEE SSDBM, SSDBM ‘99, pp. 254–. IEEE.

  • Silva, Y.N., Larson, P.A., Zhou, J. (2012). Exploiting common subexpressions for cloud query processing. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 1337–1348. IEEE.

  • Sinha, P., & Zoltners, A. A. (1979). The multiple-choice knapsack problem. Operations Research, 27(3), 503–515.

    Article  Google Scholar 

  • Wang, G., & Chan, C. Y. (2013). Multi-query optimization in mapreduce framework. Proc. VLDB Endow, 7(3), 145–156. https://doi.org/10.14778/2732232.2732234.

    Article  Google Scholar 

  • Yang, J., Karlapalem, K., & Li, Q. (1997). Algorithms for materialized view design in data warehousing environment. VLDB, 97, 25–29.

    Google Scholar 

  • Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2–2. USENIX Association.

  • Zhang, C., Yang, J. (1999). Genetic algorithm for materialized view selection in data warehouse environments. In: DataWarehousing and Knowledge Discovery, pp. 116–125. Springer.

  • Zhou, J., Larson, P.A., Freytag, J.C., Lehner, W. (2007). Efficient exploitation of similar subexpressions for query processing. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 533–544. ACM.

  • Zhu, C., Zhu, Q., Zuzarte, C., & Ma, W. (2016). Optimization of generic progressive queries based on dependency analysis and materialized views. Information Systems Frontiers, 18(1), 205–231.

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Italian National Group for Scientific Computation (GNCS-INDAM) and by “Progetto di Eccellenza” of the Computer Science Dept., Univ. of Verona, Italy.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Damiano Carra.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Michiardi, P., Carra, D. & Migliorini, S. Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks. Inf Syst Front 23, 35–51 (2021). https://doi.org/10.1007/s10796-020-09995-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-020-09995-2

Keywords