Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389758acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure

Published: 31 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Serverless computing has recently attracted a lot of attention from research and industry due to its promise of ultimate elasticity and operational simplicity. However, there is no consensus yet on whether or not the approach is suitable for data processing. In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing. In our analysis, supported with extensive experiments, we show in which scenarios serverless makes sense from an economic and performance perspective. We address several important technical questions that need to be solved to support data analytics and present examples from several domains where serverless offers a cost and performance advantage over existing solutions.

    Supplementary Material

    MP4 File (3318464.3389758.mp4)
    Presentation Video

    References

    [1]
    Karolina Alexiou, Donald Kossmann, and Per-ÅkeLarson. "Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia." In: PVLDB 6.14 (2013).
    [2]
    Lixiang Ao, Liz Izhikevich, Geoffrey M. Voelker, and George Porter. "Sprocket: A Serverless Video Processing Framework. "In: SoCC. 2018.
    [3]
    Renata Borovica-Gajic, Raja Appuswamy, and Anastasia Ailamaki. "Cheap Data Analytics using Cold Storage Devices." In: PVLDB 9.12 (2016).
    [4]
    CERN.CERN Open Data Portal. uRl: http://opendata.cern.ch/(visited on 01/20/2020).
    [5]
    CERN. MuOnia primary dataset in AOD format from RunB of 2010. (Visited on01/20/2020).
    [6]
    Microsoft Corp. Azure Functions. uRl: https://azure.microsoft.com/en-us/services/functions/(visited on 10/19/2019).
    [7]
    Matteo Cremonesi et al. "Using Big Data Technologies for HEP Analysis." In:CHEP. 2019.
    [8]
    Justin DeBrabant, Andrew Pavlo, Stephen Tu, Michael Stone-braker, and Stan Zdonik. "Anti-Caching: A New Approach to Database Management System Architecture." In: PVLDB 6.14(2013).
    [9]
    David DeWitt and Jim Gray. "Parallel Database Systems: The Future of High Performance Database Systems." In: CACM 35.6(1992).
    [10]
    Ahmed Eldawy, Justin Levandoski, and Per-Åke Larson. "Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database." In: PVLDB 7.11 (2014).
    [11]
    Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee, Christos Kozyrakis, Matei Zaharia, and Keith Win-stein. "From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers." In: USENIXATC. 2019.
    [12]
    Sadjad Fouladi et al. "Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads." In: NSDI. 2017.
    [13]
    G. Graefe and D.L. Davison. "Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution." In: IEEE Trans. Softw. Eng.19.8 (1993).
    [14]
    Goetz Graefe. "Encapsulation of Parallelism in the Volcano Query Processing System." In: SIGMOD. 1990.
    [15]
    Goetz Graefe. "Query Evaluation Techniques for Large Data-bases." In:CSUR25.2 (1993).
    [16]
    Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to Parallel Computing. 2nd Edition. Addison-Wesley, 2003. isbn: 9780201648652.
    [17]
    Joseph M. Hellerstein, Jose M. Faleiro, Joseph Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Cheng-gang Wu. "Serverless Computing: One Step Forward, Two Steps Back." In: CIDR. 2019.
    [18]
    G.J. Huffman, E.F. Stocker, D.T. Bolvin, E.J. Nelkin, and Jackson Tan. GPM IMERG Early Precipitation L3 Half Hourly 0.1 degreex 0.1 degree V06. Goddard Earth Sciences Data and Information Services Center (GES DISC). (Visited on 01/20/2020).
    [19]
    Amazon Inc.Amazon Athena. uRl: http://docs.aws.amazon.com/athena/(visited on 10/19/2019).
    [20]
    Google Inc.Google BigQuery. uRl: https://cloud.google.com/bigquery/(visited on 10/19/2019).
    [21]
    Google Inc.Google Cloud Functions. uRl: https://cloud.google.com/functions/(visited on 10/19/2019).
    [22]
    IBM Inc.IBM Multi-temperature management. uRl: https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0059106.html(visited on 10/19/2019).
    [23]
    SAP Inc.SAP using Spark to process cold data next to a mainmemory database. uRl: https://blogs.saphana.com/2018/12/03/what-is-sap-hana-cold-data-tiering/(visited on 10/19/2019).
    [24]
    Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. "Occupy the Cloud: Distributed Computingfor the 99%." In: SoCC. 2017.
    [25]
    Youngbin Kim and Jimmy Lin. "Serverless Data Analytics with Flint." In:CLOUD. 2018.
    [26]
    Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Jonas Pfefferle, and Animesh Trivedi. "Understanding Ephemeral Storage for Serverless Analytics." In: NSDI. 2018.
    [27]
    Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. "Pocket: Elastic Ephemeral Storage for Serverless Analytics." In: OSDI. 2018.
    [28]
    Justin J. Levandoski, Per-Åke Larson, and Radu Stoica. "Identifying Hot and Cold Data in Main-Memory Databases." In: ICDE. 2013.
    [29]
    Yinan Li, Ippokratis Pandis, Rene Mueller, Vijayshankar Raman, and Guy Lohman. "NUMA-aware algorithms: the case of data shuffling." In: CIDR. 2013.
    [30]
    Haicheng Liu, Peter Oosterom, Chengfang Hu, and Wen Wang. "Managing Large Multidimensional Array Hydrologic Datasets: A Case Study Comparing NetCDF and SciDB." In:Procedia Engineering 154 (2016).
    [31]
    Haicheng Lui. "Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: A case study of SciDB." MA thesis. TU Delf. (Visitedon 10/19/2019).
    [32]
    Renato Marroquín, Ingo Müller, Darko Makreshanski, and Gus-tavo Alonso. "Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution." In: SoCC '18.
    [33]
    Ingo Müller, Rodrigo Bruno, Ana Klimovic, John Wilkes, EricSedlar, and Gustavo Alonso. "Serverless Clusters: The MissingPiece for Interactive Batch Applications?" In: SPMA. 2020.
    [34]
    Ingo Müller, Renato Marroquín, Dimitrios Koutsoukos, Mike Wawrzoniak, Sabir Akhadov, and Gustavo Alonso. The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis. 2020. arXiv:2004.01908[cs.DB].
    [35]
    NASA.DATA.NASA.GOV: A catalog of publicly available NASA datasets. uRl: http://data.nasa.gov/(visited on 01/20/2020).
    [36]
    M. Tamer Özsu and P Valduriez. Principles of Distributed Data-base Systems. 3rd ed. Springer, 2011.isbn: 9781441988331.
    [37]
    Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. "Starling: A Scalable Query Engine on Cloud Function Services." In: SIGMOD. 2020.
    [38]
    Qifan Pu, U C Berkeley, Shivaram Venkataraman, Ion Stoica, UC Berkeley, and Implementation Nsdi. "Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure." In:NSDI.2019.
    [39]
    Wolf Rödiger, Tobias Mühlbauer, Alfons Kemper, and Thomas Neumann. "High-Speed Query Processing over High-Speed Net-works." In: PVLDB 9.4 (2015).
    [40]
    Josep Sampé, Gil Vernik, Marc Sánchez-Artigas, and Pedro García-López. "Serverless data analytics in the IBM cloud." In: Middleware Industry. 2018.
    [41]
    Venkat Sowrirajan, Bharath Bhushan, and Mayank Ahuja. Qubole offers Apache Spark on AWS Lambda. 2017. URL: https://www.qubole.com/blog/spark-on-aws-lambda/(visited on 12/20/2019).
    [42]
    Transaction Processing Performance Council. TPC Benchmark H (Revision 2.18). 2018.
    [43]
    Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart,and Michael Swift. "Peeking Behind the Curtains of Serverless Platforms." In:USENIX ATC. 2018.
    [44]
    Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing." In:NSDI. 2012.

    Cited By

    View all
    • (2024)MinFlowProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650716(311-328)Online publication date: 27-Feb-2024
    • (2024)Serverless computing based on dynamic-addressable sessionSCIENTIA SINICA Informationis10.1360/SSI-2023-015554:3(582)Online publication date: 11-Mar-2024
    • (2024)Data pipeline approaches in serverless computing: a taxonomy, review, and research trendsJournal of Big Data10.1186/s40537-024-00939-011:1Online publication date: 11-Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cloud computing
    2. data lake
    3. elasticity
    4. interactive analytics
    5. serverless computing
    6. serverless functions

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)414
    • Downloads (Last 6 weeks)47
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)MinFlowProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650716(311-328)Online publication date: 27-Feb-2024
    • (2024)Serverless computing based on dynamic-addressable sessionSCIENTIA SINICA Informationis10.1360/SSI-2023-015554:3(582)Online publication date: 11-Mar-2024
    • (2024)Data pipeline approaches in serverless computing: a taxonomy, review, and research trendsJournal of Big Data10.1186/s40537-024-00939-011:1Online publication date: 11-Jun-2024
    • (2024)Vexless: A Serverless Vector Data Management System Using Cloud FunctionsProceedings of the ACM on Management of Data10.1145/36549902:3(1-26)Online publication date: 30-May-2024
    • (2024)Optimus: Warming Serverless ML Inference via Inter-Function Model TransformationProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629567(1039-1053)Online publication date: 22-Apr-2024
    • (2024)FUYAO: DPU-enabled Direct Data Transfer for Serverless ComputingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651327(431-447)Online publication date: 27-Apr-2024
    • (2024)FaaSGraph: Enabling Scalable, Efficient, and Cost-Effective Graph Processing with Serverless ComputingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640361(385-400)Online publication date: 27-Apr-2024
    • (2024)DirectFaaS: A Clean-Slate Network Architecture for Efficient Serverless Chain CommunicationsProceedings of the ACM on Web Conference 202410.1145/3589334.3645333(2759-2767)Online publication date: 13-May-2024
    • (2024)TIMBER: On supporting data pipelines in Mobile Cloud Environments2024 25th IEEE International Conference on Mobile Data Management (MDM)10.1109/MDM61037.2024.00032(93-102)Online publication date: 24-Jun-2024
    • (2024)FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00168(2109-2122)Online publication date: 13-May-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media