Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3415958.3433082acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmedesConference Proceedingsconference-collections
research-article

Scalable Execution of Big Data Workflows using Software Containers

Published: 27 November 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Big Data processing involves handling large and complex data sets, incorporating different tools and frameworks as well as other processes that help organisations make sense of their data collected from various sources. This set of operations, referred to as Big Data workflows, require taking advantage of the elasticity of cloud infrastructures for scalability. In this paper, we present the design and prototype implementation of a Big Data workflow approach based on the use of software container technologies and message-oriented middleware (MOM) to enable highly scalable workflow execution. The approach is demonstrated in a use case together with a set of experiments that demonstrate the practical applicability of the proposed approach for the scalable execution of Big Data workflows. Furthermore, we present a scalability comparison of our proposed approach with that of Argo Workflows - one of the most prominent tools in the area of Big Data workflows.

    References

    [1]
    D Culler Arvind et al. 1984. The tagged token dataflow architecture. Technical Report. Technical report, MIT Laboratory for Computer Science.
    [2]
    Mutaz Barika et al. 2019. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions. Comput. Surveys 52, 5 (2019).
    [3]
    Edward Curry. 2005. Message-Oriented Middleware. John Wiley & Sons, Ltd, 1--28.
    [4]
    Yared Dejene Dessalk. 2020. Big Data Workflows: DSL-based Specification and Software Containers for Scalable Execution. The Royal Institute of Technology, 1--49.
    [5]
    W. Gerlach et al. 2014. Skyport - Container-Based Execution Environment Management for Multi-cloud Scientific Workflows. In Proc. of the DataCloud 2014. 25--32.
    [6]
    A. Kashlev et al. 2017. Big Data Workflows: A Reference Architecture and the DATAVIEW System. Services Transactions on Big Data 4, 1 (2017).
    [7]
    Marjan Mernik et al. 2005. When and How to Develop Domain-Specific Languages. Comput. Surveys 37, 4 (2005), 316--344.
    [8]
    Sara Migliorini et al. 2011. Pattern-Based Evaluation of Scientific Workflow Management Systems.
    [9]
    N. Naik. 2017. Docker container-based big data processing system in multiple clouds for everyone. In Proc. of the ISSE 2017. 1--7.
    [10]
    R. Qasha et al. 2016. Dynamic Deployment of Scientific Workflows in the Cloud Using Container Virtualization. In Proc. of the CloudCom 2016. 269--276.
    [11]
    R. Ranjan et al. 2017. Orchestrating Big Data Analysis Workflows. IEEE Cloud Computing 4, 3 (2017), 20--28.
    [12]
    Nick Russell et al. 2005. Workflow Data Patterns: Identification, Representation and Tool Support. In Proc. of the ER 2005. 353--368.
    [13]
    C. Wulf et al. 2016. Increasing the Throughput of Pipe-and-Filter Architectures by Integrating the Task Farm Parallelization Pattern. In Proc. of the CBSE 2016. 13--22.
    [14]
    Charles Zheng and Douglas Thain. 2015. Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker. In Proc. of the VTDC 2015. 31--38.

    Cited By

    View all
    • (2022)Sustainable Big Data Analytics Process Pipeline Using Apache EcosystemEncyclopedia of Data Science and Machine Learning10.4018/978-1-7998-9220-5.ch073(1247-1259)Online publication date: 14-Oct-2022
    • (2022)Dataclouddsl: Textual and Visual Presentation of Big Data Pipelines2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC54236.2022.00183(1165-1171)Online publication date: Jun-2022
    • (2022)Matching-based Scheduling of Asynchronous Data Processing Workflows on the Computing Continuum2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00021(58-70)Online publication date: Sep-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    MEDES '20: Proceedings of the 12th International Conference on Management of Digital EcoSystems
    November 2020
    170 pages
    ISBN:9781450381154
    DOI:10.1145/3415958
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 November 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Big Data workflows
    2. Domain-specific languages
    3. Software containers

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    MEDES '20
    MEDES '20: 12th International Conference on Management of Digital EcoSystems
    November 2 - 4, 2020
    Virtual Event, United Arab Emirates

    Acceptance Rates

    MEDES '20 Paper Acceptance Rate 19 of 27 submissions, 70%;
    Overall Acceptance Rate 267 of 682 submissions, 39%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)40
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Sustainable Big Data Analytics Process Pipeline Using Apache EcosystemEncyclopedia of Data Science and Machine Learning10.4018/978-1-7998-9220-5.ch073(1247-1259)Online publication date: 14-Oct-2022
    • (2022)Dataclouddsl: Textual and Visual Presentation of Big Data Pipelines2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC54236.2022.00183(1165-1171)Online publication date: Jun-2022
    • (2022)Matching-based Scheduling of Asynchronous Data Processing Workflows on the Computing Continuum2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00021(58-70)Online publication date: Sep-2022
    • (2022)Supporting Semantic Data Enrichment at ScaleTechnologies and Applications for Big Data Value10.1007/978-3-030-78307-5_2(19-39)Online publication date: 29-Apr-2022
    • (2021)Big Data Workflows: Locality-Aware Orchestration Using Software ContainersSensors10.3390/s2124821221:24(8212)Online publication date: 8-Dec-2021
    • (2021)Locality-Aware Workflow Orchestration for Big DataProceedings of the 13th International Conference on Management of Digital EcoSystems10.1145/3444757.3485106(62-70)Online publication date: 1-Nov-2021
    • (2021)Conceptualization and scalable execution of big data workflows using domain-specific languages and software containersInternet of Things10.1016/j.iot.2021.10044016(100440)Online publication date: Dec-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media