Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3337821.3337864acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Public Access

Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters

Published: 05 August 2019 Publication History
  • Get Citation Alerts
  • Abstract

    In data-intensive parallel computing clusters, it is important to provide deadline-guaranteed service to jobs while minimizing resource usage (e.g., network bandwidth and energy). Under the current computing framework (that first allocates data and then schedules jobs), in a busy cluster with many jobs, it is difficult to achieve these objectives simultaneously. We model the problem to simultaneously achieve the objectives using integer programming, and propose a heuristic Cooperative job Scheduling and data Allocation method (CSA). CSA novelly reverses the order of data allocation and job scheduling in the current computing framework, i.e., changing data-first-job-second to job-first-data-second. It enables CSA to proactively consolidate tasks with more common requested data to the same server when conducting deadline-aware scheduling, and also consolidate the tasks to as few servers as possible to maximize energy savings. This facilitates the subsequent data allocation step to allocate a data block to the server that hosts most of this data's requester tasks, thus maximally enhancing data locality and reduce bandwidth consumption. CSA also has a recursive schedule refinement process to adjust the job and data allocation schedules to improve system performance regarding the three objectives and achieve the tradeoff between data locality and energy savings with specified weights. We implemented CSA and a number of previous job schedulers on Apache Hadoop on a real supercomputing cluster. Trace-driven experiments in the simulation and the real cluster show that CSA outperforms other schedulers in supplying deadline-guarantee and resource-efficient services.

    References

    [1]
    {n. d.}. Apache Capacity Scheduler Guide. In http://hadoop.apache.org/docs/r1.2.1/capacity _scheduler.html {accessed in Apr. 2019}.
    [2]
    {n. d.}. Apache Fair Scheduler. In http://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html {accessed in Apr. 2019}.
    [3]
    {n. d.}. Apache Hadoop FileSystem and its Usage in Facebook. In http://cloud.berkeley.edu/data/hdfs.pdf {accessed in Apr. 2019}.
    [4]
    {n. d.}. Apache Hadoop NextGen MapReduce (YARN). In http://hadoop.apache.org/docs/r2.5.1/hadoop-yarn/hadoop-yarn-site/ {accessed in Apr. 2019}.
    [5]
    {n. d.}. Apache Spark. In https://spark.apache.org/. {accessed in Apr. 2019}.
    [6]
    {n. d.}. MESOS. In http://mesos.apache.org/ {accessed in Apr. 2019}.
    [7]
    {n. d.}. Palmetto Cluster. In http://citi.clemson.edu/palmetto/ {accessed in Apr. 2019}.
    [8]
    M. Al-Fares, A. Loukissas, and A. Vahdat. 2008. A Scalable, Commodity Data Center Network Architecture. In Proc. of SIGCOMM.
    [9]
    H. Amur, J. Cipar, V. Gupta, and K. Schwan. 2010. Robust and Flexible Power-Proportional Storage. In Proc. of SoCC.
    [10]
    R. Appuswamy, C. Gkantsidis, and A. Rowstron. 2013. Scale-up vs Scale-out for Hadoop: Time to rethink?. In Proc. of SoCC.
    [11]
    A. Beloglazov and R. Buyya. 2011. Optimal Online Deterministic Algorithms and Adaptive Heuristics for Energy and Performance Efficient dDnamic Consolidation of Virtual Machines in Cloud Data Centers. (2011).
    [12]
    N. Bonvin, T. G. Papaioannou, and K. Aberer. 2010. A Self-Organized, Fault-Tolerant and Scalable Replication Scheme for Cloud Storage. In Proc. of SoCC.
    [13]
    R. Chaiken, B.Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. 2008. SCOPE: easy and efficient parallel processing of massive data sets. (2008).
    [14]
    C. Chen, J. Lin, and S. Kuo. 2018. MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. (2018).
    [15]
    Y. Chen, A. Ganapathi, R. Griffith, and R. Katz. 2011. The Case for Evaluating MapReduce Performance Using Workload Suites. In Proc. of MASCOTS.
    [16]
    D. Cheng, J. Rao, C. Jiang, and X. Zhou. 2015. Resource and deadline-aware job scheduling in dynamic hadoop clusters. In Proc. of IPDPS.
    [17]
    C. Delimitrou and C. Kozyrakis. 2015. Tarcil: reconciling scheduling speed and quality in large shared clusters. In Proc. of SOCC.
    [18]
    M. Ehsan, Y. Chen, and H. Kang. 2013. EcoHadoop: A Cost-Efficient Data and Task Co-Scheduler for MapReduce. In Proc. of HiPC.
    [19]
    A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. 2013. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In Proc. of EuroSys.
    [20]
    R. Gandhi, Y. C. Hu, C. Koh, H. H. Liu, and M. Zhang. 2015. Rubik: Unlocking the Power of Locality and End-point Flexibility in Cloud Scale Load Balancing. In Proc. of USENIX ATC.
    [21]
    M. R. Garey and D. S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.
    [22]
    I. Gupta, B. Cho, M. R. Rahman, T. Chajed, N. Abad, C. L.and Roberts, and P. Lin. 2013. Natjam: Eviction Policies For Supporting Priorities and Deadlines in Mapreduce Clusters. In Proc. of SoCC.
    [23]
    B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In proc. of NSDI.
    [24]
    C. Hung, L. Golubchik, and M. Yu. 2015. Scheduling jobs across geo-distributed datacenters. In Proc. of SOCC.
    [25]
    V. Jalaparti, P. Bodik, I. Menache, S. Rao, K. Makarychev, and M. Caesar. 2015. Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can. In Proc. of SIGCOMM.
    [26]
    H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In Proc. of SOCC.
    [27]
    A. Munir, T. He, R. Raghavendra, and F. Le. 2016. Network Scheduling Aware Task Placement in Datacenters. In Proc. of CONEXT.
    [28]
    C. Peng and Z. Zhang. 2012. VDN: Virtual Machine Image Distribution Network for Cloud Data Centers. In Proc. of INFOCOM.
    [29]
    Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella, P. Bahl, and I. Stoica. 2015. Low latency geo-distributed data analytics. (2015).
    [30]
    S.Seny,J.R.Lorch, R. Hughes, C. G. J. Suarez, B. Zill, W. Cordeiroz, and J. Padhye. 2012. Don't Lose Sleep Over Availability: The GreenUp Decentralized Wakeup Service. In Proc. of NSDI.
    [31]
    K. Shvachko, H. Kuang, S. Radia, and R. Chansler. 2010. The Hadoop Distributed File System. In Proc. of MSST.
    [32]
    J. Tan, X. Meng, and L. Zhang. 2013. Coupling task progress for mapreduce resource-aware scheduling. In Proc. of INFOCOM.
    [33]
    P. Vagata and K. Wilfong. {n. d.}. Scaling the Facebook data warehouse to 300 PB. In https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/.
    [34]
    C. J. van Rijsbergen. 1979. Information Retrieval. Butterworth.
    [35]
    V. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, and S. Seth. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proc. of SOCC.
    [36]
    H. Wang and H. Shen. 2018. Proactive Incast Congestion Control in a Datacenter Serving Web Applications. In Proc. of INFOCOM.
    [37]
    H. Wang, H. Shen, and Z. Li. 2018. Approaches for Resilience Against Cascading Failures in Cloud Datacenters. In Proc. of ICDCS.
    [38]
    H. Wang, H. Shen, and G. Liu. 2017. Swarm-based Incast Congestion Control in Datacenters Serving Web Applications. In Proc. of SPAA.
    [39]
    W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang. 2016. Maptask scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality. (2016).
    [40]
    Y. Wang, M. Kapritsos, Z. Ren, P. Mahajan, J. Kirubanandam, L. Alvisi, and M. Dahlin. 2013. Robustness in the Salus Scalable Block Store. In Proc. of NSDI.
    [41]
    A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. 2012. Orchestrating the Deployment of Computations in the Cloud with Conductor. In Proc. of NSDI.
    [42]
    N. J. Yadwadkar, G. Ananthanarayanan, and R. Katz. 2014. Wrangler: Predictable and faster jobs using fewer resources. In Proc. of SOCC.
    [43]
    M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. 2010. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In Proc. of EuroSys.

    Cited By

    View all
    • (2022)Cooperative Job Scheduling and Data Allocation in Data-Intensive Parallel Computing ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2022.3206206(1-14)Online publication date: 2022

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICPP '19: Proceedings of the 48th International Conference on Parallel Processing
    August 2019
    1107 pages
    ISBN:9781450362955
    DOI:10.1145/3337821
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • University of Tsukuba: University of Tsukuba

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 August 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • CNS
    • NSF
    • OAC
    • CCF
    • Microsoft Research Faculty Fellowship
    • ACI

    Conference

    ICPP 2019

    Acceptance Rates

    Overall Acceptance Rate 91 of 313 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)8

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Cooperative Job Scheduling and Data Allocation in Data-Intensive Parallel Computing ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2022.3206206(1-14)Online publication date: 2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media