Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Modeling and analysis of dynamic coscheduling in parallel and distributed environments

Published: 01 June 2002 Publication History

Abstract

Scheduling in large-scale parallel systems has been and continues to be an important and challenging research problem. Several key factors, including the increasing use of off-the-shelf clusters of workstations to build such parallel systems, have resulted in the emergence of a new class of scheduling strategies, broadly referred to as dynamic coscheduling. Unfortunately, the size of both the design and performance spaces of these emerging scheduling strategies is quite large, due in part to the numerous dynamic interactions among the different components of the parallel computing environment as well as the wide range of applications and systems that can comprise the parallel environment. This in turn makes it difficult to fully explore the benefits and limitations of the various proposed dynamic coscheduling approaches for large-scale systems solely with the use of simulation and/or experimentation.To gain a better understanding of the fundamental properties of different dynamic coscheduling methods, we formulate a general mathematical model of this class of scheduling strategies within a unified framework that allows us to investigate a wide range of parallel environments. We derive a matrix-analytic analysis based on a stochastic decomposition and a fixed-point iteration. A large number of numerical experiments are performed in part to examine the accuracy of our approach. These numerical results are in excellent agreement with detailed simulation results. Our mathematical model and analysis is then used to explore several fundamental design and performance tradeoffs associated with the class of dynamic coscheduling policies across a broad spectrum of parallel computing environments.

References

[1]
C. Anglano. A comparative evaluation of implicit coscheduling strategies for networks of workstations. Proceedings of International Symposium on High Performance Distributed Computing, 2000.
[2]
A. C. Arpaci-Dusseau, D. E. Culler, A. M. Mainwaring. Scheduling with implicit information in distributed systems. Proceedings of ACM SIGMETRICS Conference on Measurement & Modeling of Computer Systems, 1998.
[3]
S. Asmussen. Phase-type distributions and related point processes: Fitting and recent advances. Matrix-Analytic Methods in Stochastic Models, S. R. Chakravarthy & A. S. Alfa (eds.), 137-149, 1997.
[4]
N. J. Boden et al. Myrinet: A gigabit-per-second local area network. IEEE Micro, 15(1):29-36, 1995.
[5]
A. C. Dusseau, R. H. Arpaci, D. E. Culler. Effective distributed scheduling of parallel workloads. Proceedings of ACM SIGMETRICS Conference on Measurement & Modeling of Computer Systems, 25-36, 1996.
[6]
D. G. Feitelson. A survey of scheduling in multiprogrammed parallel systems, Research Report RC 19790(87657), IBM Research Division, 1994.
[7]
D. G. Feitelson, B. Nitzberg. Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860. Job Scheduling Strategies for Parallel Processing, D. G. Feitelson & L. Rudolph (eds.), 337-360, 1995. Springer-Verlag LNCS Vol. 949.
[8]
A. Feldmann, W. Whitt. Fitting mixtures of exponentials to long-tail distributions to analyze network performance models. Performance Evaluation, 31:245-279, 1998.
[9]
D. Gaver, P. Jacobs, G. Latouche. Finite birth-and-death models in randomly changing environments. Advances in Applied Probability, 16:715-731, 1984.
[10]
A. Horvath, M. Telek. Approximating heavy tailed behaviour with phase type distributions. Advances in Algorithmic Methods for Stochastic Models, G. Latouche & P. Taylor (eds.), 191-214, 2000.
[11]
S. G. Hotovy. Workload evolution on the Cornell Theory Center IBM SP2. Job Scheduling Strategies for Parallel Processing, D. G. Feitelson & L. Rudolph (eds.), 27-40, 1996. Springer-Verlag LNCS Vol. 1162.
[12]
N. Islam, A. Prodromidis, M. S. Squillante. Dynamic partitioning in different distributed-memory environments. Job Scheduling Strategies for Parallel Processing, D. G. Feitelson & L. Rudolph (eds.), 244-270, 1996. Springer-Verlag LNCS Vol. 1162.
[13]
G. Latouche, V. Ramaswami. Introduction to Matrix Analytic Methods in Stochastic Modeling. ASA-SIAM, Philadelphia, 1999.
[14]
J. D. C. Little. A proof of the queuing formula L = λW. Operations Research, 9:383-387, 1961.
[15]
S. Nagar, A. Banerjee, A. Sivasubramaniam, C. R. Das. A closer look at coscheduling approaches for a network of work stations. Proceedings of ACM Symposium on Parallel Algorithms & Architectures, 96-105, 1999.
[16]
M. F. Neuts. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach. The Johns Hopkins University Press, 1981.
[17]
B. F. Nielsen. Modelling long-range dependent and heavy-tailed phenomena by matrix analytic methods. Advances in Algorithmic Methods for Stochastic Models, G. Latouche & P. Taylor (eds.), 265-278, 2000.
[18]
J. K. Ousterhout. Scheduling techniques for concurrent systems. Proceedings of International Conference on Distributed Computing Systems, 22-30, 1982.
[19]
P. G. Sobalvarro. Demand-based coscheduling of parallel jobs on multiprogrammed multiprocessors. PhD thesis, Dept. of Elec. Eng. & Comp. Sci., MIT, Cambridge, MA, 1997.
[20]
M. S. Squillante. A matrix-analytic approach to a general class of G/G/c queues. Research Report, IBM Research Division, 1996.
[21]
M. S. Squillante, F. Wang, M. Papaefthymiou. Stochastic analysis of gang scheduling in parallel and distributed systems. Performance Evaluation, 27&28:273-296, 1996.
[22]
M. S. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic modeling and analysis of dynamic coscheduling for a wide spectrum of parallel and distributed environments. Research Report, IBM Research Division, 2000.
[23]
Specification for the Virtual Interface Architecture. http://www.viarch.org.
[24]
Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. A simulation-based study of scheduling mechanisms for a dynamic cluster environment. Proc. of ACM International Conference on Supercomputing, 100-109, 2000.

Cited By

View all
  • (2022)On the quantum entanglement of random walks and queueing systemsQueueing Systems10.1007/s11134-022-09843-x100:3-4(253-255)Online publication date: 28-May-2022
  • (2016)Time-Sharing Redux for Large-Scale HPC Systems2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0051(301-308)Online publication date: Dec-2016
  • (2008)Coscheduled distributed-Web servers on system area networkJournal of Parallel and Distributed Computing10.1016/j.jpdc.2008.02.00968:8(1033-1043)Online publication date: 1-Aug-2008
  • Show More Cited By

Index Terms

  1. Modeling and analysis of dynamic coscheduling in parallel and distributed environments

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 30, Issue 1
    Measurement and modeling of computer systems
    June 2002
    286 pages
    ISSN:0163-5999
    DOI:10.1145/511399
    Issue’s Table of Contents
    • cover image ACM Conferences
      SIGMETRICS '02: Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
      June 2002
      299 pages
      ISBN:1581135319
      DOI:10.1145/511334
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2002
    Published in SIGMETRICS Volume 30, Issue 1

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)On the quantum entanglement of random walks and queueing systemsQueueing Systems10.1007/s11134-022-09843-x100:3-4(253-255)Online publication date: 28-May-2022
    • (2016)Time-Sharing Redux for Large-Scale HPC Systems2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0051(301-308)Online publication date: Dec-2016
    • (2008)Coscheduled distributed-Web servers on system area networkJournal of Parallel and Distributed Computing10.1016/j.jpdc.2008.02.00968:8(1033-1043)Online publication date: 1-Aug-2008
    • (2005)Queing Models for Computing and Communication Performance in Distributed Manufacturing Control SystemsInternational Journal of Modelling and Simulation10.2316/Journal.205.2005.4.205-404325:4Online publication date: 2005
    • (2004)Fault-aware job scheduling for bBueGene/L systems18th International Parallel and Distributed Processing Symposium, 2004. Proceedings.10.1109/IPDPS.2004.1302991(64-73)Online publication date: 2004
    • (2004)Performance implications of failures in large-scale cluster schedulingProceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing10.1007/11407522_13(233-252)Online publication date: 13-Jun-2004
    • (2003)Performance study of a cluster runtime system for dynamic interactive stream-oriented applications2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003.10.1109/ISPASS.2003.1190240(133-142)Online publication date: 2003
    • (2022)On Quantum Algorithms for Random Walks in the Nonnegative Quarter PlaneACM SIGMETRICS Performance Evaluation Review10.1145/3561074.356108950:2(42-44)Online publication date: 30-Aug-2022
    • (2015)Elastic Tasks: Unifying Task Parallelism and SPMD Parallelism with an Adaptive RuntimeEuro-Par 2015: Parallel Processing10.1007/978-3-662-48096-0_38(491-503)Online publication date: 25-Jul-2015
    • (2008)Performance implications of virtualizing multicore cluster machinesProceedings of the 2nd workshop on System-level virtualization for high performance computing10.1145/1435452.1435453(1-8)Online publication date: 31-Mar-2008
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media