Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

Published: 17 April 2024 Publication History
  • Get Citation Alerts
  • Abstract

    In the context of decision making under explorable uncertainty, scheduling with testing is a powerful technique used in the management of computer systems to improve performance via better job-dispatching decisions. Upon job arrival, a scheduler may run some testing algorithm against the job to extract some information about its structure, e.g., its size, and properly classify it. The acquisition of such knowledge comes with a cost because the testing algorithm delays the dispatching decisions, though this is under control. In this article, we analyze the impact of such extra cost in a load balancing setting by investigating the following questions: does it really pay off to test jobs? If so, under which conditions? Under mild assumptions connecting the information extracted by the testing algorithm in relationship with its running time, we show that whether scheduling with testing brings a performance degradation or improvement strongly depends on the traffic conditions, system size and the coefficient of variation of job sizes. Thus, the general answer to the above questions is non-trivial and some care should be considered when deploying a testing policy. Our results are achieved by proposing a load balancing model for scheduling with testing that we analyze in two limiting regimes. When the number of servers grows to infinity in proportion to the network demand, we show that job-size testing actually degrades performance unless short jobs can be predicted reliably almost instantaneously and the network load is sufficiently high. When the coefficient of variation of job sizes grows to infinity, we construct testing policies inducing an arbitrarily large performance gain with respect to running jobs untested.

    References

    [1]
    Susanne Albers and Alexander Eckl. 2020. Explorable uncertainty in scheduling with non-uniform testing times. In Approximation and Online Algorithms: 18th International Workshop, WAOA 2020, Virtual Event, September 9–10, 2020, Revised Selected Articles. Springer-Verlag, Berlin, 127–142.
    [2]
    Susanne Albers and Alexander Eckl. 2021. Scheduling with testing on multiple identical parallel machines. In Algorithms and Data Structures: 17th International Symposium, WADS 2021, August 9-11, 2021, Proceedings. Springer-Verlag, Berlin, 29–42.
    [3]
    Saed Alizamir, Francis de Véricourt, and Peng Sun. 2013. Diagnostic accuracy under congestion. Management Science 59, 1 (2013), 157–171.
    [4]
    Jonatha Anselmi. 2020. Combining size-based load balancing with round-robin for scalable low latency. IEEE Transactions on Parallel and Distributed Systems 31, 4 (2020), 886–896.
    [5]
    J. Anselmi and J. Doncel. 2019. Asymptotically optimal size-interval task assignments. IEEE Transactions on Parallel and Distributed Systems 30, 11 (nov2019), 2422–2433.
    [6]
    S. Asmussen and. 2003. Applied Probability and Queues. Springer. 86013173
    [7]
    Eitan Bachmat and Josu Doncel. 2021. Size-based routing policies: Non-asymptotic analysis and design of decentralized systems. Sensors 21, 8 (2021). https://www.mdpi.com/1424-8220/21/8/2701
    [8]
    Eitan Bachmat, Josu Doncel, and Hagit Sarfati. 2019. Performance and stability analysis of the task assignment based on guessing size routing policy. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1–13.
    [9]
    Eitan Bachmat and Hagit Sarfati. 2010. Analysis of SITA policies. Perform. Eval. 67, 2 (Feb.2010), 102–120.
    [10]
    Cynthia Bailey Lee, Yael Schwartzman, Jennifer Hardy, and Allan Snavely. 2005. Are user runtime estimates inherently inaccurate?. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin, Berlin, 253–263.
    [11]
    Luis Diego Briceno, Bhavesh Khemka, Howard Jay Siegel, Anthony A. Maciejewski, Christopher Groër, Gregory Koenig, Gene Okonski, and Steve Poole. 2011. Time utility functions for modeling and evaluating resource allocations in a heterogeneous computing system. In IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 7–19.
    [12]
    João M. P. Cardoso, José Gabriel F. Coutinho, and Pedro C. Diniz. 2017. Chapter 5 - Source code transformations and optimizations. In Embedded Computing for High Performance, João M. P. Cardoso, José Gabriel F. Coutinho, and Pedro C. Diniz (Eds.). Morgan Kaufmann, Boston, 137–183.
    [13]
    Wenyan Chen, Kejiang Ye, Yang Wang, Guoyao Xu, and Cheng-Zhong Xu. 2018. How does the workload look like in production cloud? Analysis and clustering of workloads on Alibaba cluster trace. In 2018 IEEE 24th Int. Conf. on Parallel and Distributed Systems (ICPADS). 102–109.
    [14]
    Robert J. Adler, Raisa E. Feldman, and Murad S. Taqqu (Eds.). 1998. A Practical Guide to Heavy Tails: Statistical Techniques and Applications. Birkhauser Boston Inc.
    [15]
    Mark Van der Boor, Sem C. Borst, Johan S. H. Van Leeuwaarden, and Debankur Mukherjee. 2022. Scalable load balancing in networked systems: A survey of recent advances. SIAM Rev. 64, 3 (2022), 554–622.
    [16]
    Sheng Di, Derrick Kondo, and Walfredo Cirne. 2012. Characterization and comparison of cloud versus grid workloads. In 2012 IEEE International Conference on Cluster Computing. 230–238.
    [17]
    Fanny Dufossé, Christoph Dürr, Noël Nadal, Denis Trystram, and Óscar C. Vásquez. 2022. Scheduling with a processing time oracle. Applied Mathematical Modelling 104 (2022), 701–720.
    [18]
    Christoph Dürr, Thomas Erlebach, Nicole Megow, and Julie Meißner. 2018. Scheduling with explorable uncertainty. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 94), Anna R. Karlin (Ed.). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 30:1–30:14.
    [19]
    Christoph Dürr, Thomas Erlebach, Nicole Megow, and Julie Meißner. 2020. An adversarial model for scheduling with testing. Algorithmica 82, 12 (2020), 3630–3675.
    [20]
    Muhammad El-Taha and Bacel Maddah. 2006. Allocation of service time in a multiserver system. Management Science 52, 4 (2006), 623–637.
    [21]
    Yuping Fan, Paul Rich, William E. Allcock, Michael E. Papka, and Zhiling Lan. 2017. Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 530–540.
    [22]
    Dror G. Feitelson. 2015. Workload Modeling for Computer Systems Performance Evaluation (1st ed.). Cambridge University Press, USA.
    [23]
    Ana Gainaru, Hongyang Sun, Guillaume Aupy, Yuankai Huo, Bennett A. Landman, and Padma Raghavan. 2019. On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows. The International Journal of High Performance Computing Applications 33, 6 (2019), 1140–1158.
    [24]
    Mor Harchol-Balter. 2002. Task assignment with unknown duration. J. ACM 49, 2 (2002), 260–288.
    [25]
    Mor Harchol-Balter, Mark E. Crovella, and Cristina D. Murta. 1999. On choosing a task assignment policy for a distributed server system. J. Parallel and Distrib. Comput. 59, 2 (1999), 204–228.
    [26]
    Mor Harchol-Balter and Allen B. Downey. 1996. Exploiting process lifetime distributions for dynamic load balancing. In Proceedings of the Int. Conf. on Measurement and Modeling of Computer Systems (Philadelphia, Pennsylvania, USA) (SIGMETRICS ’96). ACM, New York, NY, USA, 13–24.
    [27]
    Mor Harchol-Balter, Alan Scheller-Wolf, and Andrew R. Young. 2009. Surprising results on task assignment in server farms with high-variability workloads. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (Seattle, WA, USA) (SIGMETRICS ’09). ACM, New York, NY, USA, 287–298.
    [28]
    Esa Hyytiä and Rhonda Righter. 2024. Towards the optimal dynamic size-aware dispatching. Performance Evaluation 164 (2024), 102396.
    [29]
    Retsef Levi, Thomas Magnanti, and Yaron Shaposhnik. 2019. Scheduling with testing. Management Science 65, 2 (2019), 776–793.
    [30]
    Alex F. Mills, Nilay Tanık Argon, and Serhan Ziya. 2013. Resource-based patient prioritization in mass-casualty incidents. Manufacturing & Service Operations Management 15, 3 (2013), 361–377.
    [31]
    Mina Naghshnejad and Mukesh Singhal. 2020. A hybrid scheduling platform: A runtime prediction reliability aware scheduling platform to improve HPC scheduling performance. The Journal of Supercomputing 76 (2020), 28 pages.
    [32]
    G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. 2000. Pace—A toolset for the performance prediction of parallel and distributed systems. The International Journal of High Performance Computing Applications 14, 3 (2000), 228–251.
    [33]
    Alan Scheller-Wolf and Karl Sigman. 1997. Delay moments for FIFO GI/GI/s queues. Queueing Systems 25, 1/4 (Jan.1997), 77–95.
    [34]
    Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson. 2007. Backfilling using system-generated predictions rather than user runtime estimates. IEEE Transactions on Parallel and Distributed Systems 18, 6 (2007), 789–803.
    [35]
    Y. Wiseman, K. Schwan, and P. Widener. 2004. Efficient end to end data exchange using configurable compression. In 24th Int. Conf. on Distributed Computing Systems. 228–235.
    [36]
    Carl Witt, Marc Bux, Wladislaw Gusew, and Ulf Leser. 2019. Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Information Systems 82 (2019), 33–52.
    [37]
    Bo Zhang and Bert Zwart. 2013. Steady-state analysis for multiserver queues under size interval task assignment in the quality-driven regime. Math. of Op. Res. 38, 3 (2013), 504–525.
    [38]
    Darko Zivanovic, Milan Pavlovic, Milan Radulovic, Hyunsung Shin, Jongpil Son, Sally A. Mckee, Paul M. Carpenter, Petar Radojković, and Eduard Ayguadé. 2017. Main memory in HPC: Do we need more or could we live with less? ACM Trans. Archit. Code Optim. 14, 1, Article 3 (Mar.2017), 26 pages.
    [39]
    Salah Zrigui, Raphael Y. de Camargo, Arnaud Legrand, and Denis Trystram. 2022. Improving the performance of batch schedulers using online job runtime classification. J. Parallel and Distrib. Comput. 164 (2022), 83–95.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Modeling and Performance Evaluation of Computing Systems
    ACM Transactions on Modeling and Performance Evaluation of Computing Systems  Volume 9, Issue 2
    June 2024
    103 pages
    ISSN:2376-3639
    EISSN:2376-3647
    DOI:10.1145/3613566
    • Editor:
    • Leana Golubchik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 April 2024
    Online AM: 04 March 2024
    Accepted: 25 February 2024
    Revised: 17 January 2024
    Received: 17 October 2023
    Published in TOMPECS Volume 9, Issue 2

    Check for updates

    Author Tags

    1. Load balancing
    2. dispatching policies
    3. size-based routing
    4. scheduling with testing
    5. explorable uncertainty

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 87
      Total Downloads
    • Downloads (Last 12 months)87
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media