We first review prior works based on the scheduling objectives.
4.1.1 Efficiency.
As discussed in Section
2.1.2, the main objective for scheduling an inference workload is to improve its efficiency. This includes the reduction of inference latency and cost, and improvement of the prediction accuracy
. The challenge here is that there exist tradeoffs among different efficiency goals. Here, we discuss the techniques to improve each goal as well as to jointly balance and improve them.
(1) Accuracy efficiency. Improving prediction accuracy is a perpetual objective in an inference system. Although the prediction accuracy of fixed input and model (with deterministic inference execution, which is the usual case) is determined, the scheduling system could achieve better accuracy by selecting
different models and making resource allocation between models more intelligent. To achieve this, one approach is to collect a set of models and select the best one to predict the result for each input query. The scheduling decision made includes the model selection and resource allocation among different candidates. Ease.ml [
102] leverages the input and output shape information of the query sample to automatically select the model. It estimates the potential accuracy improvement of each candidate model and then picks the highest one for actual inference. It also formulates the cost-aware model selection process under both single-tenant and multi-tenant settings with multi-armed bandit and Bayesian Optimization. Another effective approach is the model ensemble, which combines the prediction results from multiple models to improve the prediction accuracy and generalization. Clipper [
24] examines the benefits brought from the model ensemble in computer vision tasks and applies a linear ensemble method to compute a weighted average of the base model predictions. The linear weights are decided by bandit- and learning-based approaches. Rafiki [
169] leverages an RL model to determine the model set for the ensemble. This model is also used to identify the final optimal model combinations and tune critical parameters, e.g., batch size.
(2) Latency efficiency. An inference system should have a satisfactory response time, even for burst and fluctuating query requests. In this section, we discuss the scheduling techniques to improve the latency efficiency from two perspectives: (1) how to reduce the latency of a fixed number of inference requests and (2) how to efficiently allocate resources to meet the latency requirement. The latency requirement poses challenges for the scheduler to decide which jobs to be prioritized in the job assignment and rearrangement process. This objective can be achieved via carefully optimizing resource allocation.
It is common to launch multiple inference execution instances concurrently to reduce the latency of inference requests as much as possible due to the low GPU utilization for each request. Therefore, the inference scheduler can make scheduling decisions aiming at scaling up resources according to the request density to maintain a satisfactory latency. Clipper [
24] conducts linear scaling of inference instances and uses separate docker containers to isolate different models. It replicates the model containers according to the number of queries and applies adaptive batching independently for each model due to the varied execution time. MArk [
199,
200] scales the inference instances with the cloud services. It selects and combines different cloud services such as AWS EC2 and Lambda based on their prices and scaling abilities. Also, it monitors the system loads and request queuing situations proactively and leverages Lambda to scale up instances when there exist requests violating the latency demands. InferLine [
23] targets the pipelined inference workloads with multiple stages. It monitors the frequency of queries to each model and makes the scaling decisions of each component separately to maintain the latency SLOs even during sharp bursts.
A number of works aim to provide bounded latency for inference execution at the system level considering its deterministic execution
. Clockwork [
51] discovers that many DL inference models have deterministic performance because of the underlying deterministic computations. Thus, it guarantees deterministic latency by alleviating the uncertainty introduced by other components of the system. To overcome the uncertainty from memory and cache usages, hardware interactions, and other uncontrollable external performance variations, Clockwork consolidates the configurations among all the system layers during the inference execution by proactively controlling the memory allocation and deallocation and disabling concurrent execution of multiple inference workloads to eliminate the interaction. Reducing the parallelism of execution eliminates the interference from other tasks but inevitably brings lower throughput and resource utilization. To address this issue, Abacus [
26] tries to guarantee SLO for query requests under the GPU co-location scenarios. It controls the execution sequence and the co-location situation proactively, rather than the default random-ordered execution overlap. Given the explicit order and specific co-location operators on GPUs, Abacus could predict the estimated running time under co-location from the early offline profiling stage. Based on the estimation, the query controller schedules all the simultaneous inference workloads to guarantee the QoS by searching the optimal execution sequence of DNN operators. ParM [
87] migrates the concept of erasure codes from distributed computing to model inference systems and uses learning-based coded computation to introduce redundancy and thus supports the recovery of inference executions with tail latency or failures.
Some solutions proactively schedule the inference workloads and rearrange the execution sequence at the job level. Irina [
173] models the satisfaction of latency demands as a scheduling problem. By leveraging preemption for DL inference workloads, Irina dynamically decides whether to preempt the ongoing query and launch the later arrived one, which brings a significant reduction of average completion time for inference workloads. The main challenge is that existing ML frameworks are not designed and suitable for preemption during execution. Irina carefully manages the preemption process by adding an exit node to the existing dataflow graph of the inference workload, thus enabling safe preemption at arbitrary moments. It is necessary to have more runtime information about the inference workloads for effective scheduling. Kube-Knots [
158] makes predictions about the resource utilization of each inference workload from two aspects. From the spatial aspect, Kube-Knots discovers the correlations across different resource-utilization metrics and then forecasts future resource utilization. From the temporal aspect, Kube-Knots predicts the peak inference usage and tries to avoid co-locating jobs that could attain peak consumption of the same resources simultaneously.
(3) Cost-efficiency. The monetary cost becomes one of the main concerns when using public cloud resources to deploy DL inference workloads. Considering the varied compute capabilities and prices for different types of resources and services, a couple of schedulers implement many mechanisms to achieve cost-efficient inference. MArk [
199,
200] analyzes the cost of utilizing different levels of resource abstractions in
Amazon Web Services (AWS) and
Google Cloud Platform (GCP) for inference. It finds that the
Infrastructure-as-a-Service (IaaS) provides better cost efficiency than
Content-as-a-Service (CaaS), while
Function-as-a-Service (FaaS) could compensate for the relatively long cold-start latency of IaaS at the cost of increased costs. Small instances with advanced CPUs and burstable instances are also recommended. For GPU instances, the cost can be greatly reduced by batch processing as well. Given different levels of capability, scalability, and pricing, MArk greedily selects the most cost-effective type of instances and leverages the spot instances for cost-saving. AutoDeep [
104] considers not only the resource type in the cloud but also the device placement for DL inference. It leverages Bayesian Optimization for the nearly optimal cloud configuration and Deep Reinforcement Learning for the nearly optimal device placement. Constrained by the SLO requirements, AutoDeep performs joint optimization to minimize the total cost of the inference execution. Kairos [
98] aims to maximize throughput under cost budget and SLO requirement via efficient query distribution among cloud instances and find a high-throughput heterogeneous configuration. iGniter [
178] builds a lightweight analytical performance model to explicitly capture the performance interference among workloads and propose a cost-efficient GPU resource provisioning strategy to guarantee SLOs. Besides, spot instances also can be leveraged to minimize cost. Cocktail [
52] develops a distributed weighted auto-scaling policy and leverages the spot instances to minimize cost. Similarly, SpotServe [
117] leverages the autoregressive nature of LLMs and introduces stateful inference recovery, which allows inference engines in cheap preemptible instances to commit their progress at the token level, rather than the request level as seen in prior work.
Besides monetary cost, in a GPU datacenter, how to improve the efficiency of energy cost is also critical. While this objective has been extensively explored for training workloads (Section
3.1.1), it is relatively less studied for the inference workloads. Some works [
60,
130] provide some energy characterizations of production DL inference clusters. Kube-Knots [
158] presents simple energy efficiency comparisons of inference workloads between GPUs and CPUs. It is necessary to comprehensively explore the energy optimization of different DL inference models with different types of compute resources and design more sophisticated energy-saving mechanisms with the consideration of latency and resource utilization. Clover [
97] achieves lower carbon emission using mixed-quality model variants.
(4) Tradeoffs between accuracy, latency, and cost. The objectives of accuracy, latency, and cost are not independent. Improving one goal may possibly compromise another goal if the solution is not designed properly. Besides, users may also have their specific expectations about different objectives. This motivates researchers to explore the tradeoffs between these objectives and devise more flexible and comprehensive scheduling systems.
The adoption of multiple models can improve the model inference accuracy but might also increase the response latency and cost. Several works track the latency and prediction accuracy of different models and implement mechanisms to select the most appropriate ones determined by the schedulers. Clipper [
24] introduces a model selection abstraction, which supports both single model selection and model ensemble selection. It executes the inference for all the models and combines their results. It observes the corresponding accuracy and latency feedback continuously to make the selection with a best-effort search method. Model-Switching [
203] pursues the tradeoffs between computational cost and service accuracy by switching different model variants proactively to improve the accuracy of responses under the latency constraint. By maximizing the ratio of correct predictions returned within the deadline, it makes selections among model variations with different computation demands and accuracy. Cocktail [
52] balances the cost with accuracy and latency on the public cloud via the optimization of the model ensemble. With a dynamic model selection policy that searches models tightly with the accuracy and latency bounds, Cocktail reduces the candidates in the ensemble and accomplishes fast and efficient model selection. Tabi [
170] is optimized for discriminative language models via a multi-level inference engine, which uses the calibrated confidence score to decide whether to return the accurate results of small models or re-route them to LLMs. MOSEL [
63] is designed for multi-modal models that carefully picks input modalities per request based on user-defined performance and accuracy requirements.
Some schedulers allow users to specify their demands about accuracy, latency, and cost and make scheduling decisions directly according to the demands. Tolerance Tiers [
54] discloses the efforts the system can offer to achieve each objective and makes users programmatically select their demands. Observing that improving the accuracy of some extreme requests can increase the latency greatly, Tolerance Tiers relaxes and sacrifices the accuracy demand to improve the latency and service cost. Each tier defines an error tolerance to indicate the tolerable accuracy loss and an optimization objective. Then, Tolerance Tiers optimizes the objective under the constraint of the maximum error tolerance. INFaaS [
141,
179] also asks for the throughput and accuracy demands from the users. It generates some variants from the existing models with different specific parameters (e.g., batch size, hardware configurations, hardware-specific parameters). After one-time profiling for each variant, INFaaS selects the model variant based on the resource consumption and profiling information from the profiling to serve users’ requests. Each model variant may have different inference latencies and monetary costs during execution. Therefore, INFaaS applies two levels of autoscaling to support guaranteed latency requirements and improve cost efficiency under varying loads. In addition to autoscaling at the hardware (VM) level, INFaaS also monitors load by maintaining a state machine for each model variant and supports scaling between model variants with different types and replicas. INFaaS makes the selection via a heuristic-based approach, which selects the variant with the minimum cost while meeting the SLO constraint or upgrades existing variants with higher throughput to fulfill the burst queries.
4.1.2 System Throughput.
Another important objective for scheduling inference workloads is to improve throughput capability. The techniques to achieve this goal are summarized as follows:
(1) Batching execution. One common approach is to batch multiple inference queries and execute them concurrently. Handling individual inference queries usually leads to GPU underutilization. Hence, batching inference can efficiently improve the utilization and reduce the system overhead. Like job queuing in parallel job scheduling, batching multiple queries can delay the execution of the requests that come earlier and possibly jeopardize the SLO requirement. Setting a proper batch size is critical to balance such delays and system throughput. Most schedulers dynamically adjust this hyperparameter based on the actual SLO requirement and queuing situation.
First, some schedulers adopt heuristic methods to tune the batch size. Clipper [
24] and Rafiki [
169] apply the practical
Additive-Increase-Multiplicative-Decrease (AIMD) algorithm to adjust the inference batch size. Specifically, the batch size is additively increased by a fixed amount until the latency of processing a batch exceeds the latency requirement and then multiplicatively decreased by a fixed percent. Clipper evaluates that AIMD is simple yet effective and adaptive to the changing throughput of a model in special cases. It also aggressively delays the execution of queries under moderate loads to the subsequent batch, which can bring a significant throughput increase for some models.
Second, some schedulers propose optimization-based methods to balance the inference delay and throughput. MArk [
199,
200] considers the maximum time of delaying a query, profiles the processing rate without batching, and searches for the optimal batch size under the SLO constraint. Nanily [
157] presents the upper bound of the batch size by retrieving the maximum remaining time for the requests, calculated as the remaining time towards the deadline subtracted by the least queuing time for the available resources. It then derives the corresponding batch size, which makes the inference execution time equal or close to the maximum remaining time. DyBatch [
207] considers the fairness of the delay for each independent workload when batching. It implements fine-grained batching schemes along with fairness-driven scheduling, which can compensate for the deviation of slowdown for small inference workloads. DyBatch organizes the workload batches in a time-sharing manner and selects the batch with the lowest resource utilization for running, thus maintaining fairness and minimizing the slowdown of each workload.
(2) Caching and reusing. Another widely used strategy is caching and reusing the prediction results across different requests. The scheduler selects the request that benefits most from caching and allocates proper resources. This can be done at two levels.
The first direction is to perform optimization at the query level. To provide fast responses to different queries, the inference system can cache the inference execution and prediction results for burst queries. Clipper [
24] maintains a prediction cache for requests with the same target model and the query input. Then, it can produce the results for some queries without evaluating the model, thus increasing the inference throughput. Clipper also applies an LRU cache eviction policy to optimize the caching efficiency. However, this approach may be less efficient when the queries do not have high similarities in practical scenarios, which leads to high cache miss rates and evictions.
The second direction is to perform optimization at the device level. Inference scheduling system resides models in the GPU device, thus increasing system throughput. Gillman et al. [
44] proposed to cache the DL models instead of the inference results. It schedules models to be loaded into the limited GPU memory to maximize the probability of servicing an incoming request without swapping the models in and out of the memory, thus accelerating the inference by eliminating the cold-start latency with cache hits. The caching and eviction policy considers many runtime aspects of DL inference workloads, including model size, frequency, model accuracy, and speed. This work also discusses some future directions for more dynamic caching mechanisms and policies, such as framework-level GPU memory-friendly optimization, proactively loading and evicting, and cluster-level GPU memory allocation. To address the limitation of GPU memory, TrIMS [
27] organizes the memory sharing of different models in a more systematic design. TrIMS reconciles the lifecycle of model memory consumption and carefully handles cache misses and evictions. It also considers multi-node, isolation, and fairness problems during sharing. Extensive evaluations on different models show its general abilities to improve the inference throughput by mitigating the model loading overhead.
(3) System configuration tuning. Besides the optimization techniques detailed above, there exist some schedulers leveraging end-to-end configuration tuning to improve the system throughput. Morphling [
163] formulates the optimal configuration search as a few-shot learning problem. Then, it adopts
model-agnostic meta-learning (MAML) [
37] to combine offline meta-model training for inference serving performance modeling under varied hardware and runtime configurations and performs online few-shot learning to predict the service performance. Based on the prediction, Morphling auto-tunes the resource provisioning configurations and makes better scheduling decisions. RRL [
135] concentrates on optimizing the parallelism configurations from different levels, including request-level parallelism and intra-request-level (inter-op and intra-op) parallelism, which have strong impacts on the latency of the entire system. RRL utilizes a region-based RL method to tune the parallelism configurations and reduce the inference processing latency based on the system performance similarity between different configurations within a similar parallelism setting. Shepherd [
202] exploits the insight that aggregating request streams into moderately sized groups greatly improves predictability, permitting high resource utilization as well as scalability. Symphony [
15] utilizes a non-work-conserving scheduling algorithm capable of achieving high batch efficiency while also enabling robust autoscaling.