In the single-machine non-clairvoyant scheduling problem, the goal is to minimize the total completion time of jobs whose processing times are unknowna priori. We revisit this well-studied problem and consider the question of how to effectively use (possibly erroneous) predictions of the processing times. We study this question from ground zero by first asking what constitutes a good prediction; we then propose a new measure to gauge prediction quality and design scheduling algorithms with strong guarantees under this measure. Our approach to derive a prediction error measure based on natural desiderata could find applications for other online problems.
1 Introduction
Non-clairvoyance, where the scheduler is not aware of the exact processing times of a job a priori, is a highly desired property in the design of scheduling algorithms. Due to its myriad practical applications, non-clairvoyant scheduling has been extensively studied in various settings in the scheduling literature [15, 18, 31]. With no access to the processing times (i.e., job sizes), non-clairvoyant algorithms inherently suffer from worse performance guarantees than the corresponding clairvoyant algorithms. For example, in the most basic version of scheduling, we have a set of jobs that need to be scheduled on a single machine with the goal of minimizing the total completion time of all jobs. In the non-clairvoyant setting, the job sizes are unknown to the algorithm and only become known after the job has completed; here, the Round-Robin algorithm that divides the machine equally among all incomplete jobs is 2-competitive [29] and this is known to be the best possible. In contrast, in the clairvoyant setting where job sizes are known a priori, the Shortest Job First(SJF) algorithm that schedules jobs in non-decreasing order of their sizes is known to be optimal.
Practitioners often face scheduling problems that lie somewhere in between clairvoyant and non-clairvoyant settings. While it might be impossible to know the exact job sizes, rather than assuming non-clairvoyance, it is possible to estimate job sizes based on their features using a predictor [2, 26, 30]. However, such an estimation can be error-prone. Can one use the possibly erroneous predicted job sizes to improve the performance of scheduling algorithms?
Augmenting traditional algorithms with machine-learned predictions is a fascinating and newly emerging line of work. In particular, this paradigm is applicable to online algorithms, which typically focus on obtaining worst-case guarantees against uncertain future inputs and thus settle for pessimistic bounds. Recent works have shown that, using predictions (that may be incorrect), one can provably improve the guarantees of traditional online algorithms for caching [17, 25, 32], ski-rental [3, 13, 20], scheduling [8, 20, 28], load balancing [21, 23], secretary problem [5], metrical task systems [4], set cover [9], flow and matching [22], knapsack [16, 34], and many others.
In this article, we continue the study of learning-augmented algorithms for single-machine non-clairvoyant scheduling. This problem, where an algorithm has access to predictions of each job size, was first investigated in Reference [20]. Without making any assumptions on the prediction quality, they design a non-clairvoyant algorithm that satisfies two important properties, namely, consistency and robustness. Consistency means that the guarantees of the algorithm improve with good predictions; in particular, the algorithm obtains a competitive ratio better than 2 if the predictions are good. Robustness ensures that the algorithm gracefully handles bad predictions, i.e., even if the predictions are adversarially bad, the competitive ratio stays bounded. For any \(\lambda \in (0, 1)\), they design an algorithm that guarantees robustness of \(\frac{2}{1-\lambda }\) and consistency of \(\frac{1}{\lambda }.\)1
1.1 The Need for a New Error Measure
Although Reference [20] demonstrates an appealing tradeoff between consistency and robustness for non-clairvoyant scheduling, a closer look reveals some brittleness of the result. Here, we discuss the issue at a high level and delve in more detail in the next section when we formally define the problem and the old and new error notions.
The main issue stems from the total completion time objective. Since this objective measures the total waiting time of all jobs, a shorter job could delay more jobs. In fact, different jobs can have different effects on how much they delay other jobs. The objective is thus neither linear nor quadratic in the job sizes.2
In Reference [20], it is assumed that the algorithm has a prediction \(\hat{p}_j\) of each job size \(p_j\). The quality of the prediction is the sum of the prediction errors of individual jobs, i.e., \(\ell _1(p, \hat{p}) = \sum _{j} |\hat{p}_j - p_j|\). Intuitively, such a linear error measure is incompatible with the completion time objective and may not distinguish good predictions vs. poor predictions; in fact, small perturbations in the predictions can result in large changes to the optimal solution. Consequently, the results in Reference [20] are forced to be pessimistic and have a high dependence on the error term. In particular, they show that scheduling the jobs in non-decreasing order of their predicted sizes (SPJF) leads to a cost of at most \({\rm\small OPT}+ (n-1) \cdot \ell _1(p, \hat{p})\) and this is tight, where \({\rm\small OPT}\) is the cost of the optimum solution and n the number of jobs.
We examine the \(\ell _1(\cdot , \cdot)\) error measure and show that it violates a natural and desirable Lipschitz-like property for the total completion time objective. This prompts the search for a different error measure based on two desiderata (see Section 2.2). Our new error measure better captures the sensitive nature of the objective and allows us to obtain an algorithm with total cost at most \((1+\epsilon) {\rm\small OPT}+ O_\epsilon (1) \cdot \nu (p, \hat{p})\) where \(\nu (\cdot , \cdot)\) is the measure we propose.
In practice, job sizes are predicted using black-box machine learned models that utilize various features of the jobs (e.g., history) and may be expensive to train. While it is impossible to precisely define the goodness of a prediction, intuitively, an effective error measure should neither tag bad predictions as good nor ignore predictions that could improve the objective.
1.2 Our Contributions
Under the new notion of error (denoted \(\nu (\cdot , \cdot)\)), we give the following results, stated informally below. For a job j, let \(p_j\) be its actual size and let \(\hat{p}_j\) be its predicted size. We assume all jobs are available for scheduling from time 0. Let \({\rm\small OPT}\) be the optimum solution.
(1) We obtain a non-clairvoyant algorithm that is \(O(1)\)-robust (with no dependency on \(\epsilon\)) and \((1+\epsilon)\)-consistent for any sufficiently small \(\epsilon \gt 0\) w.h.p., if no subset of \(O(\frac{1}{\epsilon ^3}\log n)\) jobs dominates the objective. (Theorem 3 and Corollary 1)
(2) We obtain a non-clairvoyant algorithm that is \(O (\frac{1}{\epsilon })\)-robust and \((1+\epsilon)\)-consistent in expectation for any sufficiently small \(\epsilon \gt 0\). More precisely, the cost of the algorithm is at most \((1+\epsilon) {\rm\small OPT}+ O (\frac{1}{\epsilon ^3}\log \frac{1}{\epsilon }) \nu (p, \hat{p})\). (Theorem 4)
In contrast, Reference [20] obtains an algorithm that is \(O(\frac{1}{\epsilon })\)-robust and whose cost is at most \((1+\epsilon) {\rm\small OPT}+ (1 + \epsilon) (n-1) \cdot \ell _1(p, \hat{p})\). Since our error measure satisfies \(\ell _1(p, \hat{p}) \le \nu (p, \hat{p}) \le n \cdot \ell _1(p, \hat{p})\), our algorithm never has an asymptotically worse dependence on the prediction quality for any fixed \(\epsilon \gt 0\) and is often sharper.
(3) We show that for any sufficiently small \(\epsilon , \gamma \gt 0\), no algorithm can have a smaller objective than \((1+ \epsilon) {\rm\small OPT}+ O(1 / \epsilon ^{1- \gamma }) \nu (p, \hat{p})\). (Theorem 6)
We now discuss the high-level ideas. The main challenge is how to determine if a prediction is reliable or not before completing all jobs. If the predictions are somewhat reliable, then we can more or less follow them; otherwise, we will essentially have to rely on non-clairvoyant algorithms such as Round-Robin. Therefore, we repeatedly take a small sample of jobs over the course of the algorithm and partially process them. Informally, we estimate the median remaining size of jobs and estimate the prediction error considering job sizes up to the estimated median. Unfortunately, this estimation is not free, since we have to partially process the sampled jobs and it can delay all the existing jobs. Therefore, we are forced to stop sampling once there are very few jobs left. Depending on how long we sample, we obtain the first and second results.
Due to the dynamic nature of our algorithm, the analysis turns out to be considerably non-trivial. In a nutshell, we never see the true error until we finish a job. Nevertheless, we still have to decide whether to follow the predictions. The mismatch between partial errors we perceive and the actual errors makes it challenging to charge our algorithm’s cost to the optimum and the error; special care is needed throughout the analysis to avoid overcharging. We note that unlike our algorithm, Reference [20] uses a static algorithm that linearly combines following the predictions and Round-Robin.
To summarize, our work demonstrates that it is possible to find quality solutions for a bigger class of predictions by using a more refined measure and it could lead to discovering new algorithmic ideas.
1.3 Other Related Work
Designing learning-augmented algorithms falls into the new beyond-worst-case algorithm design paradigm [33]. Starting with the work of Kraska et al. [19] on using ML predictions to speed up indexing, there have been many efforts to leverage ML predictions to better handle common instances that are found in practice. In addition to the aforementioned works, there also exist works on frequency counting [1, 11, 14], membership testing [27, 35], and online learning [10]. A recent work [12] shows how to speed up bipartite matching algorithms using ML predictions.
For single-machine scheduling in the clairvoyant setting, Shortest Remaining Processing Time (SRPT) is known to be optimal for minimizing the total completion time; it is in fact also optimal for minimizing the total flow/response time.3 If all jobs arrive at time 0, then SJF coincides with SRPT. In the non-clairvoyant setting, when jobs have different arrival times, no algorithm is \(O(1)\)-competitive for minimizing the total flow time, but Round-Robin is known to be \(O(1)\)-competitive when compared to the optimum schedule running on a machine with speed less than \(1/2 - \epsilon\), for any \(\epsilon \gt 0\). For a survey on online scheduling algorithms, see Reference [31]. Very recently, References [6, 7] obtained algorithms with \(O(1)\)-competitive ratio for total flow time if every job’s size is within a constant factor of its prediction.
1.4 Roadmap
In Section 2, we formally define our non-clairvoyant scheduling problem. In the same section, we continue to discuss what desiderata constitute a good measure of prediction error and propose a new measure meeting the desiderata. We also discuss other—both existing and candidate—measures and show that they fail to satisfy the desiderata. We present our algorithm in Section 3 and its analysis in Section 4. The lower bounds are presented in Section 5.
2 Formulation and Basic Properties
2.1 Non-clairvoyant Scheduling
Let J denote a set of n jobs. In the classical single-machine non-clairvoyant scheduling setting, each job \(j \in J\) has an unknown size or processing time\(p_j\). The processing time is known only after the job is complete. A job jcompletes when it has received \(p_j\) time units of processing, and we denote j’s completion time as \(C_j\). A job may be preempted at any time and resumed at a later time without any cost. Our goal is to find a schedule that completes all jobs and minimizes the total completion time of all jobs, i.e., \(\sum _{j \in J} C_j\). In the clairvoyant case, an algorithm knows the \(p_j\)’s in advance.
In the clairvoyant case, it is well-known that the Shortest Remaining Processing Time First (SRPT) algorithm minimizes the total completion time and becomes identical to the Shortest Job First(SJF) algorithm when all jobs arrive at time 0, which is the setting we consider in this article. In the non-clairvoyant case, the Round-Robin algorithm achieves a competitive ratio of 2, which is known to be optimal [29].
For any subset \(Z \subseteq J\) of jobs, we let \({\rm\small OPT}(\lbrace x_j\rbrace _{j \in Z})\) denote the minimum objective to complete all jobs in Z when each job \(j \in Z\) has size \(x_j\) and is known to the algorithm, i.e., it is the completion time of SJF when \(x_j\) is the size of job j. Here, we can think of \({\rm\small OPT}\) as a function that takes as input a multiset of non-negative job sizes and returns the minimum objective to complete all jobs with the job sizes in the set. Note that this is well-defined, as SJF is oblivious to job identities. If \(x_j\) is j’s true size, i.e., \(p_j\), for notational convenience, then we use \({\rm\small OPT}(Z) := {\rm\small OPT}(\lbrace p_j\rbrace _{j \in Z})\); in particular, \({\rm\small OPT}:= {\rm\small OPT}(J)\).
We consider the learning-augmented scheduling problem where the algorithm has access to predictions for each job size; let \(\hat{p}_j\) denote the predicted size of job j. We emphasize that we make no assumptions regarding the validity of the predictions and they may even be adversarial. As in the usual non-clairvoyant scheduling setup, the true processing size \(p_j\) of job j is revealed only after the job has received \(p_j\) amount of processing time. In the learning-augmented setting, the competitive ratio of an algorithm \(\mathcal {A}\) is a function of the prediction error. Our goal is to design an algorithm that satisfies the dual notions of robustness and consistency.
2.1.1 Properties of \({\rm\small OPT}\).
The following fact is well-known and follows from the definition of \({\rm\small OPT}\), i.e., SJF:
The following properties are simple consequences of SJF:
2.2 Prediction Error
A key question in the design of algorithms with predictions is how to define the prediction error, i.e., how to quantify the quality of predictions. While this definition can be problem-dependent, it must be algorithm-independent. For the non-clairvoyant scheduling problem, before we dive into a definition, we identify two desirable properties that we want of any such definition. Let \({\rm\small ERR}(\lbrace p_j\rbrace _{j \in J}, \lbrace \hat{p}_j\rbrace _{j \in J})\) denote the prediction error for an instance with true sizes \(\lbrace p_j\rbrace\) and predicted job sizes \(\lbrace \hat{p}_j\rbrace\); note that an algorithm knows the \(\hat{p}_j\)’s but not the \(p_j\)’s.
The first property is monotonicity, i.e., if more job sizes predictions are correct, then the error must decrease. Monotonicity is natural as better predictions are expected to decrease the error.
The second property is a Lipschitz-like condition that states that a prediction \(\lbrace \hat{p}_j\rbrace _{j \in J}\) is said to be good (as measured by \({\rm\small ERR}(\cdot , \cdot)\)) only if the optimal solution of the predicted instance is close to the true optimal solution. Indeed, if the optimal solution of a predicted instance differs significantly from true optimal solution, i.e., \(|{\rm\small OPT}(\lbrace \hat{p}_j\rbrace _{j \in J}) - {\rm\small OPT}(\lbrace p_j\rbrace _{j \in J})|\) is large, then the property requires that a good error measure assigns a large error to such predictions. Intuitively, this property allows us to effectively distinguish between good and bad predictions.
A natural way to define the prediction error is to define it as the \(\ell _1\)-norm between the predicted and the true job sizes, i.e., \(\ell _1(p, \hat{p}) = {\rm\small ERR}(\lbrace p_j\rbrace _{j \in J}, \lbrace \hat{p}_j\rbrace _{j \in J}) = \sum _{j \in J} |p_j \,-\, \hat{p}_j|\), as was done in Reference [20]. While this error definition satisfies monotonicity, it is not Lipschitz. Indeed, consider the following simple problem instance. Let \(\epsilon \gt 0\) be a constant. The true job sizes are given by \(p_1 = 1\,+\,\epsilon\) and \(p_j = 1, \forall j \in J \setminus \lbrace 1\rbrace\). The predicted job sizes are given by \(\hat{p}_1 = 1+3\epsilon\) and \(\hat{p}_j = 1, \forall j \in J \setminus \lbrace 1\rbrace\). Let \(\hat{q}\) be another set of predicted job sizes given by \(\hat{q}_1 = 1-\epsilon\) and \(\hat{q}_j = 1, \forall j \in J \setminus \lbrace 1\rbrace\). By construction, \(\ell _1(p, \hat{p}) = 2\epsilon = \ell _1(p, \hat{q})\). However, by the nature of the total completion time objective, there is a significant difference in the quality of the predictions in these two instances. Formally, \({\rm\small OPT}(\lbrace \hat{p}_j\rbrace _{j \in J}) - {\rm\small OPT}(\lbrace p_j\rbrace _{j \in J}) = 2 \epsilon\), whereas \({\rm\small OPT}(\lbrace p_j\rbrace _{j \in J}) - {\rm\small OPT}(\lbrace \hat{q}_j\rbrace _{j \in J}) = (n-1)\cdot \epsilon \gg \ell _1(p, \hat{q})\). Intuitively, the lack of Lipschitzness causes the \(\ell _1(\cdot , \cdot)\) error metric to not be able to distinguish between \(\lbrace \hat{p}\rbrace\) and \(\lbrace \hat{q}\rbrace\) predictions, although \(\lbrace \hat{p}\rbrace\) is arguably a much better prediction for this instance.
However, to satisfy the Lipschitz property, one can consider simply defining the prediction error as \({\rm\small ERR}(\lbrace p_j\rbrace _{j \in J}, \lbrace \hat{p}_j\rbrace _{j \in J}) = |{\rm\small OPT}(\lbrace \hat{p}_j\rbrace _{j \in J}) - {\rm\small OPT}(\lbrace p_j\rbrace _{j \in J})|\). Unfortunately, this may not be monotone. Indeed, consider a simple instance where the predictions are a reassignment of the true job sizes to the jobs, i.e., the job sizes are predicted correctly but the job identities are permuted. In this case, we have \(|{\rm\small OPT}(\lbrace \hat{p}_j\rbrace _{j \in J}) - {\rm\small OPT}(\lbrace p_j\rbrace _{j \in J})| = 0\). However, an improvement to any of the predictions will only result in a different optimum, and hence a non-zero error. In other words, this definition does not satisfy monotonicity.
These examples motivate a different notion of prediction error.
Our error measure follows by adding the RHS of these two inequalities.
It is easy to see that this definition, besides being symmetric and non-negative, also satisfies both monotonicity and the Lipschitz property. While this may not be the unique such definition, it is simple. Further, we are not aware of any other error measures, including those used in the previous work [9, 20], that satisfy the two desired properties. For more details, see Section 2.2.2.
When the scheduling instance is clear from context, we drop the arguments and let \(\nu = \nu (J; \lbrace p_j\rbrace , \lbrace \hat{p}_j\rbrace)\). Note that in case all the predicted job sizes are overestimates (or underestimates) of the true sizes, then we have \(\nu (J; \lbrace p_j\rbrace , \lbrace \hat{p}_j\rbrace) = |{\rm\small OPT}(\lbrace \hat{p}_j\rbrace _{j \in J}) - {\rm\small OPT}(\lbrace p_j\rbrace _{j \in J})|\).
2.2.1 Surrogate Error.
For the sake of analysis, we define a surrogate (prediction) error where we measure the error for overestimated and underestimated jobs separately. The surrogate error is a lower bound on the prediction error in Definition 3. While it does not satisfy Lipschitzness, nevertheless, it will turn out to be a useful tool for analysis.
Again, when the scheduling instance is clear from context, we drop the arguments and let \(\eta = \eta (J; \lbrace p_j\rbrace , \lbrace \hat{p}_j\rbrace)\). We first show that the surrogate error can be used to lower bound the prediction error.
A key advantage of the surrogate error \(\eta\) is that it is easier to decompose as opposed to \(\nu\). As our analysis carefully charges our algorithm’s cost in each round to the error and the optimum, decomposability will be very useful to avoid overcharging.
2.2.2 Comparisons with Other Error Measures.
We compare our new error measure with others, including those in References [9, 20]. First, we observe that our error measure is always lower bounded by the \(\ell _1(p, \hat{p})\) error utilized by Reference [20] but is at most a factor of n larger.
Thus, our error measure lends itself to asymptotically stronger algorithmic guarantees than the \(\ell _1(p, \hat{p})\) measure. In Reference [20], the cost of their algorithm is shown to be bounded by \((1+\epsilon){\rm\small OPT}+ (1 + \epsilon) \cdot (n-1) \ell _1(p, \hat{p})\). In the following sections, we obtain an algorithm whose cost is bounded by \((1+\epsilon) {\rm\small OPT}+ O_\epsilon (1) \cdot \nu\). By Proposition 6, our bound is asymptotically never worse than that in Reference [20] and can often be sharper.
Next, we discuss the error measure used by Bamas et al. [9] in their primal-dual framework. Their measure, which we call \(\eta _{BMS}\), can be defined as the cost of SPJF5 minus the optimum. It is easy to see that \(\eta _{BMS}\) is neither monotone nor Lipschitz. This is because SPJF yields an optimal schedule as long as jobs have the same order both in their true sizes and estimated sizes, i.e., \(p_j \le p_i\) if and only if \(\hat{p}_j \le \hat{p}_i\). Further, it is hard to compare our error measure to \(\eta _{BMS}\), as the latter does not directly factor in estimated job sizes but measures the cost for running an algorithm (that is based on the prediction) on the actual input. However, the following example shows that \(\eta _{BMS}\) can be excessively large even if a single job size is mispredicted: The true job sizes are given by \(p_j = 1 \; \forall j \in J \setminus \lbrace n\rbrace\) and \(p_n = n^2\). All job sizes are predicted correctly, except job n, where \(\hat{p}_n = 0\). Then, \(\nu = 1 + \cdots + n-1 + (n + n^2) - (1 + \cdots + n-1) = n + n^2\), while \(\eta _{BMS} \ge n^2 \cdot n - (1 + \cdots + n - 1 + n + n^2) = \Omega (n^3)\). Here, \(n^2 \cdot n\) comes from the fact that job n completes first under SPJF.
Finally, we note that the error measure cannot be oblivious to job identities. For example, consider the Earth Mover’s distance between the true job sizes and the estimated job sizes, i.e., find a mincost matching between two multi-sets \(\lbrace p_j\rbrace _{j \in J}\) and \(\lbrace \hat{p}_i\rbrace _{i \in J}\) where matching \(p_i\) to \(\hat{p}_j\) incurs cost \(|p_i - \hat{p}_j|\). It is easy see that such measures have zero error when the two multi-sets are identical yet the predictions are incorrect for individual jobs.
Subsequent to the publication of the conference version of this article, recent work [24] proposes a novel model that considers predictions of the relative order of job sizes instead of predictions of the actual job sizes.
3 Algorithm
In this section, we present our algorithm for scheduling with predictions. Our algorithm runs in rounds. To formalize, we need to set up some notation. We let \(J_k\) be the set of unfinished (alive) jobs at the beginning of round k, where \(k \ge 1\). Let \(n_k := |J_k|\). Let \(q_{k,j}\) be the amount of processing done on job j in round k. We define
•
\(p_{k,j} = p_j- \sum ^{k-1}_{w=1}q_{w,j}\): the true remaining size of j at the beginning of round k.
•
\(\hat{p}_{k,j} = \max \lbrace 0, \hat{p}_j - \sum ^{k-1}_{w=1}q_{w,j}\rbrace\): the predicted remaining size of j at the beginning of round k.
Note that if a job j has been processed by more than its predicted size \(\hat{p}_j\) before the kth round, then we have \(\hat{p}_{k,j} = 0\).
Our algorithm employs two subprocedures in each round to estimate the median \(m_k\) of the true remaining size of jobs in \(J_k\) and the magnitude of the error in the round. We first present the subprocedures and then present our main algorithm. We assume \(n \ge 2\) throughout, since otherwise all of our results follow immediately.
3.1 Median Estimation
To streamline the analysis, we will assume without loss of generality that all remaining sizes are distinct.6 Let \(\tilde{m}_k\) denote our estimate of the true median \(m_k\). Recall that Round-Robin processes all alive jobs equally at each time.
Algorithm 1 takes a sample S of the remaining jobs and returns as \(\tilde{m}_k\) the median of the jobs in S in terms of their remaining size. Note that this can be done by completing half of the jobs in S using Round-Robin. The sampling with replacement is done as follows: When we take job j as a sample, we pretend to create a new job in S with size equal to \(p_{k, j}\). Thus, S could contain multiple “copies” originating from the same job in \(J_k\); however, we will pretend that they are distinct jobs and they get exactly the same processor share in Round-Robin.
When the condition in the following lemma holds, we will say that \(\tilde{m}_k\) is a \((1+\delta)\)order-approximation of \(m_k\), or \((1+\delta)\)-approximation for brevity.
3.2 Error Estimation
Next, we would like to see if the prediction for the remaining jobs in round k is accurate enough to follow closely. However, measuring the error of the predictions even by running all jobs in a small sample to completion could take too much time. Thus, we estimate the error of the remaining jobs by capping all remaining sizes and predicted sizes at \((1+2\epsilon) \tilde{m}_k\). The error we seek to estimate is below.
Recall that by Proposition 1, the error \(\eta _k\) can be rewritten as \(\eta _k = \sum _{i \le j \in J_k} \min \lbrace d_{k,i}, d_{k,j}\rbrace\). Hence, to estimate \(\eta _k\), we sample pairs of jobs from \(J_k\) and for each sampled job j calculate \(d_{k,j}\). Since we only need to run job j for at most \((1+2\epsilon)\tilde{m}_k\) to compute \(d_{k,j}\), this step does not incur too much additional cost. Algorithm 2 describes it formally.
For any \(k \in [K]\), we say for brevity that \(\tilde{\eta }_k\) is a \((1+\epsilon)\)-approximation of \(\eta _k\) if it satisfies
Given the methods to estimate the median size of all jobs in \(J_k\) and the remaining jobs’ error in round k, we now describe our algorithm running in rounds \(k \ge 1\). We note that for a fixed \(\epsilon \lt 1/10\), the following algorithm yields \((1 + O(\epsilon))\)-consistency. To obtain the desired \((1+\epsilon)\)-consistency, we can later scale \(\epsilon\) by the necessary constant factor to obtain Theorem 3.
If there are enough jobs alive for accurate sampling, then we use our estimators to estimate the median and the error. If the estimated error is big, then we say that the current round is an RR round and run Round-Robin to process all jobs equally up to \(2 \tilde{m}_k\) units7; this is intuitive, as our estimator indicates that the prediction is unreliable. If not, then we closely follow the prediction. We only consider jobs that are predicted to be small and process them in increasing order of their (remaining) predicted size. To allow for a small prediction error, we allow a job to get processed \(3 \epsilon \tilde{m}_k\) more units than its remaining predicted size. In this case, we say that the current round is a non-RR round. Finally, when there are few jobs left, we run Round-Robin to complete all remaining jobs; this is the final round, indexed by \(K+1\).
The following easy observations will be useful for our analysis later:
4 Analysis
To streamline the presentation of our analysis, we will first make the following simplifying assumptions; we will remove these assumptions later in Section 4.4.
Note that Assumption 1(iv) can only hurt the algorithm. For the analysis, we extend the definition of \(\eta _k\).
Note that \(\eta _k = \eta _k(J_k)\).
4.1 Robustness
In this section, we show that our algorithm always yields a constant approximation assuming that our median and error estimation subroutines succeed. This guarantee holds in all cases even if the predicted job sizes are arbitrarily bad or even adversarially chosen.
Key to the analysis is to show that a constant fraction of jobs complete in each round.
Intuitively, from Lemma 3, we know that a large number of jobs must complete in each round and further, since we assume that \(\tilde{m}_k\) approximates the true median well, many of those jobs have remaining sizes at least \(\tilde{m}_k\). We next show that \(\Omega (n_k)\) jobs must have remaining size at least \(\tilde{m_k}\) and hence the optimal solution must incur a total cost of at least \(\Omega (\sum _{k} \tilde{m_k}n_k^2)\).
Next, we upper bound our algorithm’s cost. Let \(A_k\) be the total delay incurred by our algorithm in round k. When our algorithm processes a job j in round k, all the other alive jobs are forced to wait; \(A_k\) counts this total waiting time. Formally, we define \(A_k := \sum _{i \ne j \in J_k} D_k(i,j)\) where \(D_k(i,j)\) is the total amount of processing received by job i in round k while job j is still alive. We observe the following simple upper bound:
Observe that we use Round-Robin in the final round \(K+1\), which is known to be 2-competitive. Hence, to complete all the remaining jobs, by considering each job’s contribution to the objective, our algorithm’s cost can be upper bounded as follows:
By Lemmas 4 and 6, the algorithm’s cost is at most \(2 \cdot 266 ~{\rm\small OPT}(J \setminus J_{K+1}) + 2{\rm\small OPT}(J_{K+1}) + {\rm\small OPT}(J) \le 535~ {\rm\small OPT}(J)\),9 where the last inequality follows from Proposition 2(iv). This completes the proof of Theorem 1.
4.2 Consistency
In this section, we show that Algorithm 3 also utilizes good predictions to obtain improved guarantees. We analyze the delay incurred by our algorithm in RR rounds and non-RR rounds separately.
4.2.1 RR Round.
Intuitively, using the fact that the error is huge in each RR round, we can upper bound our algorithm’s total delay in all RR rounds by the error. Recall \(A_k \le \sum _{j \in J_k} q_{k, j} \cdot n_{k,j}\). As \(\delta\) is set to an absolute constant (\(\delta = 1/50\)), we will hide it in asymptotic notation. (Recall the surrogate error \(\eta\) from Definition 4.) This section is devoted to showing the following lemma:
From Lemma 5, the total delay \(A_k\) incurred in round \(k \in [K]\) in our algorithm is at most \(2 \tilde{m}_k n_k^2\). Thus, our goal is to carefully identify a part of the surrogate error of magnitude \(\Omega (\epsilon \tilde{m}_k n_k^2)\) to charge \(A_k\) to, in each RR round k.
In the following lemma, we consider three types of jobs and show that the error is big enough for at least one job type. The job types are: (i) those completing in round k, (ii) whose remaining predicted sizes become 0 in the round, and (iii) whose remaining predicted sizes are 0 and that do not complete in this round. We need to be careful when extracting some error for type (iii) jobs as they may reappear as type (iii) jobs in subsequent RR rounds. This is why we measure the error by pretending their remaining sizes are \(2 \tilde{m}_k\), exactly the amount by which the jobs each are processed in the round.
We next show that the above errors add up to \(O(\eta)\).
This section is devoted to proving the following lemma that bounds our algorithm’s total delay in non-RR rounds (NRR). As we do not have sufficiently large errors in non-RR rounds, we will have to bound it by both \({\rm\small OPT}\) and \(\nu\). Note that \({\rm\small OPT}- (\sum _{i \in J} p_i)\) is the total delay cost of the optimal schedule.
We begin our analysis of consistency for non-RR rounds by proving the following lemma, which shows how much error we can use for each pair of jobs.
Knowing how much error we can use for each pair of jobs, we are now ready to give an overview of the analysis. Recall that we let \(D_k(i, j)\) denote the delay i causes to j in round k. Note that in a non-RR round, \(D_k(i, j) = q_{k,i}\) if j is still alive while i gets processed; otherwise, \(D_k(i, j) = 0\).
1. Total delay involving jobs with zero remaining predicted sizes. We show that the delay involving the following jobs across all non-RR rounds is at most \(O(\epsilon) \cdot {\rm\small OPT}\):
which overrides the definition of the same notation given in Lemma 8. Fix a job \(i \in \hat{Z}_k\). Such a job i gets processed by at most \(3 \epsilon \tilde{m}_k\), since it is processed up to \(\hat{p}_{k,i} + 3 \epsilon \tilde{m}_k\). Further, if job j gets processed before i, then it implies \(\hat{p}_{k,j} = 0\), where j can delay i by at most \(3\epsilon \tilde{m}_k\) in the round. Similarly, job i can delay another job by at most \(3\epsilon \tilde{m}_k\) in the round. It is an easy exercise to see that total delay involving a job i with \(\hat{p}_{k,i} = 0\) is at most \(3\epsilon \tilde{m}_k n_k\). As there are at most \(n_k\) jobs remaining in this round,
2. Total delay involving jobs that execute but do not complete. We show the total delay across all non-RR rounds is at most \(O(\epsilon) {\rm\small OPT}+ O(1/ \epsilon ^2)\nu\). To precisely articulate what we aim to prove, let
which, roughly speaking, are the jobs with relatively small non-zero remaining predicted sizes that execute but do not complete in round k. If \(i \in U_k\), then \(\hat{p}_{k, i} \gt 0\) and \(\hat{p}_{k+1, i} = 0\) and therefore the family \(\lbrace U_k\rbrace _{k \in [K]}\) is disjoint.
The following bounds the total delay incurred due to jobs in \(U_k\).
Note that in \(D_{k, i}\), the first term is how much other jobs delay i and the second is how much job i delays other jobs: job i delays job j in the round by exactly \(\hat{p}_{k,i} + 3 \epsilon \tilde{m}_k\) if j is still alive when the algorithm stops processing i in the round. The proof is a bit subtle and is deferred to Section 4.3 but the intuition is the following: Suppose we made a bad mistake by working on job \(i \in U_k\) in round k—we thought the job was small based on its prediction but it turned out to be big. This means that job i’s processing delays many jobs in \(J_k\), which we could have avoided had we known that i was in fact big. Thus, to charge the delay, we show that the considerable underprediction of job i creates a huge error as it makes a large difference in how much i delays other big jobs.
3. Total delay due to the other jobs. Finally, we consider delay not considered by the two cases. Let us see for which pairs of jobs we did not consider their pairwise delay. For job i to delay job j in a non-RR round k, i must be processed, meaning that \(\hat{p}_{k, i} \le (1+\epsilon) \tilde{m}_k\). Since Case 1 already considered \(\hat{p}_{k, i} = 0\), \(i \in \hat{Z}_k\), we assume \(\hat{p}_{k, i} \gt 0\). Further, if i does not complete in round k, then we already covered the delay in Case 2 as \(i \in U_k\). Thus, we only need to consider the case when \(i \in V_k\), where \(V_k\) is defined as follows:
Putting the pieces together. Note that the delay incurred between every pair of jobs i and j in every non-RR round k falls into at least one of the above three categories. Thus, from Equations (2), (3), and (4), the total pairwise delay in non-RR rounds is at most
We are now ready to give the final upper bound on the objective of our algorithm, which is obtained by combining the upper bound in Lemma 7 and Equation (5) and by factoring in the total job size, \(\sum _{i \in J} p_j\). Note that \(2{\rm\small OPT}(J_{K+1})\) comes from the last round \(K+1\) where the remaining jobs \(J_{K+1}\) are processed by Round-Robin, which is known to be 2-competitive.
Our goal here is to extend Theorem 2 by removing Assumption 1. First, we can remove assumption (iv) by pretending that jobs have not been processed during the estimation processes. Thus, we might have to waste processing a job after completing it for the purpose of simulation.
Removing other assumptions need more care. We say that a bad event\(\mathscr{B}_k\) occurs in round k if \(\tilde{m}_k\) fails to be \((1+\delta)\)-approximate or \(\tilde{\eta }_k\) fails to be \((1+\frac{\delta ^2\epsilon }{32})\)-approximate. By Lemma 1 and Lemma 2, \(\mathscr{B}_k\) occurs with probability at most \(2/ n^2\). If \(\mathscr{B}_k\) does not occur, then we know that a constant fraction of jobs complete in round k, thanks to Lemma 3. Thus, if no bad events occur, then we have \(K = O(\log n)\). Hence, by a union bound, the probability that at least one bad event occurs is at most \(O((\log n) / n^2)\).
To remove Assumption 1(i) and (ii), we note that any non-idle algorithm, including ours, is n-approximate. Thus, in expectation, the above bad events can only increase the objective by \(\lceil (\log n / n^2) \cdot n \cdot {\rm\small OPT}\rceil\), which is negligible.
We now remove Assumption 1(iii) by factoring in the extra delays due to estimating \(m_k\) and \(\eta _k\), assuming no bad events occur. In the median estimation, we took a sample S of size \(\lceil \frac{\log 2n}{\delta ^2} \rceil\) and processed every job in S by exactly \(\tilde{m}_k\). So, the maximum delay due to the processing is at most \((\tilde{m}_k) \cdot |S| \cdot |J_k| = O((\log n) \tilde{m}_k n_k)\). Similarly, in estimating \(\eta _k\), we took a sample P of size \(O (\frac{1}{\epsilon ^2} \log n)\) and processed both jobs in each pair in P up to \((1+2\epsilon) \tilde{m}_k\) units. Thus, this processing causes total extra delay at most \(2 (1+2\epsilon) \tilde{m}_k \cdot |P| \cdot |J_k| = O (\frac{1}{\epsilon ^2} (\log n) \tilde{m}_k n_k)\).
Hence, the extra delay cost due to the estimation is bounded by
with probability \(1 - O((\log n) / n^2)\). Thus, the extra delay is negligible with high probability. The above discussion, Theorem 1, and Theorem 2, with \(\epsilon\) scaled appropriately by a constant factor, yield:
4.5 Guarantees in Expectation
Previously, we showed high probability guarantees on our algorithm’s objective. However, high probability guarantees inherently require \(\Omega (\log n)\) samples and, therefore, we are forced to stop sampling once the number of jobs alive becomes \(o(\log n)\). Here, we show that we can further continue to sample, if we only need guarantees in expectation, until we have \(O (\frac{1}{\epsilon ^3} \log \frac{1}{\epsilon })\) jobs unfinished.
Towards this end, we slightly change the algorithm.
(1) Reduce the sample sizes: For estimating \(m_k\) take a sample of size \(\lceil \frac{1}{\delta ^2} \log 2n_k \rceil\) in Algorithm 1 and for estimating \(\eta _k\) take a sample of size \(\lceil \frac{32^2}{\delta ^4\epsilon ^2} \log n_k \rceil\) in Algorithm 2. Note that the sample size depends on \(n_k\).
(2) Run Round-Robin concurrently: We divide each instantaneous time to run Round-Robin for \(\epsilon\) fraction and to run our algorithm for \((1 - \epsilon)\) fraction. More precisely, we pretend that we only run each of the algorithms without changing the jobs’ processing times. Let \(A_j(t)\) and \(B_j(t)\) denote how much Round-Robin and our algorithm have processed job j at time t. Then, job j will complete at the first time when \(A_j(t) + B_j(t) = p_j\). But, we pretend that j is unfinished in our algorithm’s execution until it has processed job j by its actual processing time—we do the same for Round-Robin. This way, we can analyze each of the two algorithms as if only one of them is being run. Since this simulation of the hybrid algorithm can only slow down the execution of our algorithm by a factor of \((1 - \epsilon)\), intuitively, the bound in Theorem 3 only increases by a factor of \(1 / (1 - \epsilon)\), which has no effect on our asymptotic bounds. But by running the 2-competitive Round-Robin concurrently, our final schedule will always be \(2 / \epsilon\)-competitive.
(3) Stop sampling when \(n_k = O(\frac{1}{\epsilon ^3} \log \frac{1}{\epsilon })\) (Line 2, Algorithm 3): This is doable, as we can tolerate higher probabilities of bad events, thanks to the concurrent execution of Round-Robin.
(4) In the final round \(K+1\), process all jobs in increasing order of their predicted size: As we only have \(|J_{K+1}| = O\left(\frac{1}{\epsilon ^3} \log \frac{1}{\epsilon } \right)\) jobs left, following the prediction blindly will not hurt much!
To show this, we first need the claim below. As before, we use \(\mathscr{B}_k\) to denote the bad event in round k. Our first goal is to upper bound the probability that any bad event occurs.
We first show our analysis of Algorithm 3 is tight. Recall Theorem 2, which states that the algorithm’s objective is at most \((1+\epsilon) {\rm\small OPT}+ \frac{1}{\epsilon ^2} \nu\) if for any subset \(Z \subseteq J\) of jobs whose size is polylogarithmic in n, \({\rm\small OPT}(Z) = o({\rm\small OPT})\).
The following theorem concerns the tradeoff between \({\rm\small OPT}\) and \(\nu\) in the consistency bound:
6 Conclusions and Future Directions
In this article, we defined a new prediction error measure based on natural desiderata. We believe that the new measure could be useful for other optimization problems with ML predictions where the \(\ell _1\)-norm measure is not suitable. We remark that our contributions are primarily theoretical and focus on asymptotic analysis; obtaining practical algorithms that achieve similar guarantees while incurring a small overhead (say, over Round-Robin) is an interesting question. Other future research directions include finding a deterministic algorithm with similar guarantees, obtaining a better dependence on \(\nu\), and extending the error notion to the setting where jobs have different arrival times.
Acknowledgments
Part of this work was done while Sungjin Im was visiting Google Research. We thank Joan Boyar and Martin Kristian Lorenzen for bringing to our attention some issues in an earlier version of the article.
Footnotes
1
Here, \(\alpha\)-robustness and \(\beta\)-consistency mean that the algorithm’s cost is at most \(\alpha\) times the optimum for all inputs but improves to at most \(\beta\) factor when the prediction coincides with the actual input. See Definition 2.
2
For a concrete example, consider n jobs that have unit sizes with sufficiently small perturbations. The derivative of the objective is n with respect to the length of the smallest job; yet it is 1 with respect to that of the largest job.
3
In the setting where job j has a release time \(r_j\), the flow time of a job is defined as \(C_j - r_j\) where \(C_j\) is the completion time of job j in the schedule. If all jobs are available at time 0, then the flow time coincides with the completion time.
4
An instance here is specified by both the predicted job sizes and the true job sizes.
5
Shortest Predicted Job First (SPJF) is the algorithm that blindly follows the predictions.
6
For instance, this can be achieved almost surely by adding small random perturbations to the initial job sizes. So, if a tiny random number \(\iota _j \gt 0\) is added to \(p_j\), then we pretend that j has a remaining size of \(\iota _j\) as soon as it actually completes. This perturbation has negligible effects on the objective.
7
In fact, the jobs can be processed in an arbitrary order as long as they are processed up to \(2\tilde{m}_k\) units.
8
Since all alive jobs are equally processed at each time in a RR round, the number of alive jobs can change while a job is being processed, which is not the case in non-RR rounds.
9
Note that \(\sum _{j \in J} p_j\) is already factored in the upper bound of \(A_k\) stated in Lemma 5. Thus, we can show that the algorithm’s cost is at most \(534 {\rm\small OPT}(J)\).
A Appendix
A.1 Concentration Bounds
Let \(X_1 , \ldots , X_{\ell }\) be independent random variables such that \(0 \le X_i \le 1\) for all \(i \in [\ell ]\). We define the empirical mean of these variables by \(\bar{X} = \frac{1}{\ell } (X_1 + \cdots + X_{\ell })\). Then,
References
[1]
Anders Aamand, Piotr Indyk, and Ali Vakilian. 2019. (Learned) frequency estimation algorithms under Zipfian distribution. (2019). arXiv:1908.05198
Maryam Amiri and Leyli Mohammad-Khanli. 2017. Survey on prediction models of applications for resources provisioning in cloud. J. Netw. Comput. Applic. 82 (2017), 93–113.
Antonios Antoniadis, Christian Coester, Marek Elias, Adam Polak, and Bertrand Simon. 2020. Online metric algorithms with untrusted predictions. In ICML. 345–355.
Antonios Antoniadis, Themis Gouleakis, Pieter Kleer, and Pavel Kolev. 2020. Secretary and online matching problems with machine learned advice. In NeurIPS. 7933–7944.
Étienne Bamas, Andreas Maggiori, Lars Rohwedder, and Ola Svensson. 2020. Learning augmented energy minimization via speed scaling. In NeurIPS. 15350–15359.
Michael Dinitz, Sungjin Im, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. 2021. Faster matchings via learned duals. In NeurIPS. 10393–10406.
Thomas Lavastida, Benjamin Moseley, R. Ravi, and Chenyang Xu. 2021. Learnable and instance-robust predictions for online matching, flows and load balancing. In ESA. 59:1–59:17.
Andréa Matsunaga and José A. B. Fortes. 2010. On the use of machine learning to predict the time and resources consumed by applications. In CCGRID. 495–504.
Ilia Pietri, Gideon Juve, Ewa Deelman, and Rizos Sakellariou. 2014. A performance model to estimate execution time of scientific workflows on the Cloud. In Workshop on Workflows in Support of Large-Scale Science. 11–19.
Kirk Pruhs, Jirí Sgall, and Eric Torng. 2004. Online scheduling. In Handbook of Scheduling - Algorithms, Models, and Performance Analysis, Joseph Y.-T. Leung (Ed.). Chapman and Hall/CRC.
Bo Sun, Ali Zeynali, Tongxin Li, Mohammad Hassan Hajiesmaili, Adam Wierman, and Danny H. K. Tsang. 2020. Competitive algorithms for the online multiple knapsack problem with application to electric vehicle charging. Proc. ACM Meas. Anal. Comput. Syst. 4, 3 (2020), 51:1–51:32.
SPAA '22: Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures
In non-clairvoyant scheduling, the task is to find an online strategy for scheduling jobs with a priori unknown processing requirements with the objective to minimize the total (weighted) completion time. We revisit this well-studied problem in a ...
SPAA '21: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures
In the single-machine non-clairvoyant scheduling problem, the goal is to minimize the total completion time of jobs whose processing times are unknown a priori. We revisit this well-studied problem and consider the question of how to effectively use (...
A non-clairvoyant scheduler makes decisions having no knowledge of jobs. It does not know when the jobs will arrive in the future, that is, it is online, and how long the jobs will be executed after they arrive. For non-clairvoyant scheduling, we first ...
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).