Optimal Scheduling Algorithm For Distributed-Memory Machines
Optimal Scheduling Algorithm For Distributed-Memory Machines
1, JANUARY 1998 87
Abstract—Task Scheduling is one of the key elements in any distributed-memory machine (DMM), and an efficient algorithm can
help reduce the interprocessor communication time. As optimal scheduling of tasks to DMMs is a strong NP-hard problem, many
heuristic algorithms have been introduced in the literature. This paper presents a Task Duplication based Scheduling (TDS)
2
algorithm which can schedule directed acyclic graphs (DAGs) with a complexity of O(|V | ), where |V | is the number of tasks in the
DAG. This algorithm generates an optimal schedule for a class of DAGs which satisfy a simple cost relationship. The performance of
the algorithm has been observed by its application to some practical DAGs, and by comparing it with other existing scheduling
schemes in terms of the schedule length and algorithm complexity.
Index Terms—Directed acyclic graph, distributed-memory machines, optimal scheduling algorithms, task duplication, task scheduling.
—————————— ✦ ——————————
1 INTRODUCTION
The proposed algorithm is described in Section 2 and the The computation of the earliest start and completion
results are shown in Section 3. Finally, Section 4 provides times proceeds in a top-down fashion, starting with the
the conclusions. entry node and terminating at the exit node. The latest al-
lowable start and completion times are determined in a
2 TDS ALGORITHM bottom-up fashion, in which the process starts from the exit
node and terminates at the entry node. For each task i, a
The motivation behind this work is to introduce a fast algo- favorite predecessor fpred(i) is assigned using (6), which
rithm which can schedule the tasks such that they finish their signifies that assigning both the task and its favorite prede-
execution in an optimal time. In this paper, it is assumed that cessor to the same processor will result in a lower parallel
the task graph is available in the form of a DAG defined by time. The level of any node is the length of the longest path
the tuple (V, E, W, c), where V is the set of tasks and E is the set from the node to the exit node. When calculating the level or
of edges. The set W consists of computation costs and each the length of the path, the communication times are ig-
task i ° V has a computation cost represented by W(i). Simi- nored, and only the computation times are taken into ac-
larly, c is the set of communication costs, and each edge from count. The level of entry node is the sum of computation
task i to task j, ei,j ° E has a cost ci,j associated with it. Without costs along the longest linear path. This schedule length can
loss of generality, it can be assumed that there is one entry never be lower than the level of the entry node of the DAG.
node and one exit node for the DAG. If there are multiple The TDS algorithm is shown to yield optimal results if the
entry or exit nodes, then the multiple nodes can always be condition given below can be satisfied by all the join nodes
connected through a dummy node which has zero computa- (nodes having more than one predecessor) of the DAG.
tion cost and zero communication cost edges.
A task is an indivisible unit of work and is nonpreemp- CONDITION 1. Let m and n be the predecessor tasks of task i
tive. The underlying target architecture is assumed to be which have the highest and second-highest values of
homogeneous, and the communication costs between a pair {( ect( j ) c j , i ) j ° pred(i )} , respectively. Then, one of the
of processors, for a fixed length of message, is considered to following must be satisfied:
be the same. The number of processors is assumed to be
• W(m) cn,i if est(m) est(n) or,
unbounded, and an I/O coprocessor is available so that
• W(m) (cn,i+ est(n) est(m)) if est(m) < est(n).
computation and communication can be performed concur-
rently. It is assumed that memory space is not a problem This condition can be satisfied if the DAG is of coarse
and that all tasks are stored at each processor. Thus, different grain, and the communication requirements are low. But it
processors could execute duplicate copies of the same task is not necessary for the DAG to be coarse grain for this con-
using the same initial data. It is assumed that there is no fault dition to be satisfied. If the condition is satisfied, then the
in the system, and that the data generated by duplicate cop- optimal solution is guaranteed and is proven in Section 3.
ies of the same task are consistent with each other. The pseudocode in Fig. 2 shows the steps involved in the
The TDS algorithm schedules the tasks based on certain algorithm. In step one, the task graph is traversed to com-
parameters, and the mathematical expressions to evaluate pute the est, ect, and fpred of each node. Step two involves
these parameters for a node i of the DAG are given below: computing last, lact, and level for all the nodes of the task
graph. These two steps are performed using (1)-(11).
pred(i ) { j ei , j ° E} (1)
Input:
succ(i ) { j ei , j ° E} (2) DAG (V, E, τ, c)
pred(i): Set of Parents of Task i.
est(i ) 0 , if pred(i ) I (3) succ(i): Set of Successor Tasks of Task i.
Step 3, shown in Fig. 3, generates the tasks clusters and If j is not critical, then the process of cluster generation can
is based on the parameters computed in steps one and two, be continued by following through any other unassigned
and on the array queue. The elements in the array queue are predecessor of i. This will help reduce the number of tasks
the nodes of the task graph sorted in smallest level first or- which are duplicated. In case j is critical, other predecessors
der. Each cluster is assigned to a different processor, and which have not yet been assigned to a processor are exam-
the generation of a cluster is initiated from the first task in ined if they could initially have been the favorite predecessor.
the array queue, which has not yet been assigned to a proc- This could have happened if, for another task k, (k ° pred(i),
essor. The generation of the cluster is completed by per- k j) (ect(k) + ck,i) = (ect(j) + cj,i). If there exists such a task k,
forming a search similar to the depth first search starting the path to the entry node is traced by traversing through
from the initial task. The search is performed by tracing the the task k. Otherwise, task j is duplicated on the current
path from the initial task, selected from queue, to the entry processor. The generation of cluster terminates once the
node by following the favorite predecessors. If the favorite path reaches the entry node. The next cluster starts from the
predecessor is unassigned, i.e., not yet assigned to a proces- first unassigned task in queue. If all the tasks are assigned to
sor, then it is selected. Otherwise, the favorite predecessor a processor, then the algorithm terminates.
may or may not be duplicated onto the current processor.
Suppose i is the current task and its favorite predecessor j 2.1 Complexity of the Scheduling Algorithm
has been assigned to another processor. In this case, before The first and second steps of the algorithm traverse each
task j is duplicated onto the current processor, it is impor- task of the task graph and compute the start times and
tant to examine if task j is critical for i. Task j would be criti- completion times. At each node, the incoming and outgoing
cal for task i is the condition (last(i) lact(j)) < cj,i is satisfied. edges are examined, and, in the worst case, all the edges of
90 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998
TABLE 1
START AND COMPLETION TIMES FOR THE NODES
Node level est ect last lact fpred
1 21 0 3 0 3 –
2 9 3 5 6 8 1
3 11 3 7 6 10 1
4 18 3 6 3 6 1
5 7 7 9 10 12 3
6 15 6 12 6 12 4
7 11 6 8 8 10 4
8 5 12 14 14 16 6
9 9 12 18 12 18 6
10 3 18 21 18 21 9
It will be proven that the start time of task i cannot the earliest task i can start is given by (est(n) + W(n) +
be lowered by assigning tasks m and n to the same W(m)), i.e., ect(n) + W(m). Since W(m) cn,i, ect(n) +
processor if the condition is satisfied. Thus, tasks m W(m) would be greater than or equal to ect(n) + cn,i.
and n have to be assigned to different processors. The Thus, est(i) cannot be lowered.
other predecessors may have any values of computa- CASE 2. est(m) < est(n). From the condition stated in Section 2,
tion and communication times, but, task i has to wait W(m) cn,i + est(n) est(m) has to be satisfied.
until ect(m) or ect(n) + cn,i, whichever is higher. Thus,
the other tasks will not affect est(i), as long as the con- CASE 2a. ect(m) ect(n) + cn,i, i.e., est(i) = ect(m). If tasks
dition ect(k) + ck,i
ect(n) + cn,i, for all k ° pred(i) and k m, n, and i are assigned to the same processor, est(i)
m, n is satisfied. Thus, only tasks m and n need to be = est(m) + W(m) + W(n) = ect(m) + W(n). Thus, the start
considered among all the predecessors of task i. time of task i cannot be lower than ect(m).
There are two possible cases here. CASE 2b. ect(m) < ect(n) + cn,i, i.e., est(i) = ect(n) + cn,i. If m, n,
CASE 1. est(m) est(n). From the condition stated in Section 2, and i are assigned to the same processor, earliest
W(m) cn,i has to be satisfied. Here, again, there could be start time of task i would be est(m) + W(m) + W(n). The
two cases: start time of task i can be improved if est(m) + W(m) +
W(n) < est(n) + W(n) + cn,i. In other words, if est(m) +
CASE 1a. ect(m) ect(n) + cn,i, i.e., est(i) = ect(m). If tasks m,
W(m) < est(n) + cn,i, or if W(m) < (est(n) est(m)) + cn,i.
n, and i are assigned to the same processor, est(i) =
But, it was assumed that W(m) (est(n) est(m)) + cn,i.
max(est(m), (est(n) + W(n))) + W(m), est(i) = max(ect(m),
Thus, the start time of task i cannot be lowered.
ect(n) + W(m)). Thus, the start time of est(i) cannot be
reduced below ect(m) by assigning m, n, and i to the This proves that, if the condition given in Section 2 is satis-
same processor. fied by all the join nodes of the DAG, then TDS algorithm
yields the earliest possible start time, and, consequently, ear-
CASE 1b. ect(m) < ect(n) + cn,i, i.e., est(i) = ect(n) + cn,i. If
liest possible completion time for all the tasks of the DAG.
tasks m, n, and i are assigned to the same processor,
92 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998
TABLE 2
RATIO OF SCHEDULE LENGTH GENERATED BY LC OVER TDS ALGORITHM
Communication to Computation Ratio
Algorithms Condition No. of Processors
Satisfied Required by TDS 1 50 100 150 200
Bellman-Ford Yes 1,171 1.0000 1.0007 1.0005 1.0004 1.0004
Cholesky Yes 342 1.159 1.272 1.272 1.272 1.272
Decomposition
Master-Slave No 50 1.001 1.017 1.031 1.043 1.054
Systolic No 97 1.002 1.077 1.143 1.200 1.245
them to different processors. Thus, like TDS algorithm, LC are also similar. But the ratio of schedule time is almost
algorithm also does not assign two independent tasks to the constant at 28 percent for most of the range and may not
same processor. The worst case schedule generated by TDS rise any further. The schedule time primarily consists of
algorithm matches the schedule generated by linear clus- communication costs. Thus, for higher values of communi-
tering algorithm. cation to computation time ratios, the ratio of schedule
The ratios of schedule time of LC to TDS has been ob- length will, in effect, be the ratio of intertask communica-
tained for each ratio of communication to computation time tion costs. Since the communication times for both algo-
and are shown in Fig. 11. The communication to computa- rithms are increasing by the same constant, the ratio be-
tion time ratios are varied from one to 200. It can be ob- tween the schedule lengths remains almost constant. In
served that the Systolic and Master-Slave algorithms pro- Section 2, it was stated that if the DAG is of coarse granu-
vide a steadily rising schedule for TDS as compared to LC larity, then optimal results can be obtained. From the re-
algorithm because of the structured nature of these DAGs. sults here, it can be noticed that if the DAGs are of finer
They are more suited for scheduling algorithms, which granularity, optimal results may not be obtained, but the
schedule using the concept of linear clusters, i.e., clusters results are good as compared to linear clustering algorithm.
which do not contain two independent tasks. It can be seen The speedup achieved by TDS algorithm over the LC algo-
that for Bellman-Ford algorithm, TDS is marginally better rithm for five different ratios of communication to compu-
than LC. The schedules generated by both the algorithms tation times are shown in Table 2.
are almost the same, which is the reason why TDS algo-
rithm does not perform much better than the linear clus- 3.4 Comparison with Other Algorithms
tering algorithm. For Cholesky decomposition algorithm, The DAG [8] shown in Fig. 12 has been used to compare the
the schedules generated by both TDS and linear clustering TDS algorithm with five other algorithms, namely, the DSC
algorithm [8], Linear Clustering algorithm [11], Internalization
94 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998
REFERENCES
[1] T.L. Adam, K.M. Chandy, and J.R. Dickson, “A Comparison of
List Schedules for Parallel Processing Systems,” Comm. ACM, vol.
17, no. 12, pp. 685-690, Dec. 1974.
[2] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, The Design and Analysis
of Computer Algorithms. Addison-Wesley, 1974.
[3] D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computa-
tion: Numerical Methods. Prentice-Hall Int’l, 1989.
[4] H.B. Chen, B. Shirazi, K.Kavi, and A.R. Hurson, “Static Schedul-
ing Using Linear Clustering with Task Duplication,” Proc. ISCA
Int’l Conf. Parallel and Distributed Computing and Systems, pp. 285-
290, Louisville, Ky., Oct. 14-16 1993.
[5] J.Y. Colin and P. Chritienne, “C.P.M. Scheduling with Small
Communication Delays and Task Duplication,” Operations Re-
search, vol. 39, no. 4, pp. 680-684, July 1991.
[6] S. Darbha and D.P. Agrawal, “SDBS: A Task Duplication Based
Optimal Scheduling Algorithm,” Proc. Scalable High Performance
Computing Conf., pp. 756-763, Knoxville, Tenn., May 23-25 1994.
[7] H. El-Rewini and T.G. Lewis, “Scheduling Parallel Program Tasks
Onto Arbitrary Target Architectures,” J. Parallel and Distributed
Computing, vol. 9, pp. 138-153, 1990.
[8] A. Gerasoulis and T. Yang, “A Comparison of Clustering Heuris-
Fig. 12. DAG for comparison of algorithms. tics for Scheduling Directed Acyclic Graphs on Multiprocessors,”
J. Parallel and Distributed Computing, vol. 16, pp. 276-291, 1992.
[9] R.L. Graham, L.E. Lawler, J.K. Lenstra, and A.H. Kan,
TABLE 3 “Optimization and Approximation in Deterministic Sequencing
COMPARISON OF ALGORITHMS and Scheduling: A Survey,” Annals of Discrete Mathematics, pp. 287-
326, 1979.
Algorithms Schedule Length Complexity [10] O.H. Ibarra and S.M. Sohn, “On Mapping Systolic Algorithms
Optimal Algorithm 8.5 NP-Complete onto the Hypercube,” IEEE Trans. Parallel and Distributed Systems,
Linear Clustering 11.5 O(|V|(|E| + |V|)) vol. 1, no. 1, pp. 48-63, Jan. 1990.
MCP Algorithm 10.5 2 [11] S.J. Kim and J.C. Browne, “A General Approach to Mapping of
O(|V| log |V|) Parallel Computation upon Multiprocessor Architectures,” Int’l
Internalization Prepass 10.0 O(|E|(|V| + |E|)) Conf. Parallel Processing, vol. 3, pp. 1-8, 1988.
Dominant Sequence 9.0 [12] J.P. Kitajima and B. Plateau, “Building Synthetic Parallel Pro-
O((|E| + |V|) log |V|) grams: The Project (ALPES),” Proc. IFIP WG 10.3 Workshop on Pro-
Clustering
gramming Environments for Parallel Computing, pp. 161-170, Edin-
Threshold Scheduling 10.0 2
Algorithm O(|V| ) burgh, Scotland, Apr. 6-8 ,1992.
[13] B. Kruatrachue, “Static Task Scheduling and Grain Packing in
TDS Algorithm 8.5 2
O(|V| ) Parallel Processing Systems,” PhD thesis, Oregon State Univ.,
1987.
[14] Y.-K. Kwok and I. Ahmad, “Exploiting Duplication to Minimize
the Execution Times of Parallel Programs on Message-Passing
4 CONCLUSIONS Systems,” Proc. Sixth IEEE Symp. Parallel and Distributed Process-
This paper presents a fast algorithm to schedule the tasks of ing, pp. 426-433, Oct. 26-29, 1994.
[15] S.S. Pande, D.P. Agrawal, and J. Mauney, “A New Threshold
a DAG onto processors of DMMs and is based on the du- Scheduling Strategy for Sisal Programs on Distributed Memory
plication of tasks. The algorithm has a complexity of Systems,” J. Parallel and Distributed Computing, vol. 21, no. 2, pp.
2
O(|V| ), where |V| is the number of tasks of the DAG and 223-236, May 1994.
provides the best possible solution in case the task graph [16] S.S. Pande, D.P. Agrawal, and J. Mauney, “A Scalable Scheduling
Method for Functional Parallelism on Distributed Memory Multi-
satisfies a simple condition. Even if the condition is not sat- Processors,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 4,
isfied, this algorithm provides a good schedule, which is pp. 388-399, Apr. 1995.
close to the optimum solution. [17] V. Sarkar, Partitioning and Scheduling Parallel Programs for Execu-
The performance of TDS has been observed by schedul- tion on Multiprocessors. Cambridge, Mass.: MIT Press, 1989.
[18] G.C. Sih and E.A. Lee, “A Compile-Time Scheduling Heuristic for
ing some practical DAGs onto DMMs and comparing Interconnection-Constrained Heterogeneous Processor Architec-
against the schedule length obtained by the Linear Clus- tures,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 2,
tering algorithm. The TDS algorithm has been compared to pp. 175-187, Feb. 1993.
[19] M.Y. Wu and D. Gajski, “A Programming Aid for Hypercube
other algorithms in terms of its complexity and the sched- Architectures,” J. Supercomputing, vol. 2, pp. 349-372, 1988.
ule length generated.
DARBHA AND AGRAWAL: OPTIMAL SCHEDULING ALGORITHM FOR DISTRIBUTED MEMORY MACHINES 95
Sekhar Darbha (S’89, M’96) received his BTech Dharma P. Agrawal (M’84-F’87) is a professor
degree in electrical engineering from the Institute in the Department of Electrical and Computer
of Technology, Banaras Hindu University, Engineering at the North Carolina State Univer-
Varanasi, India in 1989. He received his MS and sity, Raleigh. His research interests include par-
PhD degrees in computer engineering from North allelizing and scheduling techniques, routing in
Carolina State University, Raleigh, NC, in 1991 multicomputer networks, mobile networks, and
and 1995, respectively. He has been working as system reliability. He has edited a tutorial text on
an assistant professor in the Department of Elec- Advanced Computer Architecture (IEEE Com-
trical and Computer Engineering at Rutgers Uni- puter Society Press, 1986), coedited texts enti-
versity since August, 1995. His research interests tled Distributed Computing Network Reliability,
are in program partitioning and scheduling for and Advances in Distributed System Reliability
multiprocessing systems. He has served as a minitrack coordinator for (IEEE Computer Society Press, 1990), and a self study guide on Par-
partitioning and scheduling minitrack at the 30th Hawaii International allel Processing (IEEE Press, 1991).
Conference on System Sciences (HICSS), and he will be involved with Dr. Agrawal is an editor for the Journal of Parallel and Distributed
the 31st HICSS as a minitrack coordinator for the minitrack on compiling Systems and International Journal of High Speed Computing. He has
for distributed and embedded systems. He is a member of the IEEE. served as an editor of the IEEE Computer magazine, and the IEEE
Transactions on Computers. He has been the program chair for the
1984 International Symposium on Computer Architecture, the 1994
International Conference on Parallel Processing, the workshop chair
for the 1995 ICPP Workshop on Challenges for Parallel Processing,
and the General Chair for the 1993 ISMM International Conference on
Parallel and Distributed Computing Systems, MASCOTS 1996 work-
shop. Recently, he served as the chair of the IEEE Computer Society
Technical Committee on Computer Architecture and is currently chair
of the IEEE-CS Harry Goode and McDowell award committees. He is a
fellow of the IEEE.