Optimal Scheduling Algorithm For Distributed-Memory Machines

Uploaded by

SWETA DEY

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Optimal Scheduling Algorithm For Distributed-Memory Machines

Uploaded by

SWETA DEY

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO.

1, JANUARY 1998 87

Optimal Scheduling Algorithm for

Distributed-Memory Machines
Sekhar Darbha, Member, IEEE, and Dharma P. Agrawal, Fellow, IEEE

Abstract—Task Scheduling is one of the key elements in any distributed-memory machine (DMM), and an efficient algorithm can
help reduce the interprocessor communication time. As optimal scheduling of tasks to DMMs is a strong NP-hard problem, many
heuristic algorithms have been introduced in the literature. This paper presents a Task Duplication based Scheduling (TDS)
2
algorithm which can schedule directed acyclic graphs (DAGs) with a complexity of O(|V | ), where |V | is the number of tasks in the
DAG. This algorithm generates an optimal schedule for a class of DAGs which satisfy a simple cost relationship. The performance of
the algorithm has been observed by its application to some practical DAGs, and by comparing it with other existing scheduling
schemes in terms of the schedule length and algorithm complexity.

Index Terms—Directed acyclic graph, distributed-memory machines, optimal scheduling algorithms, task duplication, task scheduling.

—————————— ✦ ——————————

1 INTRODUCTION

R ECENTLY, there has been an increase in the use of the

Distributed-Memory Machines (DMM) due to the ad-
vances in VLSI technology, interprocessor communication
processor is available, the task which has the highest prior-
ity among all the tasks which are ready to execute is allo-
cated to the free processor. The problem with these schemes
networks, and efficient routing algorithms. DMMs are be- is that they do not take interprocessor communication (IPC)
ing used for several applications, including fluid flow, time into account. The assumption of neglecting IPC costs
weather modeling, database systems, real-time, and image hinder their application to DMMs. Even if they take the IPC
processing. The data for these applications can be distributed costs into account, they suffer from the fact that they try to
evenly onto the processors of the DMM and, with fast ac- balance the workload more than trying to minimize the
cess of local data, high speed-ups can be obtained. Maximum overall schedule length.
benefits from DMMs can be obtained by employing an effi- The next set of scheduling algorithms is the clustering
cient task partitioning and scheduling strategy. The parti- schemes [8], [11], [15], [16], [17], [19], which try to cluster
tioning algorithm partitions an application into tasks with tasks which communicate heavily onto the same processor.
appropriate grain size and represents them in the form of a Even these schemes do not guarantee optimal execution
directed acyclic graph (DAG). The tasks of the resulting time. Also, if the number of processors available is less than
DAG are then scheduled onto the processors of a DMM. the number of clusters, there could be a problem.
This paper assumes the application program to be available There are several task duplication based scheduling
in the form of a DAG or in other compiler intermediate schemes [4], [6], [13], [14] which duplicate certain tasks in an
forms and introduces an algorithm to schedule the tasks. attempt to minimize the communication costs. Critical path
The scheduling problem has been proven to be an NP- method and task duplication based scheduling algorithm
Complete [9] problem. There are very restrictive cases [13] was proposed in [5] and is based on a very restrictive condi-
for which an optimal solution can be obtained in polyno- tion. They require that for every join node (defined as a node
mial time. Suboptimal solutions to the static scheduling having more than one incoming edge), the maximum value
problem can be obtained by heuristic methods. Most of the of the communication costs of the incoming edges must be
heuristic methods are based on certain assumptions which less than or equal to the minimum value of the computation
allow the algorithm to be executed in polynomial time. costs of the predecessor tasks. For example, for the join node
The first set of scheduling algorithms is priority based al- i, shown in Fig. 1, the computation cost of task a (which hap-
gorithms [1], [7], [18] and are simple to implement. In these pens to be the predecessor task of task d with the lowest cost)
algorithms, each task is assigned a priority, and, whenever a is the bottleneck for communication costs. All the edges
which are incident on node d should have cost less than or
¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥ equal to three. This condition is not satisfied for the join node
• S. Darbha is with the Department of Electrical and Computer Engineering, of Fig. 1. The condition introduced in this paper is more flexi-
Rutgers University, Piscataway, NJ 08855-0909. ble and does not let a task with a low computation cost to
E-mail: darbha@ece.rutgers.edu. become the bottleneck for the edge costs.
• D.P. Agrawal is with the Department of Electrical and Computer Engi- A Task Duplication based Scheduling (TDS) algorithm
neering, North Carolina State University, Raleigh, NC 27695-7911.
E-mail: dpa@eos.ncsu.edu. is introduced in this paper. It is proven that optimal
Manuscript received 23 Nov. 1994. schedule is generated by this algorithm if a simple condi-
For information on obtaining reprints of this article, please send e-mail to: tion is satisfied. The rest of the paper is organized as follows.
tpds@computer.org, and reference IEEECS Log Number 101032.

1045-9219/98/$10.00 © 1998 IEEE

88 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998

lact(i ) min( min (last( j ) ci , j ),

j °succ ( i ), i fpred ( j )
(8)
min (last( j )))
j °succ ( i ), i fpred ( j )

last(i ) lact(i ) W (i ) (9)

level(i ) W (i ) if succ(i ) I (10)

Fig. 1. Join node (node d).
level(i ) max (level( k )) W (i ) if succ(i ) I (11)
k °succ ( i )

The proposed algorithm is described in Section 2 and the The computation of the earliest start and completion
results are shown in Section 3. Finally, Section 4 provides times proceeds in a top-down fashion, starting with the
the conclusions. entry node and terminating at the exit node. The latest al-
lowable start and completion times are determined in a
2 TDS ALGORITHM bottom-up fashion, in which the process starts from the exit
node and terminates at the entry node. For each task i, a
The motivation behind this work is to introduce a fast algo- favorite predecessor fpred(i) is assigned using (6), which
rithm which can schedule the tasks such that they finish their signifies that assigning both the task and its favorite prede-
execution in an optimal time. In this paper, it is assumed that cessor to the same processor will result in a lower parallel
the task graph is available in the form of a DAG defined by time. The level of any node is the length of the longest path
the tuple (V, E, W, c), where V is the set of tasks and E is the set from the node to the exit node. When calculating the level or
of edges. The set W consists of computation costs and each the length of the path, the communication times are ig-
task i ° V has a computation cost represented by W(i). Simi- nored, and only the computation times are taken into ac-
larly, c is the set of communication costs, and each edge from count. The level of entry node is the sum of computation
task i to task j, ei,j ° E has a cost ci,j associated with it. Without costs along the longest linear path. This schedule length can
loss of generality, it can be assumed that there is one entry never be lower than the level of the entry node of the DAG.
node and one exit node for the DAG. If there are multiple The TDS algorithm is shown to yield optimal results if the
entry or exit nodes, then the multiple nodes can always be condition given below can be satisfied by all the join nodes
connected through a dummy node which has zero computa- (nodes having more than one predecessor) of the DAG.
tion cost and zero communication cost edges.
A task is an indivisible unit of work and is nonpreemp- CONDITION 1. Let m and n be the predecessor tasks of task i
tive. The underlying target architecture is assumed to be which have the highest and second-highest values of
homogeneous, and the communication costs between a pair {( ect( j ) c j , i ) j ° pred(i )} , respectively. Then, one of the
of processors, for a fixed length of message, is considered to following must be satisfied:
be the same. The number of processors is assumed to be
• W(m) cn,i if est(m) est(n) or,
unbounded, and an I/O coprocessor is available so that
• W(m) (cn,i+ est(n) est(m)) if est(m) < est(n).
computation and communication can be performed concur-
rently. It is assumed that memory space is not a problem This condition can be satisfied if the DAG is of coarse
and that all tasks are stored at each processor. Thus, different grain, and the communication requirements are low. But it
processors could execute duplicate copies of the same task is not necessary for the DAG to be coarse grain for this con-
using the same initial data. It is assumed that there is no fault dition to be satisfied. If the condition is satisfied, then the
in the system, and that the data generated by duplicate cop- optimal solution is guaranteed and is proven in Section 3.
ies of the same task are consistent with each other. The pseudocode in Fig. 2 shows the steps involved in the
The TDS algorithm schedules the tasks based on certain algorithm. In step one, the task graph is traversed to com-
parameters, and the mathematical expressions to evaluate pute the est, ect, and fpred of each node. Step two involves
these parameters for a node i of the DAG are given below: computing last, lact, and level for all the nodes of the task
graph. These two steps are performed using (1)-(11).
pred(i ) { j ei , j ° E} (1)
Input:
succ(i ) { j ei , j ° E} (2) DAG (V, E, τ, c)
pred(i): Set of Parents of Task i.
est(i ) 0 , if pred(i ) I (3) succ(i): Set of Successor Tasks of Task i.

est(i ) min max (ect( j ), ect( k ) ck , i ) if pred(i ) I (4) Output: Schedule

j °pred ( i ) k °pred ( i ), k j Begin:
1. Compute est(i), ect(i), and fpred(i) for all nodes
ect(i ) est(i ) W (i ) (5) i ∈ V.
2. Compute last(i), lact(i), and level(i) for all nodes
fpred(i ) j ( ect( j ) c j , i ) ( ect( k ) ck , i ) , i ∈ V.
(6) 3. Assign tasks to processors (refer to Fig. 3).
j ° pred(i ); k ° pred(i ), k j End
lact(i ) ect(i ) if succ(i ) I (7) Fig. 2. Pseudocode of scheduling algorithm.
DARBHA AND AGRAWAL: OPTIMAL SCHEDULING ALGORITHM FOR DISTRIBUTED MEMORY MACHINES 89

Input: DAG (v, e, τ, ci,j)

pred(i): Set of predecessor tasks for task i.
succ(i): Set of successor tasks for task i.
queue: Set of all tasks stored in ascending order of level.

Output: Task Clusters

Begin
x = first element of queue
i = 0;
Assign x to Pi
while (not all tasks are assigned to a processor){
y = fpred(x)
if (y has already been assigned to another processor){
if((last(x) lact(y)) > = cx,y) then /* y is not critical for x. */
y = another predecssor of x which has not yet been assigned to another processor.
else
for another predecessor z of x, z ≠ y
if ((ect(y) + cy,x) = (ect(z) + cz,x) and task z has not yet been assigned to any processor), then y = z
endif
endif
}
endif
Assign y to Pi
x=y
if x is entry node
assign x to Pi
x = the next element in queue which has not yet been assigned to a processor
i++;
Assign x to Pi
endif
}
Fig. 3. Details of Step 3 of TDS algorithm.

Step 3, shown in Fig. 3, generates the tasks clusters and If j is not critical, then the process of cluster generation can
is based on the parameters computed in steps one and two, be continued by following through any other unassigned
and on the array queue. The elements in the array queue are predecessor of i. This will help reduce the number of tasks
the nodes of the task graph sorted in smallest level first or- which are duplicated. In case j is critical, other predecessors
der. Each cluster is assigned to a different processor, and which have not yet been assigned to a processor are exam-
the generation of a cluster is initiated from the first task in ined if they could initially have been the favorite predecessor.
the array queue, which has not yet been assigned to a proc- This could have happened if, for another task k, (k ° pred(i),
essor. The generation of the cluster is completed by per- k j) (ect(k) + ck,i) = (ect(j) + cj,i). If there exists such a task k,
forming a search similar to the depth first search starting the path to the entry node is traced by traversing through
from the initial task. The search is performed by tracing the the task k. Otherwise, task j is duplicated on the current
path from the initial task, selected from queue, to the entry processor. The generation of cluster terminates once the
node by following the favorite predecessors. If the favorite path reaches the entry node. The next cluster starts from the
predecessor is unassigned, i.e., not yet assigned to a proces- first unassigned task in queue. If all the tasks are assigned to
sor, then it is selected. Otherwise, the favorite predecessor a processor, then the algorithm terminates.
may or may not be duplicated onto the current processor.
Suppose i is the current task and its favorite predecessor j 2.1 Complexity of the Scheduling Algorithm
has been assigned to another processor. In this case, before The first and second steps of the algorithm traverse each
task j is duplicated onto the current processor, it is impor- task of the task graph and compute the start times and
tant to examine if task j is critical for i. Task j would be criti- completion times. At each node, the incoming and outgoing
cal for task i is the condition (last(i) lact(j)) < cj,i is satisfied. edges are examined, and, in the worst case, all the edges of
90 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998

TABLE 1
START AND COMPLETION TIMES FOR THE NODES
Node level est ect last lact fpred
1 21 0 3 0 3 –
2 9 3 5 6 8 1
3 11 3 7 6 10 1
4 18 3 6 3 6 1
5 7 7 9 10 12 3
6 15 6 12 6 12 4
7 11 6 8 8 10 4
8 5 12 14 14 16 6
9 9 12 18 12 18 6
10 3 18 21 18 21 9

assigned to processor 1, and the value of last(8) lact(6)

is 2, which is greater than or equal to c6,8. Thus, task 6 is
not critical for task 8, and it is not necessary to dupli-
cate task 6 on the same processor as task 8. Conse-
quently, task 5 is chosen to be allocated onto the same
processor as task 8. The rest of the allocations are gen-
erated by following this process until all tasks have
been assigned to a processor.
Fig. 4. Example of a directed acyclic graph.
The processor allocation and the schedule times obtained
by following this algorithm are shown in Fig. 5. For this
the DAG need to be examined. Thus, the worst case com- example DAG, it can be observed that the schedule time
plexity of these steps would be O(|E|), where |E| is the generated is equal to the level of the entry node of the DAG.
number of edges. Thus, the TDS algorithm generates an optimal schedule for
In the last step, the DAG is traversed similar to the depth the example DAG.
first search, and the complexity of this step would be the
same as the complexity of a general search algorithm,
which is O(|V| + |E|) [2], where |V| is the number of 3 RESULTS
nodes. To execute the third step, the array queue needs to be In this section, the performance of TDS algorithm is re-
generated. This can be done in time O(|V| log |V|), which ported. There are two cases here. In the first case, the DAG
is the time for sorting the nodes of the DAG in ascending satisfies the condition stated in Section 2. If that condition is
order of level. satisfied, then earliest completion time is guaranteed and is
Thus, the overall time complexity of TDS algorithm is in proven below.
O(|V| + |E| + |V| log |V|). For a dense graph, the num- The algorithm has also been applied to some practical
2
ber of edges are proportional to O(|V| ). Thus, the worst DAGs. In this case, the condition is not necessarily satisfied
2
case complexity of TDS algorithm is in O(|V| ). and the optimal solution may or may not be obtained.
2.2 Running Trace of the Algorithm 3.1 Condition Is Satisfied
This subsection illustrates the working of the TDS algo- A DAG consists of fork and join nodes. The fork nodes can
rithm for the example DAG of Fig. 4. be transformed to achieve the earliest possible schedule
1) The est of the entry node 1 is zero, since pred(1) = I. time as shown in Fig. 6. The problem arises when schedul-
Node 1 completes execution at time three. The est and ing the join nodes, because only one predecessor can be
ect of other nodes can be computed by using (4) and assigned to the same processor as the join node. In this sub-
(5). For example, node 9 has nodes 6 and 7 as prede- section, it is proven that, for join nodes which satisfy the
cessors. Thus est(9) = min{max{(ect(7) + c7,9), ect(6)}, condition, the schedule time obtained by scheduling the
max{(ect(6) + c6,9), ect(7)}} = 12. The est and ect of all the join node on the same processor as its favorite predecessor,
nodes of the DAG are shown in Table 1. is optimal. The rest of the predecessors of the join node are
2) Starting from the exit node, the latest allowable start each assigned to a separate processor.
and completion times are computed by using (7)-(9) THEOREM 3.1. Given a join node satisfying the condition stated
and are shown in Table 1. in Section 2, the TDS algorithm gives minimum possible
3) The array queue for this DAG is: queue = [10, 8, 5, 9, 2, schedule time.
7, 3, 6, 4, 1]. Starting from the first node in queue, i.e.,
PROOF. Fig. 7 shows an example join node. According to the
node 10, the first cluster is generated. The search
condition, tasks m and n have the highest and the sec-
process visits through the favorite predecessors of
ond-highest values of {ect(j) + cj,i j ° pred(i)}. It is as-
each task. The first processor is allocated tasks 1, 4, 6,
sumed that task m is assigned to processor m and task
9, and 10. The next search is started from the first un-
n is assigned to processor n. Since task m, has the
assigned task in queue, which is task 8. The favorite
highest value of ect(m) + cm,i, task i is also assigned to
predecessor of task 8 is task 6, which has already been
processor m, and est(i) = max{ect(m), ect(n) + cn,i}.
DARBHA AND AGRAWAL: OPTIMAL SCHEDULING ALGORITHM FOR DISTRIBUTED MEMORY MACHINES 91

Fig. 5. Scheduling of processors.

Fig. 6. Schedule of fork node.

Fig. 7. Example join node.

It will be proven that the start time of task i cannot the earliest task i can start is given by (est(n) + W(n) +
be lowered by assigning tasks m and n to the same W(m)), i.e., ect(n) + W(m). Since W(m) cn,i, ect(n) +
processor if the condition is satisfied. Thus, tasks m W(m) would be greater than or equal to ect(n) + cn,i.
and n have to be assigned to different processors. The Thus, est(i) cannot be lowered.
other predecessors may have any values of computa- CASE 2. est(m) < est(n). From the condition stated in Section 2,
tion and communication times, but, task i has to wait W(m) cn,i + est(n) est(m) has to be satisfied.
until ect(m) or ect(n) + cn,i, whichever is higher. Thus,
the other tasks will not affect est(i), as long as the con- CASE 2a. ect(m) ect(n) + cn,i, i.e., est(i) = ect(m). If tasks
dition ect(k) + ck,i ect(n) + cn,i, for all k ° pred(i) and k m, n, and i are assigned to the same processor, est(i)
m, n is satisfied. Thus, only tasks m and n need to be = est(m) + W(m) + W(n) = ect(m) + W(n). Thus, the start
considered among all the predecessors of task i. time of task i cannot be lower than ect(m).
There are two possible cases here. CASE 2b. ect(m) < ect(n) + cn,i, i.e., est(i) = ect(n) + cn,i. If m, n,
CASE 1. est(m) est(n). From the condition stated in Section 2, and i are assigned to the same processor, earliest
W(m) cn,i has to be satisfied. Here, again, there could be start time of task i would be est(m) + W(m) + W(n). The
two cases: start time of task i can be improved if est(m) + W(m) +
W(n) < est(n) + W(n) + cn,i. In other words, if est(m) +
CASE 1a. ect(m) ect(n) + cn,i, i.e., est(i) = ect(m). If tasks m,
W(m) < est(n) + cn,i, or if W(m) < (est(n) est(m)) + cn,i.
n, and i are assigned to the same processor, est(i) =
But, it was assumed that W(m) (est(n) est(m)) + cn,i.
max(est(m), (est(n) + W(n))) + W(m), est(i) = max(ect(m),
Thus, the start time of task i cannot be lowered.
ect(n) + W(m)). Thus, the start time of est(i) cannot be
reduced below ect(m) by assigning m, n, and i to the This proves that, if the condition given in Section 2 is satis-
same processor. fied by all the join nodes of the DAG, then TDS algorithm
yields the earliest possible start time, and, consequently, ear-
CASE 1b. ect(m) < ect(n) + cn,i, i.e., est(i) = ect(n) + cn,i. If
liest possible completion time for all the tasks of the DAG.
tasks m, n, and i are assigned to the same processor,
92 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998

Fig. 8. General structure of diamond DAGs.

Fig. 9. Diamond DAGs, where n = 3.

Fig. 10. Schedule for diamond DAG, where n = 3.

3.2 Scheduling of Diamond DAGs 3.3 Application of Algorithm to Practical DAGs

In this section, the performance of TDS algorithm on a spe- TDS algorithm has been applied to four application DAGs
cial class of DAGs, namely the diamond DAGs, is observed. with varying characteristics. The four applications are
The general structure of the diamond DAGs is shown in Bellman-Ford algorithm [3], [12], Systolic algorithm [10],
Fig. 8. These DAGs are similar to master-slave DAGs, in [12], Master-Slave algorithm [12], and Cholesky decompo-
which the master gives instructions to the slaves, and the sition algorithm [8]. The first three applications are part of
slaves send the result back to the master node. the ALPES project [12]. The number of tasks in each DAG is
If these DAGs satisfy the condition of Section 2, then around 3,000 nodes, and the number of edges varies from
they provide an optimal schedule using n processors, 5,000 to 25,000. The number of predecessors and successors
where n is the width of the DAG or the number of slave varies from one to 140. The computation costs vary from
tasks at each level. Fig. 9 shows a special case of diamond one to 20,000 time units. To obtain a wider variation in the
DAG, where n is three, and it satisfies the condition given communication costs, the communication to computation
in Section 2. The schedule generated by TDS algorithm for cost ratio has been varied from one to 200. Thus, these re-
the DAG shown in Fig. 9 is shown in Fig. 10. This schedule sults can help in better evaluation of the TDS algorithm for
length obtained by TDS algorithm for this DAG is optimal. practical applications.
In this schedule, it can be observed that only the master For each of these cases, the results of the TDS algorithm
nodes have been duplicated on the three processors. is compared against linear clustering (LC) algorithm [11],
which is based on finding the linear paths and assigning
DARBHA AND AGRAWAL: OPTIMAL SCHEDULING ALGORITHM FOR DISTRIBUTED MEMORY MACHINES 93

Fig. 11. Ratio of schedules generated by LC/TDS.

TABLE 2
RATIO OF SCHEDULE LENGTH GENERATED BY LC OVER TDS ALGORITHM
Communication to Computation Ratio
Algorithms Condition No. of Processors
Satisfied Required by TDS 1 50 100 150 200
Bellman-Ford Yes 1,171 1.0000 1.0007 1.0005 1.0004 1.0004
Cholesky Yes 342 1.159 1.272 1.272 1.272 1.272
Decomposition
Master-Slave No 50 1.001 1.017 1.031 1.043 1.054
Systolic No 97 1.002 1.077 1.143 1.200 1.245

them to different processors. Thus, like TDS algorithm, LC are also similar. But the ratio of schedule time is almost
algorithm also does not assign two independent tasks to the constant at 28 percent for most of the range and may not
same processor. The worst case schedule generated by TDS rise any further. The schedule time primarily consists of
algorithm matches the schedule generated by linear clus- communication costs. Thus, for higher values of communi-
tering algorithm. cation to computation time ratios, the ratio of schedule
The ratios of schedule time of LC to TDS has been ob- length will, in effect, be the ratio of intertask communica-
tained for each ratio of communication to computation time tion costs. Since the communication times for both algo-
and are shown in Fig. 11. The communication to computa- rithms are increasing by the same constant, the ratio be-
tion time ratios are varied from one to 200. It can be ob- tween the schedule lengths remains almost constant. In
served that the Systolic and Master-Slave algorithms pro- Section 2, it was stated that if the DAG is of coarse granu-
vide a steadily rising schedule for TDS as compared to LC larity, then optimal results can be obtained. From the re-
algorithm because of the structured nature of these DAGs. sults here, it can be noticed that if the DAGs are of finer
They are more suited for scheduling algorithms, which granularity, optimal results may not be obtained, but the
schedule using the concept of linear clusters, i.e., clusters results are good as compared to linear clustering algorithm.
which do not contain two independent tasks. It can be seen The speedup achieved by TDS algorithm over the LC algo-
that for Bellman-Ford algorithm, TDS is marginally better rithm for five different ratios of communication to compu-
than LC. The schedules generated by both the algorithms tation times are shown in Table 2.
are almost the same, which is the reason why TDS algo-
rithm does not perform much better than the linear clus- 3.4 Comparison with Other Algorithms
tering algorithm. For Cholesky decomposition algorithm, The DAG [8] shown in Fig. 12 has been used to compare the
the schedules generated by both TDS and linear clustering TDS algorithm with five other algorithms, namely, the DSC
algorithm [8], Linear Clustering algorithm [11], Internalization
94 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998

Prepass [17], MCP Clustering algorithm [19], and the ACKNOWLEDGMENTS

Threshold Scheduling algorithm [15]. The schedule time
We would like to thank the anonymous referees and Dr.
generated by each of the algorithms and their complexities
Santosh Pande for their constructive remarks that helped
are shown in Table 3. It can be seen that TDS algorithm
improve the quality of the paper. This work was supported
generates optimal schedule and has the lowest complexity
by the Army Research Office under contract nos. DAAL03-
among these scheduling algorithms.
91-G-0031 and DAAH04-94-G-0306-1.

REFERENCES
[1] T.L. Adam, K.M. Chandy, and J.R. Dickson, “A Comparison of
List Schedules for Parallel Processing Systems,” Comm. ACM, vol.
17, no. 12, pp. 685-690, Dec. 1974.
[2] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, The Design and Analysis
of Computer Algorithms. Addison-Wesley, 1974.
[3] D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computa-
tion: Numerical Methods. Prentice-Hall Int’l, 1989.
[4] H.B. Chen, B. Shirazi, K.Kavi, and A.R. Hurson, “Static Schedul-
ing Using Linear Clustering with Task Duplication,” Proc. ISCA
Int’l Conf. Parallel and Distributed Computing and Systems, pp. 285-
290, Louisville, Ky., Oct. 14-16 1993.
[5] J.Y. Colin and P. Chritienne, “C.P.M. Scheduling with Small
Communication Delays and Task Duplication,” Operations Re-
search, vol. 39, no. 4, pp. 680-684, July 1991.
[6] S. Darbha and D.P. Agrawal, “SDBS: A Task Duplication Based
Optimal Scheduling Algorithm,” Proc. Scalable High Performance
Computing Conf., pp. 756-763, Knoxville, Tenn., May 23-25 1994.
[7] H. El-Rewini and T.G. Lewis, “Scheduling Parallel Program Tasks
Onto Arbitrary Target Architectures,” J. Parallel and Distributed
Computing, vol. 9, pp. 138-153, 1990.
[8] A. Gerasoulis and T. Yang, “A Comparison of Clustering Heuris-
Fig. 12. DAG for comparison of algorithms. tics for Scheduling Directed Acyclic Graphs on Multiprocessors,”
J. Parallel and Distributed Computing, vol. 16, pp. 276-291, 1992.
[9] R.L. Graham, L.E. Lawler, J.K. Lenstra, and A.H. Kan,
TABLE 3 “Optimization and Approximation in Deterministic Sequencing
COMPARISON OF ALGORITHMS and Scheduling: A Survey,” Annals of Discrete Mathematics, pp. 287-
326, 1979.
Algorithms Schedule Length Complexity [10] O.H. Ibarra and S.M. Sohn, “On Mapping Systolic Algorithms
Optimal Algorithm 8.5 NP-Complete onto the Hypercube,” IEEE Trans. Parallel and Distributed Systems,
Linear Clustering 11.5 O(|V|(|E| + |V|)) vol. 1, no. 1, pp. 48-63, Jan. 1990.
MCP Algorithm 10.5 2 [11] S.J. Kim and J.C. Browne, “A General Approach to Mapping of
O(|V| log |V|) Parallel Computation upon Multiprocessor Architectures,” Int’l
Internalization Prepass 10.0 O(|E|(|V| + |E|)) Conf. Parallel Processing, vol. 3, pp. 1-8, 1988.
Dominant Sequence 9.0 [12] J.P. Kitajima and B. Plateau, “Building Synthetic Parallel Pro-
O((|E| + |V|) log |V|) grams: The Project (ALPES),” Proc. IFIP WG 10.3 Workshop on Pro-
Clustering
gramming Environments for Parallel Computing, pp. 161-170, Edin-
Threshold Scheduling 10.0 2
Algorithm O(|V| ) burgh, Scotland, Apr. 6-8 ,1992.
[13] B. Kruatrachue, “Static Task Scheduling and Grain Packing in
TDS Algorithm 8.5 2
O(|V| ) Parallel Processing Systems,” PhD thesis, Oregon State Univ.,
1987.
[14] Y.-K. Kwok and I. Ahmad, “Exploiting Duplication to Minimize
the Execution Times of Parallel Programs on Message-Passing
4 CONCLUSIONS Systems,” Proc. Sixth IEEE Symp. Parallel and Distributed Process-
This paper presents a fast algorithm to schedule the tasks of ing, pp. 426-433, Oct. 26-29, 1994.
[15] S.S. Pande, D.P. Agrawal, and J. Mauney, “A New Threshold
a DAG onto processors of DMMs and is based on the du- Scheduling Strategy for Sisal Programs on Distributed Memory
plication of tasks. The algorithm has a complexity of Systems,” J. Parallel and Distributed Computing, vol. 21, no. 2, pp.
2
O(|V| ), where |V| is the number of tasks of the DAG and 223-236, May 1994.
provides the best possible solution in case the task graph [16] S.S. Pande, D.P. Agrawal, and J. Mauney, “A Scalable Scheduling
Method for Functional Parallelism on Distributed Memory Multi-
satisfies a simple condition. Even if the condition is not sat- Processors,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 4,
isfied, this algorithm provides a good schedule, which is pp. 388-399, Apr. 1995.
close to the optimum solution. [17] V. Sarkar, Partitioning and Scheduling Parallel Programs for Execu-
The performance of TDS has been observed by schedul- tion on Multiprocessors. Cambridge, Mass.: MIT Press, 1989.
[18] G.C. Sih and E.A. Lee, “A Compile-Time Scheduling Heuristic for
ing some practical DAGs onto DMMs and comparing Interconnection-Constrained Heterogeneous Processor Architec-
against the schedule length obtained by the Linear Clus- tures,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 2,
tering algorithm. The TDS algorithm has been compared to pp. 175-187, Feb. 1993.
[19] M.Y. Wu and D. Gajski, “A Programming Aid for Hypercube
other algorithms in terms of its complexity and the sched- Architectures,” J. Supercomputing, vol. 2, pp. 349-372, 1988.
ule length generated.
DARBHA AND AGRAWAL: OPTIMAL SCHEDULING ALGORITHM FOR DISTRIBUTED MEMORY MACHINES 95

Sekhar Darbha (S’89, M’96) received his BTech Dharma P. Agrawal (M’84-F’87) is a professor
degree in electrical engineering from the Institute in the Department of Electrical and Computer
of Technology, Banaras Hindu University, Engineering at the North Carolina State Univer-
Varanasi, India in 1989. He received his MS and sity, Raleigh. His research interests include par-
PhD degrees in computer engineering from North allelizing and scheduling techniques, routing in
Carolina State University, Raleigh, NC, in 1991 multicomputer networks, mobile networks, and
and 1995, respectively. He has been working as system reliability. He has edited a tutorial text on
an assistant professor in the Department of Elec- Advanced Computer Architecture (IEEE Com-
trical and Computer Engineering at Rutgers Uni- puter Society Press, 1986), coedited texts enti-
versity since August, 1995. His research interests tled Distributed Computing Network Reliability,
are in program partitioning and scheduling for and Advances in Distributed System Reliability
multiprocessing systems. He has served as a minitrack coordinator for (IEEE Computer Society Press, 1990), and a self study guide on Par-
partitioning and scheduling minitrack at the 30th Hawaii International allel Processing (IEEE Press, 1991).
Conference on System Sciences (HICSS), and he will be involved with Dr. Agrawal is an editor for the Journal of Parallel and Distributed
the 31st HICSS as a minitrack coordinator for the minitrack on compiling Systems and International Journal of High Speed Computing. He has
for distributed and embedded systems. He is a member of the IEEE. served as an editor of the IEEE Computer magazine, and the IEEE
Transactions on Computers. He has been the program chair for the
1984 International Symposium on Computer Architecture, the 1994
International Conference on Parallel Processing, the workshop chair
for the 1995 ICPP Workshop on Challenges for Parallel Processing,
and the General Chair for the 1993 ISMM International Conference on
Parallel and Distributed Computing Systems, MASCOTS 1996 work-
shop. Recently, he served as the chair of the IEEE Computer Society
Technical Committee on Computer Architecture and is currently chair
of the IEEE-CS Harry Goode and McDowell award committees. He is a
fellow of the IEEE.